从字符串中提取信息并转换为列表

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31, [Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue [Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)

3条回答

网友

1楼 · 编辑于 2024-07-07 08:38:43

re解决方案：

import re

input = [
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,",
    "[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue",
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)",
]

def extract(s):
    match = re.search("(X=\d+(?:\.\d*)?).*?\](.*?)$",s)
    return match.groups()

output = [extract(item) for item in input]
print(output)

输出：

[
    ('X=250.44', 'DECEMBER 31,'),
    ('X=307.5', 'respectively. The net decrease in the revenue'),
    ('X=49.5', '(US$ in millions)'),
]

说明：

\d。。。数字
\d+。。。一个或多个数字
(?:...)。。。非捕获（“正常”）括号
\.\d*。。。点后跟零或多个数字
(?:\.\d*)?。。。可选（零或一）“小数部分”
(X=\d+(?:\.\d*)?)。。。第一组，X=number
.*?。。。任何字符的零个或多个（非贪婪）
\]]符号
$。。。结束
\](.*?)$。。。第二组，介于]和字符串结尾之间的任何内容

网友

2楼 · 编辑于 2024-07-07 08:38:43

试试这个：

(X=[^,]*)(?:.*])(.*)

import re

source = """[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)""".split('\n')

pattern = r"(X=[^,]*)(?:.*])(.*)"

for line in source:
    print(re.search(pattern, line).groups())

输出：

('X=250.44', 'DECEMBER 31,')
('X=307.5', 'respectively. The net decrease in the revenue')
('X=49.5', '(US$ in millions)')

在所有捕获之前都有X=，所以我只是做了一个捕获组，如果有必要，可以随意添加非捕获组

网友

3楼 · 编辑于 2024-07-07 08:38:43

使用带有命名组的正则表达式捕获相关位：

>>> line = "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,"
>>> m = re.search(r'(?:\(X=)(?P<x_coord>.*?)(?:,.*])(?P<text>.*)$', line)
>>> m.groups()
('250.44', 'DECEMBER 31,')
>>> m['x_coord']
'250.44'
>>> m['text']
'DECEMBER 31,'

相关问题更多 >

编程相关推荐

热门问题

热门文章