从字符串中提取信息并转换为列表

2024-07-07 08:38:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个字符串,如下所示:

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,

[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)

我想提取“X”的值和关联的文本,并将其转换为列表。请参见下面的预期输出:

预期输出:

['X=250.44','DECEMBER 31,']
['X=307.5','respectively. The net decrease in the revenue']
['X=49.5','(US$ in millions)']

我们如何在Python中实现这一点

MyApproach:

mylist = []
for line in data.split("\n"):
    if line.strip():
        x_coord = re.findall('^(X=.*)\,$', line)
        text = re.findall('^(]\w +)', line)
        mylist.append([x_coord, text])

我的方法没有为x_coordtext标识任何值


Tags: thetextinbasesizelinewidthfont
3条回答

re解决方案:

import re

input = [
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,",
    "[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue",
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)",
]

def extract(s):
    match = re.search("(X=\d+(?:\.\d*)?).*?\](.*?)$",s)
    return match.groups()

output = [extract(item) for item in input]
print(output)

输出:

[
    ('X=250.44', 'DECEMBER 31,'),
    ('X=307.5', 'respectively. The net decrease in the revenue'),
    ('X=49.5', '(US$ in millions)'),
]

说明:

  • \d。。。数字
  • \d+。。。一个或多个数字
  • (?:...)。。。非捕获(“正常”)括号
  • \.\d*。。。点后跟零或多个数字
  • (?:\.\d*)?。。。可选(零或一)“小数部分”
  • (X=\d+(?:\.\d*)?)。。。第一组,X=number
  • .*?。。。任何字符的零个或多个(非贪婪)
  • \]]符号
  • $。。。结束
  • \](.*?)$。。。第二组,介于]和字符串结尾之间的任何内容

试试这个:

(X=[^,]*)(?:.*])(.*)
import re

source = """[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)""".split('\n')

pattern = r"(X=[^,]*)(?:.*])(.*)"

for line in source:
    print(re.search(pattern, line).groups())

输出:

('X=250.44', 'DECEMBER 31,')
('X=307.5', 'respectively. The net decrease in the revenue')
('X=49.5', '(US$ in millions)')

在所有捕获之前都有X=,所以我只是做了一个捕获组,如果有必要,可以随意添加非捕获组

使用带有命名组的正则表达式捕获相关位:

>>> line = "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,"
>>> m = re.search(r'(?:\(X=)(?P<x_coord>.*?)(?:,.*])(?P<text>.*)$', line)
>>> m.groups()
('250.44', 'DECEMBER 31,')
>>> m['x_coord']
'250.44'
>>> m['text']
'DECEMBER 31,'

相关问题 更多 >