使用Regex从文本中提取元素并附加到字典

2024-09-30 01:31:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从这个文本中创建一个字典,通过使用某种类型的循环和regex表达式从一个网站检索这个文本。我想让字典看起来像这样:

{36:30281, 36 2/3:30282, 37:30283, 37 1/3: 30283, 38:30284 etc..}

以下是我从网站上检索到的文本:

[option value="-1">Choose size</option>, option value="30281">\r\n\t\t\t\t\t\t\t\t\t36\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option>, option value="30282">\r\n\t\t\t\t\t\t\t\t\t36 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30283"\r\n\t\t\t\t\t\t\t\t\t37 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30284">\r\n\t\t\t\t\t\t\t\t\t38\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30285">\r\n\t\t\t\t\t\t\t\t\t38 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30286">\r\n\t\t\t\t\t\t\t\t\t39 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30287">\r\n\t\t\t\t\t\t\t\t\t40\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30288">\r\n\t\t\t\t\t\t\t\t\t40 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30289">\r\n\t\t\t\t\t\t\t\t\t41 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>]

我不太擅长正则表达式。有谁能给我一个解决方案来帮助我做到这一点吗?你知道吗

谢谢


Tags: 文本类型size字典value表达式网站etc
2条回答

下面是一个有效的正则表达式:

re.findall('\\t(\d{2}\s+\d\/\d)\\r\\n', [option value="-1">Choose size</option>, 'option value="30281">\r\n\t\t\t\t\t\t\t\t\t36\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option>, option value="30282">\r\n\t\t\t\t\t\t\t\t\t36 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30283"\r\n\t\t\t\t\t\t\t\t\t37 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30284">\r\n\t\t\t\t\t\t\t\t\t38\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30285">\r\n\t\t\t\t\t\t\t\t\t38 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30286">\r\n\t\t\t\t\t\t\t\t\t39 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30287">\r\n\t\t\t\t\t\t\t\t\t40\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30288">\r\n\t\t\t\t\t\t\t\t\t40 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30289">\r\n\t\t\t\t\t\t\t\t\t41 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>]')

这将输出:

['36 2/3', '37 1/3', '38 2/3', '39 1/3', '40 2/3', '41 1/3']

它的工作方式主要是基于你的,但变化如下。我删除了正则表达式的第一部分以外的所有内容。我把13改成了'\d',意思是任意数字,而不是一和三。然后我在结尾加了'\\r\\n',这其实没必要,所以如果你想的话可以脱下来,但我想这对你来说只是额外的安全措施。你知道吗

您可以使用(demo):

value=\"(\d+)\"\D*(\d+(?:\ [\d/]+)?)


Python这将是(使用dict理解):
import re 

junk_string = """
[option value="-1">Choose size</option>, option value="30281">\r\n\t\t\t\t\t\t\t\t\t36\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option>, option value="30282">\r\n\t\t\t\t\t\t\t\t\t36 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30283"\r\n\t\t\t\t\t\t\t\t\t37 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30284">\r\n\t\t\t\t\t\t\t\t\t38\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30285">\r\n\t\t\t\t\t\t\t\t\t38 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30286">\r\n\t\t\t\t\t\t\t\t\t39 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30287">\r\n\t\t\t\t\t\t\t\t\t40\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30288">\r\n\t\t\t\t\t\t\t\t\t40 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30289">\r\n\t\t\t\t\t\t\t\t\t41 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>]
"""

rx = re.compile(r'value=\"(\d+)\"\D*(\d+(?:\ [\d/]+)?)')
result = {m.group(2): m.group(1) 
            for m in rx.finditer(junk_string)}

print(result)
# {'36': '30281', '36 2/3': '30282', '37 1/3': '30283', '38': '30284', '38 2/3': '30285', '39 1/3': '30286', '40': '30287', '40 2/3': '30288', '41 1/3': '30289'}

但是正如在评论中所说的,这实际上不是文本而是DOM的一部分,所以至少要考虑使用解析器。你知道吗

相关问题 更多 >

    热门问题