用于从wiki模板标记中提取字段的正则表达式

In [1]: markup = get_wikipedia_markup('United States presidential election, 2012') In [2]: markup Out[2]: u"{{ | nominee1 = '''[[Barack Obama]]'''\n | party1 = Democratic Party (United States)\n | home_state1 = [[Illinois]]\n | running_mate1 = '''[[Joe Biden]]'''\n | nominee2 = [[Mitt Romney]]\n | party2 = Republican Party (United States)\n | home_state2 = [[Massachusetts]]\n | running_mate2 = [[Paul Ryan]]\n }}"

3条回答

网友

1楼 · 编辑于 2024-10-04 03:20:20

对于这样的infobox数据，最好使用DBpedia。他们为你做了所有的提取工作：）

http://wiki.dbpedia.org/Downloads38

请参阅“本体信息框属性”文件。你不必成为本体论专家。只需使用简单的tsv解析器就可以找到您需要的信息！在

网友

2楼 · 编辑于 2024-10-04 03:20:20

使用mwparserfromhell！它压缩了代码，对于捕获结果更可靠。对于本例的用法：

import mwparserfromhell as mw
text = get_wikipedia_markup('United States presidential election, 2012')
code = mw.parse(text)
templates = code.filter_templates()
for template in templates:
    if template.name == 'Infobox election':
        nominee1 = template.get('nominee1').value
        nominee2 = template.get('nominee2').value
print nominee1
print nominee2

很简单的事情来捕捉结果。在

网友

3楼 · 编辑于 2024-10-04 03:20:20

在这里，从lookings中提取的字符串应该更容易使用。（事实上，lookbehind在这里不能与Python的正则表达式引擎一起工作，因为可选的空格使表达式的宽度可变。）

试试这个正则表达式：

\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?

结果：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章