如何使用regex或工具箱将句子解析为标记

2024-06-13 15:13:52 发布

您现在位置:Python中文网/ 问答频道 /正文

如何使用regex或beautifulsoup、lxml等工具箱解析这样的句子:

input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

对此:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

我不能使用re.findall("<person>(.*?)</person>", input),因为标记不同。你知道吗


Tags: tonewinput工具箱locationlxmlregex句子
2条回答

试试这个正则表达式-

>>> import re
>>> input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> print re.sub("<[^>]*?[^/]\s*>[^<]*?</.*?>",r"\n\g<0>\n",input)
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

>>> 

正则表达式的演示here

看看使用BeautifulSoup有多简单:

from bs4 import BeautifulSoup

data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    print item

印刷品:

Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

UPD(将非标记项拆分为空格,并在新行上打印每个部分):

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    if not isinstance(item, Tag):
        for part in item.split():
            print part
    else:
        print item

印刷品:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

希望有帮助。你知道吗

相关问题 更多 >