在Python中使用提取XML标记内的文本（同时避免标记） - 问答

<body.content> <block class="lead_paragraph"> LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said. </block> <block class="full_text"> LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said. </block>

2条回答

网友

1楼 · 编辑于 2024-06-28 19:41:38

Is this something that can be easily parsed by regex?

Dont'!

使用类似lxml的xml解析器。你知道吗

ex = """
<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
</body.content>"""

from lxml import etree
ex = etree.fromstring(ex)
print ex.findtext('./block/p')

输出：

LEAD: Two police officers responding to a reported robbery at a 
Brooklyn tavern early yesterday were themselves held up by the robbers, who
took their revolvers and herded them into a back room with patrons, the 
police said.

网友

2楼 · 编辑于 2024-06-28 19:41:38

您也可以使用BeautifulSoup解析器。你知道吗

>>> from bs4 import BeautifulSoup
>>> s = '''<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('block', class_="full_text"):
        print(i.text)



LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

在Python中使用提取XML标记内的文本（同时避免<p>标记）

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Python中使用提取XML标记内的文本（同时避免<p>标记）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >