在Python中使用提取XML标记内的文本(同时避免<p>标记)

2024-06-28 19:41:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Python中的NYT语料库,并试图只提取每个.xml文章文件的“full\u text”类中的内容。例如:

<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>

理想情况下,我只想解析出字符串,产生“线索:两名警察对一宗抢劫案作出反应……”但我不确定最好的方法是什么。这是正则表达式很容易解析的东西吗?如果是这样的话,我所做的一切似乎都不管用。你知道吗

任何建议都将不胜感激!你知道吗


Tags: thetotextblockatfullclasslead
2条回答

Is this something that can be easily parsed by regex?

Dont'!

使用类似lxml的xml解析器。你知道吗

ex = """
<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
</body.content>"""

from lxml import etree
ex = etree.fromstring(ex)
print ex.findtext('./block/p')

输出:

LEAD: Two police officers responding to a reported robbery at a 
Brooklyn tavern early yesterday were themselves held up by the robbers, who
took their revolvers and herded them into a back room with patrons, the 
police said.

您也可以使用BeautifulSoup解析器。你知道吗

>>> from bs4 import BeautifulSoup
>>> s = '''<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('block', class_="full_text"):
        print(i.text)



LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

相关问题 更多 >