我正在使用Python中的NYT语料库,并试图只提取每个.xml文章文件的“full\u text”类中的内容。例如:
<body.content>
<block class="lead_paragraph">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
<block class="full_text">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
理想情况下,我只想解析出字符串,产生“线索:两名警察对一宗抢劫案作出反应……”但我不确定最好的方法是什么。这是正则表达式很容易解析的东西吗?如果是这样的话,我所做的一切似乎都不管用。你知道吗
任何建议都将不胜感激!你知道吗
Dont'!
使用类似lxml的xml解析器。你知道吗
输出:
您也可以使用
BeautifulSoup
解析器。你知道吗相关问题 更多 >
编程相关推荐