尝试使用Beautifulsoup查找多个span标记之间的所有文本

2024-09-27 22:23:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一篇文章(http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK)中获取一段文本,下面是我要获取的特定代码部分

<span id="midArticle_start"></span>

<span id="midArticle_0"></span>
<span class="focusParagraph"><p><span class="articleLocation">YANGON</span>
  Standing among the party seeing off Myanmar's new president as he left for Russia on Wednesday was leading businessman Htun Myint Naing, better known as Steven Law.</p></span>
<span id="midArticle_1"></span><p>Only the day before, the United States had added six of his companies to the Treasury's blacklist, a move that is unlikely to hamper the tycoon's business empire significantly.</p>
<span id="midArticle_2"></span><p>President Barack Obama's sanctions policy on Myanmar, updated on Tuesday, aims to strike a balance between targeting individuals without undermining development or deterring U.S. businesses eying the country as it opens up to global trade.</p>
<span id="midArticle_3"></span><p>Underlining how tricky that balance is, Law may actually gain commercially from the latest changes, even if they do make it harder for him to portray himself as an internationally accepted businessman close to the new democratic government.</p>
<span id="midArticle_4"></span><p>"Though (sanctions) are not meant to have a blanket effect on the country, their intended targets often play outsize roles ... controlling critical infrastructure impacting trade and business for ordinary citizens," said Nyantha Maw Lin, managing director at consultancy Vriens & Partners in Yangon.</p>
<span id="midArticle_5"></span><p>On Tuesday, Washington eased some restrictions on Myanmar but also strengthened measures against Law by adding six firms connected to him and his conglomerate, Asia World, to the Treasury blacklist.</p>
<span id="midArticle_6"></span><p>Yet the blacklisting, which attracted considerable attention in Myanmar, looks like a formality given that the companies were already covered by sanctions, because they were owned 50 percent or more by Law or Asia World. Law was sanctioned in 2008 for alleged ties to Myanmar's military.</p>
<span id="midArticle_7"></span><p>More important for Law was the U.S. decision to further ease restrictions on trading through his shipping port and airports, extending a temporary six month allowance set in December to an indefinite one.</p>
<span id="midArticle_8"></span><p></p>
<span id="midArticle_9"></span><p>PORTS BACK IN FAVOR</p>
<span id="midArticle_10"></span><p>Law is one of the most powerful and well-connected businessmen in Myanmar with close ties to China.</p>
<span id="midArticle_11"></span><p>He is not, however, universally popular at home or abroad because of alleged ties to the military, which ruled Myanmar with an iron fist until 2011.</p>
<span id="midArticle_12"></span>

最终目标是将每个句子作为单独的对象,以便以后使用,例如

^{pr2}$

~缅甸新总统周三启程前往俄罗斯时,站在送别会上的是著名的商人滕明奈(Htun Myint Naing),更广为人知的是史蒂文•劳(Steven Law)。在

print(sentence2)

~就在前一天,美国将他的六家公司列入了财政部的黑名单,此举不太可能严重阻碍这位大亨的商业帝国。在

我的代码只检索第一个句子,但没有超过第一个句子的内容,如下所示:

import requests
from bs4 import BeautifulSoup
z = requests.get("http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK/")
url2 = 'http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK'
response2 = requests.get(url2)

soup2 = BeautifulSoup(response2.content, "html.parser")
first_sentence = soup2.p.get_text()
print(first_sentence)
second_sentence = soup2.p.find_all_next()
print(second_sentence)

如果有人能帮我弄清楚如何把所有的句子分别出来,我将不胜感激。我已经尝试过在其他stackoverflow问题中讨论过的方法 Finding next occuring tag and its enclosed text with Beautiful SoupUsing beautifulsoup to extract text between line breaks (e.g. <br /> tags)


Tags: orandthetoinidforis
3条回答

您可以使用CSS选择器#articleText p返回<span>中的所有<p>元素,其中id等于“articleText”:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> url2 = 'http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK'
>>> response2 = requests.get(url2)
>>> soup2 = BeautifulSoup(response2.content, "html.parser")
>>> for sentence in soup2.select("#articleText p"):
...     print(sentence.get_text())
...     print()
... 
YANGON Standing among the party seeing off Myanmar's new president as he left for Russia on Wednesday was leading businessman Htun Myint Naing, better known as Steven Law.

Only the day before, the United States had added six of his companies to the Treasury's blacklist, a move that is unlikely to hamper the tycoon's business empire significantly.

President Barack Obama's sanctions policy on Myanmar, updated on Tuesday, aims to strike a balance between targeting individuals without undermining development or deterring U.S. businesses eying the country as it opens up to global trade.

Underlining how tricky that balance is, Law may actually gain commercially from the latest changes, even if they do make it harder for him to portray himself as an internationally accepted businessman close to the new democratic government.

......
......

你可以试试:soup2.p.find_all_next(text=True)

像这样:

second_sentence = soup2.p.find_all_next(text=True)

for item in second_sentence:

       print(item.split('\n'))

您的问题可能是find_all_next()方法返回出现在起始元素(之前匹配的<p>)之后的所有匹配项,并且由于您没有指定要匹配的标记,所以它匹配所有内容。在

如果您将其更改为soup2.p.find_all_next("p"),您将得到页面上所有剩余的<p>标记,然后可以通过使用类似的方法遍历这些标记(或者如果愿意,可以显式地分配它们)

soup2 = BeautifulSoup(response2.content, "html.parser")
first_sentence = soup2.p.get_text()
print(first_sentence)
for sentence in soup2.p.find_all_next("p")
    print(sentence.get_text())

如果只删除附加变量并使用findAll(),则更简单:

^{pr2}$

相关问题 更多 >

    热门问题