如何使用BeautifulSoup从网站获取所有标题?

2024-10-03 11:21:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个简单的网站上获取所有的标题。我的尝试:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://nypost.com/business"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
soup.find_all('h')

soup.find_all('h')返回[],但如果我执行soup.h1soup.h2之类的操作,它将返回相应的数据。我只是打错了电话吗?


Tags: fromimporthttpurl标题data网站page
3条回答

你需要做soup.find_all('h1')

你可以这样做:

for a in ["h1","h2"]:
  soup.find_all(a)

按正则表达式筛选:

soup.find_all(re.compile('^h[1-6]$'))

此正则表达式查找以h开头、在h后面有一个数字、然后在该数字后面结束的所有标记。

如果您不想使用regex,那么您可能需要执行以下操作:

from bs4 import BeautifulSoup
import requests

url = "http://nypost.com/business"

page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h3"):
    print(headlines.text.strip())

结果:

The epitome of chic fashion is the latest victim of retail's collapse
Rent-a-Center shares soar after rejecting takeover bid
NFL ad revenue may go limp with loss of erectile-dysfunction ads
'Pharma Bro' talked about sex with men to get my money, investor says

And So On------

相关问题 更多 >