如何使用BeautifulSoup从网站获取所有标题？

网友

1楼 · 编辑于 2024-10-03 11:21:37

你需要做soup.find_all('h1')

你可以这样做：

for a in ["h1","h2"]:
  soup.find_all(a)

网友

2楼 · 编辑于 2024-10-03 11:21:37

按正则表达式筛选：

soup.find_all(re.compile('^h[1-6]$'))

此正则表达式查找以h开头、在h后面有一个数字、然后在该数字后面结束的所有标记。

网友

3楼 · 编辑于 2024-10-03 11:21:37

如果您不想使用regex，那么您可能需要执行以下操作：

from bs4 import BeautifulSoup
import requests

url = "http://nypost.com/business"

page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h3"):
    print(headlines.text.strip())

结果：

The epitome of chic fashion is the latest victim of retail's collapse
Rent-a-Center shares soar after rejecting takeover bid
NFL ad revenue may go limp with loss of erectile-dysfunction ads
'Pharma Bro' talked about sex with men to get my money, investor says

And So On------

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用BeautifulSoup从网站获取所有标题？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >