如何使用网页抓取获取网页上的可见文本？ - 问答 - Python中文网

如何使用网页抓取获取网页上的可见文本？

2024-09-29 08:19:32 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

这是我想抓取的网页链接：https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html

我还通过单击环绕的标题1应用了其他过滤器

这是点击标题2后网页的外观

我想获取网页上显示的所有位置的名称，但我似乎遇到了问题，因为url在应用过滤器时没有更改。我正在为此使用python urllib。这是我的密码：

url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

Tags： in https url 网页标题过滤器 html www

1条回答

网友

1楼 · 发布于 2024-09-29 08:19:32

您可以使用bs4。Bs4是一个python模块，允许您从网页中获取某些内容。这将从站点获取文本：

from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)

如果您想获取非文本的内容，也可以使用bs4：

soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title

找到所有地名的类和标记，然后使用上面的方法获得所有地名

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

相关问题更多 >

编程相关推荐

热门问题

热门文章