尝试使用“美丽的汤”提取电影类型

2024-09-30 22:20:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在学习网络抓取,并试图从谷歌搜索结果中提取电影类型。 我已经提供了下面的代码。 我正在用我要提取的突出显示的部分Part I want to extract is highlighted证明图像

<div class="wwUB2c PZPZlf" data-attrid="subtitle"><span data-ved="2ahUKEwizlJiu9OLoAhXFgeYKHXzvAlMQ2kooAjAlegQIJhAN">1999 ‧ Romance/Comedy ‧ 2h 4m</span></div>

我想摘录“浪漫/喜剧”部分

import requests
from bs4 import BeautifulSoup as bs

url = requests.get("https://www.google.com/search?biw=1920&bih=1008&ei=uwqTXuyUB-Ov8QPIvbKACQ&q=notting+hill+&oq=notting+hill+&gs_lcp=CgZwc3ktYWIQAzIECCMQJzIHCAAQgwEQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIFCAAQgwEyBAgAEEM6BAgAEEdKDQgXEgkxMC0xOThnMThKCggYEgYxMC0xZzNQwN0XWMDdF2CW3xdoAHADeACAAasBiAGrAZIBAzAuMZgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwis3vbz8uLoAhXjV3wKHcieDJAQ4dUDCAw&uact=5")


soup = bs(url.text, "lxml")

soup.select(".subtitle") #in this case it is returning a empty list
soup.find("div", {"class": "wwUB2c PZPZlf"}) #in this case also it is returning a empty list
soup.find("span", {"data-ved": "2ahUKEwizlJiu9OLoAhXFgeYKHXzvAlMQ2kooAjAlegQIJhAN"}) #in this case also it is returning a empty list



Tags: indivdataisitthislistclass
3条回答

首先,你需要提出一个要求

错误:

soup = bs(url.text, "lxml")

正确:

soup = bs(requests.get(url).text, "lxml")

其次,这些数据(浪漫/喜剧)是由AJAX下载的,所以您无法从谷歌搜索的请求中获取

如果我发现如何帮助你,我会更新这个答案

许多类看起来是动态的。您可以尝试在DOM中的元素之间建立一种关系,并使用:contains收紧关系,前提是“/”出现在感兴趣的文本中,并限制到第一个节点

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://www.google.com/search?biw=1920&bih=1008&ei=uwqTXuyUB-Ov8QPIvbKACQ&q=notting+hill+&oq=notting+hill+&gs_lcp=CgZwc3ktYWIQAzIECCMQJzIHCAAQgwEQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIFCAAQgwEyBAgAEEM6BAgAEEdKDQgXEgkxMC0xOThnMThKCggYEgYxMC0xZzNQwN0XWMDdF2CW3xdoAHADeACAAasBiAGrAZIBAzAuMZgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwis3vbz8uLoAhXjV3wKHcieDJAQ4dUDCAw&uact=5')
soup = bs(r.content, 'lxml')
print(soup.select_one('span div[class]:contains("‧")').text.split('‧')[1])

试试这个

from simplified_scrapy import SimplifiedDoc,req,utils
url = "https://www.google.com/search?biw=1920&bih=1008&ei=uwqTXuyUB-Ov8QPIvbKACQ&q=notting+hill+&oq=notting+hill+&gs_lcp=CgZwc3ktYWIQAzIECCMQJzIHCAAQgwEQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIECAAQQzIFCAAQgwEyBAgAEEM6BAgAEEdKDQgXEgkxMC0xOThnMThKCggYEgYxMC0xZzNQwN0XWMDdF2CW3xdoAHADeACAAasBiAGrAZIBAzAuMZgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwis3vbz8uLoAhXjV3wKHcieDJAQ4dUDCAw&uact=5"
html = req.get(url)
#html = '''
#<div class="wwUB2c PZPZlf" data-attrid="subtitle">
#   <span data-ved="2ahUKEwizlJiu9OLoAhXFgeYKHXzvAlMQ2kooAjAlegQIJhAN">1999 ‧ Romance/Comedy ‧ 2h 4m</span>
#</div>
#'''
doc = SimplifiedDoc(html)
text = doc.getElement("div",attr="data-attrid",value="subtitle").text
print (text)

结果:

1999 ‧ Romance/Comedy ‧ 2h 4m

相关问题 更多 >