我如何识别HTML的正确位,以便用python刮取插曲数据

2024-05-20 20:21:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图通过使用Beautifulsouprequests模块来改进Python。我已经完成了一些教程,并成功地从不同的地方收集了数据,但无法使这一个正常工作。我知道有一个现成的产品imdb提供访问数据,但我喜欢使用该网站来练习Python

我试图在this page上刮取每一集的标题,但我的代码只是给了我一个空列表

import requests
from bs4 import BeautifulSoup

URL = 'https://www.imdb.com/title/tt0094525/episodes?season=5&ref_=tt_eps_sn_5'

headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 '
                         '(KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

page = requests.get(URL, headers=headers)
pageTree = requests.get(URL, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html.parser')

print(soup) #testing its working
print(soup.title.string)

episodes_list = []

episodes = soup.find_all("a", class_="title")

for episode in episodes:
    episodeName = episodes.find("a").get_text()
    episodes_list.append(episodeName)
print(episodes_list)

我知道问题出在episodes变量上,但试错法并没有给我答案


Tags: 数据importurlgettitlepagerequestslist
2条回答

你可以试试这样的。它将只选择该系列的标题并将其放入插曲列表中

import requests
    from bs4 import BeautifulSoup

URL = 'https://www.imdb.com/title/tt0094525/episodes?season=5&ref_=tt_eps_sn_5'

headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 '
                         '(KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

page = requests.get(URL, headers=headers)
pageTree = requests.get(URL, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html.parser')



episodes_list = []

episodes = soup.find_all("div",{"class": "info"})

# Iterate over results and print
for episode in episodes:
    episodes_list.append(episode.a.text)


print(episodes_list)

输出如下所示:

['The Adventure of the Egyptian Tomb', 'The Underdog', 'The Yellow Iris', 'The Case of the Missing Will', 'The Adventure of the Italian Nobleman', 'The Chocolate Box', "Dead Man's Mirror", 'Jewel Robbery at the Grand Metropolitan']

您正在查找具有class=title的元素,但是如果查看HTML,您正在查找的a元素没有class属性。例如:

<a href="/title/tt0676164/"
title="The Adventure of the Egyptian Tomb" itemprop="url">...</a>

有一个title属性,但没有class属性。通读beautifulsoup documentation,您似乎可以使用具有属性筛选器的正则表达式,因此我们可能可以执行以下操作:

episodes = soup.find_all("a", title=re.compile('.'))

它查找具有非空title属性的所有内容,这似乎是您想要的:

>>> episodes = soup.find_all("a", title=re.compile('.'))
>>> [x.get('title') for x in episodes]
['The Adventure of the Egyptian Tomb', 'The Adventure of the Egyptian Tomb', 
'The Underdog', 'The Underdog', 'The Yellow Iris', 'The Yellow Iris', 
'The Case of the Missing Will', 'The Case of the Missing Will', 
'The Adventure of the Italian Nobleman', 'The Adventure of the Italian Nobleman', 
'The Chocolate Box', 'The Chocolate Box', "Dead Man's Mirror", 
"Dead Man's Mirror", 'Jewel Robbery at the Grand Metropolitan', 
'Jewel Robbery at the Grand Metropolitan', 'Share on Facebook', 
'Share on Twitter', 'Share the page', 'Facebook', 'Instagram', 'Twitch',
'Twitter', 'YouTube']

相关问题 更多 >