我如何识别HTML的正确位，以便用python刮取插曲数据

import requests from bs4 import BeautifulSoup URL = 'https://www.imdb.com/title/tt0094525/episodes?season=5&ref_=tt_eps_sn_5' headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'} page = requests.get(URL, headers=headers) pageTree = requests.get(URL, headers=headers) soup = BeautifulSoup(pageTree.content, 'html.parser') print(soup) #testing its working print(soup.title.string) episodes_list = [] episodes = soup.find_all("a", class_="title") for episode in episodes: episodeName = episodes.find("a").get_text() episodes_list.append(episodeName) print(episodes_list)

2条回答

网友

1楼 · 编辑于 2024-05-20 20:21:59

你可以试试这样的。它将只选择该系列的标题并将其放入插曲列表中

import requests
    from bs4 import BeautifulSoup

URL = 'https://www.imdb.com/title/tt0094525/episodes?season=5&ref_=tt_eps_sn_5'

headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 '
                         '(KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

page = requests.get(URL, headers=headers)
pageTree = requests.get(URL, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html.parser')



episodes_list = []

episodes = soup.find_all("div",{"class": "info"})

# Iterate over results and print
for episode in episodes:
    episodes_list.append(episode.a.text)


print(episodes_list)

输出如下所示：

['The Adventure of the Egyptian Tomb', 'The Underdog', 'The Yellow Iris', 'The Case of the Missing Will', 'The Adventure of the Italian Nobleman', 'The Chocolate Box', "Dead Man's Mirror", 'Jewel Robbery at the Grand Metropolitan']

网友

2楼 · 编辑于 2024-05-20 20:21:59

您正在查找具有class=title的元素，但是如果查看HTML，您正在查找的a元素没有class属性。例如：

<a href="/title/tt0676164/"
title="The Adventure of the Egyptian Tomb" itemprop="url">...</a>

有一个title属性，但没有class属性。通读beautifulsoup documentation，您似乎可以使用具有属性筛选器的正则表达式，因此我们可能可以执行以下操作：

episodes = soup.find_all("a", title=re.compile('.'))

它查找具有非空title属性的所有内容，这似乎是您想要的：

>>> episodes = soup.find_all("a", title=re.compile('.'))
>>> [x.get('title') for x in episodes]
['The Adventure of the Egyptian Tomb', 'The Adventure of the Egyptian Tomb', 
'The Underdog', 'The Underdog', 'The Yellow Iris', 'The Yellow Iris', 
'The Case of the Missing Will', 'The Case of the Missing Will', 
'The Adventure of the Italian Nobleman', 'The Adventure of the Italian Nobleman', 
'The Chocolate Box', 'The Chocolate Box', "Dead Man's Mirror", 
"Dead Man's Mirror", 'Jewel Robbery at the Grand Metropolitan', 
'Jewel Robbery at the Grand Metropolitan', 'Share on Facebook', 
'Share on Twitter', 'Share the page', 'Facebook', 'Instagram', 'Twitch',
'Twitter', 'YouTube']

相关问题更多 >

编程相关推荐

热门问题

热门文章