(美丽的soap4,)AttributeError:“非类型”对象没有属性“获取文本”

2024-10-04 17:20:59 发布

您现在位置:Python中文网/ 问答频道 /正文

博客:分析IMDb的前250部电影:第1部分;让我们搜集一些数据

有关详细信息:https://medium.com/analytics-vidhya/analyzing-imdbs-top-250-movies-part-1-let-scrape-some-data-a422adc3eb8d

问题是,每当我想要检索IMDb前250部电影中的单个页面链接时,都会出现一个错误:AttributeError: 'NoneType' object has no attribute 'get_text'这意味着我知道它没有所需的类名或我们在HTML中寻找的元素。但是HTML由我传递的类名组成。 我实现了与博客相同的功能,但我无法检索单个电影并获取数据。 以下代码与博客中的代码相同:

import requests                 # Simpler HTTP requests 
from bs4 import BeautifulSoup   # Python package for pulling data out of HTML and XML files
import pandas as pd             # Python package for data manipulation and analysis
import re                       # regular expressions
import json                     # Python package used to work with JSON data
from tqdm import tqdm           # python for displaying progressbar 
from datetime import datetime

url = 'https://www.imdb.com/chart/top'
url_text = requests.get(url).text
soup = BeautifulSoup(url_text, 'html.parser'
template = 'https://www.imdb.com%s'
title_links = [template % a.attrs.get('href') for a in url_soup.select( 'td.titleColumn a' )]

movie_name = (page_soup.find("div",{ "class":"title_wrapper" }).get_text( strip=True ).split('|')[0]).split('(')[0]



Tags: textfromhttpsimportcomurlpackagefor
2条回答

您需要的数据可以很容易地在td标签中找到,标签的类名为titleColumn。您可以从那里提取电影名称和链接

在这里,我将展示前10部电影。您可以修改此代码以满足您的要求

import requests
import bs4 as bs

url = "https://www.imdb.com/chart/top"
response = requests.get(url)
html = response.text
soup = bs.BeautifulSoup(html, 'lxml')

t = soup.findAll('td', class_='titleColumn')
for i in range(10):
    a_tag = t[i].find('a')
    link = 'https://www.imdb.com/' + a_tag['href']
    title = a_tag.text

    print(f'Link: {link}\nMovie: {title}\n')
Sample Output:

Link: https://www.imdb.com//title/tt0111161/
Movie: The Shawshank Redemption

Link: https://www.imdb.com//title/tt0068646/
Movie: The Godfather

Link: https://www.imdb.com//title/tt0071562/
Movie: The Godfather: Part II

Link: https://www.imdb.com//title/tt0468569/
Movie: The Dark Knight

有一个更简单的方法来获得冠军。每个页面都有一个<title>元素,其中正好包含您需要的信息:

import requests                 # Simpler HTTP requests
from bs4 import BeautifulSoup   # Python package for pulling data out of HTML and XML files
#import pandas as pd             # Python package for data manipulation and analysis
import re                       # regular expressions
import json                     # Python package used to work with JSON data
#from tqdm import tqdm           # python for displaying progressbar
from datetime import datetime

url = 'https://www.imdb.com/chart/top'
url_text = requests.get(url).text
url_soup = BeautifulSoup(url_text, 'html.parser')
template = 'https://www.imdb.com%s'
title_links = [template % a.attrs.get('href') for a in url_soup.select( 'td.titleColumn a' )]

movie_names = []

for title_link in title_links:
    page_soup = BeautifulSoup(requests.get(title_link).text, 'html.parser')

    movie_names.append(page_soup.title.get_text(strip=True).split(' (')[0])


print(movie_names)

相关问题 更多 >

    热门问题