BeautifulSoup在第一页之后不工作

2024-06-13 20:03:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用Python的BeautifulSoup从以下website中提取数据。网站上的数据分为四个不同的页面。每个页面都有一个唯一的链接(,即第一页为http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/0,第二页为http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/1,等等)。我能够成功地刮取第一页上的数据,但当我试图刮取第二页上的数据时,它会变成空的。以下是我正在使用的代码:

# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

# Define url and request webpage
season = 2019
page = 1
url = "http://insider.espn.com/nbadraft/results/top100/_/year/{}/set/{}".format(season, page)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")

# Scrape all of the data in the table
rows = page_soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
                for i in range(len(rows))]

# Get the column headers
headers = player_stats[0]

# Remove the first row
player_stats.pop(0)

# Convert to pandas dataframe
df = pd.DataFrame(player_stats, columns = headers)

# Remove all rows where Name = None
df = df[~df['NAME'].isnull()]

# Remove PLAYER column because it's empty
df = df.drop(columns='PLAYER')
df 

任何建议都将不胜感激!我对使用BeautifulSoup有点陌生,所以如果代码不是特别好或者效率不高,我会提前道歉

更新:只有在Chrome中打开链接时,链接才起作用,这可能是导致问题的原因。有什么办法吗


Tags: the数据comhttpdf链接statspage