使用Python进行Web抓取，给出HTTP错误404:找不到

from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist" uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") for che in chelsea: player = che.tbody.tr.td.table.tbody.tr.td["spielprofil_tooltip tooltipstered"] print("player: " +player)

1条回答

网友

1楼 · 发布于 2024-06-28 20:41:11

正如上面提到的，你的用户代理可能已经被服务器拒绝了。在

尝试使用以下内容扩充代码：

import urllib.request  # we are going to need to generate a Request object
from bs4 import BeautifulSoup as soup

my_url = "https://www.transfermarkt.com/chelsea-fc/leihspielerhistorie/verein/631/plus/1?saison_id=2018&leihe=ist"

# here we define the headers for the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0'}

# this request object will integrate your URL and the headers defined above
req = urllib.request.Request(url=my_url, headers=headers)

# calling urlopen this way will automatically handle closing the request
with urllib.request.urlopen(req) as response:
    page_html = response.read()

完成以上代码后，您可以继续分析。Python文档中有一些关于这个主题的有用页面：

https://docs.python.org/3/library/urllib.request.html#examples

https://docs.python.org/3/library/urllib.request.html

Mozilla的文档中有大量用户代理字符串可供尝试：

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent