转到403页时进行网页抓取

2024-05-17 06:34:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我是一个网页抓取的初学者,需要使用Beautifulsoup抓取https://mirror-h.org/archive/page/1。但它给出了一个错误,并转到403页。我怎样才能解决这个问题?我真的很感谢你的帮助

这是我的密码:

import requests
from bs4 import BeautifulSoup
import pandas

url = "https://mirror-h.org/archive/page/1"
page = pandas.read_html(url)
headers = {
    'user-agent:' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
    }
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

我得到的错误是:

 raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Tags: httpsorgimporturlpandasresponsemirrorhtml
1条回答
网友
1楼 · 发布于 2024-05-17 06:34:04
import requests
import pandas as pd
from bs4 import BeautifulSoup


# make sure you insert the headers as a dict as you missed the : within your original code
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
}


def main(url):
    # included headers in request
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    # response 200
    print(r)


    # this is how you can use pandas with the previous headers to get 200 response text
    df = pd.read_html(r.text)
    print(df)  # you will get error  > ValueError: No tables found because you are dealing with JS website behind CloudFlare protection! try selenium then!
    


main('https://mirror-h.org/archive/page/1 ')

相关问题 更多 >