转到403页时进行网页抓取

import requests from bs4 import BeautifulSoup import pandas url = "https://mirror-h.org/archive/page/1" page = pandas.read_html(url) headers = { 'user-agent:' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') print(soup)

1条回答

网友

1楼 · 发布于 2024-05-17 06:34:04

import requests
import pandas as pd
from bs4 import BeautifulSoup


# make sure you insert the headers as a dict as you missed the : within your original code
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
}


def main(url):
    # included headers in request
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    # response 200
    print(r)


    # this is how you can use pandas with the previous headers to get 200 response text
    df = pd.read_html(r.text)
    print(df)  # you will get error  > ValueError: No tables found because you are dealing with JS website behind CloudFlare protection! try selenium then!
    


main('https://mirror-h.org/archive/page/1 ')

相关问题更多 >

编程相关推荐

热门问题

热门文章

转到403页时进行网页抓取

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >