修复了在刮取数据时第一次POST请求的错误408

2024-10-08 22:23:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图用BS4刮网站。这是我的网站:

https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.html

我想刮掉这个页面上所有新闻文章的URL。若我只是把url放在请求库中,我就不会得到网站的url。但如果我去查看第页->;网络中,有一个post请求返回包含所有URL的HTML(href-s)。 我必须使用post请求才能获得网站上的所有URL,但问题是我总是收到错误408

url = 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.filter.html?tx_wslfilter_filter%5Baction%5D=ajax&tx_wslfilter_filter%5Bcontroller%5D=Filter&cHash=88a50dfb12c7c7e03ce68f244dbfda20'

headers = {
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
            'Connection': 'keep-alive',
            'Content-Length': '757',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Host': 'www.wsl.ch',
            'Origin': 'https://www.wsl.ch',
            'Referer': 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.html',
            'Sec-Fetch-Dest': 'empty',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Site': 'same-origin',
            'Server-Timing': 'miss, db;dur=63, app;dur=55.2'}


response = requests.post(url, headers = headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

我试过有标题和没有标题,但都一样。 我该怎么办


Tags: httpsurl网站htmlwwwdechpost
1条回答
网友
1楼 · 发布于 2024-10-08 22:23:49
  • 您在post请求中未发送正文
  • 我已经更正了您的代码,现在您将无法获得408(超时)
from bs4 import BeautifulSoup
import requests

url = 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.filter.html?tx_wslfilter_filter%5Baction%5D=ajax&tx_wslfilter_filter%5Bcontroller%5D=Filter&cHash=88a50dfb12c7c7e03ce68f244dbfda20'

headers = {
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
            'Connection': 'keep-alive',
            'Content-Length': '757',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Host': 'www.wsl.ch',
            'Origin': 'https://www.wsl.ch',
            'Referer': 'https://www.wsl.ch/de/ueber-die-wsl/news/alle-news.html',
            'Sec-Fetch-Dest': 'empty',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Site': 'same-origin',
            'Server-Timing': 'miss, db;dur=63, app;dur=55.2'}

data='tx_wslfilter_filter%5Btype%5D=news&tx_wslfilter_filter%5Bslf%5D=0&tx_wslfilter_filter%5Blang%5D=0&tx_wslfilter_filter%5Bpage%5D=1&tx_wslfilter_filter%5Bperpage%5D=10&tx_wslfilter_filter%5Bkeyword%5D=&tx_wslfilter_filter%5Ball%5D=1&tx_wslfilter_filter%5Bcategory%5D%5B10%5D=10&tx_wslfilter_filter%5Bcategory%5D%5B11%5D=11&tx_wslfilter_filter%5Bcategory%5D%5B12%5D=12&tx_wslfilter_filter%5Bcategory%5D%5B13%5D=13&tx_wslfilter_filter%5Bcategory%5D%5B1%5D=1&tx_wslfilter_filter%5Btag%5D%5B76%5D=76&tx_wslfilter_filter%5Btag%5D%5B1%5D=1&tx_wslfilter_filter%5Btag%5D%5B11%5D=11&tx_wslfilter_filter%5Btag%5D%5B7%5D=7&tx_wslfilter_filter%5Btag%5D%5B9%5D=9&tx_wslfilter_filter%5Btag%5D%5B8%5D=8&tx_wslfilter_filter%5Btag%5D%5B52%5D=52&tx_wslfilter_filter%5Byear%5D=0'
response = requests.post(url,data=data, headers = headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

相关问题 更多 >

    热门问题