Python中阻塞的GET请求的解决方法

import pandas as pd from requests import get import bs4 as bs import re # works # baseURL = 'https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=ky119sb&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=TOYOTA&model=VERSO&year-from=1990&year-to=2017&minimum-mileage=0&maximum-mileage=200000&body-type=MPV&fuel-type=Diesel&minimum-badge-engine-size=1.6&maximum-badge-engine-size=4.5&maximum-seats=8' # doesn't work baseURL = 'https://www.autotrader.com/cars-for-sale/Certified+Cars/cars+under+50000/Jeep/Grand+Cherokee/Seattle+WA-98101?extColorsSimple=BURGUNDY%2CRED%2CWHITE&maxMileage=45000&makeCodeList=JEEP&listingTypes=CERTIFIED%2CUSED&interiorColorsSimple=BEIGE%2CBROWN%2CBURGUNDY%2CTAN&searchRadius=0&modelCodeList=JEEPGRAND&trimCodeList=JEEPGRAND%7CSRT%2CJEEPGRAND%7CSRT8&zip=98101&maxPrice=50000&startYear=2015&marketExtension=true&sortBy=derivedpriceDESC&numRecords=25&firstRecord=0' a = get(baseURL) soup = bs.BeautifulSoup(a.content,'html.parser') info = soup.find_all('div', class_ = 'information-container') price = soup.find_all('div', class_ = 'vehicle-price') d = [] for idx, i in enumerate(info): ii = i.find_next('ul').find_all('li') year_ = ii[0].text miles = re.sub("[^0-9\.]", "", ii[2].text) engine = ii[3].text hp = re.sub("[^\d\.]", "", ii[4].text) p = re.sub("[^\d\.]", "", price[idx].text) d.append([year_, miles, engine, hp, p]) df = pd.DataFrame(d, columns=['year','miles','engine','hp','price'])

1条回答

网友

1楼 · 发布于 2024-09-28 01:24:02

默认情况下，请求在发出请求时发送唯一的用户代理。在

>>> r = requests.get('https://google.com')
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

您正在使用的网站可能试图通过拒绝用户代理python-requests来避免爬虫程序。在

要解决这个问题，可以在发送请求时change your user agent。因为它在你的浏览器上工作，所以只需复制你的浏览器用户代理（你可以用谷歌搜索它，或者把一个请求记录到一个网页上，然后像这样复制你的用户代理）。对我来说，它是Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36（多好的一口），所以我将我的用户代理设置为：

^{pr2}$

然后用新的头发送请求（新的头被添加到默认的头中，除非它们具有相同的名称，否则它们不会替换它们）：

>>> r = requests.get('https://google.com', headers=headers)  # Using the custom headers we defined above
>>> r.request.headers
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

现在我们可以看到请求是用我们首选的头发送的，希望网站不能区分请求和浏览器之间的区别。在

相关问题更多 >

编程相关推荐

热门问题

热门文章