我正试图从gumtree.com上的汽车列表中获取电话号码。问题是,当我向gumtree.com发出“get”请求时,它返回4,5个请求的全部内容(html+javascript),但在这之后,它只返回javascript
然后我等待5分钟,然后重试,它返回4,5个请求的全部内容(html+javascript),然后再次返回唯一的javascript。我不明白为什么会这样。他们没有禁止我的IP。我仍然可以发送请求,并且返回全部内容(html+javascript),但问题是我无法获取所有请求的全部内容(html+javascript)
这是我正在抓取的链接https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private,它包含汽车列表。我得到所有的广告链接,然后我迭代每个链接,从每个广告中得到电话号码
我试过什么
这是密码
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from requests_html import HTMLSession
import time
session = HTMLSession()
headers = {
'authority': 'www.gumtree.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4391.1 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': 'gt_ab=ln:MzhhaQ==; gt_p=id:OWYzY2I3Y2MtZTIxYS00OWQ1LWIzYjUtYzNhMDRiYjk4MDNi; gt_appBanner=',
}
s = requests.Session()
response = session.get('https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private',headers=headers, timeout=(5, 27))
soup = BeautifulSoup(response.content, 'html.parser')
ul = soup.find("ul", attrs={"class": "list-listing-maxi"})
lis = ul.findChildren("li",recursive=False)
for li in lis:
ad = li.find("article", attrs={"class": "listing-maxi"})
adLink = ad.find("a", attrs={"class": "listing-link"})
adLink = adLink.get('href')
if 'gumtree' in adLink:
adLink = adLink
else:
adLink = adLink.replace("/p","https://www.gumtree.com/p")
adID = ad.get('data-q')
adID = adID.split('-')
adID = adID[1]
revealUrl = adLink
response = session.get(revealUrl, timeout=(10, 27))
soup = BeautifulSoup(response.content, 'html.parser')
f = open("abc.html", "wb")
f.write(response.content)
f.close()
fone = soup.find("div", attrs={"class": "seller-phone-reveal"})
if fone is None:
fone = soup.find("div", attrs={"class": "space-mbs"})
fone = fone.find("a")
fone = fone.get('href')
fone = fone.split('&')
try:
fone = fone[1]
except IndexError:
fone = soup.find("a", attrs={"id": "reply-panel-reveal-btn"})
fone = fone.get('href')
token = fone.replace("rt=","")
print(token)
print('-----------------')
time.sleep(50)
这不是登录问题。该页面无需登录即可访问。“scrapy”能处理这个问题吗
有人能调查一下这个问题吗
谢谢
目前没有回答
相关问题 更多 >
编程相关推荐