无法使用python在对gumtree.com的连续请求(循环中)中获取完整页面内容

2024-07-05 14:54:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从gumtree.com上的汽车列表中获取电话号码。问题是,当我向gumtree.com发出“get”请求时,它返回4,5个请求的全部内容(html+javascript),但在这之后,它只返回javascript

然后我等待5分钟,然后重试,它返回4,5个请求的全部内容(html+javascript),然后再次返回唯一的javascript。我不明白为什么会这样。他们没有禁止我的IP。我仍然可以发送请求,并且返回全部内容(html+javascript),但问题是我无法获取所有请求的全部内容(html+javascript)

这是我正在抓取的链接https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private,它包含汽车列表。我得到所有的广告链接,然后我迭代每个链接,从每个广告中得到电话号码

我试过什么

  1. 我尝试了python中的HTML请求https://requests.readthedocs.io/projects/requests-html/en/latest/
  2. 我使用了selenium和Chrome无头浏览器,但是 在4,5次请求之后,浏览器还会显示一个空页面
  3. 我使用了IP轮换技巧,但问题是有些IP有效,有些无效, 另外,我认为这不是一个IP问题,因为我仍在发送 来自同一IP的请求,我甚至添加了50秒的睡眠

这是密码

import requests
from bs4 import BeautifulSoup 
from selenium import webdriver 
from requests_html import HTMLSession
import time 
session = HTMLSession()

headers = {
    'authority': 'www.gumtree.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4391.1 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'referer': 'https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'cookie': 'gt_ab=ln:MzhhaQ==; gt_p=id:OWYzY2I3Y2MtZTIxYS00OWQ1LWIzYjUtYzNhMDRiYjk4MDNi; gt_appBanner=',
}

s = requests.Session()

response = session.get('https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private',headers=headers, timeout=(5, 27))


soup = BeautifulSoup(response.content, 'html.parser')


ul = soup.find("ul", attrs={"class": "list-listing-maxi"})

lis = ul.findChildren("li",recursive=False)

for li in lis:

  ad = li.find("article", attrs={"class": "listing-maxi"})
  
  adLink = ad.find("a", attrs={"class": "listing-link"})
  adLink = adLink.get('href')
  if 'gumtree' in adLink:
    adLink = adLink
  else:
    adLink = adLink.replace("/p","https://www.gumtree.com/p")

  adID = ad.get('data-q')
 
  adID = adID.split('-')
  adID = adID[1]
  revealUrl = adLink
  response = session.get(revealUrl, timeout=(10, 27))
  soup = BeautifulSoup(response.content, 'html.parser')
  f = open("abc.html", "wb")
  f.write(response.content)
  f.close()

  fone = soup.find("div", attrs={"class": "seller-phone-reveal"})

  if fone is None:
    fone = soup.find("div", attrs={"class": "space-mbs"})
  
  fone = fone.find("a")
  fone = fone.get('href')
  fone = fone.split('&')
  
  try:
    fone = fone[1]
  except IndexError:
    fone = soup.find("a", attrs={"id": "reply-panel-reveal-btn"})
    fone = fone.get('href')

  token = fone.replace("rt=","")
  print(token)
  print('-----------------')

  time.sleep(50)

这不是登录问题。该页面无需登录即可访问。“scrapy”能处理这个问题吗

有人能调查一下这个问题吗

谢谢


Tags: httpsipcomsearchgethtmlwwwfind