无法使用python在对gumtree.com的连续请求（循环中）中获取完整页面内容

2024-07-05 14:54:33 发布

男 | 程序猿一只，喜欢编程写python代码。

我正试图从gumtree.com上的汽车列表中获取电话号码。问题是，当我向gumtree.com发出“get”请求时，它返回4,5个请求的全部内容（html+javascript），但在这之后，它只返回javascript

然后我等待5分钟，然后重试，它返回4,5个请求的全部内容（html+javascript），然后再次返回唯一的javascript。我不明白为什么会这样。他们没有禁止我的IP。我仍然可以发送请求，并且返回全部内容（html+javascript），但问题是我无法获取所有请求的全部内容（html+javascript）

这是我正在抓取的链接https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private，它包含汽车列表。我得到所有的广告链接，然后我迭代每个链接，从每个广告中得到电话号码

我试过什么

我尝试了python中的HTML请求https://requests.readthedocs.io/projects/requests-html/en/latest/
我使用了selenium和Chrome无头浏览器，但是在4,5次请求之后，浏览器还会显示一个空页面
我使用了IP轮换技巧，但问题是有些IP有效，有些无效，另外，我认为这不是一个IP问题，因为我仍在发送来自同一IP的请求，我甚至添加了50秒的睡眠

这是密码

import requests
from bs4 import BeautifulSoup 
from selenium import webdriver 
from requests_html import HTMLSession
import time 
session = HTMLSession()

headers = {
    'authority': 'www.gumtree.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4391.1 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'referer': 'https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'cookie': 'gt_ab=ln:MzhhaQ==; gt_p=id:OWYzY2I3Y2MtZTIxYS00OWQ1LWIzYjUtYzNhMDRiYjk4MDNi; gt_appBanner=',
}

s = requests.Session()

response = session.get('https://www.gumtree.com/search?search_category=cars&search_location=ls14jj&distance=1000&seller_type=private',headers=headers, timeout=(5, 27))


soup = BeautifulSoup(response.content, 'html.parser')


ul = soup.find("ul", attrs={"class": "list-listing-maxi"})

lis = ul.findChildren("li",recursive=False)

for li in lis:

  ad = li.find("article", attrs={"class": "listing-maxi"})
  
  adLink = ad.find("a", attrs={"class": "listing-link"})
  adLink = adLink.get('href')
  if 'gumtree' in adLink:
    adLink = adLink
  else:
    adLink = adLink.replace("/p","https://www.gumtree.com/p")

  adID = ad.get('data-q')
 
  adID = adID.split('-')
  adID = adID[1]
  revealUrl = adLink
  response = session.get(revealUrl, timeout=(10, 27))
  soup = BeautifulSoup(response.content, 'html.parser')
  f = open("abc.html", "wb")
  f.write(response.content)
  f.close()

  fone = soup.find("div", attrs={"class": "seller-phone-reveal"})

  if fone is None:
    fone = soup.find("div", attrs={"class": "space-mbs"})
  
  fone = fone.find("a")
  fone = fone.get('href')
  fone = fone.split('&')
  
  try:
    fone = fone[1]
  except IndexError:
    fone = soup.find("a", attrs={"id": "reply-panel-reveal-btn"})
    fone = fone.get('href')

  token = fone.replace("rt=","")
  print(token)
  print('-----------------')

  time.sleep(50)

这不是登录问题。该页面无需登录即可访问。“scrapy”能处理这个问题吗

有人能调查一下这个问题吗

谢谢

Tags： https ip com search get html www find

0条回答

目前没有回答

无法使用python在对gumtree.com的连续请求（循环中）中获取完整页面内容

相关问题更多 >

编程相关推荐

热门问题

热门文章

无法使用python在对gumtree.com的连续请求（循环中）中获取完整页面内容

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >