Python靓汤网：只返回新数据？

from bs4 import BeautifulSoup import requests #URL and headers so it thinks we are a browser url = "https://www.autotrader.co.uk/car-search?search-target=usedcars&is-quick-search=true&radius=&onesearchad=used&onesearchad=nearlynew&onesearchad=new&make=AC&model=&price-from=&price-to=&postcode=sw65bg" headers = {'User-Agent' : 'Mozilla/5.0'} #Request request = requests.get(url, headers) soup = BeautifulSoup(request.text, "html.parser") #Find the name box name_box = soup.find_all('h2', attrs={'class' : 'listing-title'}) #Print the name_box results to see them for listing in range(len(name_box)): temp = name_box[listing] value = temp.text print(value)

A 0 AC Cobra 6.3 2dr 1 AC Cobra 4.9 MK IV 2dr 2 AC Cobra 3.5 2dr 3 AC Cobra 3.5 2dr 4 AC Cobra 5.3 2dr 5 AC Cobra 5.7 6 AC Cobra 4736 Built By Gardner Douglas 4.7 2dr 7 AC Cobra 5.7 8 AC Cobra 5.7 2dr 9 AC Cobra 5.8

1条回答

网友

1楼 · 发布于 2024-09-28 01:25:02

如果页面发送一个^{} header（基本上是页面的校验和），则可以将其作为数据库，并随下一个请求一起发送。如果没有更改，服务器将返回一个304（没有更改），您可以停止。在

如果页面发送一个^{} header，您可以将其建立数据库，并在下一个请求中将其与Last-Modified头进行比较。要节省加工，请在刮削前检查头部。如果页面很少更改，则可以通过downloading only the header节省带宽。在

或者，更好的方法是发送一个带有^{} header的请求，服务器应该返回304或{}（完整响应），这取决于页面是否比上一个时间戳更新。在

当然，所有这一切都取决于服务器/页面所有者是否友好通过发送和处理有用的标题。不幸的是，我没有看到一个ETag或Last-Modified头随示例页面而来。在

最终，确定没有新数据的唯一方法是将其与数据库中的数据进行比较。您可以通过编写流畅的抓取和DB代码来尽可能优化该过程。在

相关问题更多 >

编程相关推荐

热门问题

热门文章