使用ajax在主页上使用scrapy发布请求

import scrapy class MedizinfuchsSpider(scrapy.Spider): name = "medizinfuchs" start_urls = [ 'https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html' ] def parse(self, response): for apotheke in response.css('div.apotheke'): yield { 'name': apotheke.css('a.name::text').getall(), 'single': apotheke.css('div.single::text').getall(), 'shipping': apotheke.css('div.shipping::text').getall(), }

1条回答

网友
1楼 · 发布于 2024-09-30 22:22:32

如果您愿意接受仅使用请求和beautifulsoup的建议，您可以：
使用requests.Session()存储cookie并对url s.get(url)执行第一次调用。这将得到与产品id相等的cookie product_history
使用requests.post调用您在chrome开发工具中发现的API，并在表单数据中指定id
以下示例迭代产品列表并执行上述流程：
import requests from bs4 import BeautifulSoup import pandas as pd products = [ "https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html", "https://www.medizinfuchs.de/preisvergleich/alcohol-pads-b.braun-100-st-b.-braun-melsungen-ag-pzn-629703.html" ] results = [] for url in products: # get id s = requests.Session() r = s.get(url) id = s.cookies.get_dict()["product_history"] soup = BeautifulSoup(r.text, "html.parser") pzn = soup.find("li", {"class": "pzn"}).text[5:] print(f'pzn: {pzn}') # make the call r = requests.post("https://www.medizinfuchs.de/ajax_apotheken", data={ "params[ppn]": id, "params[entry_order]": "single_asc", "params[filter][rating]": "", "params[filter][country]": 7, "params[filter][favorit]": 0, "params[filter][products_from][de]": 0, "params[filter][products_from][at]": 0, "params[filter][send]": 1, "params[limit]": 300, "params[merkzettel_sel]": "", "params[merkzettel_reload]": "", "params[apo_id]": "" }) soup = BeautifulSoup(r.text, "html.parser") data = [ { "name": t.find("a").text.strip(), "single": t.find("div", {"class": "single"}).text.strip(), "shipping": t.find("div", {"class": "shipping"}).text.strip().replace("\t", "").replace("\n", " "), } for t in soup.findAll("div", {"class": "apotheke"}) ] for t in data: results.append({ "pzn": pzn, **t }) df = pd.DataFrame(results) df.to_csv('result.csv', index=False) print(df)
答复：https://replit.com/@bertrandmartel/ScrapeMedicinFuchs
注意，在上面的解决方案中，我只使用requests.Session()来获取product_historycookie。后续调用中不需要该会话。这样，我就可以直接获得产品id，而不必在html/js中使用正则表达式。但是可能有更好的方法来获取产品id，我们无法从url获取它，因为它只有部分产品id4114918，而不是1104114918（如果您不想对110后缀部分进行编码）

相关问题更多 >

编程相关推荐

热门问题

热门文章