使用Python使用下拉菜单+按钮进行抓取

from bs4 import BeautifulSoup import requests pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html") soupm = BeautifulSoup(pagem.content,"lxml") lst=soupm.find_all('a', href=True) url=lst[-1]['href'] page = requests.get(url) soup = BeautifulSoup(page.content,"lxml") xin= soup.find("select",{"id":"_id0:selectOneFechaIni"}) xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"}) ino=list(xin.stripped_strings) fino=list(xfn.stripped_strings) headers = {'Referer': url} data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"} respo=requests.post(url,data,headers=headers) print(respo.url)

3条回答

网友

1楼 · 编辑于 2024-05-19 14:31:13

上次检查时，您不能通过单击带有BeautifulSoup和Python的按钮来提交表单。我经常看到两种典型的方法：

对表单进行反向工程

如果表单进行AJAX调用（例如，在后台发出请求，对于用React或Angular编写的spa来说很常见），那么最好的方法是使用Chrome或其他浏览器中的network requests选项卡来了解端点是什么以及负载是什么。一旦得到了这些答案，就可以使用requests库向使用data=your_payload_dictionary的端点发出POST请求（例如，手动执行表单在幕后执行的操作）。阅读this post了解更详细的教程。在

使用无头浏览器

如果网站上写的是ASP.NET或者类似的MVC框架，那么最好的方法是使用headless浏览器填写表单并单击submit。一个流行的框架是Selenium。这将模拟普通浏览器。阅读this post以获得更详细的教程。在

粗略地看一下你正在处理的页面，我推荐第二种方法。在

网友

2楼 · 编辑于 2024-05-19 14:31:13

你要刮的那一页是：

http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces

在有效负载中添加要咨询的日期和来自cookies的JSESSIONID，以及Referer，User-Agent以及请求头中所有旧的好东西

示例：

import requests
import pandas as pd

cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"


payload = {
    "JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
    "fechaAConsultar": "21/03/2019"
}

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "Content-Type": "application/x-www-form-urlencoded",
    "Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)

网友

3楼 · 编辑于 2024-05-19 14:31:13

当只需点击页面时，似乎有某种cookie/session的内容在使用requests时可能很难考虑这些内容。在

（示例：http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000）

使用selenium编写代码可能更容易，因为这将使浏览器自动化（并处理所有标题等）。您仍然可以访问html来获取所需的内容。您可能还可以重用selenium中所做的许多工作。在

相关问题更多 >

编程相关推荐

热门问题

热门文章