网页抓取点击下载

2024-07-04 08:49:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我想使用Python自动化以下任务: 给定文件ID 8426和日期03312021

  1. 请访问以下网站: https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id=8426&date=03312021
  2. 点击“下载PDF”
  3. 将文件保存到目录

我做了一些研究,发现了一个python模块请求:https://docs.python-requests.org/en/master/user/quickstart/

看起来我应该能够声明一个数据对象并传递它,以便发送请求

r = requests.post('https://my_url', data = {'key':'value'})
with open(test.pdf, "wb") as f:
   f.write(r.content)

但是,在本例中,我很难在数据对象内部找到正确的属性。我尝试了一些,但无法获取所需的pdf文件。任何帮助都将不胜感激


Tags: 文件数据对象httpsidpdf网站ds
2条回答

我知道您询问了“请求”,但我认为使用Selenium很容易。如果需要,请尝试以下操作:

from selenium import webdriver
from time import sleep

id = input("id: ")
date = input("date: ")

url = f"https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id={id}&date={date}"

browser = webdriver.Chrome()
browser.get(url)
el = browser.find_element_by_id("Download_PDF_2")
el.click()
sleep(5)
browser.quit()

您还可以更改获取id、日期值和睡眠时间的方式

确保chromedriver在PATH中可用,或将其保存在与脚本相同的目录中

所以。。对于request.post()方法,data参数是字典,它表示html post表单的键值对。为了找到它,您可以在浏览器中的DevTools(Chrome和Mozilla中的shift-ctr-I)上找到它,打开网络选项卡并提交需要检查的表单——在您的情况下,表单表示为单个<input type="submit" ... >元素(样式化为“下载PDF”按钮)。点击此输入后,浏览器将向服务器发出格式良好的POST请求,您可以在“网络”选项卡上看到该请求的正确html标题和键值-只需将其grub并在python脚本中形成两个DICT:第一个带有标题,第二个带有POST表单值

您提前发布的url示例

# http headers
headers = 
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Content-Length': '1017',
 'Content-Type': 'application/x-www-form-urlencoded',
 'Cookie': 'ASP.NET_SessionId=okonm4wfhg5ddup5e0wkp0ur; BIGipServerfdic_Forward_prod_80=172495532.20480.0000; _ga=GA1.2.77529009.1621351450; _gid=GA1.2.1620156842.1621351450',
 'Host': 'cdr.ffiec.gov',
 'Origin': 'https://cdr.ffiec.gov',
 'Referer': 'https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id=8426&date=03312021',
 'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
 'sec-ch-ua-mobile': '?0',
 'Sec-Fetch-Dest': 'document',
 'Sec-Fetch-Mode': 'navigate',
 'Sec-Fetch-Site': 'same-origin',
 'Sec-Fetch-User': '?1',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36' }

# Form data
post_form_data = 
{ "__EVENTTARGET": "",
"__EVENTARGUMENT": "",
"__VIEWSTATE": "/wEPDwULLTE0NTY3MjMzNTQPFggeHVZpZXdQREZGYWNzaW1pbGVfU3VibWlzc2lvbklEApTmYR4UVmlld1BERkZhY3NpbWlsZU1vZGULKX1DZHIuUGRkLlVJLkNvbnRyb2xzLlVJSGVscGVyK1ZpZXdGYWNzaW1pbGVNb2RlLCBDZHIuUGRkLlVJLlByb2Nlc3NlcywgVmVyc2lvbj03LjEuMTMzLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49bnVsbAAeBkZJTmFtZQV4MVNUIFNVTU1JVCBCQU5LICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgHg5GRElDQ2VydE51bWJlcgUEODQyNhYCZg9kFgICAQ9kFgICBw9kFgYCAQ9kFgQCAQ8PFgYeBFRleHRkHghDc3NDbGFzcwUJZG9jaGVhZGVyHgRfIVNCAgJkZAIDDw8WBh8EZB8FBQZoZWFkZXIfBgICZGQCAw9kFgICAQ8UKwACZBQrAAUUKwAIaAUFUHJpbnRoaGRoZ2QUKwAIZwUNRG93bmxvYWQgWEJSTGdoZGhnZBQrAAhnBQxEb3dubG9hZCBQREZnaGRoZ2QUKwAIZwUMRG93bmxvYWQgU0RGZ2hkaGdkFCsACGcFEURvd25sb2FkIFRheG9ub215Z2hkaGhkZAIFDw8WAh4HVmlzaWJsZWhkZGTtXpFTz1TYX73fKLF2ros5Z2CvJ/pDUy88F6s57Qs97Q==",
"__VIEWSTATEGENERATOR": "A250BEAE",
"ctl00$MainContentHolder$viewTabStrip$Download_PDF_2": "Download PDF" }

# url to submit the form
url = 'https://cdr.ffiec.gov/Public/ViewPDFFacsimile.aspx?ds=call&idType=fdiccert&id=8426&date=03312021'

# making request
resp = requests.post(url, headers=headers, data=post_form_data)

# writing the file from response content
with open('file_name.pdf', 'wb') as file:
    file.write(resp.content)

查找带有特定fileiddate的文档:
此信息在url参数中给出:... /ViewPDFFacsimile.aspx?ds=call&idType=fdiccert&id=8426&date=03312021' 您还可以在网络选项卡上找到它(在Chrome中称为“查询字符串参数”)。要在请求中传递它,请使用request.post()方法的params参数

url_params = { 
"ds": "call",
"idType": "fdiccert",
"id": "8426", 
"date": "03312021" }

request.post(url, headers=headers, data=post_form_data, params=params)

相关问题 更多 >

    热门问题