如何通过web刮取需要身份验证的ASPX页面

2024-09-26 04:52:14 发布

您现在位置:Python中文网/ 问答频道 /正文

使用python的库“Requests”,我试图通过web抓取一个需要先登录(https://cei.bmfbovespa.com.br/CEI_Responsivo/login.aspx)的ASPX站点(https://cei.bmfbovespa.com.br/CEI_Responsivo/home.aspx

以下是我尝试执行的步骤:

  1. 创建一个带有“请求”的会话来处理cookies(这样做对吗?)
  2. 用Chrome开发工具从“请求头”中获得的所有信息更新头(由于会话原因,cookie的信息除外)
  3. 在登录页面中执行GET以获得帖子的输入值
  4. 职位

当我用chrome手动执行此操作时,在成功登录后,我得到了“302响应”,并被重定向到主页。但是对于python,在发布后,我得到了“200响应”,我仍然在登录页面中

import requests
from bs4 import BeautifulSoup
from requests.packages.urllib3 import add_stderr_logger

add_stderr_logger()

s = requests.Session()

url_login = 'https://cei.bmfbovespa.com.br/CEI_Responsivo/login.aspx'

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36',
    'Upgrade-Insecure-Requests':'1',
    'Host':'cei.bmfbovespa.com.br',
    'Connection':'keep-alive',
    'Accept-Language':'pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4',
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
s.headers.update(headers)

r = s.get(url_login, verify=False)
soup = BeautifulSoup(r.content)

viewstate = soup.find(id="__VIEWSTATE")['value']
viewgen = soup.find(id="__VIEWSTATEGENERATOR")['value']
eventvalid = soup.find(id="__EVENTVALIDATION")['value']

login_data = {          
        '__VIEWSTATE' : viewstate,
        '__VIEWSTATEGENERATOR' : viewgen,
        '__EVENTVALIDATION' : eventvalid,
        'ctl00$ContentPlaceHolder1$txtLogin' : '*',
        'ctl00$ContentPlaceHolder1$txtSenha' : '*',
        'tl00$ContentPlaceHolder1$btnLogar': 'Entrar'
}

resp = s.post(url_login, data=login_data, verify=False)

如果我仍然尝试对会话执行GET,我将被重定向到登录页面:

url_carteira = 'https://cei.bmfbovespa.com.br/CEI_Responsivo/home.aspx'
response = s.get(url_carteira, verify=False)

这就是我收到的输出:

2016-02-11 22:07:07,476 INFO Starting new HTTPS connection (1): cei.bmfbovespa.com.br
2016-02-11 22:07:07,823 DEBUG "GET /CEI_Responsivo/login.aspx HTTP/1.1" 200 4522
2016-02-11 22:07:07,898 DEBUG "POST /CEI_Responsivo/login.aspx HTTP/1.1" 200 4534
C:\Users\luciano\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connectionpool.py:791: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
  InsecureRequestWarning)
2016-02-11 22:07:10,470 DEBUG "GET /CEI_Responsivo/home.aspx HTTP/1.1" 302 147
2016-02-11 22:07:10,510 DEBUG "GET /CEI_Responsivo/login.aspx HTTP/1.1" 200 4522

我正在使用python 3.5.1

你知道我为什么不能成功登录并访问主页吗


Tags: httpsdebugbrcomhttpurlgetlogin