python请求启用cookies/javascrip

2024-10-02 20:40:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我试着从一个特定的网站下载一个excel文件。在我的本地计算机中,它工作得非常好:

>>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
>>> r.status_code
200
>>> r.content
b'\xd0\xcf\x11\xe0\xa1\xb1...\x00\x00' # Long binary string

但是当我连接到一个远程ubuntu服务器时,我收到一条与启用cookies/javascript相关的消息。在

^{pr2}$

在local上,我从安装了Chrome的MACos上运行(我没有在脚本中使用它,但可能是相关的?),在远程,我在数字海洋上运行ubuntu,没有安装任何GUI浏览器。在


Tags: 文件httpget远程网站ubuntuwww计算机
1条回答
网友
1楼 · 发布于 2024-10-02 20:40:55

requests的行为与系统上安装的浏览器无关,它不以任何方式依赖或与它们交互。在

这里的问题是,您请求的资源启用了某种“bot缓解”机制来阻止这种访问。它返回一些javascript,其中包含需要评估的逻辑,然后该逻辑的结果将用于附加请求,以“证明”您不是bot。在

幸运的是,这个特定的缓解机制似乎是solved before,我能够利用代码中的挑战解决功能快速地使这个请求生效:

from math import cos, pi, floor

import requests

URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


def parse_challenge(page):
    """
    Parse a challenge given by mmi and mavat's web servers, forcing us to solve
    some math stuff and send the result as a header to actually get the page.
    This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
    """
    top = page.split('<script>')[1].split('\n')
    challenge = top[1].split(';')[0].split('=')[1]
    challenge_id = top[2].split(';')[0].split('=')[1]
    return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


def get_challenge_answer(challenge):
    """
    Solve the math part of the challenge and get the result
    """
    arr = list(challenge)
    last_digit = int(arr[-1])
    arr.sort()
    min_digit = int(arr[0])
    subvar1 = (2 * int(arr[2])) + int(arr[1])
    subvar2 = str(2 * int(arr[2])) + arr[1]
    power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
    x = (int(challenge) * 3 + subvar1)
    y = cos(pi * subvar1)
    answer = x * y
    answer -= power
    answer += (min_digit - last_digit)
    answer = str(int(floor(answer))) + subvar2
    return answer


def main():
    s = requests.Session()
    r = s.get(URL)

    if 'X-AA-Challenge' in r.text:
        challenge = parse_challenge(r.text)
        r = s.get(URL, headers={
            'X-AA-Challenge': challenge['challenge'],
            'X-AA-Challenge-ID': challenge['challenge_id'],
            'X-AA-Challenge-Result': challenge['challenge_result']
        })

        yum = r.cookies
        r = s.get(URL, cookies=yum)

    print(r.content)


if __name__ == '__main__':
    main()

相关问题 更多 >