如何使用BeautifulSoup从带有保护的重定向网站获取html内容？

2条回答

网友

1楼 · 编辑于 2024-10-01 07:51:29

这个网站检查你是否有他们网站的推荐人，否则会给你403的回复。通过设置一个referer，您可以轻松地绕过这个问题。在

import requests
ref='https://tmofans.com'
headers = { 'Referer': ref }
r = requests.get('https://tmofans.com/goto/347231',headers=headers)
print(r.url)
print(r.status_code)

输出

^{pr2}$

网友

2楼 · 编辑于 2024-10-01 07:51:29

有一次我用http.client和我的浏览器设法删除了一些受保护的页面。在

我首先导航到需要访问的页面，然后使用浏览器的开发工具复制了请求标题并将其用于脚本中。这样，您的脚本将以浏览器访问资源的方式访问资源。在

这两种方法可以帮助您，首先解析HTTP请求以获取头文件（request和body可能也有帮助，具体取决于您的情况），然后使用第二种方法下载文件。在

这可能需要你做些调整才能奏效。在

from http.client import HTTPSConnection

def parse_headers(http_post):
    """Converts a header string to a dictionnary of its attributes."""

    # Regex to extract headers
    req_line = re.compile(r'(?P<method>GET|POST)\s+(?P<resource>.+?)\s+(?P<version>HTTP/1.1)')
    field_line = re.compile(r'\s*(?P<key>.+\S)\s*:\s+(?P<value>.+\S)\s*')

    first_line_end = http_post.find('\n')
    headers_end = http_post.find('\n\n')
    request = req_line.match(http_post[:first_line_end]).groupdict()
    headers = dict(field_line.findall(http_post[first_line_end:headers_end]))
    body = http_post[headers_end + 2:]

    return request, headers, body


def get_file(url, domain, headers, temp_directory):
    """
    Fetches the file located at the provided URL and returns the content.
    Uses `headers` to bypass auth.
    """
    conn = HTTPSConnection(domain)
    conn.request('GET', url, headers=headers)
    response = conn.getresponse()
    content_type = response.getheader('Content-Type')
    content_disp = response.getheader('Content-Disposition')

    # Change to whatever content type you need
    if content_type != 'application/pdf':
        conn.close()
        return
    else:
        file_content = response.read()
        conn.close()
        return file_content

标题字符串应如下所示：

^{pr2}$

它可能会根据网站的不同而改变，但使用这些可以让我在登录后下载文件。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用BeautifulSoup从带有保护的重定向网站获取html内容？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >