使用Python获取请求失败

2024-10-03 21:27:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在读《网络爬行终极指南》

用于运行第一个HTTP get请求的代码如下:

import requests 
url = "https://scrapethissite.com/pages/simple/" 
r = requests.get(url) 
print("We got a {} response code from {}".format(r.status_code, url))

我收到了错误消息:

HTTPSConnectionPool(host='scrapethissite.com', port=443): Max retries exceeded with url: /pages/simple/ (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))

我知道我的请求没有进入正确的端口。它是否与网站使用通信协议HTTPS(vs HTTP)有关?我不确定,但这似乎是问题的一部分

我正在PyCharm上使用Python 3.8。我的SSL版本是:

OpenSSL 1.1.1g 21 Apr 2020

我是网络绘图的初学者。这就是为什么我选择运行另一个代码来运行我的HTTP get请求,这将允许我选择适当的端口和协议(源代码:https://pythonprogramming.net/python-sockets/):

import socket
import ssl    

context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()

server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)

request = "GET / HTTP/1.1\nHost: "+server+"\n\n"

s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)

while (len(result) > 0):
    print(result)
    result = s.recv(4096)

我得到了HTTP 200 OK状态响应,因此它工作得很好。我在PyCharm终端中获得以下输出:

b'HTTP/1.1 200 OK\r\nDate: Tue, 12 Jan 2021 14:59:35 GMT\r\nContent-Type: text/html; charset=utf-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nSet-Cookie: __cfduid=d205b0b8e8ce061174412767189bf10b41610463575; expires=Thu, 11-Feb-21 14:59:35 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax\r\nCF-Cache-Status: DYNAMIC\r\ncf-request-id: 0798b515a60000ea04f707d000000001\r\nExpect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"\r\nReport-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=%2FROG7Z2JWZJBMeVNn1IgnJh2TZsqJCi9TJOL3zau98btlLo1nPg4WhGlmOz2SZ6PRep6%2BKZfv0M81fqKOw1l6%2BRbc5M9dErdtyeTsei9Ee%2F2jc0%3D"}],"group":"cf-nel","max_age":604800}\r\nNEL: {"report_to":"cf-nel","max_age":604800}\r\nServer: cloudflare\r\nCF-RAY: 6107be029e27ea04-IAD\r\n\r\n1fb5\r\n<!doctype html>\n\n \n \n

Scrape This Site | A public sandbox for learning web scraping\n \n\n \n \n\n \n \n \n\n \n\n \n\n \n \n \n \n \n \n \n \n Scrape This Site\n \n \n \n \n \n Sandbox\n \n \n \n \n \n Lesson' b's\n \n \n \n \n \n FAQ\n \n \n \n \n \n Login\n \n \n \n \n \n \n \n\n \n var path = document.location.pathname;\n var tab = undefined;\n if (path === "/"){\n tab = document.querySelector("#nav-homepage");\n } else if (path.indexOf("/faq/") === 0){\n tab = document.querySelector("#nav-faq");\n } else if (path.indexOf("/lessons/") === 0){\n tab = document.querySelector("#nav-lessons");\n } else if (path.indexOf("/pages/") === 0) {\n tab = document.querySelector("#nav-sandbox");\n } else if (path.indexOf("/login/") === 0) {\n tab = do' b'cument.querySelector("#nav-login");\n }\n tab.classList.add("active")\n \n\n \n\n \n\n \n \n \n \n \n

\n Scrape This Site\n

\n \n The internet\'s best resource for learning web scraping.\n \n


\n Explore Sandbox\n \n \n Begin Lessons →\n \n \n \n \n \n\n \n\n\n \n \n \n \n Lessons and Videos © Hartley Bro' b'dy 2018\n \n \n \n \n \n\n \n \n\n \n\n \n \n\n \n \n \n PNotify.prototype.options.styling = "bootstrap3";\n $(function(){\n \n });\n \n\n $(function () {\n $(\'[data-toggle="tooltip"]\').tooltip()\n })\n \n\n \n \n $("video").hover(function() {\n $(this).prop("controls", true);\n }, function() {\n $(this).prop("controls", false);\n });\n\n $("video").click(function() {\n if( this.paused){\n this.play();\n }\n else {\n this.pause();\n }\n });\n \n\n \n \n (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n })(window,document,\'script\',\'https://www.google-analytics.com/analytics.js\',\'ga\');\n\n ga(\'create\', \'UA-41551755-8\', \'auto\');\n ga(\'send\', \'pageview\');\n \n\n \n \n !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?\n n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;\n n.push=n;n.loaded=!0;n.version=\'2.0\';n.queue=[];t=b.createElement(e);t.async=!0;\n t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,\n document,\'script\',\'https://connect.facebook.net/en_US/fbevents.js\');\n\n fbq(\'init\', \'764287443701341\');\n fbq(\'track\', "PageView");\n \n \n\n \n \n /* */\n \n \n \n \n \n \n \n \n\n \n \n \n window.dataLayer = window.dataLayer || [];\n function gtag(){dataLayer.push(arguments);}\n gtag(\'js\', new Date());\n\n gtag(\'config\', \'AW-950945448\');\n \n\n\r\n' b'0\r\n\r\n'

唯一的问题是我想刮这个网站:

https://scrapethissite.com/pages/simple/

而不是:

https://scrapethissite.com

当我替换

server = 'scrapethissite.com'

作者:

server = 'scrapethissite.com/pages/simple/'

在前面的代码中,我收到以下新错误消息:

socket.gaierror: [Errno 11001] getaddrinfo failed

我的理解是,问题与代理有关。知道问题可能与端口、套接字、代理等有关,这是有益的,但我不确定该如何修复代码,因为它在一个网站上运行良好,而在另一个网站上运行不好

非常感谢您的帮助。谢谢大家!


根据OneCricketeer的回复,代码现在是:

context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()

server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)

request = "GET /pages/simple HTTP/1.1\nHost: "+server+"\n\n"

s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)

while (len(result) > 0):
    print(result)
    result = s.recv(4096)

我得到HTTP 301永久移动状态响应

b'HTTP/1.1 301 MOVED PERMANENTLY\r\nDate: Tue, 12 Jan 2021 15:34:15 GMT\r\nContent-Type: text/html; charset=utf-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nSet-Cookie: __cfduid=d6e32136f617c0b90e7f92a3e391c159f1610465655; expires=Thu, 11-Feb-21 15:34:15 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax\r\nLocation: https://scrapethissite.com/pages/simple/\r\nCF-Cache-Status: DYNAMIC\r\ncf-request-id: 0798d4d0d700002550fc1c3000000001\r\nExpect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"\r\nReport-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=2moOTvTDPvS65D6d0LvsiZTLDqYcv8OFZvtunIQDq6H%2FKLucm1LOOlMABcnCUjUO9fK4bwd%2BVDiescQ0NyHbu3DxhTCkOUHTvMcilkM%2BdcZnz3A%3D"}],"group":"cf-nel","max_age":604800}\r\nNEL: {"report_to":"cf-nel","max_age":604800}\r\nServer: cloudflare\r\nCF-RAY: 6107f0c7bb432550-IAD\r\n\r\n11f\r\n\nRedirecting...\n

Redirecting...

\n

You should be redirected automatically to target URL: https://scrapethissite.com/pages/simple/. If not click the link.\r\n' b'0\r\n\r\n'

我错过了什么吗


Tags: pathhttpsreportcomhttpsslifserver
1条回答
网友
1楼 · 发布于 2024-10-03 21:27:00

I am using Python 3.8 on PyCharm

根据您的print用法,您实际上正在使用Python2

在任何情况下,此解决方案都可能适用于请求方式

import requests 
url = "https://scrapethissite.com/pages/simple/" 
r = requests.get(url, verify=False) 

如果要使用socket方法,可以将GET /更改为GET /pages/simple,并将server保留为域名

I understand that my request doesn't go the right port.

443是正确的HTTPS端口。错误说明SSL版本不正确

相关问题 更多 >