urllib.error.HTTPError：HTTP错误405:python 3.X中不允许如何消除bot检测

from urllib.request import urlopen from bs4 import BeautifulSoup import urllib.request import re import numpy as np # Opening the Builder website html = "http://www.builderonline.com" req = urllib.request.Request(html,headers={'User-Agent' : "Mozilla/5.0"}) soup = BeautifulSoup(urlopen(req).read(),"html.parser") print ("end") Error Messages: Traceback (most recent call last): File "test3.py", line 9, in <module> soup = BeautifulSoup(urlopen(req).read(),"html.parser") File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 223, in urlopen return opener.open(url, data, timeout) File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 532, in open response = meth(req, response) File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 642, in http_response 'http', request, response, code, msg, hdrs) File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 570, in error return self._call_chain(*args) File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 504, in _call_chain result = func(*args) File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 650, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 405: Not Allowed

2条回答

网友

1楼 · 编辑于 2024-09-30 12:23:26

这个页面有验证码和用户没有JavaScript保护。请尝试以下代码：

import requests
from bs4 import BeautifulSoup
request_page = requests.get('http://www.builderonline.com')
soup = BeautifulSoup(request_page.text, 'lxml')


for i in soup.findAll('li'):
    print(i.text)

如果您想从网站上获取数据，我建议您使用Selenium与PhantomJS一起使用Selenium>（无头浏览器）。在

对于错误405：

打开到该IP地址的IP套接字连接。在
通过该套接字编写一个HTTP数据流。在
从Web服务器接收HTTP数据流作为响应。此数据流包含状态代码，其值由HTTP协议确定。在

This error occurs in the final step above when the client receives an HTTP status code that it recognises as '405'.

很棒的教程HERE

网友
2楼 · 编辑于 2024-09-30 12:23:26

使用请求和BeautifulGroup我可以很容易地抓取列表标签：
>>> import requests >>> from pprint import pprint #for readability >>> from bs4 import BeautifulSoup as BS >>> headers = {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0"} >>> response = requests.get('http://www.builderonline.com', headers=headers) >>> soup = BS(response.text, 'lxml')
以及输出（使用pprint）：
^{pr2}$
也许这和你的标题格式有关。可能站点设置为检查格式错误或不完整的标头。尝试使用浏览器转到https://httpbin.org/headers，并使用脚本中列出的用户代理数据。在

很棒的教程HERE

相关问题更多 >

编程相关推荐

热门问题

热门文章