我最近一直在尝试使用python中的requests模块制作一个web scraper
一开始它是工作的,然后我收到了403错误的响应,然后当我回去测试的网站,我已经刮我得到了200错误的响应输出。我想知道有没有人知道为什么会这样
在下面的代码中,我得到collect\u omers和collect\u real的响应200,然后collect\u bdc的响应403。 谢谢
import requests,bs4
def collect_omers():
acquired_list = []
logo_list = []
omers_html = requests.get('https://www.omersventures.com/portfolio-summary')
print(omers_html)
omers_soup = bs4.BeautifulSoup(omers_html.text,"html.parser")
omers_tags = omers_soup.select('.field-content a')
for logo in omers_tags:
if "portfolio" in str(logo) and logo.get_text() != "":
if "acquired" in logo.get_text().lower():
acquired_list.append(logo.get_text())
else:
logo_list.append(logo.get_text())
def collect_real():
acquired_list = []
logo_list = []
real_html = requests.get('https://realventures.com/backing/')
print(real_html)
real_soup = bs4.BeautifulSoup(real_html.text,"html.parser")
real_tags = real_soup.select('.company-list__grid-item')
count = 1
for logo in real_tags:
listed = logo.get_text().strip().split("\n")
if len(listed)>3:
acquired_list.append(listed[0].strip() + " " + "(" + listed[3] + ")")
else:
logo_list.append(listed[0].strip())
def collect_bdc():
acquired_list = []
logo_list = []
bdc_html = requests.get('https://www.inovia.vc/portfolio/')
print(bdc_html)
bdc_soup = bs4.BeautifulSoup(bdc_html.text,"html.parser")
bdc_tags = bdc_soup.select('.row')
count = 1
for logo in bdc_tags:
print(logo.get_text())
collect_real()
回复200是好的,这意味着你的请求通过并返回了回复
第三个网站的403回复确实意味着出了问题。看一看,第三个站点似乎会自动拒绝不提供用户代理头的GET请求。在Chrome中按F12,单击“网络”选项卡,导航到站点,然后单击列表中相应的请求,就可以找到自己的用户代理头。用户代理标头将位于“请求标头”部分下。必须通过
requests.get()
headers
关键字参数提供此头代码如下所示:
相关问题 更多 >
编程相关推荐