想刮螳螂却一无所获

import requests from bs4 import BeautifulSoup #Get the different pages to begin scraping data from url = "http://www.manta.com/mb_41_ALL_19/louisiana" headers = { 'Origin':'http://www.manta.com', 'Referer':'http://www.manta.com/mb_41_ALL_19/louisiana', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' , 'Accept-Language':'en-US,en;q=0.8' ,'Content-Type':'text/html; charset=utf-8', 'Host':None,} newurl = requests.get(url, headers=headers) soup = BeautifulSoup(newurl.text, "html.parser") print(soup)

1条回答

网友

1楼 · 发布于 2024-09-27 04:13:05

坏消息是，看看你在body上得到了什么：

<div id="distil_ident_block"></div>

distil是“Distil Networks”反网页抓取服务的标志。他们有自己的理由。引用"Terms of Service"：

We give you a limited right to access and use Manta. You are not authorized to access Manta or its computers, servers and databases to scrape or “data mine” our data.

从技术上讲，你可以挑战Distil，但在法律上你不应该。在

相关问题更多 >

编程相关推荐

热门问题

热门文章