擅长:python、mysql、java
<p>为什么你必须手动检查你的网址?
您可以在Python3中使用<code>urllib.robotparser</code>,并执行以下操作</p>
<pre><code>import urllib.robotparser as urobot
url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
site = urllib.request.urlopen(url)
sauce = site.read()
soup = BeautifulSoup(sauce, "html.parser")
actual_url = site.geturl()[:site.geturl().rfind('/')]
my_list = soup.find_all("a", href=True)
for i in my_list:
# rather than != "#" you can control your list before loop over it
if i != "#":
newurl = str(actual_url+"/"+i)
try:
if rp.can_fetch("*", newurl):
site = urllib.request.urlopen(newurl)
# do what you want on each authorized webpage
except:
pass
else:
print("cannot scrap")
</code></pre>