回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我试图实现一个简单的网络爬虫,我已经写了一个简单的代码开始:有两个模块<strong>抓取器.py</strong>和<strong>爬虫.py</strong>。以下是文件:</p>
<p>在抓取器.py公司名称:</p>
<pre><code> import urllib2
import re
def fetcher(s):
"fetch a web page from a url"
try:
req = urllib2.Request(s)
urlResponse = urllib2.urlopen(req).read()
except urllib2.URLError as e:
print e.reason
return
p,q = s.split("//")
d = q.split("/")
fdes = open(d[0],"w+")
fdes.write(str(urlResponse))
fdes.seek(0)
return fdes
if __name__ == "__main__":
defaultSeed = "http://www.python.org"
print fetcher(defaultSeed)
</code></pre>
<p>在爬虫.py以下内容:</p>
^{pr2}$
<p>问题是当我跑<strong>爬虫.py</strong>对于前4-5个链接,它可以正常工作,然后会挂起,一分钟后会出现以下错误:</p>
<pre><code>[Errno 110] Connection timed out
Traceback (most recent call last):
File "crawler.py", line 37, in <module>
crawler("http://www.python.org/",7)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 33, in crawler
newSeed = parse(fdes,newLinks.tell())
File "crawler.py", line 11, in parse
soup = BeautifulSoup(fd)
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "/usr/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 191, in __init__
self._detectEncoding(markup, is_html)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 362, in _detectEncoding
xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer
</code></pre>
<p>有谁能帮我这个忙吗?我对python很陌生,我不知道为什么在一段时间后连接超时?在</p>