大家好!我写了小网页爬虫函数。但我是新的多线程,我不能优化它。我的代码是:
alreadySeenURLs = dict() #the dictionary of already seen crawlers
candidates = set() #the set of URL candidates to crawl
def initializeCandidates(url):
#gets page with urllib2
page = getPage(url)
#parses page with BeautifulSoup
parsedPage = getParsedPage(page)
#function which return all links from parsed page as set
initialURLsFromRoot = getLinksFromParsedPage(parsedPage)
return initialURLsFromRoot
def updateCandidates(oldCandidates, newCandidates):
return oldCandidates.union(newCandidates)
candidates = initializeCandidates(rootURL)
for url in candidates:
print len(candidates)
#fingerprint of URL
fp = hashlib.sha1(url).hexdigest()
#checking whether url is in alreadySeenURLs
if fp in alreadySeenURLs:
continue
alreadySeenURLs[fp] = url
#do some processing
print url
page = getPage(url)
parsedPage = getParsedPage(page, fix=True)
newCandidates = getLinksFromParsedPage(parsedPage)
candidates = updateCandidates(candidates, newCandidates)
正如我们所看到的,在这里它从候选人在特定时间的一个url。我想让这个脚本多线程,以这样一种方式,它可以至少从候选的N个url,并完成这项工作。有人能引导我吗?提供任何链接或建议?在
您可以从以下两个链接开始:
基本的Python线程参考 http://docs.python.org/library/threading.html
他们在一个python教程中实现了一个多线程的URL http://www.ibm.com/developerworks/aix/library/au-threadingpython/
此外,您已经有了python的爬虫程序:http://scrapy.org/
相关问题 更多 >
编程相关推荐