多线程优化python脚本

2024-10-01 09:17:12 发布

男 | 程序猿一只，喜欢编程写python代码。

大家好！我写了小网页爬虫函数。但我是新的多线程，我不能优化它。我的代码是：

alreadySeenURLs = dict() #the dictionary of already seen crawlers
candidates = set() #the set of URL candidates to crawl

def initializeCandidates(url):

    #gets page with urllib2
    page = getPage(url)

    #parses page with BeautifulSoup
    parsedPage = getParsedPage(page)

    #function which return all links from parsed page as set
    initialURLsFromRoot = getLinksFromParsedPage(parsedPage)

    return initialURLsFromRoot 

def updateCandidates(oldCandidates, newCandidates):
    return oldCandidates.union(newCandidates)

candidates = initializeCandidates(rootURL)

for url in candidates:

    print len(candidates)

    #fingerprint of URL
    fp = hashlib.sha1(url).hexdigest()

    #checking whether url is in alreadySeenURLs
    if fp in alreadySeenURLs:
        continue

    alreadySeenURLs[fp] = url

    #do some processing
    print url

    page = getPage(url)
    parsedPage = getParsedPage(page, fix=True)
    newCandidates = getLinksFromParsedPage(parsedPage)

    candidates = updateCandidates(candidates, newCandidates)

正如我们所看到的，在这里它从候选人在特定时间的一个url。我想让这个脚本多线程，以这样一种方式，它可以至少从候选的N个url，并完成这项工作。有人能引导我吗？提供任何链接或建议？在

Tags： of the in url return def page candidates

1条回答

网友

1楼 · 发布于 2024-10-01 09:17:12

您可以从以下两个链接开始：

基本的Python线程参考 http://docs.python.org/library/threading.html
他们在一个python教程中实现了一个多线程的URL http://www.ibm.com/developerworks/aix/library/au-threadingpython/

此外，您已经有了python的爬虫程序：http://scrapy.org/

多线程优化python脚本

相关问题更多 >

编程相关推荐

热门问题

热门文章

多线程优化python脚本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >