如何用Python中嵌套的urllib2.urlopen（）加速web抓取？

2024-06-25 22:42:29 发布

男 | 程序猿一只，喜欢编程写python代码。

我有下面的代码来收集一本书的每一章中有多少单词。简而言之，它打开每本书的url，然后打开与该书相关的每一章的url。在

import urllib2
from bs4 import BeautifulSoup
import re

def scrapeBook(bookId):
    url = 'http://www.qidian.com/BookReader/'+str(bookId)+'.aspx'
    try:
        words = []
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)           
        try:                             
            chapters = soup.find_all('a', rel='nofollow')  # find all relevant chapters
            for chapter in chapters:                       # loop through chapters
                if 'title' in chapter.attrs: 
                    link = chapter['href']                 # go to chapter to find words
                    htmlTemp = urllib2.urlopen(link,'html').read()
                    soupTemp = BeautifulSoup(htmlTemp)

                    # find out how many words there are in each chapter
                    spans = soupTemp.find_all('span')
                    for span in spans:
                        content = span.string
                        if not content == None:
                            if u'\u5b57\u6570' in content:
                               word = re.sub("[^0-9]", "", content)
                               words.append(word)
        except: pass

        return words

    except:       
        print 'Book'+ str(bookId) + 'does not exist'

下面是一个运行示例

^{pr2}$

毫无疑问，代码非常慢。一个主要原因是我需要打开每本书的url，而对于每本书，我需要打开每一章的url。有没有办法让这个过程更快？在

这是另一个没有空返回值3052409的bookId。它有数百章，代码永远运行。在

Tags：代码 in import url if html content all

1条回答

网友

1楼 · 发布于 2024-06-25 22:42:29

您需要打开每本书和每一章的事实是由服务器上公开的视图决定的。你能做的，就是实现并行客户机。创建一个线程池，您可以在其中将HTTP请求作为作业卸载给工作线程，或者使用协同例程执行类似的操作。在

然后是HTTP客户端库的选择。我发现libcurl和{}比urllib或任何其他python标准库的CPU效率都要高。在

如何用Python中嵌套的urllib2.urlopen（）加速web抓取？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何用Python中嵌套的urllib2.urlopen（）加速web抓取？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >