如何在Python中结合使用Request和BeautifulSoup来加速Webscraping?

2024-06-02 17:44:01 发布

您现在位置:Python中文网/ 问答频道 /正文

目标是使用输入来自requests.get模块的BeautifulSoup刮取多个pages

这些步骤是:

首先使用requests加载html

page = requests.get('https://oatd.org/oatd/' + url_to_pass)

然后,使用以下定义刮取html内容:

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

比如说,我们有一百个独特的url要被刮取['record?record=handle\:11012\%2F16478&q=eeg'] * 100,整个过程可以通过下面的代码完成:

import requests
from bs4 import BeautifulSoup as Soup

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100 # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
all_website_scrape = []
for url_to_pass in list_of_url:

    page = requests.get('https://oatd.org/oatd/' + url_to_pass)
    if page.status_code == 200:
        all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))

但是,每个url都会被请求,并且每次都会刮取一个,因此原则上非常耗时

我想知道是否有其他方法可以提高上述代码的性能,但我不知道


Tags: totextnameurlgethtmlpagefind
2条回答

您也许可以使用线程模块。 您可以使脚本多线程,运行速度更快。 https://docs.python.org/3/library/threading.html

但是如果你愿意改变主意,我推荐scrapy

realpython.com有一篇关于使用并发加速python脚本的文章

https://realpython.com/python-concurrency/

使用他们的线程示例,您可以设置执行多个线程的工作线程数,从而增加一次可以发出的请求数

    from bs4 import BeautifulSoup as Soup
    import concurrent.futures
    import requests
    import threading
    import time
    
    def get_each_page(page_soup):
        return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                    paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
    
    def get_session():
        if not hasattr(thread_local, "session"):
            thread_local.session = requests.Session()
        return thread_local.session
    
    def download_site(url_to_pass):
        session = get_session()
        page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
        print(f"{page.status_code}: {page.reason}")
        if page.status_code == 200:
            all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
    
    def download_all_sites(sites):
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(download_site, sites)
    
    if __name__ == "__main__":
        list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100  # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
        all_website_scrape = []
        thread_local = threading.local()
        start_time = time.time()
        download_all_sites(list_of_url)
        duration = time.time() - start_time
        print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")

相关问题 更多 >