如何在Python中结合使用Request和BeautifulSoup来加速Webscraping？

import requests from bs4 import BeautifulSoup as Soup def get_each_page(page_soup): return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text, paper_title=page_soup.find(attrs={"itemprop": "name"}).text) list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100 # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url all_website_scrape = [] for url_to_pass in list_of_url: page = requests.get('https://oatd.org/oatd/' + url_to_pass) if page.status_code == 200: all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))

2条回答

网友

1楼 · 编辑于 2024-06-02 17:44:01

您也许可以使用线程模块。您可以使脚本多线程，运行速度更快。 https://docs.python.org/3/library/threading.html

但是如果你愿意改变主意，我推荐scrapy

网友

2楼 · 编辑于 2024-06-02 17:44:01

realpython.com有一篇关于使用并发加速python脚本的文章

https://realpython.com/python-concurrency/

使用他们的线程示例，您可以设置执行多个线程的工作线程数，从而增加一次可以发出的请求数

    from bs4 import BeautifulSoup as Soup
    import concurrent.futures
    import requests
    import threading
    import time
    
    def get_each_page(page_soup):
        return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                    paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
    
    def get_session():
        if not hasattr(thread_local, "session"):
            thread_local.session = requests.Session()
        return thread_local.session
    
    def download_site(url_to_pass):
        session = get_session()
        page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
        print(f"{page.status_code}: {page.reason}")
        if page.status_code == 200:
            all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
    
    def download_all_sites(sites):
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(download_site, sites)
    
    if __name__ == "__main__":
        list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100  # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
        all_website_scrape = []
        thread_local = threading.local()
        start_time = time.time()
        download_all_sites(list_of_url)
        duration = time.time() - start_time
        print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")

相关问题更多 >

编程相关推荐

热门问题

热门文章