<p>我建议多处理。在等待服务器响应每个请求时,您的计算机基本上处于空闲状态。根据我正在抓取的服务器的不同,我可以通过使用多处理来获得10x-20x的加速。在</p>
<p>首先,我将您的循环转换为一个函数,该函数以url为参数并返回:
<code>[gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]</code>。在</p>
<p>这里有一个大概的轮廓</p>
<pre><code>import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import csv
from multiprocessing.dummy import Pool
def scrape_gas_data(url):
# your main code here
return [gas_unit_values,gas_capacity_values,gas_commissioned_date,gas_decommissioned_date,gas_HRSG_OEM,gas_turbine_OEM,gas_generator_OEM]
url_list = ["http://www.globalenergyobservatory.com/form.php?pid={}".format(i) for i in range(1,46624)]
# Since http requests can sit idle for some time, you might be able to get away
# with passing a large number to pool (say 50) even though your machine probably
# can't run 50 threads at once
my_pool = Pool()
my_pool.map(scrape_gas_data, url_list)
</code></pre>
<p>beauthoulsoup文档提到<code>lxml</code>解析器比<code>html.parser</code>更快。我不确定这是限制速率的步骤,但是由于更改解析器通常是容易实现的,所以我也会提到这一点。在</p>
<p>另外,作为一个关于良好实践的注释,您在循环中分配变量<code>i</code>,这不是很干净。在</p>