<p>下面是一个综合了我和埃维的建议的答案</p>
<pre><code>import numpy as np
import pandas as pd
import csv
from multiprocessing import Pool
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop, keys, chunksize=200)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in resultset:
writer.writerow(listitem)
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
</code></pre>
<p>同样,这里的变化是</p>
<ol>
<li>直接迭代<code>resultset</code>,而不是不必要地先将其复制到列表中。在</li>
<li>直接将<code>keys</code>列表提供给<code>pool.imap</code>,而不是从中创建生成器理解。在</li>
<li>提供大于默认值1的<code>chunksize</code>到{<cd5>}。较大的<code>chunksize</code>减少了将<code>keys</code>内的值传递给池中的子进程所需的进程间通信成本,当<code>keys</code>非常大时,<a href="https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.imap" rel="nofollow noreferrer">can give big performance boosts</a>就可以了。您应该试验一下<code>chunksize</code>的不同值(尝试一些比200大得多的值,比如5000,等等),看看它如何影响性能。我在胡乱猜测200,不过肯定比1好。在</li>
</ol>