<p>受到这个问题和@unutbu的答案的启发,我在<a href="https://github.com/fashandge/partools" rel="nofollow">github</a>上写了一个map的并行版本。该函数适用于多核单机无限并行处理只读大数据结构。基本思想类似于@unutbu suggested,使用临时全局变量保存大数据结构(例如,数据帧),并将其“名称”而不是变量本身传递给工人。但所有这些都封装在一个map函数中,因此它几乎是在pathos包的帮助下对标准map函数的一个简单替换。示例用法如下:</p>
<pre><code># Suppose we process a big dataframe with millions of rows.
size = 10**9
df = pd.DataFrame(np.random.randn(size, 4),
columns=['column_01', 'column_02',
'column_03', 'column_04'])
# divide df into sections of 10000 rows; each section will be
# processed by one worker at a time
section_size = 10000
sections = [xrange(start, start+section_size)
for start in xrange(0, size, section_size)]
# The worker function that processes one section of the
# df. The key assumption is that a child
# process does NOT modify the dataframe, but do some
# analysis or aggregation and return some result.
def func(section, df):
return some_processing(df.iloc[section])
num_cores = 4
# sections (local_args) specify which parts of a big object to be processed;
# global_arg holds the big object to be processed to avoid unnecessary copy;
# results are a list of objects each of which is the processing results
# of one part of a big object (i.e., one element in the iterable sections)
# in order.
results = map(func, sections, global_arg=df,
chunksize=10,
processes=num_cores)
# reduce results (assume it is a list of data frames)
result = pd.concat(results)
</code></pre>
<p>在我的一些文本挖掘任务中,直接将df传递给worker函数的朴素并行实现甚至比单线程版本慢,这是因为大数据帧的复制操作非常昂贵。但是,对于那些4核的任务,上面的实现可以给那些有3倍以上的加速,这看起来非常接近真正的轻量级多线程。在</p>