<p>首先,您可能希望将来使用<code>itertools.combinations</code>和<code>random.sample</code>来获得唯一对,但是由于内存问题,在这种情况下它将无法工作。因此,多处理不是多线程,也就是说,产生一个新的进程涉及到巨大的系统开销。为每个单独的任务生成一个进程没有什么意义。一个任务必须是值得的,以合理地开始一个新的过程,因此你最好把所有的工作分成单独的工作(分成你想要使用的核心数量的多个部分)。然后,别忘了<code>multiprocessing</code>实现序列化了整个名称空间并将其加载到内存中N次,其中N是进程数。如果没有足够的RAM来存储庞大阵列的N个拷贝,这可能导致密集的交换。所以你可以减少核心的数量。在</p>
<p><strong>已更新</strong>以按照您的要求恢复初始订单。在</p>
<p>我做了一个相同向量的测试数据集,因此<code>cosine</code>必须返回一个零向量。在</p>
<pre><code>from __future__ import division, print_function
import math
import multiprocessing as mp
from scipy.spatial.distance import cosine
from operator import itemgetter
import itertools
def worker(enumerated_comps):
return [(ind, cosine(artist_topic_probs[a], artist_topic_probs[b])) for ind, (a, b) in enumerated_comps]
def slice_iterable(iterable, chunk):
"""
Slices an iterable into chunks of size n
:param chunk: the number of items per slice
:type chunk: int
:type iterable: collections.Iterable
:rtype: collections.Generator
"""
_it = iter(iterable)
return itertools.takewhile(
bool, (tuple(itertools.islice(_it, chunk)) for _ in itertools.count(0))
)
# Test data
artist_topic_probs = [range(10) for _ in xrange(10)]
comps = tuple(enumerate([(1, 2), (1, 3), (1, 4), (1, 5)]))
n_cores = 2
chunksize = int(math.ceil(len(comps)/n_cores))
jobs = tuple(slice_iterable(comps, chunksize))
pool = mp.Pool(processes=n_cores)
work_res = pool.map_async(worker, jobs)
c_dists = map(itemgetter(1), sorted(itertools.chain(*work_res.get())))
print(c_dists)
</code></pre>
<p>输出:</p>
^{pr2}$
<p>这些值相当接近于零。在</p>
<p>p.S</p>
<p>来自<code>multiprocessing.Pool.apply</code>文档</p>
<blockquote>
<p>Equivalent of the <code>apply()</code> built-in function. It <strong>blocks until the
result is ready</strong>, so <code>apply_async()</code> is better suited for performing
work in parallel. Additionally, func is only executed in one of the
workers of the pool.</p>
</blockquote>