<p>首先,不显式地嵌套for循环会给自己带来一点伤害。结果你重复了很多努力,却没有保存任何记忆。当循环嵌套时,可以将一些计算从一个级别移动到另一个级别,并确定哪些内部循环可以矢量化。在</p>
<pre><code>def supports_5_loop(data, num_series, resolutions, buffer_size, vocab_size):
ratios = np.full((num_series, vocab_size, num_series, vocab_size, resolutions), 0.0)
for res in xrange(resolutions):
for s0 in xrange(num_series):
# Find the positions where s0==v0
for v0 in np.unique(data[s0, res]):
# only need to find indices once for each series and value
found0 = np.where(data[s0, res, :] == v0)[0]
for s1 in xrange(num_series):
# Check how often s1==v1 right before s0==v0
candidates = (s1, res, (found0 - 1 + buffer_size) % buffer_size)
total01 = np.logical_or(data[s0, res, :] >= 0, data[s1, res, :] >= 0).sum()
# can skip inner loops if there are no candidates
if total01 == 0:
continue
for v1 in xrange(vocab_size):
found01 = np.count_nonzero(data[candidates] == v1)
if found01 == 0:
continue
ratio = (float(found01) / total01)
ratios[(s0, v0, s1, v1, res)] = ratio
return ratios
</code></pre>
<p>你会在计时中看到,大部分的速度提升来自于不重复的努力。在</p>
<p>一旦创建了嵌套结构,就可以开始研究向量化和其他优化。在</p>
^{pr2}$
<p>不幸的是,我只能对最里面的循环进行矢量化,这只会额外获得10%的加速。在最里面的循环之外,你不能保证所有的向量都是相同的大小,所以你不能构建一个数组。在</p>
<pre><code>In [121]: (np.all(supports_loop(data, num_series, resolutions, buffer_size, vocab_size) == supports_5_loop(data, num_series, resolutions, buffer_size, vocab_size)))
Out[121]: True
In [122]: (np.all(supports_loop(data, num_series, resolutions, buffer_size, vocab_size) == supports_4_loop(data, num_series, resolutions, buffer_size, vocab_size)))
Out[122]: True
In [123]: %timeit(supports_loop(data, num_series, resolutions, buffer_size, vocab_size))
2.29 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [124]: %timeit(supports_5_loop(data, num_series, resolutions, buffer_size, vocab_size))
949 µs ± 5.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [125]: %timeit(supports_4_loop(data, num_series, resolutions, buffer_size, vocab_size))
843 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
</code></pre>