<h2>纽比广播</h2>
<p>如果不受内存限制,优化<code>numpy</code>中嵌套循环的第一步是使用广播并以矢量化的方式执行操作:</p>
<pre><code>import numpy as np
def mb_r(forecasted_array, observed_array):
"""Returns the Mielke-Berry R value."""
assert len(observed_array) == len(forecasted_array)
total = np.abs(forecasted_array[:, np.newaxis] - observed_array).sum() # Broadcasting
return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])
</code></pre>
<p>但在这种情况下,循环是用C而不是Python进行的,它涉及到一个大小(N,N)数组的分配。在</p>
<h2>广播不是灵丹妙药,试着打开内环</h2>
<p>如上所述,广播意味着巨大的内存开销。所以它应该小心使用,而且它并不总是正确的方法。虽然你的第一印象可能是在任何地方使用它-<strong>不要</strong>。不久前我也被这个事实搞糊涂了,看我的问题<a href="https://stackoverflow.com/questions/41325427/numpy-ufuncs-speed-vs-for-loop-speed">Numpy ufuncs speed vs for loop speed</a>。为了不太冗长,我将在您的示例中显示这一点:</p>
^{pr2}$
<p><strong><em>小型阵列(广播更快)</em></strong></p>
<pre><code>forecasted = np.random.rand(100)
observed = np.random.rand(100)
%timeit mb_r_bcast(forecasted, observed)
57.5 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mb_r_unroll(forecasted, observed)
1.17 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
</code></pre>
<p><strong><em>中型阵列(相等)</em></strong></p>
<pre><code>forecasted = np.random.rand(1000)
observed = np.random.rand(1000)
%timeit mb_r_bcast(forecasted, observed)
15.6 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mb_r_unroll(forecasted, observed)
16.4 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</code></pre>
<p><strong><em>大尺寸阵列(广播速度较慢)</em></strong></p>
<pre><code>forecasted = np.random.rand(10000)
observed = np.random.rand(10000)
%timeit mb_r_bcast(forecasted, observed)
1.51 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mb_r_unroll(forecasted, observed)
377 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p>正如您可以看到的,对于小型阵列,广播版本比展开的快<strong>20倍</strong>,对于中型阵列,它们是<em>而不是</em>相等,但对于大型阵列,它的速度<strong>4倍</strong>,因为内存开销要付出高昂的代价。在</p>
<h2>Numba jit与并行化</h2>
<p>另一种方法是使用<code>numba</code>及其强大的<code>@jit</code>函数修饰符。在这种情况下,只需对初始代码稍作修改。另外,为了使循环并行,您应该将<code>range</code>更改为<code>prange</code>,并提供<code>parallel=True</code>关键字参数。在下面的代码片段中,我使用了与<code>@jit(nopython=True)</code>相同的<code>@njit</code>修饰符:</p>
<pre><code>from numba import njit, prange
@njit(parallel=True)
def mb_r_njit(forecasted_array, observed_array):
"""Returns the Mielke-Berry R value."""
assert len(observed_array) == len(forecasted_array)
total = 0.
size = len(forecasted_array)
for i in prange(size):
observed = observed_array[i]
for j in prange(size):
total += abs(forecasted_array[j] - observed)
return 1 - (mae(forecasted_array, observed_array) * size ** 2 / total)
</code></pre>
<p>您没有提供<code>mae</code>函数,但是要在<code>njit</code>模式下运行代码,还必须修饰<code>mae</code>函数,或者如果它是一个数字,则将其作为一个参数传递给jitted函数。在</p>
<h3>其他选项</h3>
<p>Python的科学生态系统是巨大的,我只是提到了其他一些可以加速的选项:<code>Cython</code>,<code>Nuitka</code>,<code>Pythran</code>,<code>bottleneck</code>等等。也许你对<code>gpu computing</code>感兴趣,但这实际上是另一个故事。在</p>
<h2>时间安排</h2>
<p>不幸的是,在我的电脑上,时间安排是:</p>
<pre><code>import numpy as np
import numexpr as ne
forecasted_array = np.random.rand(10000)
observed_array = np.random.rand(10000)
</code></pre>
<p><strong>初始版本</strong></p>
<pre><code>%timeit mb_r(forecasted_array, observed_array)
23.4 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p><strong><code>numexpr</code></strong></p>
<pre><code>%%timeit
forecasted_array2d = forecasted_array[:, np.newaxis]
ne.evaluate('sum(abs(forecasted_array2d - observed_array))')[()]
784 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p><strong>广播版</strong></p>
<pre><code>%timeit mb_r_bcast(forecasted, observed)
1.47 s ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p><strong>内部循环展开版本</strong></p>
<pre><code>%timeit mb_r_unroll(forecasted, observed)
389 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p><strong>numba<code>njit(parallel=True)</code>版本</strong></p>
<pre><code>%timeit mb_r_njit(forecasted_array, observed_array)
32 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p>可以看出,<code>njit</code>方法比初始解决方案快<strong>730倍</strong>,也比<code>numexpr</code>解决方案快24.5倍(也许你需要英特尔的矢量数学库来加速)。同样,与初始版本相比,内部循环展开的简单方法可以使<strong>60x</strong>加速。我的规格是:</p>
<p><em>Intel(R)Core(TM)2四CPU Q9550 2.83GHz</em><br/>
<em>Python 3.6.3<br/>
数字1.13.3<br/>
数字0.36.1<br/>
numexpr 2.6.4</em></p>
<h2>最后说明</h2>
<p>很奇怪,<mpi>这是一个非常缓慢的测试</p>
<pre><code>arr = np.arange(1000)
ls = arr.tolistist()
%timeit for i in arr: pass
69.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit for i in ls: pass
13.3 µs ± 81.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit for i in range(len(arr)): arr[i]
167 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit for i in range(len(ls)): ls[i]
90.8 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
</code></pre>
<p>结果证明你是对的。迭代列表的速度<strong>2-5x</strong>。当然,这些结果必须带有一定的讽刺意味:)</p>