列和与行和：为什么我看不到使用NumPy的区别？问题的回答

列和与行和：为什么我看不到使用NumPy的区别？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

<p>我用numpy测试了这个<a href="http://www.pytables.org/docs/LargeDataAnalysis.pdf" rel="nofollow noreferrer">talk</a>[pytables]中演示的一个例子（第20/57页）。在</p> <p>结果表明，<code>a[:,1].sum()</code>只需9.3ms，而{<cd2>}只需72 us。在</p> <p>我试图复制它，但没有成功。我量错了吗？或者从2010年开始，纽比的情况发生了变化？在</p> <pre><code>$ python2 -m timeit -n1000 --setup \ 'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()' 1000 loops, best of 3: 16.5 usec per loop $ python2 -m timeit -n1000 --setup \ 'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()' 1000 loops, best of 3: 13.8 usec per loop $ python2 --version Python 2.7.7 $ python2 -c 'import numpy; print numpy.version.version' 1.8.1 </code></pre> <p>虽然我可以衡量第二个版本的好处（因为numpy使用C风格的行排序，所以应该更少的缓存未命中），但我看不出pytables贡献者所说的那种巨大差异。在</p> <p>另外，在使用列V行求和时，似乎看不到更多的缓存未命中。在</p> <hr/> <p><strong>编辑</strong></p> <ul> <li><p>到目前为止，我的洞察力是我用错了<code>timeit</code>模块。使用同一个数组（或数组的行/列）重复运行几乎肯定会被缓存（我有一级数据缓存的<code>32KiB</code>，因此其中有一行很适合：<code>4000 * 4 byte = 15k < 32k</code>）。</p></li> <li><p>使用@alim的<a href="https://stackoverflow.com/a/24738454/543411">answer</a>中的脚本和一个单循环（<code>nloop=1</code>）和十次试验<code>nrep=10</code>，并改变我测量的随机数组（<code>n x n</code>）的大小</p> ^{pr2}$ <p>*<code>n=10k</code>及更高版本不再适合L1d缓存。</p></li> </ul> <p>我仍然不确定是否能找到原因，因为<code>perf</code>显示了与更快的行和相同的缓存未命中率（有时甚至更高）。在</p> <h2><code>Perf</code>数据：</h2> <p><code>nloop = 2</code>和<code>nrep=2</code>，所以我希望一些数据仍在缓存中。。。第二轮。在</p> <h3>行和<code>n=10k</code></h3> <pre><code> perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g' row sum: 103.593 us Performance counter stats for './answer1.py': 25850670 cache-references [30.04%] 1321945 cache-misses # 5.114 % of all cache refs [20.04%] 5706371393 L1-dcache-loads [20.00%] 11733777 L1-dcache-load-misses # 0.21% of all L1-dcache hits [19.97%] 2401264190 L1-dcache-stores [20.04%] 131964213 L1-dcache-store-misses [20.03%] 2007640 L1-dcache-prefetches [20.04%] 21894150686 cycles [20.02%] 24582770606 instructions # 1.12 insns per cycle [30.06%] 3534308182 branches [30.01%] 3767 faults 6 migrations 7.331092823 seconds time elapsed </code></pre> <h3>列和<code>n=10k</code></h3> <pre><code> perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g' column sum: 377.059 us Performance counter stats for './answer1.py': 26673628 cache-references [30.02%] 1409989 cache-misses # 5.286 % of all cache refs [20.07%] 5676222625 L1-dcache-loads [20.06%] 11050999 L1-dcache-load-misses # 0.19% of all L1-dcache hits [19.99%] 2405281776 L1-dcache-stores [20.01%] 126425747 L1-dcache-store-misses [20.02%] 2128076 L1-dcache-prefetches [20.04%] 21876671763 cycles [20.00%] 24607897857 instructions # 1.12 insns per cycle [30.00%] 3536753654 branches [29.98%] 3763 faults 9 migrations 7.327833360 seconds time elapsed </code></pre> <hr/> <p><strong>编辑2</strong> 我想我已经了解了一些方面，但是这个问题我想还没有得到回答。目前，我认为这个求和示例根本没有揭示任何关于CPU缓存的信息。为了消除numpy/python的不确定性，我尝试在<em>C</em>中使用<code>perf</code>进行求和，结果如下所示。在</p>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p>我在<em>C</em>中编写了求和示例：结果显示为<a href="https://stackoverflow.com/a/459704/543411">CPU time</a>度量，我总是使用<code>gcc -O1 using-c.c</code>来编译（gcc版本：gcc版本4.9.0 20140604）。源代码如下。在</p> <p>我选择矩阵大小为<code>n x n</code>。对于<code>n<2k</code>，行和列的总和没有任何可测量的差异（对于<code>n=2k</code>，每次运行6-7 us）。在</p> <h3>行总和</h3> <pre><code>n first/us converged/us 1k 5 4 4k 19 12 10k 35 31 20k 70 61 30k 130 90 </code></pre> e、 g<code>n=20k</code> ^{pr2}$ <h3>列</h3> <pre><code>n first/us converged/us 1k 5 4 4k 112 14 10k 228 32 20k 550 246 30k 1000 300 </code></pre> <p>例如<code>n=20k</code></p> <pre><code>Run 0 taken 552 cycles. 0 ms 552 us Run 1 taken 358 cycles. 0 ms 358 us Run 2 taken 291 cycles. 0 ms 291 us Run 3 taken 264 cycles. 0 ms 264 us Run 4 taken 252 cycles. 0 ms 252 us Run 5 taken 275 cycles. 0 ms 275 us Run 6 taken 262 cycles. 0 ms 262 us Run 7 taken 249 cycles. 0 ms 249 us Run 8 taken 249 cycles. 0 ms 249 us Run 9 taken 246 cycles. 0 ms 246 us </code></pre> <h3>讨论</h3> <p>行总和更快。我并没有从任何缓存中获益，也就是说，重复求和并不比初始求和快多少。列求和的速度要慢得多，但在5-8次迭代中它会稳步增加。在<code>n=4k</code>到{<cd8>}之间，这种增长最为明显，其中缓存有助于将速度提高约10倍。在较大的阵列中，加速仅为因子2。我还观察到，虽然行和收敛非常快（经过一次或两次试验），列求和收敛需要更多的迭代（5次或更多）。在</p> <p>给我上一课：</p> <ul> <li>对于大型数组（超过2k个元素），求和速度存在差异。我相信这是由于从RAM获取数据到L1d缓存时的协同作用。虽然我不知道一次读取的块/行大小，但我假设它大于8个字节。所以下一个要总结的元素已经在缓存中了。在</li> <li>列和速度首先受内存带宽的限制。当从RAM读取分散的块时，CPU似乎急需数据。在</li> <li>当重复执行求和时，人们期望一些数据不需要从RAM中获取，并且已经存在于L2/L1d缓存中。对于行求和，这只对<code>n>30k</code>很明显，对于列求和，它已经在<code>n>2k</code>处变得明显。在</li> </ul> <p>使用<code>perf</code>，我看不出有什么大的区别。但是C程序的大部分工作是用随机数据填充数组。我不知道如何消除这些“设置”数据。。。在</p> <p>以下是本例的<em>C</em>代码：</p> <pre><code>#include <stdio.h> #include <stdlib.h> // see `man random` #include <time.h> // man time.h, info clock int main (void) { // seed srandom(62); //printf ("test %g\n", (double)random()/(double)RAND_MAX); const size_t SIZE = 20E3; const size_t RUNS = 10; double (*b)[SIZE]; printf ("Array size: %dx%d, each %d bytes. slice = %f KiB\n", SIZE, SIZE, sizeof(double), ((double)SIZE)*sizeof(double)/1024); b = malloc(sizeof *b * SIZE); //double a[SIZE][SIZE]; // too large! int i,j; for (i = 0; i< SIZE; i++) { for (j = 0; j < SIZE; j++) { b[i][j] = (double)random()/(double)RAND_MAX; } } double sum = 0; int run = 0; clock_t start, diff; int usec; for (run = 0; run < RUNS; run++) { start = clock(); for (i = 0; i<SIZE; i++) { // column wise (slower?) sum += b[i][1]; // row wise (faster?) //sum += b[1][i]; } diff = clock() - start; usec = ((double) diff*1e6) / CLOCKS_PER_SEC; // https://stackoverflow.com/a/459704/543411 printf("Run %d taken %d cycles. %d ms %d us\n",run, diff, usec/1000, usec%1000); } printf("Sum: %g\n", sum); return 0; } </code></pre>

列和与行和：为什么我看不到使用NumPy的区别？

1 个回答

相关Python问题