<p>这似乎在很大程度上取决于数组大小和“稀疏性”(可能是由于哈希表的魔力)</p>
<p>来自<a href="https://stackoverflow.com/questions/8317022/get-intersecting-rows-across-two-2d-numpy-arrays">Get intersecting rows across two 2D numpy arrays</a>的答案是<code>so_8317022</code>函数</p>
<p>外卖似乎(在我的机器上)是:</p>
<ul>
<li>Pandas方法具有较大稀疏集的优势</li>
<li>集合交集非常非常快,数组大小很小(尽管它返回的是集合,而不是numpy数组)</li>
<li>另一个Numpy答案可以比设置较大数组大小的交集更快</李>
</ul>
<pre><code>from collections import defaultdict
import numpy as np
import pandas as pd
import timeit
import matplotlib.pyplot as plt
def pandas_merge(a, b):
return pd.DataFrame(a).merge(pd.DataFrame(b)).to_numpy()
def set_intersection(a, b):
return set(map(tuple, a.tolist())) & set(map(tuple, b.tolist()))
def so_8317022(a, b):
nrows, ncols = a.shape
dtype = {
"names": ["f{}".format(i) for i in range(ncols)],
"formats": ncols * [a.dtype],
}
C = np.intersect1d(a.view(dtype), b.view(dtype))
return C.view(a.dtype).reshape(-1, ncols)
def test_fn(f, a, b):
number, time_taken = timeit.Timer(lambda: f(a, b)).autorange()
return number / time_taken
def test(size, max_coord):
a = np.random.default_rng().integers(0, max_coord, size=(size, 2))
b = np.random.default_rng().integers(0, max_coord, size=(size, 2))
return {fn.__name__: test_fn(fn, a, b) for fn in (pandas_merge, set_intersection, so_8317022)}
series = []
datas = defaultdict(list)
for size in (100, 1000, 10000, 100000):
for max_coord in (50, 500, 5000):
print(size, max_coord)
series.append((size, max_coord))
for fn, result in test(size, max_coord).items():
datas[fn].append(result)
print("size", "sparseness", "func", "ops/sec")
for fn, values in datas.items():
for (size, max_coord), value in zip(series, values):
print(size, max_coord, fn, int(value))
</code></pre>
<p>我机器上的结果是</p>
<div class="s-table-container">
^{tb1}$
</div>