<p>在<code>numpy</code>代码的死水中有一个简单的解决方案<code>recfunctions.join_by</code>。你知道吗</p>
<pre><code>import numpy as np
A=[('2015', '1', '1', 0.0, 'G06', 46.29),
('2015', '1', '1', 0.0, 'G12', 444.344),
('2015', '1', '1', 0.0, 'G14', -99.269),
('2015', '1', '1', 0.0, 'G20', 6.874),
('2015', '1', '1', 0.0, 'G24', 158.488),
('2015', '1', '1', 0.0, 'G25', -60.831),
('2015', '1', '1', 0.0, 'G31', -48.234),
('2015', '1', '1', 0.0, 'R07', -6.243)]
B=[('2015', '1', '1', 0.0, 'G06', '0.000'),
('2015', '1', '1', 0.0, 'G12', '0.000'),
('2015', '1', '1', 0.0, 'G14', '0.000'),
('2015', '1', '1', 0.0, 'G24', '0.000'),
('2015', '1', '1', 0.0, 'G25', '0.000'),
('2015', '1', '1', 0.0, 'G29', '0.000'),
('2015', '1', '1', 0.0, 'G31', '0.000')]
dt=[('a', 'S4'), ('b', 'S1'), ('c', 'S1'), ('d',float), ('e', 'S3'), ('f',float)]
aA=np.array(A,dt)
aB=np.array(B,dt)
flds=list('abcde')
from numpy.lib import recfunctions
mrgd = recfunctions.join_by(flds, aA, aB, usemask=False)
print(mrgd)
print(mrgd.dtype)
</code></pre>
<p>生产</p>
<pre><code>[('2015', '1', '1', 0.0, 'G06', 46.29, 0.0)
('2015', '1', '1', 0.0, 'G12', 444.344, 0.0)
('2015', '1', '1', 0.0, 'G14', -99.269, 0.0)
('2015', '1', '1', 0.0, 'G24', 158.488, 0.0)
('2015', '1', '1', 0.0, 'G25', -60.831, 0.0)
('2015', '1', '1', 0.0, 'G31', -48.234, 0.0)]
[('a', 'S4'), ('b', 'S1'), ('c', 'S1'), ('d', '<f8'), ('e', 'S3'), ('f1', '<f8'), ('f2', '<f8')]
</code></pre>
<p>在当前组织中<code>recfunctions</code>必须单独导入。
<a href="https://stackoverflow.com/a/33680606/901925">https://stackoverflow.com/a/33680606/901925</a></p>
<p>我们必须检查代码,看看它是如何实际实现的。我不知道,如果没有进一步的计时,速度与等价的<code>pandas</code>相比会如何。你知道吗</p>
<hr/>
<p>对于这个小样本,<code>recfunctions</code>比<code>pandas</code>快,特别是如果包括创建数据帧所需的时间。你知道吗</p>
<pre><code>In [302]: %%timeit
.....: a = pd.DataFrame(A)
.....: b = pd.DataFrame(B)
.....: c = pd.merge(a, b, 'inner', left_on=[0,1,2,3,4], right_on=[0,1,2,3,4])
.....:
100 loops, best of 3: 8.01 ms per loop
In [303]: %%timeit
.....: aA=np.array(A,dt)
.....: aB=np.array(B,dt)
.....: aC=recfunctions.join_by(flds, aA, aB,usemask=False)
.....:
100 loops, best of 3: 3.35 ms per loop
</code></pre>
<p>与<code>in1d</code>(不尝试合并)这样的numpy集操作相比,这两种操作都很慢:</p>
<pre><code>In [308]: timeit np.intersect1d(aA[flds],aB[flds])
1000 loops, best of 3: 326 µs per loop
</code></pre>