<p>为了提高速度,纯粹使用<a href="https://numpy.org/doc/stable/user/basics.broadcasting.html" rel="nofollow noreferrer">^{<cd1>}</a>:</p>
<pre><code>diffs = np.not_equal(df.filter(like='rater'), df['right_answer'][:, None])
diffs = np.sum(diffs, axis=1) >= 2
df[diffs]
right_answer rater1 rater2 rater3 item
1 1 1 2 2 S02
2 2 1 2 1 S03
</code></pre>
<hr/>
<p><strong>让我们计时吧</强></p>
<pre><code># create dataframe with 4 million rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape
# (4000000, 5)
</code></pre>
<pre><code>def numpy_broadcasting(data):
diffs = np.not_equal(data.filter(like='rater'), data['right_answer'][:, None])
diffs = np.sum(diffs, axis=1) >= 2
def pandas_method(data):
mask = (
data.filter(like='rater')
.ne(df['right_answer'], axis=0).sum(axis=1).ge(2)
)
%%timeit
numpy_broadcasting(dfbig)
# 92.5 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pandas_method(dfbig)
# 296 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p><code>numpy broadcasting</code>快了<code>296 / 92.5 = 3.2</code>倍</p>