<p>索引值查找比列值查找快。我不知道实现细节(看起来查找取决于行数)。以下是性能比较:</p>
<pre><code>def test_value_matches(df, v1, v2):
# return True if v1, v2 found in df columns, else return False
if any(df[(df.c1 == v1) & (df.c2 == v2)]):
return True
return False
def test_index_matches(df, v1, v2):
# returns True if (v1, v2) found in (multi) index, else returns False
if (v1, v2) in df.index:
return True
return False
# test dependence of funcs above on num rows in df:
for n in [int(j) for j in [1e4, 1e5, 1e6, 1e7]]:
df = pd.DataFrame(np.random.random(size=(n, 2)), columns=["c1", "c2"])
v1, v2 = df.sample(n=1).iloc[0]
%timeit test_value_matches(df, v1, v2)
# create an index based on column values:
df2 = df.set_index(["c1", "c2"])
%timeit test_index_matches(df2, v1, v2)
</code></pre>
<p>输出</p>
<pre><code>421 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.5 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
557 µs ± 5.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.3 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.77 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.5 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
22.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28.1 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>
<p>注意,这忽略了索引时间本身,这可能很重要;这种方法可能在重复查找同一个df时效果最好。对于<code>n=1e7</code>,性能有点像您在我的机器上遇到的问题;索引版本快约1000倍(尽管显然随着<code>n</code>而增长)</p>