查找索引列值对的最快方法

Type ColA ColB ColC ColD ColE ColF 2021-01-19 B 83.0 -122.15 0.0 11.0 11.000 11.0 2021-01-19 D 83.0 -1495.48 0.0 11.0 11.000 11.0 2021-03-25 D 83.0 432.00 0.0 11.0 11.000 11.0 2021-04-14 D 83.0 646.00 0.0 11.0 11.000 11.0 2021-04-16 A 20.0 11.00 0.0 30.0 11.000 11.0 2021-04-25 D 83.0 -26.82 0.0 11.0 11.000 11.0 2021-04-28 B 83.0 -651.00 0.0 11.0 11.000 11.0

2条回答

网友

1楼 · 编辑于 2024-10-01 05:01:38

索引值查找比列值查找快。我不知道实现细节（看起来查找取决于行数）。以下是性能比较：

def test_value_matches(df, v1, v2):
    # return True if v1, v2 found in df columns, else return False
    if any(df[(df.c1 == v1) & (df.c2 == v2)]):
        return True
    return False

def test_index_matches(df, v1, v2):
    # returns True if (v1, v2) found in (multi) index, else returns False
    if (v1, v2) in df.index:
        return True
    return False

# test dependence of funcs above on num rows in df:
for n in [int(j) for j in [1e4, 1e5, 1e6, 1e7]]:
    df = pd.DataFrame(np.random.random(size=(n, 2)), columns=["c1", "c2"])
    v1, v2 = df.sample(n=1).iloc[0]
    %timeit test_value_matches(df, v1, v2)
    
    # create an index based on column values:
    df2 = df.set_index(["c1", "c2"])
    %timeit test_index_matches(df2, v1, v2)

输出

421 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.5 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

557 µs ± 5.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.3 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

3.77 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.5 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28.1 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意，这忽略了索引时间本身，这可能很重要；这种方法可能在重复查找同一个df时效果最好。对于n=1e7，性能有点像您在我的机器上遇到的问题；索引版本快约1000倍（尽管显然随着n而增长）

网友

2楼 · 编辑于 2024-10-01 05:01:38

尝试将^{}与MultiIndex一起使用：

看起来compiledData已经有了日期索引，所以将Type附加到索引中：
```
compiledData = compiledData.set_index('Type', append=True)
```
看起来newData将Date作为一个独立列，因此将其索引设置为['Date', 'Type']：
```
newData = newData.set_index(['Date', 'Type'])
```
既然两者都有一个日期/类型MultiIndex，那么使用它们的^{}来获得唯一的newData索引：
```
unique = newData.index.difference(compiledData.index)
```

因此，可以使用^{}添加newData.loc[unique]行：

compiledData.append(newData.loc[unique]).reset_index(level=1)

或^{}：

pd.concat([compiledData, newData.loc[unique]]).reset_index(level=1)

相关问题更多 >

编程相关推荐

热门问题

热门文章