Pandas数据帧搜索是线性时间还是常数时间?

2024-10-01 17:28:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个超过15000行的dataframe对象df,如:

anime_id          name              genre    rating
1234      Kimi no nawa    Romance, Comedy     9.31
5678       Stiens;Gate             Sci-fi     8.92

我正试图找到一排有特定动漫身份的人

^{pr2}$

我只想知道这个搜索是在固定时间(比如字典)还是线性时间(比如列表)中完成的。在


Tags: 对象nonameiddataframedf时间rating
3条回答

我不能告诉你它是如何实现的,但是在运行了一个小测试之后。看起来数据帧布尔掩码更像是线性的。在

>>> timeit.timeit('dict_data[key]',setup=setup,number = 10000)
0.0005770014540757984
>>> timeit.timeit('df[df.val==key]',setup=setup,number = 10000)
17.583375428628642
>>> timeit.timeit('[i == key for i in dict_data ]',setup=setup,number = 10000)
16.613936403242406

这是一个非常有趣的问题!在

我认为这取决于以下几个方面:

按索引访问单行(索引已排序且唯一)应具有运行时O(m),其中m << n_rows

按索引访问单行(索引不是唯一的,并且未排序)应具有运行时O(n_rows)

按索引访问单行(索引不是唯一的,并且是排序的)应该有运行时O(m),其中m < n_rows

通过布尔索引访问行(独立于索引)应具有运行时O(n_rows)


演示:

索引已排序且唯一:

In [49]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'))

In [50]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 331 µs per loop

In [51]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 275 µs per loop

In [52]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.84 ms per loop

In [53]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.96 ms per loop

索引未排序且不唯一:

^{pr2}$

索引不是唯一的,并且已排序:

In [64]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()

In [65]: df.index.is_monotonic_increasing
Out[65]: True

In [66]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 478 µs per loop

In [67]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop

In [68]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.81 ms per loop

In [69]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.95 ms per loop

您应该注意,当您的索引是唯一的时,即使是iloc也比hashmap慢2个数量级:

df = pd.DataFrame(np.random.randint(0, 10**7, 10**5), columns=['a'])
%timeit df.iloc[random.randint(0,10**5)]
10000 loops, best of 3: 51.5 µs per loop

s = set(np.random.randint(0, 10**7, 10**5))
%timeit random.randint(0,10**7) in s
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 615 ns per loop

相关问题 更多 >

    热门问题