高效查询Pandas数据

2条回答

网友

1楼 · 编辑于 2024-09-29 06:32:29

列A到D可以转换为类别，因为这些值是非唯一的和有限的。在

下面的例子是基于你在作业中提供的测向

# Original data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A                10 non-null object
B                10 non-null object
C                10 non-null object
D                10 non-null object
important_col    10 non-null int64
dtypes: int64(1), object(4)
memory usage: 480.0+ bytes

# Convert to category
df['A'] = df.A.astype('category')
df['B'] = df.B.astype('category')
df['C'] = df.C.astype('category')
df['D'] = df.D.astype('category')

# Modified data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A                10 non-null category
B                10 non-null category
C                10 non-null category
D                10 non-null category
important_col    10 non-null int64
dtypes: category(4), int64(1)
memory usage: 360.0 bytes

您应该看到内存使用的好处（值被整数替换并使用小的查找表进行映射）以及选择时的速度（基于整数值的查找将比使用字符串值的相同查找更快）。在

更新

我创建了一个Jupyter notebook来展示将列转换为类别的改进。在

使用1.000.000行的样本（与OP定义的结构相同）和OP中提供的示例查询，内存使用率得到了显著提高，因为内存大小从232.7MB下降到11.4MB（减少了95%）。在

此外，示例查询还显示了速度优势：

问题1：83%改善（57毫秒>9.36毫秒）
问题2：91%改善（80.9 ms>；6.97 ms）
问题3：92%改善（119 ms>；9.37）

我用800万个样本做了同样的测试，结果同样提高了速度和资源利用率。在

网友

2楼 · 编辑于 2024-09-29 06:32:29

@Kristof的答案是一个很好的起点。我注意到这个建议的速度提高了不到2倍。对于大型数据帧，还有一些需要记住的事情是使用表达式的顺序（例如，您需要创建一个新的数据帧来选择一个序列，还是可以直接生成新的序列）。当不需要富熊猫方法时，也可以直接使用numpy类型。在

扩展您的示例：

In [58]: df_big = pd.DataFrame()
In [59]: for i in range(1000): df_big = df_big.append(df)
In [61]: len(df_big)
Out[61]: 10000

In [62]: dfr = df_big.to_records()

In [63]: dfr
Out[63]: 
rec.array([(0, 'A1', 'BA1', 'CA1', 'D1', 900), (1, 'A2', 'BA2', 'CA2', 'D2', 900),
 (2, 'A3', 'BA3', 'CA3', 'D3', 500), ...,
 (7, 'A1', 'BA1', 'CA1', 'D1', 700), (8, 'A4', 'BA4', 'CA4', 'D4', 300),
 (9, 'A4', 'BA4', 'CA4', 'D4', 500)], 
          dtype=[('index', '<i8'), ('A', '|O'), ('B', '|O'), ('C', '|O'), ('D', '|O'), ('important_col', '<i8')])


In [71]: %timeit df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean() 
100 loops, best of 3: 2.91 ms per loop

In [72]: %timeit df_big['important_col'][(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')].mean()
100 loops, best of 3: 2.46 ms per loop

In [73]: df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean()

In [74]: %timeit dfr['important_col'][(dfr['A']== 'A4') & (dfr['C'] == 'CA4') & (dfr['D'] == 'D4')].mean()
1000 loops, best of 3: 877 µs per loop

更新

相关问题更多 >

编程相关推荐

热门问题

热门文章

高效查询Pandas数据

更新

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >