高效查询Pandas数据问题的回答

高效查询Pandas数据

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

列A到D可以转换为类别，因为这些值是非唯一的和有限的。在 下面的例子是基于你在作业中提供的测向 <pre><code># Original data frame df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 5 columns): A 10 non-null object B 10 non-null object C 10 non-null object D 10 non-null object important_col 10 non-null int64 dtypes: int64(1), object(4) memory usage: 480.0+ bytes # Convert to category df['A'] = df.A.astype('category') df['B'] = df.B.astype('category') df['C'] = df.C.astype('category') df['D'] = df.D.astype('category') # Modified data frame df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 5 columns): A 10 non-null category B 10 non-null category C 10 non-null category D 10 non-null category important_col 10 non-null int64 dtypes: category(4), int64(1) memory usage: 360.0 bytes </code></pre> 您应该看到内存使用的好处（值被整数替换并使用小的查找表进行映射）以及选择时的速度（基于整数值的查找将比使用字符串值的相同查找更快）。在 <h2>更新</h2> 我创建了一个<a href="https://gist.github.com/kspeeckaert/957c966a5332fc6bc544617c2efad4e9" rel="nofollow">Jupyter notebook</a>来展示将列转换为类别的改进。在 使用1.000.000行的样本（与OP定义的结构相同）和OP中提供的示例查询，内存使用率得到了显著提高，因为内存大小从232.7MB下降到11.4MB（减少了95%）。在 此外，示例查询还显示了速度优势： <ul> <li>问题1：83%改善（57毫秒>9.36毫秒）</li> <li>问题2：91%改善（80.9 ms&gt；6.97 ms）</li> <li>问题3：92%改善（119 ms&gt；9.37）</li> </ul> 我用800万个样本做了同样的测试，结果同样提高了速度和资源利用率。在

高效查询Pandas数据

1 个回答

相关Python问题