为什么在Pandas身上int转换比float慢得多？问题的回答

为什么在Pandas身上int转换比float慢得多？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

更新：我的笔记本有16GB的RAM，所以我将用4倍（64GB/16GB=4）的DF测试它： 设置： <pre><code>In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32) In [2]: df.shape Out[2]: (12000, 47395) In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000) 1 loop, best of 1: 5min 34s per loop </code></pre> 我们也将此数据框保存为羽毛格式： ^{pr2}$ 然后读回来： <pre><code>In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather') 1 loop, best of 1: 17.4 s per loop # reading is reasonably fast as well </code></pre> 从CSV文件中分块读取要慢得多，但仍然不能给我<code>MemoryError</code>： <pre><code>In [2]: %%timeit -n 1 -r 1 ...: df = pd.DataFrame() ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000): ...: df = pd.concat([df, chunk]) ...: print(df.shape) ...: print(df.dtypes.unique()) ...: (1000, 47395) (2000, 47395) (3000, 47395) (4000, 47395) (5000, 47395) (6000, 47395) (7000, 47395) (8000, 47395) (9000, 47395) (10000, 47395) (11000, 47395) (12000, 47395) [dtype('int64')] 1 loop, best of 1: 9min 25s per loop </code></pre> 现在让我们显式地指定<code>dtype=np.int32</code>： <pre><code>In [1]: %%timeit -n 1 -r 1 ...: df = pd.DataFrame() ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32): ...: df = pd.concat([df, chunk]) ...: print(df.shape) ...: print(df.dtypes.unique()) ...: (1000, 47395) (2000, 47395) (3000, 47395) (4000, 47395) (5000, 47395) (6000, 47395) (7000, 47395) (8000, 47395) (9000, 47395) (10000, 47395) (11000, 47395) (12000, 47395) [dtype('int32')] 1 loop, best of 1: 10min 38s per loop </code></pre> 测试HDF存储： <pre><code>In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test') 1 loop, best of 1: 22.5 s per loop In [11]: del df In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test') 1 loop, best of 1: 1.04 s per loop </code></pre> <h2>结论：</h2> 如果你有机会改变你的存储文件格式-无论如何不要使用CSV文件-使用HDF5（.h5）或羽毛格式。。。在 旧答案： 我只需使用原生熊猫<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html" rel="nofollow noreferrer">read_csv()</a>方法： <pre><code>chunksize = 10**6 reader = pd.read_csv(filename, index_col=0, chunksize=chunksize) df = pd.concat([chunk for chunk in reader], ignore_indexes=True) </code></pre> 根据您的代码： <blockquote> tag = row[0] df.loc[tag] = np.array(row[1:], dtype=dftype) </blockquote> 看起来您希望使用CSV文件中的第一列作为索引，因此：<code>index_col=0</code>

为什么在Pandas身上int转换比float慢得多？

1 个回答

相关Python问题