Pandas/Python在读取3.2 GB fi时内存峰值问题的回答

Pandas/Python在读取3.2 GB fi时内存峰值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

所以我一直试图用pandas <code>read_csv</code>函数读取内存中的一个3.2GB的文件，但是我不断地遇到某种内存泄漏，我的内存使用量会激增<code>90%+</code>。在 作为替代品 <ol> <li>我尝试定义<code>dtype</code>以避免将数据作为字符串保存在内存中，但是看到了类似的行为。</li> <li>尝试了numpy read csv，以为我会得到一些不同的结果，但这是绝对错误的。</li> <li>试着逐行阅读也遇到了同样的问题，但速度很慢。</li> <li>我最近搬到了Python3，所以我认为那里可能有一些bug，但是在python2+pandas上看到了类似的结果。</li> </ol> 有问题的文件是火车.csv来自kaggle竞赛的文件<a href="https://www.kaggle.com/c/grupo-bimbo-inventory-demand/" rel="nofollow noreferrer">grupo bimbo</a> 系统信息： <code>RAM: 16GB, Processor: i7 8cores</code> 如果你还想知道其他情况，请告诉我。在 谢谢：） 编辑一：这是一个记忆高峰！一点也不漏（对不起我的错） 编辑2:csv文件示例 <pre><code>Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil 3,1110,7,3301,15766,1212,3,25.14,0,0.0,3 3,1110,7,3301,15766,1216,4,33.52,0,0.0,4 3,1110,7,3301,15766,1238,4,39.32,0,0.0,4 3,1110,7,3301,15766,1240,4,33.52,0,0.0,4 3,1110,7,3301,15766,1242,3,22.92,0,0.0,3 </code></pre> 编辑3：对文件中的行进行编号74180465 另一个简单的<code>pd.read_csv('filename', low_memory=False)</code> 我试过了 ^{pr2}$ 更新 下面的代码刚刚起作用，但是我还是想弄清楚这个问题，一定是出了什么问题。在 <pre><code>import pandas as pd import gc data = pd.DataFrame() data_iterator = pd.read_csv('data/train.csv', chunksize=100000) for sub_data in data_iterator: data.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(sub_data) gc.collect() </code></pre> <a href="https://i.stack.imgur.com/EFcup.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/EFcup.png" alt="enter image description here"/></a> <a href="https://i.stack.imgur.com/rnslk.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/rnslk.png" alt="enter image description here"/></a> 编辑：有效的代码。 感谢所有的帮助人员，我把我的数据类型弄乱了，因为我添加了python数据类型而不是numpy类型。有一次，我修复了以下代码的工作原理。在 <pre><code>dtypes = {'Semana': pd.np.int8, 'Agencia_ID':pd.np.int8, 'Canal_ID':pd.np.int8, 'Ruta_SAK':pd.np.int8, 'Cliente_ID':pd.np.int8, 'Producto_ID':pd.np.int8, 'Venta_uni_hoy':pd.np.int8, 'Venta_hoy':pd.np.float16, 'Dev_uni_proxima':pd.np.int8, 'Dev_proxima':pd.np.float16, 'Demanda_uni_equil':pd.np.int8} data = pd.read_csv('data/train.csv', dtype=dtypes) </code></pre> 这使得内存消耗降到了4Gb以下

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

根据您的第二张图表，您的计算机可能会在短时间内分配额外的4.368GB内存，这大约相当于3.2GB数据集的大小（假设1GB的开销，这可能是一个扩展）。在 我试图找到一个可能发生这种情况的地方，但没有取得成功。不过，如果你有动力的话，也许你能找到它。我走的路是： <a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L908" rel="nofollow">This line</a>显示： <pre><code>def read(self, nrows=None): if nrows is not None: if self.options.get('skip_footer'): raise ValueError('skip_footer not supported for iteration') ret = self._engine.read(nrows) </code></pre> 这里，<code>_engine</code>引用<a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1674" rel="nofollow">^{<cd2>}</a>。在 反过来，调用<a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L2359" rel="nofollow">^{<cd3>}</a>。在 调用数据<a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L2379" rel="nofollow">^{<cd4>}</a>。在 它似乎以字符串的形式从一些相对标准的东西中读入（参见<a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1741" rel="nofollow">here</a>），比如<a href="https://docs.python.org/2/library/io.html#io.TextIOWrapper" rel="nofollow">TextIOWrapper</a>。在 所以东西被当作标准文本读入并转换，这就解释了慢斜坡的原因。在 扣球呢？我想这可以用<a href="https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L908-L916" rel="nofollow">these lines</a>来解释： ^{pr2}$ <code>ret</code>成为数据帧的所有组件。在 <code>self._create_index()</code>将<code>ret</code>分成以下几个部分： <pre><code>def _create_index(self, ret): index, columns, col_dict = ret return index, columns, col_dict </code></pre> 到目前为止，一切都可以通过引用完成，对<code>DataFrame()</code>的调用延续了这一趋势（请参见<a href="https://github.com/pydata/pandas/blob/master/pandas/core/frame.py#L251" rel="nofollow">here</a>）。在 所以，如果我的理论是正确的，<code>DataFrame()</code>要么在某处复制数据，要么{<cd10>}沿着我确定的路径在某处复制数据。在

Pandas/Python在读取3.2 GB fi时内存峰值

1 个回答

相关Python问题