合并pandas数据帧占用了太多内存问题的回答

合并pandas数据帧占用了太多内存

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在做<a href="https://www.kaggle.com/c/competitive-data-science-predict-future-sales" rel="nofollow noreferrer">this Kaggle competition</a>作为我所学课程的最后一个项目，为此，我试图复制<a href="https://www.kaggle.com/dlarionov/feature-engineering-xgboost" rel="nofollow noreferrer">this notebook</a>，但他使用了一个函数来获取滞后特性，这对我来说占用了太多内存。这是他的密码： <pre><code>def lag_feature(df, lags, col): tmp = df[['date_block_num','shop_id','item_id',col]] for i in lags: shifted = tmp.copy() shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)] shifted['date_block_num'] += i df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left') return df </code></pre> 在用他的代码运行失败后，我做了一些小的修改，试图减少内存使用量，我开始使用google colab，因为它的内存比我的笔记本电脑多，所以我的代码如下： ^{pr2}$ 但是仍然使用了太多的内存，以至于我的代码使用了google为这个函数调用提供的10gbo内存 <pre><code>sales_train = lag_feature(sales_train, [1, 2, 3, 12, 20], 'item_cnt_month') </code></pre> 有没有办法可以减少我的内存使用？这是我的数据框： <pre><code>Int64Index: 2829445 entries, 0 to 3134798 Data columns (total 8 columns): date object date_block_num int8 item_cnt_day float16 item_id int16 item_price float16 shop_id int8 item_cnt_month float16 item_category_id int8 dtypes: float16(3), int16(1), int8(3), object(1) memory usage: 152.9+ MB </code></pre> 为了添加更多信息，列'date_block_num'保留了一个数字，用来标识该功能发生的月份，我要做的是将上个月的一些数据放入该行。因此，如果我的延迟为1，意味着我要从一个月前的数据中获取数据，并将其添加到另一个名为“feature_lag_1”的列中。例如，对于此数据帧： <pre><code> date date_block_num item_cnt_day item_id item_price shop_id \ 0 14.09.2013 8 1.0 2848 99.0 24 1 14.09.2013 8 1.0 2848 99.0 24 2 14.09.2013 8 1.0 2848 99.0 24 3 01.09.2013 8 1.0 2848 99.0 24 4 01.09.2013 8 1.0 2848 99.0 24 item_cnt_month item_category_id 0 2.0 30 1 2.0 30 2 2.0 30 3 2.0 30 4 2.0 30 </code></pre> 这个函数调用： <pre><code>sales_train = lag_feature(sales_train, [1], 'item_cnt_month') </code></pre> 我想要这个输出： <pre><code> date date_block_num item_cnt_day item_id item_price shop_id \ 0 14.09.2013 8 1.0 2848 99.0 24 1 14.09.2013 8 1.0 2848 99.0 24 2 14.09.2013 8 1.0 2848 99.0 24 3 01.09.2013 8 1.0 2848 99.0 24 4 01.09.2013 8 1.0 2848 99.0 24 item_cnt_month item_category_id item_cnt_month_lag_1 0 2.0 30 3.0 1 2.0 30 3.0 2 2.0 30 3.0 3 2.0 30 3.0 4 2.0 30 3.0 </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

您面临的内存问题可能是由于同一数据帧有多个（子）副本。在pandas中没有必要这样做，正如其他人所指出的，<code>.shift</code>函数可以实现您需要的功能。在 首先创建一个pandas数据帧，它有两个商店，即24和25。在 <pre><code>df = pd.DataFrame({'shop_id':[24, 24, 24, 24, 24, 25, 25, 25, 25, 25], 'item_id': [2000, 2000, 2000, 3000, 3000, 1000, 1000, 1000, 1000, 1000], 'date_block_num': [7, 8, 9, 7, 8, 5, 6, 7, 8, 9], 'item_cnt_month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) + -+ -+ + + |shop_id|item_id|date_block_num|item_cnt_month| + -+ -+ + + | 24| 2000| 7| 1| | 24| 2000| 8| 2| | 24| 2000| 9| 3| | 24| 3000| 7| 4| | 24| 3000| 8| 5| | 25| 1000| 5| 6| | 25| 1000| 6| 7| | 25| 1000| 7| 8| | 25| 1000| 8| 9| | 25| 1000| 9| 10| + -+ -+ + + </code></pre> 24号店有2000号和3000号。在 在数据块7中有1个项目2000的计数，在数据块8中有2个计数，等等 目标是为该商店中的该商品创建一个item_cnt_month lag列，该列的值为item_cnt_month n个月前。在 要创建滞后特性，可以使用下面的函数。在 ^{pr2}$ 通过打电话 <pre><code>lags = [1, 2] group_cols = ['shop_id', 'item_id'] shift_col = 'item_cnt_month' order_col = 'date_block_num' df = df.sort_values(by=group_cols+[order_col], ascending=True) df = lag_features(df, lags, group_cols, shift_col) </code></pre> 结果是： <pre><code>+ -+ -+ + + + + |shop_id|item_id|date_block_num|item_cnt_month|item_cnt_month_lag_1|item_cnt_month_lag_2| + -+ -+ + + + + | 24| 2000| 7| 1| NaN| NaN| | 24| 2000| 8| 2| 1.0| NaN| | 24| 2000| 9| 3| 2.0| 1.0| | 24| 3000| 7| 4| NaN| NaN| | 24| 3000| 8| 5| 4.0| NaN| | 25| 1000| 5| 6| NaN| NaN| | 25| 1000| 6| 7| 6.0| NaN| | 25| 1000| 7| 8| 7.0| 6.0| | 25| 1000| 8| 9| 8.0| 7.0| | 25| 1000| 9| 10| 9.0| 8.0| + -+ -+ + + + + </code></pre> 请注意，由于没有显式联接，因此需要使用<code>.sort_values(all key columns and date column)</code>对数据帧进行正确的排序

合并pandas数据帧占用了太多内存

1 个回答

相关Python问题