我有两个数据帧,合并后会创建一个50gb的文件,这对于python来说太多了。我甚至不能在python中合并,只能在SQLite中完成。在
这是两个数据集的样子
第一个数据集:
a_id c_consumed
0 sam oil
1 sam bread
2 sam soap
3 harry shoes
4 harry oil
5 alice eggs
6 alice pen
7 alice eggroll
生成此数据集的代码
^{pr2}$第二个数据集:
a_id b_received brand_id type_received date
0 sam soap bill edibles 2011-01-01
1 sam oil chris utility 2011-01-02
2 sam brush dan grocery 2011-01-01
3 harry oil chris clothing 2011-01-02
4 harry shoes nancy edibles 2011-01-03
5 alice beer peter breakfast 2011-01-03
6 alice brush dan cleaning 2011-01-02
7 alice eggs jaju edibles 2011-01-03
生成此数据集的代码:
df_id = pd.DataFrame({'a_id': 'sam sam sam harry harry alice alice alice'.split(),
'b_received': 'soap oil brush oil shoes beer brush eggs'.split(),
'brand_id': 'bill chris dan chris nancy peter dan jaju'.split(),
'type_received': 'edibles utility grocery clothing edibles breakfast cleaning edibles'.split()})
date3 = ['2011-01-01','2011-01-02','2011-01-01','2011-01-02','2011-01-03','2011-01-03','2011-01-02','2011-01-03']
date3 = pd.to_datetime(date3)
df_id['date']= date3
我用这段代码合并数据集
combined = pd.merge(df_id,df,on='a_id',how='left')
这就是结果数据集
a_id b_received brand_id type_received date c_consumed
0 sam soap bill edibles 2011-01-01 oil
1 sam soap bill edibles 2011-01-01 bread
2 sam soap bill edibles 2011-01-01 soap
3 sam oil chris utility 2011-01-02 oil
4 sam oil chris utility 2011-01-02 bread
5 sam oil chris utility 2011-01-02 soap
6 sam brush dan grocery 2011-01-01 oil
7 sam brush dan grocery 2011-01-01 bread
8 sam brush dan grocery 2011-01-01 soap
9 harry oil chris clothing 2011-01-02 shoes
10 harry oil chris clothing 2011-01-02 oil
11 harry shoes nancy edibles 2011-01-03 shoes
12 harry shoes nancy edibles 2011-01-03 oil
13 alice beer peter breakfast 2011-01-03 eggs
14 alice beer peter breakfast 2011-01-03 pen
15 alice beer peter breakfast 2011-01-03 eggroll
16 alice brush dan cleaning 2011-01-02 eggs
17 alice brush dan cleaning 2011-01-02 pen
18 alice brush dan cleaning 2011-01-02 eggroll
19 alice eggs jaju edibles 2011-01-03 eggs
20 alice eggs jaju edibles 2011-01-03 pen
21 alice eggs jaju edibles 2011-01-03 eggroll
我想知道的是,如果一个人消费了收到的产品,我需要保留其余的信息,因为我以后需要看它是否受到品牌或产品类型的影响。为了做到这一点,我使用下面的代码创建一个新的列,它给出了以下结果。在
代码:
combined['output']= (combined.groupby('a_id')
.apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
.reset_index(level='a_id', drop=True))
产生的数据帧是
a_id b_received brand_id type_received date c_consumed output
0 sam soap bill edibles 2011-01-01 oil 1
1 sam soap bill edibles 2011-01-01 bread 1
2 sam soap bill edibles 2011-01-01 soap 1
3 sam oil chris utility 2011-01-02 oil 1
4 sam oil chris utility 2011-01-02 bread 1
5 sam oil chris utility 2011-01-02 soap 1
6 sam brush dan grocery 2011-01-01 oil 0
7 sam brush dan grocery 2011-01-01 bread 0
8 sam brush dan grocery 2011-01-01 soap 0
9 harry oil chris clothing 2011-01-02 shoes 1
10 harry oil chris clothing 2011-01-02 oil 1
11 harry shoes nancy edibles 2011-01-03 shoes 1
12 harry shoes nancy edibles 2011-01-03 oil 1
13 alice beer peter breakfast 2011-01-03 eggs 0
14 alice beer peter breakfast 2011-01-03 pen 0
15 alice beer peter breakfast 2011-01-03 eggroll 0
16 alice brush dan cleaning 2011-01-02 eggs 0
17 alice brush dan cleaning 2011-01-02 pen 0
18 alice brush dan cleaning 2011-01-02 eggroll 0
19 alice eggs jaju edibles 2011-01-03 eggs 1
20 alice eggs jaju edibles 2011-01-03 pen 1
21 alice eggs jaju edibles 2011-01-03 eggroll 1
正如您所看到的输出结果是错误的,我真正想要的是一个更像这样的数据集
a_id b_received brand_id c_consumed type_received date output
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil chris shoes clothing 2011-01-04 1
4 harry shoes nancy oil edibles 2011-01-05 1
5 alice beer peter eggs breakfast 2011-01-06 0
6 alice brush dan brush cleaning 2011-01-07 1
7 alice eggs jaju NaN edibles 2011-01-08 1
我可以在合并后使用drop\u duplicates来处理重复,但结果数据帧太大,无法合并。在
我真的需要在合并过程中或者在合并之前处理复制,因为结果数据帧太大,python无法处理,并且会给我带来内存错误。在
有什么关于如何改进我的合并或任何其他不合并输出列的方法的建议?在
最后,我只需要date列和output列来计算日志几率,并创建一个timeseries。但由于文件的大小,我一直在合并文件。在
注意,我执行了两个groupby操作来获取输出表。我将
b_received
添加到要分组的键上,并在第二个groupby上取第一个值,因为对于这个分组级别,所有值都是相同的。在相关问题 更多 >
编程相关推荐