要合并的大文件。如何防止大Pandas合并中的重复？

a_id c_consumed 0 sam oil 1 sam bread 2 sam soap 3 harry shoes 4 harry oil 5 alice eggs 6 alice pen 7 alice eggroll

a_id b_received brand_id type_received date 0 sam soap bill edibles 2011-01-01 1 sam oil chris utility 2011-01-02 2 sam brush dan grocery 2011-01-01 3 harry oil chris clothing 2011-01-02 4 harry shoes nancy edibles 2011-01-03 5 alice beer peter breakfast 2011-01-03 6 alice brush dan cleaning 2011-01-02 7 alice eggs jaju edibles 2011-01-03

df_id = pd.DataFrame({'a_id': 'sam sam sam harry harry alice alice alice'.split(), 'b_received': 'soap oil brush oil shoes beer brush eggs'.split(), 'brand_id': 'bill chris dan chris nancy peter dan jaju'.split(), 'type_received': 'edibles utility grocery clothing edibles breakfast cleaning edibles'.split()}) date3 = ['2011-01-01','2011-01-02','2011-01-01','2011-01-02','2011-01-03','2011-01-03','2011-01-02','2011-01-03'] date3 = pd.to_datetime(date3) df_id['date']= date3

a_id b_received brand_id type_received date c_consumed 0 sam soap bill edibles 2011-01-01 oil 1 sam soap bill edibles 2011-01-01 bread 2 sam soap bill edibles 2011-01-01 soap 3 sam oil chris utility 2011-01-02 oil 4 sam oil chris utility 2011-01-02 bread 5 sam oil chris utility 2011-01-02 soap 6 sam brush dan grocery 2011-01-01 oil 7 sam brush dan grocery 2011-01-01 bread 8 sam brush dan grocery 2011-01-01 soap 9 harry oil chris clothing 2011-01-02 shoes 10 harry oil chris clothing 2011-01-02 oil 11 harry shoes nancy edibles 2011-01-03 shoes 12 harry shoes nancy edibles 2011-01-03 oil 13 alice beer peter breakfast 2011-01-03 eggs 14 alice beer peter breakfast 2011-01-03 pen 15 alice beer peter breakfast 2011-01-03 eggroll 16 alice brush dan cleaning 2011-01-02 eggs 17 alice brush dan cleaning 2011-01-02 pen 18 alice brush dan cleaning 2011-01-02 eggroll 19 alice eggs jaju edibles 2011-01-03 eggs 20 alice eggs jaju edibles 2011-01-03 pen 21 alice eggs jaju edibles 2011-01-03 eggroll

a_id b_received brand_id type_received date c_consumed output 0 sam soap bill edibles 2011-01-01 oil 1 1 sam soap bill edibles 2011-01-01 bread 1 2 sam soap bill edibles 2011-01-01 soap 1 3 sam oil chris utility 2011-01-02 oil 1 4 sam oil chris utility 2011-01-02 bread 1 5 sam oil chris utility 2011-01-02 soap 1 6 sam brush dan grocery 2011-01-01 oil 0 7 sam brush dan grocery 2011-01-01 bread 0 8 sam brush dan grocery 2011-01-01 soap 0 9 harry oil chris clothing 2011-01-02 shoes 1 10 harry oil chris clothing 2011-01-02 oil 1 11 harry shoes nancy edibles 2011-01-03 shoes 1 12 harry shoes nancy edibles 2011-01-03 oil 1 13 alice beer peter breakfast 2011-01-03 eggs 0 14 alice beer peter breakfast 2011-01-03 pen 0 15 alice beer peter breakfast 2011-01-03 eggroll 0 16 alice brush dan cleaning 2011-01-02 eggs 0 17 alice brush dan cleaning 2011-01-02 pen 0 18 alice brush dan cleaning 2011-01-02 eggroll 0 19 alice eggs jaju edibles 2011-01-03 eggs 1 20 alice eggs jaju edibles 2011-01-03 pen 1 21 alice eggs jaju edibles 2011-01-03 eggroll 1

a_id b_received brand_id c_consumed type_received date output 0 sam soap bill oil edibles 2011-01-01 1 1 sam oil chris NaN utility 2011-01-02 1 2 sam brush dan soap grocery 2011-01-03 0 3 harry oil chris shoes clothing 2011-01-04 1 4 harry shoes nancy oil edibles 2011-01-05 1 5 alice beer peter eggs breakfast 2011-01-06 0 6 alice brush dan brush cleaning 2011-01-07 1 7 alice eggs jaju NaN edibles 2011-01-08 1

1条回答

网友

1楼 · 发布于 2024-07-03 06:52:12

注意，我执行了两个groupby操作来获取输出表。我将b_received添加到要分组的键上，并在第二个groupby上取第一个值，因为对于这个分组级别，所有值都是相同的。在

output = ((combined
           .groupby(['a_id', 'b_received'])
           .apply(lambda x : x['b_received'].isin(x['c_consumed'])
           .astype(int)))
          .groupby(level=[0, 1])
          .first())

output.name = 'output'

>>> (df_id[['a_id', 'b_received', 'date']]
     .merge(output.reset_index(), on=['a_id', 'b_received']))
    a_id b_received       date  output
0    sam       soap 2011-01-01       1
1    sam        oil 2011-01-02       1
2    sam      brush 2011-01-01       0
3  harry        oil 2011-01-02       1
4  harry      shoes 2011-01-03       1
5  alice       beer 2011-01-03       0
6  alice      brush 2011-01-02       0
7  alice       eggs 2011-01-03       1

相关问题更多 >

编程相关推荐

热门问题

热门文章