使用Pandas在Python中通过查找连接2个df

Name ID1 StartDate1 EndDate1 StartDate2 EndDate2 ab 111 1/1/2017 3/1/2017 2/1/2017 4/1/2017 bc 112 2/1/2017 3/1/2017 2/1/2017 4/1/2017 cd 113 1/1/2017 2/1/2017 2/1/2017 4/1/2017 df 114 2/1/2017 4/1/2017 2/1/2017 4/1/2017 fe 115 4/1/2017 5/1/2017 2/1/2017 4/1/2017

ID1 ID2 Date Id1_num id2_num 111 10 1/1/2017 10 11 111 10 2/1/2017 10 111 10 3/1/2017 13 13 111 10 4/1/2017 15 13 112 20 2/1/2017 11 19 112 20 3/1/2017 12 112 20 4/1/2017 19 113 20 1/1/2017 17 5 113 20 2/1/2017 17 14 114 30 2/1/2017 11 13 114 30 3/1/2017 10 114 30 4/1/2017 18 13 115 30 4/1/2017 7 5 115 30 5/1/2017 13 19

Name ID1 ID2 StartDate1 EndDate1 StartDate2 EndDate2 ID1Date1Count ID1Date1Average ID1Date2Count ID1Date2Average ID2Date1Count ID2Date1Average ID2Date2Count ID2Date2Average ab 111 10 1/1/2017 3/1/2017 2/1/2017 4/1/2017 2 11.5 2 14 3 11.33 3 12 bc 112 20 2/1/2017 3/1/2017 2/1/2017 4/1/2017 2 11.5 2 11.5 1 19 2 19 cd 113 20 1/1/2017 2/1/2017 2/1/2017 4/1/2017 2 17 1 17 1 14 1 14 df 114 30 2/1/2017 4/1/2017 2/1/2017 4/1/2017 3 13 3 13 2 13 2 13 fe 115 30 4/1/2017 5/1/2017 2/1/2017 4/1/2017 1 19 0 0 1 19 0 0

2条回答

网友

1楼 · 编辑于 2024-09-30 01:32:12

不是100%清楚你真正想要什么（例如什么是“num<；10”，哪个num？Id1\u num还是id2\u num？）但我相信下面的内容会让你更接近。你知道吗

第一组df2并将其格式化为日期间隔：

df2_grouped = df2.groupby(['ID1', 'ID2']).\
    apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max(), 
                               x['id2_num'].count()],
                              index=['StartDate2', 'EndDate2', 'ID1Date1Count']))
df2_grouped = df2_grouped.reset_index()

print(df2_grouped)

    ID1 ID2 StartDate2  EndDate2    ID1Date1Count
0   111 10  1/1/2017    4/1/2017    3
1   112 20  2/1/2017    4/1/2017    1
2   113 20  1/1/2017    2/1/2017    2
3   114 30  2/1/2017    4/1/2017    2
4   115 30  4/1/2017    5/1/2017    2

然后将其与df1合并：

df1[['ID1', 'StartDate1', 'EndDate1']].merge(df2_grouped, on='ID1')
df1 = df1.sort_values(by=['StartDate2'])

print(df1)

    ID1 StartDate1  EndDate1    ID2 StartDate2  EndDate2    ID1Date1Count
0   111 1/1/2017    3/1/2017    10  1/1/2017    4/1/2017    3
2   113 1/1/2017    2/1/2017    20  1/1/2017    2/1/2017    2
1   112 2/1/2017    3/1/2017    20  2/1/2017    4/1/2017    1
3   114 2/1/2017    4/1/2017    30  2/1/2017    4/1/2017    2
4   115 4/1/2017    5/1/2017    30  4/1/2017    5/1/2017    2

同样，输出与所需的输出并不完全匹配，但非常接近。你知道吗

网友

2楼 · 编辑于 2024-09-30 01:32:12

我分解了步骤，在我们得到dd1之后，我们可以merge它回到你的df1，你可以用fillna(0)替换NaN

Newdf=pd.wide_to_long(df1,stubnames=['StartDate','EndDate'],i=['Name','ID1'],j='nnumer')
Newdf.StartDate=pd.to_datetime(Newdf.StartDate)
Newdf.EndDate=pd.to_datetime(Newdf.EndDate)
Newdf['New']=Newdf[['StartDate','EndDate']].apply(lambda x : [pd.date_range(x['StartDate'], x['EndDate'],freq='MS').tolist()],1)['StartDate']
Newdf=Newdf.set_index(['StartDate','EndDate'],append=True).New.apply(pd.Series).stack()
Newdf=Newdf.to_frame('Date').reset_index()
df2.Date=pd.to_datetime(df2.Date)
dd=Newdf.merge(df2,on=['ID1','Date'])
dd[['Id1_num','id2_num']]=dd[['Id1_num','id2_num']].mask(dd[['Id1_num','id2_num']]<10)
dd1=dd.groupby(['ID1','nnumer','StartDate','EndDate'])['Id1_num','id2_num'].agg(['mean','count']).unstack(1).groupby(level='ID1').ffill().bfill().reset_index(['StartDate','EndDate'],drop=True).drop_duplicates()

dd1
Out[626]: 
       Id1_num                     id2_num                 
          mean       count            mean       count     
nnumer       1     2     1    2          1     2     1    2
ID1                                                        
111       11.5  14.0   2.0  2.0  11.333333  12.0   3.0  3.0
112       11.5  11.5   2.0  2.0  19.000000  19.0   1.0  2.0
113       17.0  17.0   2.0  1.0  14.000000  14.0   1.0  1.0
114       13.0  13.0   3.0  3.0  13.000000  13.0   2.0  2.0
115       13.0   NaN   1.0  0.0  19.000000   NaN   1.0  0.0

更新

dd1.columns=dd1.columns.map(''.join)
dd1
Out[650]: 
     Id1_nummean1  Id1_nummean2  Id1_numcount1  Id1_numcount2  id2_nummean1  \
ID1                                                                           
111          11.5          14.0            2.0            2.0     11.333333   
112          11.5          11.5            2.0            2.0     19.000000   
113          17.0          17.0            2.0            1.0     14.000000   
114          13.0          13.0            3.0            3.0     13.000000   
115          13.0           NaN            1.0            0.0     19.000000   
     id2_nummean2  id2_numcount1  id2_numcount2  
ID1                                              
111          12.0            3.0            3.0  
112          19.0            1.0            2.0  
113          14.0            1.0            1.0  
114          13.0            2.0            2.0  
115           NaN            1.0            0.0

相关问题更多 >

编程相关推荐

热门问题

热门文章