如何使用一个df的开始和结束时间范围进行分组,并使用python从另一个df查找每个车辆槽的开始位置和结束位置?

2024-07-01 07:43:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧:df1: 这是车辆的插槽:

    CompanyID   RegistrationNo  slotStartTime           slotEndTime
1   602         veh1            2020-07-27 21:12:00 2020-07-27 22:12:00
2   602         veh1            2020-07-27 21:30:00 2020-07-27 22:30:00
3   602         veh2            2020-07-28 22:16:00 2020-07-28 23:16:00

另:df2: 从这些数据中,我想找到插槽的开始位置和结束位置

    RegistrationNo  GPSTime         Location
0   veh1            2020-07-27 21:12:00 loc1
1   veh1            2020-07-27 21:15:00 loc2
2   veh1            2020-07-27 21:20:00 loc3
3   veh1            2020-07-27 21:30:00 loc4
4   veh1            2020-07-27 21:45:00 loc5
5   veh1            2020-07-27 22:15:00 loc6
6   veh1            2020-07-27 22:29:00 loc7
4   veh2            2020-07-28 21:45:00 loc8
5   veh2            2020-07-28 22:15:00 loc9
6   veh2            2020-07-28 22:29:00 loc10 
7   veh2            2020-07-28 22:50:00 loc11 
7   veh2            2020-07-28 23:16:00 loc12 

预期结果:

    CompanyID   RegistrationNo  slotStartTime           slotEndTime      slotStartloc slotEndLoc
1   602         veh1            2020-07-27 21:12:00 2020-07-27 22:12:00  loc1         loc5
2   602         veh1            2020-07-27 21:30:00 2020-07-27 22:30:00  loc4         loc7
3   602         veh2            2020-07-28 22:16:00 2020-07-28 23:16:00  loc10        loc12

我试过使用group by daterange,但我猜bcoz涉及到另一个df,它不起作用并抛出错误


Tags: 数据df1插槽loc1companyidveh1loc4loc5
3条回答
def compare(reg, start, end):
    startslot = df2[(df2['RegistrationNo']==reg) & (df2['GPSTime'].between(start, end, inclusive=True))]['location'].iloc[0]
    endslot = df2[(df2['RegistrationNo']==reg) & (df2['GPSTime'].between(start, end, inclusive=True))]['location'].iloc[-1]
    return startslot, endslot

df2.sort_values('GPSTime', ascending=True, inplace=True)
df1[['slotStartloc', 'slotEndloc']] = df1.apply(lambda x: compare(x['RegistrationNo'], x['slotStartTime'], x['slotEndTime']), axis=1, result_type='expand')

df1
    CompanyID   RegistrationNo  slotStartTime           slotEndTime      slotStartloc slotEndLoc
1   602         veh1            2020-07-27 21:12:00 2020-07-27 22:12:00  loc1         loc5
2   602         veh1            2020-07-27 21:30:00 2020-07-27 22:30:00  loc4         loc7
3   602         veh2            2020-07-28 22:16:00 2020-07-28 23:16:00  loc10        loc12

下面是一种使用iterrows()并使用.at[]写入数据帧的方法:

df['start_loc'] = ''
df['end_loc'] = ''

for index, row in df.iterrows():
    start = row.slotStartTime
    end = row.slotEndTime
    reg = row.RegistrationNo
    
    mask = ((df2['RegistrationNo'] == reg) & 
            (start <= df2['GPSTime']) & (df2['GPSTime'] <= end))
    
    df.at[index, 'start_loc'] = df2.loc[mask, 'location'].min()
    df.at[index, 'end_loc']   = df2.loc[mask, 'location'].max()
    
print(df[['start_loc', 'end_loc']])   # other columns omitted to save space

  start_loc end_loc
0      loc1    loc5
1      loc4    loc7
2     loc10   loc12
df=pd.DataFrame({
    'CompanyID':[602,602,202],
    'RegistrationNo':['veh1','veh1','veh2'],
    'slotStartTime':['2020-07-27 21:12:00','2020-07-27 21:30:00',
                     '2020-07-28 22:16:00'],
    'slotEndTime':['2020-07-27 22:12:00','2020-07-27 22:30:00',
                   '2020-07-28 23:16:00']
})
df2=pd.DataFrame({
    'RegistrationNo':['veh1','veh1','veh1','veh1','veh1','veh1','veh1',
                      'veh2','veh2','veh2','veh2','veh2'],
    'GPSTime':['2020-07-27 21:12:00','2020-07-27 21:15:00',
                     '2020-07-27 21:20:00','2020-07-27 21:30:00',
               '2020-07-27 21:45:00','2020-07-27 22:15:00','2020-07-27 22:29:00',
               '2020-07-28 21:45:00','2020-07-28 22:15:00','2020-07-28 22:29:00',
               '2020-07-28 22:50:00','2020-07-28 23:16:00'],
    'location':['loc1','loc2','loc3','loc4','loc5','loc6','loc7','loc8',
                'loc9','loc10','loc11','loc12',]
})

df['slotStartTime']=pd.to_datetime(df['slotStartTime'])
df['slotEndTime']=pd.to_datetime(df['slotEndTime'])
df2['GPSTime']=pd.to_datetime(df2['GPSTime'])

#for each for in df merge df2 to get you time range start-end
#then take first and last row to get start loc and end loc
result=pd.DataFrame()

for index,row in df.iterrows():
    row= pd.DataFrame(row).T
    df_main = pd.merge(row,df2,on='RegistrationNo')
    # print(df_main)
    df_main = df_main[
        (df_main.slotEndTime>df_main.slotStartTime) &
        (df_main.slotStartTime<=df_main.GPSTime)&
        (df_main.GPSTime<=df_main.slotEndTime)
    ].sort_values(by=['slotStartTime','RegistrationNo'])
    df_main['start_loc'] = df_main.iloc[0]['location']
    df_main['end_loc'] = df_main.iloc[-1]['location']
    result = result.append(df_main)
#here you have 'result' DF with locations, now you need to assign them to original DF

df=df.merge(result,on=['slotStartTime','slotEndTime','RegistrationNo','CompanyID'],how='inner').drop_duplicates(
    keep='last',subset=['slotStartTime','slotEndTime','RegistrationNo']
)
del df['location']
print(df)
   CompanyID RegistrationNo       slotStartTime         slotEndTime             GPSTime start_loc end_loc
4        602           veh1 2020-07-27 21:12:00 2020-07-27 22:12:00 2020-07-27 21:45:00      loc1    loc5
8        602           veh1 2020-07-27 21:30:00 2020-07-27 22:30:00 2020-07-27 22:29:00      loc4    loc7
11       202           veh2 2020-07-28 22:16:00 2020-07-28 23:16:00 2020-07-28 23:16:00     loc10   loc12

相关问题 更多 >

    热门问题