pd.合并每个时间段有多个匹配项?

2024-10-05 14:30:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试用多个匹配项按时间合并两个数据帧。我正在查找所有在df1中timestamp早于endofweek7天或更短时间的df2实例。可能有一个以上的记录,适合的情况下,我想要所有的匹配,而不仅仅是第一个或最后一个(哪一个)pd.合并是的)。你知道吗

import pandas as pd
df1 = pd.DataFrame({'endofweek': ['2019-08-31', '2019-08-31', '2019-09-07', '2019-09-07', '2019-09-14', '2019-09-14'], 'GroupCol': [1234,8679,1234,8679,1234,8679]})
df2 = pd.DataFrame({'timestamp': ['2019-08-30 10:00', '2019-08-30 10:30', '2019-09-07 12:00', '2019-09-08 14:00'], 'GroupVal': [1234, 1234, 8679, 1234], 'TextVal': ['1234_1', '1234_2', '8679_1', '1234_3']})
df1['endofweek'] = pd.to_datetime(df1['endofweek'])
df2['timestamp'] = pd.to_datetime(df2['timestamp'])

我试过了

pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')

但我明白了

   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
1 2019-08-31      8679                 NaT       NaN     NaN
2 2019-09-07      1234                 NaT       NaN     NaN
3 2019-09-07      8679                 NaT       NaN     NaN
4 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

我的短信丢了1234\u 1。有没有办法为pd.merge_asof做一种外部连接,在这里我可以保存df2的所有实例,而不仅仅是第一个或最后一个实例?你知道吗

我的理想结果如下所示(假设endofweek时间被视为该日期的00:00:00):

   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:00:00    1234.0  1234_1
1 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
2 2019-08-31      8679                 NaT       NaN     NaN
3 2019-09-07      1234                 NaT       NaN     NaN                 
4 2019-09-07      8679                 NaT       NaN     NaN 
5 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
6 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

Tags: to实例dataframedatetime时间nannattimestamp
3条回答

您应该将method更改为nearest

pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='nearest'
              , left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
Out[106]: 
   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
1 2019-08-31      8679                 NaT       NaN     NaN
2 2019-09-07      1234 2019-09-08 14:00:00    1234.0  1234_3
3 2019-09-07      8679 2019-09-07 12:00:00    8679.0  8679_1
4 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

pd.merge_asof只执行左连接。在尝试加快groupby/merge_ordered示例的速度时遇到了很多挫折,但在不同方向上对两个数据源执行pd.merge_asof,然后执行外部联接来组合它们更直观、更快。你知道吗

left_merge = pd.merge_asof(df1, df2,
    tolerance=pd.Timedelta('7d'), direction='backward', 
    left_on='endofweek', right_on='timestamp', 
    left_by='GroupCol', right_by='GroupVal')

right_merge = pd.merge_asof(df2, df1, 
    tolerance=pd.Timedelta('7d'), direction='forward', 
    left_on='timestamp', right_on='endofweek',
    left_by='GroupVal', right_by='GroupCol')

merged = (left_merge.merge(right_merge, how="outer")
    .sort_values(['endofweek', 'GroupCol', 'timestamp'])
    .reset_index(drop=True))

merged

   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:00:00    1234.0  1234_1
1 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
2 2019-08-31      8679                 NaT       NaN     NaN
3 2019-09-07      1234                 NaT       NaN     NaN
4 2019-09-07      8679                 NaT       NaN     NaN
5 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
6 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

此外,它比我的另一个答案快得多:

import time
n=1000
start=time.time()
for i in range(n):
    left_merge = pd.merge_asof(df1, df2,
        tolerance=pd.Timedelta('7d'), direction='backward', 
        left_on='endofweek', right_on='timestamp', 
        left_by='GroupCol', right_by='GroupVal')
    right_merge = pd.merge_asof(df2, df1, 
        tolerance=pd.Timedelta('7d'), direction='forward', 
        left_on='timestamp', right_on='endofweek',
        left_by='GroupVal', right_by='GroupCol')
    merged = (left_merge.merge(right_merge, how="outer")
        .sort_values(['endofweek', 'GroupCol', 'timestamp'])
        .reset_index(drop=True))

end = time.time()

end-start
15.040804386138916

我尝试的一种方法是在一个数据帧上使用groupby,然后在pd.merge_ordered中对另一个数据帧进行子集设置:

merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))], 
left_on='endofweek', right_on='timestamp')))

merged

                       endofweek  GroupCol           timestamp  GroupVal TextVal
GroupCol endofweek
1234     2019-08-31 0        NaT       NaN 2019-08-30 10:00:00    1234.0  1234_1
                    1        NaT       NaN 2019-08-30 10:30:00    1234.0  1234_2
                    2 2019-08-31    1234.0                 NaT       NaN     NaN
         2019-09-07 0 2019-09-07    1234.0                 NaT       NaN     NaN
         2019-09-14 0        NaT       NaN 2019-09-08 14:00:00    1234.0  1234_3
                    1 2019-09-14    1234.0                 NaT       NaN     NaN
8679     2019-08-31 0 2019-08-31    8679.0                 NaT       NaN     NaN
         2019-09-07 0 2019-09-07    8679.0                 NaT       NaN     NaN
         2019-09-14 0        NaT       NaN 2019-09-07 12:00:00    8679.0  8679_1
                    1 2019-09-14    8679.0                 NaT       NaN     NaN

merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))

merged.reset_index(drop=True, inplace=True)

merged
   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31    1234.0 2019-08-30 10:00:00    1234.0  1234_1
1 2019-08-31    1234.0 2019-08-30 10:30:00    1234.0  1234_2
2 2019-08-31    1234.0                 NaT       NaN     NaN
3 2019-09-07    1234.0                 NaT       NaN     NaN
4 2019-09-14    1234.0 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14    1234.0                 NaT       NaN     NaN
6 2019-08-31    8679.0                 NaT       NaN     NaN
7 2019-09-07    8679.0                 NaT       NaN     NaN
8 2019-09-14    8679.0 2019-09-07 12:00:00    8679.0  8679_1
9 2019-09-14    8679.0                 NaT       NaN     NaN

然而,在我看来,结果非常缓慢:

import time
n=1000
start=time.time()
for i in range(n):
    merged = (df1.groupby(['GroupCol', 'endofweek']).
    apply(lambda x: pd.merge_ordered(x, df2[(
    (df2['GroupVal']==x.name[0])
    &(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))], 
    left_on='endofweek', right_on='timestamp')))

end = time.time()

end-start
40.72932052612305

我将非常感谢任何改进!你知道吗

相关问题 更多 >