寻找有效的方法来比较和过滤数据范围中的重叠

2024-10-04 09:19:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框,其中有两列包含日期范围。你知道吗

原始数据如下所示:

df1:
personid, visitto, intime, outtime
8, 8, 2017-07-01 06:00:00, 2017-07-01 08:00:00
17 8, 2017-07-02 09:00:00, 2017-07-02 10:00:00

df2:
location, open, close
8, 2017-07-01 04:00:00, 2017-07-01 13:00:00

以下是我迄今为止所做的: 1合并visito上的两个数据帧->;location

merged_df:
personid, visitto, intime, outtime, location, open, close
8, 8, 2017-07-01 06:00:00, 2017-07-01 08:00:00, 8, 2017-07-01 04:00:00, 2017-07-01 13:00:00
17 8, 2017-07-02 09:00:00, 2017-07-02 10:00:00, 8, 2017-07-01 04:00:00, 2017-07-01 13:00:00
  1. 将带有日期的四列转换为间隔
personid, visitto, visittime, opentime
8, 8, [2017-07-01 06:00:00, 2017-07-01 08:00:00], [2017-07-01 04:00:00, 2017-07-01 13:00:00]
17 8, [2017-07-02 09:00:00, 2017-07-02 10:00:00], [2017-07-01 04:00:00, 2017-07-01 13:00:00]

我现在想过滤这个合并的间隔数据帧,只返回日期时间重叠的行。我的预期结果是:

personid, visitto, visittime, opentime
8, 8, [2017-07-01 06:00:00, 2017-07-01 08:00:00], [2017-07-01 04:00:00, 2017-07-01 13:00:00]

使用iterrows()并逐行比较重叠,我成功地获得了所需的结果,但是,这非常冗长。我更喜欢这样做(非工作示例):

merged_df.loc[merged_df['visittime'].overlaps(merged_df['opentime'])]

或者,在数据帧合并期间完成此操作。类似于(不起作用的例子):

merge_df = pd.merge(df1[["personid", "visitto", "intime", "outtime"]], df2[["location", "open", "close"]], how='inner', left_on='visitto', right_on='location') #WHERE intime, outtime OVERLAPS open, close

是否可以从函数而不是循环执行此操作?这可以很容易地从数据库中的Sql中完成,但是,我正在努力找到一种有效的方法来在dataframe上执行此操作。你知道吗


Tags: 数据dfclose间隔locationopenmergeddf1
3条回答

虽然它基本上仍然是一个循环,但它确实实现了这一点。你知道吗

merged_df[merged_df.apply(lambda x: x['in1'].overlaps(x['in2']), axis = 1)]

如果您可以将IntervalIndex用于您的熊猫:

df1['visittime'] = pd.IntervalIndex.from_arrays(left=df1.intime,
                                                right=df1.outtime,
                                                closed='both')

df2['worktime'] = pd.IntervalIndex.from_arrays(left=df2['open'],
                                               right=df2['close'],
                                               closed='both')

new_df = df1.merge(df2, left_on='visitto', right_on='location', how='left')
new_df.loc[[a.overlaps(b) for a, b in zip(new_df['visittime'], new_df['worktime']) ]
          ]

使用内置的pandasdatetime功能(加上来自here的答案),而不是创建复合对象列(比如list),您将总是需要迭代这些列:

from io import StringIO
import pandas as pd

csv_buff = StringIO("""personid,visitto,intime,outtime,location,open,close
8,8,2017-07-01 06:00:00,2017-07-01 08:00:00,8,2017-07-01 04:00:00,2017-07-01 13:00:00
17,8,2017-07-02 09:00:00,2017-07-02 10:00:00, 8,2017-07-01 04:00:00,2017-07-01 13:00:00""")
csv_buff.seek(0)
dtypes = {
    "intime": "datetime64",
    "outtime": "datetime64",
    "open": "datetime64",
    "close": "datetime64"
}
df = pd.read_csv(csv_buff).astype(dtypes)

df[(df["intime"] <= df["close"]) & (df["open"] <= df["outtime"])]

结果:

   personid visitto              intime             outtime  location                open               close
0         8       8 2017-07-01 06:00:00 2017-07-01 08:00:00         8 2017-07-01 04:00:00 2017-07-01 13:00:00

相关问题 更多 >