一种快速、有效的方法来计算大Pandas行组之间的时间差?

2024-09-25 02:34:23 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我在一个数据框中有这个表,其中有几辆车的加油日期:

+-------+-------------+
| carId | refill_date |
+-------+-------------+
|     1 |  2020-03-01 |
+-------+-------------+
|     1 |  2020-03-12 |
+-------+-------------+
|     1 |  2020-04-04 |
+-------+-------------+
|     2 |  2020-03-07 |
+-------+-------------+
|     2 |  2020-03-26 |
+-------+-------------+
|     2 |  2020-04-01 |
+-------+-------------+

我想添加第三列,time_elapsed,其中包含每次重新填充之间的持续时间

+-------+-------------+--------------+
| carId | refill_date | time_elapsed |
+-------+-------------+--------------+
|     1 |  2020-03-01 |              |
+-------+-------------+--------------+
|     1 |  2020-03-12 |           11 |
+-------+-------------+--------------+
|     1 |  2020-04-04 |           23 |
+-------+-------------+--------------+
|     2 |  2020-03-07 |              |
+-------+-------------+--------------+
|     2 |  2020-03-26 |           19 |
+-------+-------------+--------------+
|     2 |  2020-04-01 |            6 |
+-------+-------------+--------------+

下面是我的工作:

import pandas as pd
df = pd.DataFrame

data = [
    {
        "carId": 1,
        "refill_date": "2020-3-1"
    },
    {
        "carId": 1,
        "refill_date": "2020-3-12"
    },
    {
        "carId": 1,
        "refill_date": "2020-4-4"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-7"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-26"
    },
    {
        "carId": 2,
        "refill_date": "2020-4-1"
    }
]

df = pd.DataFrame(data)

df['refill_date'] = pd.to_datetime(df['refill_date'])

for c in df['carId'].unique():
    df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                      'refill_date'].diff()

它返回预期结果:

+---+-------+-------------+--------------+
|   | carId | refill_date | time_elapsed |
+---+-------+-------------+--------------+
| 0 |     1 |  2020-03-01 |          NaT |
+---+-------+-------------+--------------+
| 1 |     1 |  2020-03-12 |      11 days |
+---+-------+-------------+--------------+
| 2 |     1 |  2020-04-04 |      23 days |
+---+-------+-------------+--------------+
| 3 |     2 |  2020-03-07 |          NaT |
+---+-------+-------------+--------------+
| 4 |     2 |  2020-03-26 |      19 days |
+---+-------+-------------+--------------+
| 5 |     2 |  2020-04-01 |       6 days |
+---+-------+-------------+--------------+

所以,看起来一切正常,但这里有一个陷阱:在我的现实生活实例中,我的数据帧包含350万行,处理需要很长时间,即使它是一个完全数字的内存计算,“只有”1711个组可以循环

有没有其他更快捷的方法

谢谢


Tags: 数据dataframedfdatadatetimedaysnat
3条回答

您只需要使用.groupby

df['time_elapsed'] = df.groupby('carId').diff()

输出:

  refill_date
0         NaT
1     11 days
2     23 days
3         NaT
4     19 days
5      6 days

通过使用shift并从重新填充日期中减去来获取所经过的时间

(
    df.assign(
        refill_date=pd.to_datetime(df.refill_date),
        time_shift=lambda x: x.groupby("carId").refill_date.shift(),
        time_elapsed=lambda x: x.time_shift.sub(x.refill_date).abs(),
    )
)

使用diff的其他答案更好,因为这更简洁,而且我相信更快

df.groupby上使用本机pandas方法应能在“本机python”循环中显著提高性能:

df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

这里有一个小基准(在我的笔记本电脑上,YMMV…),使用100辆车,每辆车31天, 表现出几乎10倍的性能提升:

import pandas as pd
import timeit

data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])

def original_method():
    for c in df['carId'].unique():
        df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                          'refill_date'].diff()

def using_groupby():
    df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)

print(time1)
print(time2)
print(time1/time2)

输出:

16.6183732
1.7910263000000022
9.278687420726307

相关问题 更多 >