一种快速、有效的方法来计算大Pandas行组之间的时间差？

+-------+-------------+ | carId | refill_date | +-------+-------------+ | 1 | 2020-03-01 | +-------+-------------+ | 1 | 2020-03-12 | +-------+-------------+ | 1 | 2020-04-04 | +-------+-------------+ | 2 | 2020-03-07 | +-------+-------------+ | 2 | 2020-03-26 | +-------+-------------+ | 2 | 2020-04-01 | +-------+-------------+

+-------+-------------+--------------+ | carId | refill_date | time_elapsed | +-------+-------------+--------------+ | 1 | 2020-03-01 | | +-------+-------------+--------------+ | 1 | 2020-03-12 | 11 | +-------+-------------+--------------+ | 1 | 2020-04-04 | 23 | +-------+-------------+--------------+ | 2 | 2020-03-07 | | +-------+-------------+--------------+ | 2 | 2020-03-26 | 19 | +-------+-------------+--------------+ | 2 | 2020-04-01 | 6 | +-------+-------------+--------------+

import pandas as pd df = pd.DataFrame data = [ { "carId": 1, "refill_date": "2020-3-1" }, { "carId": 1, "refill_date": "2020-3-12" }, { "carId": 1, "refill_date": "2020-4-4" }, { "carId": 2, "refill_date": "2020-3-7" }, { "carId": 2, "refill_date": "2020-3-26" }, { "carId": 2, "refill_date": "2020-4-1" } ] df = pd.DataFrame(data) df['refill_date'] = pd.to_datetime(df['refill_date']) for c in df['carId'].unique(): df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c, 'refill_date'].diff()

+---+-------+-------------+--------------+ | | carId | refill_date | time_elapsed | +---+-------+-------------+--------------+ | 0 | 1 | 2020-03-01 | NaT | +---+-------+-------------+--------------+ | 1 | 1 | 2020-03-12 | 11 days | +---+-------+-------------+--------------+ | 2 | 1 | 2020-04-04 | 23 days | +---+-------+-------------+--------------+ | 3 | 2 | 2020-03-07 | NaT | +---+-------+-------------+--------------+ | 4 | 2 | 2020-03-26 | 19 days | +---+-------+-------------+--------------+ | 5 | 2 | 2020-04-01 | 6 days | +---+-------+-------------+--------------+

3条回答

网友

1楼 · 编辑于 2024-09-25 02:34:23

您只需要使用.groupby：

df['time_elapsed'] = df.groupby('carId').diff()

输出：

  refill_date
0         NaT
1     11 days
2     23 days
3         NaT
4     19 days
5      6 days

网友

2楼 · 编辑于 2024-09-25 02:34:23

通过使用shift并从重新填充日期中减去来获取所经过的时间

(
    df.assign(
        refill_date=pd.to_datetime(df.refill_date),
        time_shift=lambda x: x.groupby("carId").refill_date.shift(),
        time_elapsed=lambda x: x.time_shift.sub(x.refill_date).abs(),
    )
)

使用diff的其他答案更好，因为这更简洁，而且我相信更快

网友

3楼 · 编辑于 2024-09-25 02:34:23

在df.groupby上使用本机pandas方法应能在“本机python”循环中显著提高性能：

df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

这里有一个小基准（在我的笔记本电脑上，YMMV…），使用100辆车，每辆车31天，表现出几乎10倍的性能提升：

import pandas as pd
import timeit

data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])

def original_method():
    for c in df['carId'].unique():
        df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                          'refill_date'].diff()

def using_groupby():
    df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)

print(time1)
print(time2)
print(time1/time2)

输出：

16.6183732
1.7910263000000022
9.278687420726307

相关问题更多 >

编程相关推荐

热门问题

热门文章