假设我在一个数据框中有这个表,其中有几辆车的加油日期:
+-------+-------------+
| carId | refill_date |
+-------+-------------+
| 1 | 2020-03-01 |
+-------+-------------+
| 1 | 2020-03-12 |
+-------+-------------+
| 1 | 2020-04-04 |
+-------+-------------+
| 2 | 2020-03-07 |
+-------+-------------+
| 2 | 2020-03-26 |
+-------+-------------+
| 2 | 2020-04-01 |
+-------+-------------+
我想添加第三列,time_elapsed
,其中包含每次重新填充之间的持续时间
+-------+-------------+--------------+
| carId | refill_date | time_elapsed |
+-------+-------------+--------------+
| 1 | 2020-03-01 | |
+-------+-------------+--------------+
| 1 | 2020-03-12 | 11 |
+-------+-------------+--------------+
| 1 | 2020-04-04 | 23 |
+-------+-------------+--------------+
| 2 | 2020-03-07 | |
+-------+-------------+--------------+
| 2 | 2020-03-26 | 19 |
+-------+-------------+--------------+
| 2 | 2020-04-01 | 6 |
+-------+-------------+--------------+
下面是我的工作:
import pandas as pd
df = pd.DataFrame
data = [
{
"carId": 1,
"refill_date": "2020-3-1"
},
{
"carId": 1,
"refill_date": "2020-3-12"
},
{
"carId": 1,
"refill_date": "2020-4-4"
},
{
"carId": 2,
"refill_date": "2020-3-7"
},
{
"carId": 2,
"refill_date": "2020-3-26"
},
{
"carId": 2,
"refill_date": "2020-4-1"
}
]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()
它返回预期结果:
+---+-------+-------------+--------------+
| | carId | refill_date | time_elapsed |
+---+-------+-------------+--------------+
| 0 | 1 | 2020-03-01 | NaT |
+---+-------+-------------+--------------+
| 1 | 1 | 2020-03-12 | 11 days |
+---+-------+-------------+--------------+
| 2 | 1 | 2020-04-04 | 23 days |
+---+-------+-------------+--------------+
| 3 | 2 | 2020-03-07 | NaT |
+---+-------+-------------+--------------+
| 4 | 2 | 2020-03-26 | 19 days |
+---+-------+-------------+--------------+
| 5 | 2 | 2020-04-01 | 6 days |
+---+-------+-------------+--------------+
所以,看起来一切正常,但这里有一个陷阱:在我的现实生活实例中,我的数据帧包含350万行,处理需要很长时间,即使它是一个完全数字的内存计算,“只有”1711个组可以循环
有没有其他更快捷的方法
谢谢
您只需要使用
.groupby
:输出:
通过使用shift并从重新填充日期中减去来获取所经过的时间
使用diff的其他答案更好,因为这更简洁,而且我相信更快
在
df.groupby
上使用本机pandas方法应能在“本机python”循环中显著提高性能:这里有一个小基准(在我的笔记本电脑上,YMMV…),使用100辆车,每辆车31天, 表现出几乎10倍的性能提升:
输出:
相关问题 更多 >
编程相关推荐