基于条件聚合数据

2024-06-23 19:20:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个类似这样的数据集。。。你知道吗

pd.DataFrame({
 'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
 'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
 'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
 'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0]
 })

我基本上是想把上一次a触发维修需求(=1)以来每次观察的里程表的差值加起来。你知道吗

所以我希望结果是这样的:

pd.DataFrame({
 'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
 'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
 'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
 'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0],
 'miles_since_maint': [0,0,0,4,9,30,80,12,18,15,75,24]})

基本上,它会查看每一个观察结果,并对同一个车号的观察结果被标记为需要维修后的累计里程进行累加。然后它将继续累计里程数,因为维修。你知道吗

作为参考,我试图预测汽车需要修理前的行驶里程数。你知道吗

有人知道怎么做吗?你知道吗

编辑:

我想我没有把预期的产出说得那么清楚。更新它以匹配我需要的,并使数据帧更容易解释,因为多个汽车id的困惑,甚至我。你知道吗


Tags: 数据iddataframeneedcarstart汽车end
3条回答

类似于匡宏的回答,但作为一个没有numpy的班轮:

df['miles_since_last_maint'] = df.groupby('car_id')['odometer_start'].diff().where(df.need_maintanince==1,0).astype(int)

结果:

   car_id  need_maintanince  odometer_start  miles_since_last_maint
0       1                 0               0                       0
1       2                 0               5                       0
2       2                 0               9                       0
3       3                 0               1                       0
4       3                 1               3                       2
5       3                 0               8                       0
6       3                 1              19                      11
7       3                 1              52                      33
8       1                 0              11                       0
9       2                 0              22                       0
10      2                 1              64                      42
11      4                 0             132                       0
12      4                 1             144                      12

IIUC公司:

s = df.groupby('car_id')['odometer_start'].diff()
df['miles_since_last_maint'] = np.where(df['need_maintanince'], s, 0)

给予

   car_id  odometer_start  need_maintanince  miles_since_last_maint
0       1               0                 0                     0.0
1       2               5                 0                     0.0
2       2               9                 0                     0.0
3       3               1                 0                     0.0
4       3               3                 1                     2.0
5       3               8                 0                     0.0
6       3              19                 1                    11.0
7       3              52                 1                    33.0
8       1              11                 0                     0.0
9       2              22                 0                     0.0
10      2              64                 1                    42.0
11      4             132                 0                     0.0
12      4             144                 1                    12.0

这似乎给出了您要查找的结果:

df = pd.DataFrame({
 'car_id': ['1', '2', '2', '3', '3', '3', '3', '3', '1','2','2','4','4'],
 'odometer_start': [0, 5, 9, 1,3, 8,19,52,11,22,64,132, 144],
 'need_maintanince': [0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1]
 })

df['miles_since_maint'] = (df.groupby('car_id')['odometer_start'].diff() 
                            * df['need_maintanince']).fillna(0)
   car_id        ...          miles_since_maint
0       1        ...                        0.0
1       2        ...                        0.0
2       2        ...                        0.0
3       3        ...                        0.0
4       3        ...                        2.0
5       3        ...                        0.0
6       3        ...                       11.0
7       3        ...                       33.0
8       1        ...                        0.0
9       2        ...                        0.0
10      2        ...                       42.0
11      4        ...                        0.0
12      4        ...                       12.0

按评论编辑:

df = pd.DataFrame({
 'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
 'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
 'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
 'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0],
 'miles_since_maint': [0,0,0,4,9,30,80,12,18,15,75,24]})

df['odo_chg'] = df['odometer_end'] - df['odometer_start']
maint_group = df['need_maintanince'].shift().cumsum().fillna(0)
df['miles_since_maint_2'] = (df.groupby(['car_id', maint_group])['odo_chg'].cumsum())
# Reassign initial group to 0 per desired output
df.loc[maint_group == 0, 'miles_since_maint_2'] = 0
df.T

提供(转置以便于查看)

                    0  1  2   3   4   5   6    7    8    9    10   11
car_id               1  1  1   1   1   1   1    1    1    1    1    1
odometer_start       0  3  6   9  13  18  39   89  101  107  122  182
odometer_end         3  6  9  13  18  39  89  101  107  122  182  206
need_maintanince     0  0  1   0   0   0   1    0    1    0    1    0
miles_since_maint    0  0  0   4   9  30  80   12   18   15   75   24
odo_chg              3  3  3   4   5  21  50   12    6   15   60   24
miles_since_maint_2  0  0  0   4   9  30  80   12   18   15   75   24

相关问题 更多 >

    热门问题