优雅的 Pandas 分组和更新?

2024-10-03 19:23:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下pandas.DataFrame对象:

       offset                      ts               op    time
0    0.000000 2015-10-27 18:31:40.318       Decompress   2.953
1    0.000000 2015-10-27 18:31:40.318  DeserializeBond   0.015
32   0.000000 2015-10-27 18:31:40.318         Compress  17.135
33   0.000000 2015-10-27 18:31:40.318       BuildIndex  19.494
34   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.625
35   0.000000 2015-10-27 18:31:40.318         Compress  16.970
36   0.000000 2015-10-27 18:31:40.318       BuildIndex  18.954
37   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.047
38   0.000000 2015-10-27 18:31:40.318         Compress  16.017
39   0.000000 2015-10-27 18:31:40.318       BuildIndex  17.814
40   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.047
77   4.960683 2015-10-27 18:36:37.959       Decompress   2.844
78   4.960683 2015-10-27 18:36:37.959  DeserializeBond   0.000
108  4.960683 2015-10-27 18:36:37.959         Compress  17.758
109  4.960683 2015-10-27 18:36:37.959       BuildIndex  19.742
110  4.960683 2015-10-27 18:36:37.959      InsertIndex   0.110
111  4.960683 2015-10-27 18:36:37.959         Compress  16.267
112  4.960683 2015-10-27 18:36:37.959       BuildIndex  18.111
113  4.960683 2015-10-27 18:36:37.959      InsertIndex   0.062

我想按(offset, ts, op)字段分组,并将time值相加:

df = df.groupby(['offset', 'ts', 'op']).sum()

目前为止还不错:

                                                    time
offset   ts                      op                     
0.000000 2015-10-27 18:31:40.318 BuildIndex       56.262
                                 Compress         50.122
                                 Decompress        2.953
                                 DeserializeBond   0.015
                                 InsertIndex       0.719
4.960683 2015-10-27 18:36:37.959 BuildIndex       37.853
                                 Compress         34.025
                                 Decompress        2.844
                                 DeserializeBond   0.000
                                 InsertIndex       0.172

问题是,我必须从每个组的BuildIndex-中减去Compress时间。I was recommended来使用DataFrame.xs(),我得出了以下结论:

diff = df.xs("BuildIndex", level="op") - df.xs("Compress", level="op")
diff['op'] = 'BuildIndex'
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)
df.update(diff)

它确实起作用,但我有一种强烈的感觉,那就是这个问题必须有一个更优雅的解决方案。你知道吗

有人能提出更好的方法吗?你知道吗


Tags: dataframedftimedifflevelcompressdecompressoffset
1条回答
网友
1楼 · 发布于 2024-10-03 19:23:12

注意:您的线路:

diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)

没有必要,因为diff是不变的(因为它通过上一个groupby已经是唯一的)。你知道吗


一个小技巧是将drop_levels=False与.values一起使用(因此减法时忽略索引),这有点厚颜无耻,因为它假设每个组都有一个“BuildIndex”和一个“op”行,这可能是不安全的。你知道吗

In [11]: diff = df1.xs("BuildIndex", level="op", drop_level=False) - df1.xs("Compress", level="op").values

In [12]: diff
Out[12]:
                                     time
offset     ts           op
2015-10-27 18:31:40.318 BuildIndex  6.140
           18:36:37.959 BuildIndex  3.828

我很想在这里展开讨论,因为数据实际上是二维的:

In [21]: res = df1.unstack("op")

In [22]: res
Out[22]:
                              time
op                      BuildIndex Compress Decompress DeserializeBond InsertIndex
offset     ts
2015-10-27 18:31:40.318     56.262   50.122      2.953           0.015       0.719
           18:36:37.959     37.853   34.025      2.844           0.000       0.172

不过,目前还不清楚这是否是一个多索引列:

In [23]: res.columns = res.columns.get_level_values(1)

In [24]: res
Out[24]:
op                       BuildIndex  Compress  Decompress  DeserializeBond  InsertIndex
offset     ts
2015-10-27 18:31:40.318      56.262    50.122       2.953            0.015        0.719
           18:36:37.959      37.853    34.025       2.844            0.000        0.172

那么减法就简单多了:

In [25]: res["BuildIndex"] - res["Compress"]
Out[25]:
offset      ts
2015-10-27  18:31:40.318    6.140
            18:36:37.959    3.828
dtype: float64

In [26]: res["BuildIndex"] = res["BuildIndex"] - res["Compress"]

我想这是最优雅的。。。

相关问题 更多 >