如何按多个列分组并聚合不同列上的差异?

2024-06-26 00:26:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我在这里寻求有关如何在Python/Panda中执行此操作的帮助:

我希望获得原始数据(如下),并通过一组多个COL(州、县和日期)找到多个COL(cnt_a和cnt_b)的每日差异

我一直在尝试不同的方法,但我似乎无法通过“检查重复”的问题

df.cnt_a = df.sort_values(['state','county','date']).groupby['state','county','date','cnt_a'].diff(-1)

尝试将其拆分以一次修复一件事情:

df1 = df.sort_values(['state','county','date'])

df2 = df1.groupby(['state','county'])['cnt_a'].diff()

原始数据=&燃气轮机;df

date        county  state       cnt_a    cnt_b
2020-06-13  Bergen  New Jersey   308     11
2020-06-14  Bergen  New Jersey   308     11
2020-06-15  Bergen  New Jersey   320     15
2020-06-12  Union   New Jersey   100     3
2020-06-13  Union   New Jersey   130     4
2020-06-14  Union   New Jersey   150     5
2020-06-12  Bronx   New York     200     100
2020-06-13  Bronx   New York     210     200

想要的输出

date        county  state       cnt_a   cnt_b   daydiff_a    daydiff_b
2020-06-13  Bergen  New Jersey   308     11        0            0 
2020-06-14  Bergen  New Jersey   308     11        0            0
2020-06-15  Bergen  New Jersey   320     15        12           4
2020-06-12  Union   New Jersey   100     3         0            0
2020-06-13  Union   New Jersey   130     4         30           1
2020-06-14  Union   New Jersey   150     5         20           1
2020-06-12  Bronx   New York     200     100       0            0 
2020-06-13  Bronx   New York     210     200       10           100

Tags: dfnew原始数据datecolsortstatevalues
1条回答
网友
1楼 · 发布于 2024-06-26 00:26:03
  • df进行排序很重要,因为df.groupby将被排序。如果df未首先排序,则.groupby中的联接列的顺序将与df的顺序不同。
    • 一定要df,按'state''country''date'的顺序,但是.groupby中的'date'列被忽略
  • .groupby'state''country'上,并将.diff聚合到所需的列
  • fillna.join{}到groupby对象,基于索引。
    • 指定rsuffix,或使用.rename更改列标题
import pandas as pd

# setup the test dataframe
data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
        'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
        'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
        'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
        'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}

df = pd.DataFrame(data)

# set the date column to a datetime format
df.date = pd.to_datetime(df.date)

# sort the values
df = df.sort_values(['state', 'county', 'date'])

# groupby and join back to dataframe df
df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')

# display(df)
        date  county       state  cnt_a  cnt_b  cnt_a_diff  cnt_b_diff
0 2020-06-13  Bergen  New Jersey    308     11         0.0         0.0
1 2020-06-14  Bergen  New Jersey    308     11         0.0         0.0
2 2020-06-15  Bergen  New Jersey    320     15        12.0         4.0
3 2020-06-12   Union  New Jersey    100      3         0.0         0.0
4 2020-06-13   Union  New Jersey    130      4        30.0         1.0
5 2020-06-14   Union  New Jersey    150      5        20.0         1.0
6 2020-06-12   Bronx    New York    200    100         0.0         0.0
7 2020-06-13   Bronx    New York    210    200        10.0       100.0

相关问题 更多 >