在groupby函数后合并2个不同大小的数据帧

2024-09-30 10:31:28 发布

您现在位置:Python中文网/ 问答频道 /正文

I'm trying to relate the groupby filtered dataframe to the original dataframe. After doing the groupby I lose some columns that I had in the original dataframe. The idea is to relate it back to their respective STATE and CITY values. But when I try to relate back the dataframe grows back to normal size with 18 rows. I just want the interface of the original dataframe to the final dataframe which contains 9 rows.

原始数据帧:

    |  COD      |STATE| CITY  |   AZIM | SET|TEC|
0   |ALAAD_0001 |AL   |MAC    |0       |1   |4  |
1   |ALAAD_0001 |AL   |MAC    |110     |2   |4  |
2   |ALAAD_0001 |AL   |ARA    |120     |2   |4  |
3   |ALAAD_0001 |AL   |MAC    |220     |3   |4  |
4   |ALAAD_0001 |AL   |MAC    |240     |3   |4  |
5   |BAPID_0001 |BA   |SAL    |20      |1   |2  |
6   |BAPID_0001 |BA   |SAL    |20      |1   |2  |
7   |BAPID_0001 |BA   |VIT    |100     |2   |2  |
8   |BAPID_0001 |BA   |SAL    |100     |2   |2  |
9   |BAPID_0001 |BA   |SAL    |210     |3   |2  |
10  |BAPID_0001 |BA   |SAL    |250     |3   |2  |
11  |BAPID_0001 |BA   |SAL    |250     |3   |2  |
12  |CEMBC_0003 |CE   |FOR    |90      |1   |4  |
13  |CEMBC_0003 |CE   |FOR    |80      |1   |4  |
14  |CEMBC_0003 |CE   |CAU    |160     |2   |4  |
15  |CEMBC_0003 |CE   |FOR    |160     |2   |4  |
16  |CEMBC_0003 |CE   |FOR    |170     |2   |4  |
17  |CEMBC_0003 |CE   |FOR    |280     |3   |4  |

After groupby:

df_cut = (
    df.groupby(["COD", "TEC", "SET"])
        .AZIM
        .agg(lambda x: pd.Series.mode(x).max())
        .reset_index()
)
    | COD       |TEC     |SET |AZIM|
0   |ALAAD_0001 |4       |1   |0   |
1   |ALAAD_0001 |4       |2   |120 | 
2   |ALAAD_0001 |4       |3   |240 | 
3   |BAPID_0001 |2       |1   |20  | 
4   |BAPID_0001 |2       |2   |100 | 
5   |BAPID_0001 |2       |3   |250 |
6   |CEMBC_0003 |4       |1   |90  | 
7   |CEMBC_0003 |4       |2   |160 | 
8   |CEMBC_0003 |4       |3   |280 | 

Expected output:

    COD        TEC  SET AZIM    STATE   CITY
0   ALAAD_0001  4   1   0       AL      MAC
1   ALAAD_0001  4   2   120     AL      ARA
2   ALAAD_0001  4   3   240     AL      MAC
3   BAPID_0001  2   1   20      BA      SAL
4   BAPID_0001  2   2   100     BA      VIT
5   BAPID_0001  2   3   250     BA      SAL
6   CEMBC_0003  4   1   90      CE      FOR
7   CEMBC_0003  4   2   160     CE      CAU
8   CEMBC_0003  4   3   280     CE      FOR

Tags: thetodataframeformaccodceal
3条回答

使用^{}+^{}+^{}

cols = ["COD", "TEC", "SET"]
df_cut = (
    df[df['AZIM'].eq(
        df.groupby(cols)['AZIM'].transform(lambda x: x.mode().max())
    )].drop_duplicates(cols).reset_index(drop=True)
)

df_cut

          COD STATE CITY  AZIM  SET  TEC
0  ALAAD_0001    AL  MAC     0    1    4
1  ALAAD_0001    AL  ARA   120    2    4
2  ALAAD_0001    AL  MAC   240    3    4
3  BAPID_0001    BA  SAL    20    1    2
4  BAPID_0001    BA  VIT   100    2    2
5  BAPID_0001    BA  SAL   250    3    2
6  CEMBC_0003    CE  FOR    90    1    4
7  CEMBC_0003    CE  CAU   160    2    4
8  CEMBC_0003    CE  FOR   280    3    4

说明:

^{}将mode max放置在每个组的末尾:

df.groupby(["COD", "TEC", "SET"])['AZIM'].transform(lambda x: x.mode().max())
0       0
1     120
2     120
3     240
4     240
5      20
6      20
7     100
8     100
9     250
10    250
11    250
12     90
13     90
14    160
15    160
16    160
17    280
Name: AZIM, dtype: int64

通过将其与“AZIM”列进行比较来创建布尔索引,以查找mode max所在的索引:

df['AZIM'].eq(
    df.groupby(["COD", "TEC", "SET"])['AZIM']
        .transform(lambda x: x.mode().max())
)
0      True
1     False
2      True
3     False
4      True
5      True
6      True
7      True
8      True
9     False
10     True
11     True
12     True
13    False
14     True
15     True
16    False
17     True
Name: AZIM, dtype: bool

这用于过滤df

df[df['AZIM'].eq(
    df.groupby(["COD", "TEC", "SET"])['AZIM']
        .transform(lambda x: x.mode().max())
)]
           COD STATE CITY  AZIM  SET  TEC
0   ALAAD_0001    AL  MAC     0    1    4
2   ALAAD_0001    AL  ARA   120    2    4
4   ALAAD_0001    AL  MAC   240    3    4
5   BAPID_0001    BA  SAL    20    1    2
6   BAPID_0001    BA  SAL    20    1    2
7   BAPID_0001    BA  VIT   100    2    2
8   BAPID_0001    BA  SAL   100    2    2
10  BAPID_0001    BA  SAL   250    3    2
11  BAPID_0001    BA  SAL   250    3    2
12  CEMBC_0003    CE  FOR    90    1    4
14  CEMBC_0003    CE  CAU   160    2    4
15  CEMBC_0003    CE  FOR   160    2    4
17  CEMBC_0003    CE  FOR   280    3    4

最后^{}+^{}要删除重复项并清理索引,请执行以下操作:

df[df['AZIM'].eq(
    df.groupby(["COD", "TEC", "SET"])['AZIM']
        .transform(lambda x: x.mode().max())
)].drop_duplicates(["COD", "TEC", "SET"]).reset_index(drop=True)
          COD STATE CITY  AZIM  SET  TEC
0  ALAAD_0001    AL  MAC     0    1    4
1  ALAAD_0001    AL  ARA   120    2    4
2  ALAAD_0001    AL  MAC   240    3    4
3  BAPID_0001    BA  SAL    20    1    2
4  BAPID_0001    BA  VIT   100    2    2
5  BAPID_0001    BA  SAL   250    3    2
6  CEMBC_0003    CE  FOR    90    1    4
7  CEMBC_0003    CE  CAU   160    2    4
8  CEMBC_0003    CE  FOR   280    3    4

gp-您的gropby数据帧

或者-您的原始数据帧

```python
or_drop = or.drop_duplicates(subset=['CITY', 'TEC', 'SET', 'AZIM'], keep='first')
expected = gp.merge(or['COD', 'STATE'], how=inner, on=['COD'])
expected = expected.merge(or_drop['TEC', 'SET', 'AZIM','CITY'], how=inner, on=['TEC', 'SET', 'AZIM])

对不起,我没有检查过

你可以把你的cut_df和原来的df做一个内部连接来引入城市和州。尝试:

更新

下面连接键的行在df中不是唯一的,因此需要drop_duplicates()

key = ['COD', 'TEC', 'SET', 'AZIM']
result = pd.merge(df, df_cut, on=key, how='inner').drop_duplicates()

相关问题 更多 >

    热门问题