Python-Pandas:groupby 3个带有字符串、数值和NaN值的数据帧,使用一个公共列创建一个新的数据帧

2024-09-20 22:53:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我想分组(合并?)使用公共列local_authority_name的3个数据集,并考虑所有其他列和值(NaN、string、percentage、integer等):

数据集的某些行的示例:

1df_total_per_dif-->

             local_authority_name dif_per_2013-2016
0            Barking and Dagenham             61.0%
1                          Barnet             -8.4%
2                        Barnsley            -42.8%
3    Bath and North East Somerset            -45.1%
4                          Bexley              6.7%
5                      Birmingham            -14.0%

2election2016-->

        local_authority_name pre2016 post2016
0                       Adur     CON      CON
1               Amber Valley     CON      CON
2                     Barnet     NOC      NOC
3      Basingstoke and Deane     CON      CON
4                  Brentwood     CON      CON

3 IMD2015-->

                              IMD_decile
local_authority_name                    

Barking and Dagenham            2.600000
Barnet                          5.914692
Barnsley                        3.931973
Barrow-in-Furness               4.040816
Basildon                        5.209091
Basingstoke and Deane           7.678899
Bassetlaw                       5.042857
Bath and North East Somerset    7.400000

我想得到如下结果:

   local_authority_name    dif_per_2013-2016    pre2016   post2016    IMD_decile
(...)
5    Barking and Dagenham             61.0%     NaN       NaN         2.600000
6                  Barnet             -8.4%     NOC       NOC         5.914692
7                Barnsley            -42.8%     NaN       NaN         NaN
8       Barrow-in-Furness               NaN     NaN       NaN         4.040816
9    Basingstoke and Deane              NaN     CON       CON         7.678899
(...)

我试过:

df_list = [df_total_per_dif, election2016, IMD2015]
df_concat = pd.concat(df_list)
df_final = df_concat.groupby('local_authority_name')['dif_per_2013-2016',' pre2016', 'post2016', 'IMD_decile'].apply(lambda x: ','.join(x)).reset_index()

但我不知道如何更改函数lambda以获得所需的内容。你知道吗


Tags: andnamegtdflocalnanconper
1条回答
网友
1楼 · 发布于 2024-09-20 22:53:46

local_authority_name设置为每个数据帧的索引,然后调用pd.concat

df_list = [df_total_per_dif, election2016, IMD2015]
df_list = [df.set_index('local_authority_name') for df in df_list]
df_final = pd.concat(df_list, axis=1)

例如

import pandas as pd
pd.options.display.width = 10000
df_total_per_dif = pd.DataFrame( {'dif_per_2013-2016': ['61.0%', '-8.4%', '-42.8%', '-45.1%', '6.7%', '-14.0%'], 'local_authority_name': ['Barking and Dagenham', 'Barnet', 'Barnsley', 'Bath and North East Somerset', 'Bexley', 'Birmingham']})

election2016 = pd.DataFrame({'local_authority_name': ['Adur', 'Amber Valley', 'Barnet', 'Basingstoke and Deane', 'Brentwood'], 'post2016': ['CON', 'CON', 'NOC', 'CON', 'CON'], 'pre2016': ['CON', 'CON', 'NOC', 'CON', 'CON']})

IMD2015 = pd.DataFrame({'IMD_decile': [2.6, 5.914692, 3.931973, 4.040816, 5.209091, 7.678899, 5.042857, 7.4], 'local_authority_name': ['Barking and Dagenham', 'Barnet', 'Barnsley', 'Barrow-in-Furness', 'Basildon', 'Basingstoke and Deane', 'Bassetlaw', 'Bath and North East Somerset']})


df_list = [df_total_per_dif, election2016, IMD2015]
df_list = [df.set_index('local_authority_name') for df in df_list]
df_final = pd.concat(df_list, axis=1)

print(df_final)

收益率

                             dif_per_2013-2016 post2016 pre2016  IMD_decile
Adur                                       NaN      CON     CON         NaN
Amber Valley                               NaN      CON     CON         NaN
Barking and Dagenham                     61.0%      NaN     NaN    2.600000
Barnet                                   -8.4%      NOC     NOC    5.914692
Barnsley                                -42.8%      NaN     NaN    3.931973
Barrow-in-Furness                          NaN      NaN     NaN    4.040816
Basildon                                   NaN      NaN     NaN    5.209091
Basingstoke and Deane                      NaN      CON     CON    7.678899
Bassetlaw                                  NaN      NaN     NaN    5.042857
Bath and North East Somerset            -45.1%      NaN     NaN    7.400000
Bexley                                    6.7%      NaN     NaN         NaN
Birmingham                              -14.0%      NaN     NaN         NaN
Brentwood                                  NaN      CON     CON         NaN

或者

df_list = [df_total_per_dif, election2016, IMD2015]
df_final = df_list[0]
for df in df_list[1:]:
    df_final = pd.merge(df_final, df, on='local_authority_name', how='outer')

同样有效。注意,但是结果有一点不同,第一个方法返回索引中带有local_authority_name的数据帧,而第二个方法返回带有local_authority_name列的数据帧。只要在第一个结果上调用reset_index,就可以将local_authority_name移动到一列中。你知道吗


对于小数据帧,例如您的示例,第一个方法(using_concat)更快。 对于较大的数据帧,迭代合并可能更快。例如,对于以下100000行数据帧:

import numpy as np
import pandas as pd
pd.options.display.width = 10000
N = 100000
def make_names(N):
    names = np.arange(2*N)
    np.random.shuffle(names)
    names = list(map(str, names[:N]))
    return names

df_total_per_dif = pd.DataFrame(
    {'dif_per_2013-2016': np.random.randint(10, size=N), 
     'local_authority_name': make_names(N)})

election2016 = pd.DataFrame({
    'local_authority_name': make_names(N), 
    'post2016': ['CON']*N, 'pre2016': ['CON']*N})

IMD2015 = pd.DataFrame({
    'IMD_decile': np.random.randint(10, size=N), 
    'local_authority_name': make_names(N)})

def using_concat(df_list):
    df_list = [df.set_index('local_authority_name') for df in df_list]
    df_final = pd.concat(df_list, axis=1)
    return df_final

def using_merge(df_list):
    df_final = df_list[0]
    for df in df_list[1:]:
        df_final = pd.merge(df_final, df, on='local_authority_name', how='outer')
    return df_final

df_list = [df_total_per_dif, election2016, IMD2015]

using_mergeusing_concat快约70%:

In [88]: %timeit using_merge(df_list)
1 loop, best of 3: 264 ms per loop

In [89]: %timeit using_concat(df_list)
1 loop, best of 3: 455 ms per loop

In [90]: 455/264
Out[90]: 1.7234848484848484

对您来说,更快的方法可能取决于许多其他因素。如果 性能很重要,一定要在自己的机器上用 你自己的数据。你知道吗

相关问题 更多 >

    热门问题