基于键值关系在pandas中创建新列

2024-10-06 07:10:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,其中包含如下序列:

0    CollgCr
1    Veenker
2    CollgCr
3    Crawfor
4    NoRidge
5    Mitchel
6    Somerst
7     NWAmes
8    OldTown
9    BrkSide

基于这个系列,我想通过对它们进行分组来创建一个新的列(特性)。你知道吗

例如,如果数据是CollgCr或venker,那么新列中的值将是“Middle”

我试着使用以下代码

df_full['NeighborGrp'] = "Upper"
df_full['NeighborGrp'].loc[df_full["Neighborhood"] == "CollgCr"] = "Middle"
df_full['NeighborGrp'].loc[df_full["Neighborhood"] == ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes", "IDOTRR",
                                                          "MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]] = "Lower"

第一行和第二行运行良好,但第三行返回错误“ValueError: Arrays were different lengths"

padas中有没有特殊的语法允许我基于这样的多值条件创建一个新列?你知道吗

谢谢


Tags: 数据middledf序列locfullneighborhoodmitchel
3条回答

如果有一个表可以表示City和Type之间的关系,^{}将是一种更直接的方法(无需在脚本中硬编码每个City):

In [52]: # In reality you probably should prepare the table elsewhere and read it in as a pandas dataframe
df_types = pd.DataFrame({'CollgCr': 'Middle',
                         'Veenker': 'Middle',
                         "Mitchel": 'Lower',
                         "OldTown": 'Lower',
                         "BrkSide": 'Lower',
                         "Sawyer": 'Lower',
                         "NAmes": 'Lower',
                         "IDOTRR": 'Lower',
                         "MeadowV": 'Lower',
                         "Edwards": 'Lower',
                         "NPkVill": 'Lower',
                         "BrDale": 'Lower',
                         "SWISU": 'Lower',
                         "Blueste": 'Lower'}, index=['Type']).T

df = pd.DataFrame({'city': ['CollgCr', 'Veenker', 'CollgCr', 'Crawfor',
                            'NoRidge', 'Mitchel', 'Somerst', 'NWAmes',
                            'OldTown', 'BrkSide']})

df.merge(df_types, left_on='city', right_index=True, how='left').fillna('Upper')

Out[52]:
      city    Type
0  CollgCr  Middle
1  Veenker  Middle
2  CollgCr  Middle
3  Crawfor   Upper
4  NoRidge   Upper
5  Mitchel   Lower
6  Somerst   Upper
7   NWAmes   Upper
8  OldTown   Lower
9  BrkSide   Lower

使用^{}by dictionary^{}表示不匹配的值:

d =  {'CollgCr': 'Middle',
      'Veenker': 'Middle',
      "Mitchel": 'Lower',
      "OldTown": 'Lower',
      "BrkSide": 'Lower',
       "Sawyer": 'Lower',
        "NAmes": 'Lower',
       "IDOTRR": 'Lower',
      "MeadowV": 'Lower',
      "Edwards": 'Lower',
      "NPkVill": 'Lower',
       "BrDale": 'Lower',
        "SWISU": 'Lower',
      "Blueste": 'Lower'}

或创建动态词典:

Mi = ['CollgCr', 'Veenker']
Lo = ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes", "IDOTRR",
      "MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]

d = {**dict.fromkeys(Lo, 'Lower'), **dict.fromkeys(Mi, 'Middle')}

df_full['new'] = df_full['city'].map(d).fillna('Upper')
print (df_full)
        city      new
0     CollgCr  Middle
1     Veenker  Middle
2     CollgCr  Middle
3     Crawfor   Upper
4     NoRidge   Upper
5     Mitchel   Lower
6     Somerst   Upper
7      NWAmes   Upper
8     OldTown   Lower
9     BrkSide   Lower

它取决于数据,但是map应该是最快的:

In [25]: %timeit (jez(df_full.copy()))
15 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [26]: %timeit (raf(df_full.copy()))
20.3 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [27]: %timeit (ct(df_full.copy()))
26.9 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

计时代码:

df_full = pd.DataFrame({'city': ['CollgCr', 'Veenker', 'CollgCr', 'Crawfor',
                            'NoRidge', 'Mitchel', 'Somerst', 'NWAmes',
                            'OldTown', 'BrkSide']})

#[100000 rows x 1 columns]
df_full = pd.concat([df_full] * 10000, ignore_index=True)


def jez(df_full):

    d =  {'CollgCr': 'Middle',
         'Veenker': 'Middle',
         "Mitchel": 'Lower',
         "OldTown": 'Lower',
         "BrkSide": 'Lower',
         "Sawyer": 'Lower',
         "NAmes": 'Lower',
         "IDOTRR": 'Lower',
         "MeadowV": 'Lower',
         "Edwards": 'Lower',
         "NPkVill": 'Lower',
         "BrDale": 'Lower',
         "SWISU": 'Lower',
         "Blueste": 'Lower'}

    df_full['new'] = df_full['city'].map(d).fillna('Upper')
    return df_full

def raf(df):

    m = ['CollgCr', 'Veenker']
    l = ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes", 
         "IDOTRR","MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]

    df['new_col'] = np.select([df.city.isin(l), df.city.isin(m)],
                              ['lower', 'middle'], default='upper')
    return df

def ct(df):
    df_types = pd.DataFrame({'CollgCr': 'Middle',
                             'Veenker': 'Middle',
                             "Mitchel": 'Lower',
                             "OldTown": 'Lower',
                             "BrkSide": 'Lower',
                             "Sawyer": 'Lower',
                             "NAmes": 'Lower',
                             "IDOTRR": 'Lower',
                             "MeadowV": 'Lower',
                             "Edwards": 'Lower',
                             "NPkVill": 'Lower',
                             "BrDale": 'Lower',
                             "SWISU": 'Lower',
                             "Blueste": 'Lower'}, index=['Type']).T



    return df.merge(df_types, left_on='city', right_index=True, how='left').fillna('Upper')

print (jez(df_full.copy()))
print (raf(df_full.copy()))
print (ct(df_full.copy()))

使用^{}^{}

m = ['CollgCr', 'Veenker']
l = ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes", "IDOTRR","MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]

df['new_col'] = np.select([df.city.isin(l), df.city.isin(m)], ['lower', 'middle'], default='upper')


    city    new_col
0   CollgCr middle
1   Veenker middle
2   CollgCr middle
3   Crawfor upper
4   NoRidge upper
5   Mitchel lower
6   Somerst upper
7   NWAmes  upper
8   OldTown lower
9   BrkSide lower

相关问题 更多 >