后续根据中另一列的值创建新列

2024-05-04 21:28:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我上一个问题的后续问题--Creating new columns based on value from another column in pandas

我现在的目标是:

Code    Name        Level1    Level1Name    Level2  Level2Name  Level3  Level3Name
0   A   USA             A       USA             
1   AM  Massachusetts   A       USA          AM     Massachusetts   
2   AMB Boston          A       USA          AM     Massachusetts   AMB     Boston
3   AMS Springfield     A       USA          AM     Massachusetts   AMS     Springfiled
4   D   Germany         D   Germany          
5   DB  Brandenburg     D   Germany          DB     Brandenburg     
6   DBB     Berlin      D   Germany          DB     Brandenburg     DBB     Berlin
7   DBD     Dresden     D   Germany          DB     Brandenburg     DBD     Dresden

基于Scott Boston的准则,到目前为止,我已经:

match   0   1   2
0       A   A   A
1       A   AM  AM
2       A   AM  AMB
3       A   AM  AMS
4       D   D   D
5       D   DB  DB
6       D   DB  DBB
7       D   DB  DBD

我的方法是循环遍历每一列,删除与该列中其余值长度不同但似乎无法理解逻辑的行。你知道吗

示例代码:

df = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx')

df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match').fillna('').cumsum(axis=1)
df_level

谢谢你的帮助!你知道吗


Tags: indfdbcodeambostonamsberlin
3条回答

让我们试试:

df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match', fill_value='')
df_level = df_level.cumsum(axis=1).mask(df_level == '')
s_map = df.explode('Codes').drop_duplicates('Code', keep='last').set_index('Code')['Name']
df_level.columns = [f'Level{i+1}' for i in df_level.columns]
df_level_names =  pd.concat([df_level[i].map(s_map) for i in df_level.columns], 
                            axis=1, 
                            keys=df_level.columns+'Name')
df_out = df.join([df_level, df_level_names]).drop('Codes', axis=1)
df_out

输出:

  Code           Name Level1 Level2 Level3 Level1Name     Level2Name   Level3Name
0    A            USA      A    NaN    NaN        USA            NaN          NaN
1   AM  Massachusetts      A     AM    NaN        USA  Massachusetts          NaN
2  AMB         Boston      A     AM    AMB        USA  Massachusetts       Boston
3  AMS    Springfield      A     AM    AMS        USA  Massachusetts  Springfield
4    D        Germany      D    NaN    NaN    Germany            NaN          NaN
5   DB    Brandenburg      D     DB    NaN    Germany    Brandenburg          NaN
6  DBB         Berlin      D     DB    DBB    Germany    Brandenburg       Berlin
7  DBD        Dresden      D     DB    DBD    Germany    Brandenburg      Dresden

此方法使用apply和函数:

import pandas as pd
l = ['A', 'AM', 'AMB', 'AMS', 'D', 'DB', 'DBB', 'DBD']
df = pd.DataFrame(l).rename(columns={0:'code'})

def level2(col):
  if len(col) == 1:
    return ''
  elif len(col) >= 2:
    return col[:2]

def level3(col):
  if len(col) <= 2:
    return ''
  elif len(col) > 2:
    return col[:3]

df['Level1'] = df['code'].apply(lambda col: col[0])
df['Level2'] = df['code'].apply(level2)
df['Level3'] = df['code'].apply(level3)

print(df)

输出:

  code Level1 Level2 Level3
0    A      A              
1   AM      A     AM       
2  AMB      A     AM    AMB
3  AMS      A     AM    AMS
4    D      D              
5   DB      D     DB       
6  DBB      D     DB    DBB
7  DBD      D     DB    DBD

这些函数也可以重构成一个函数,但你可以理解其中的要点。我建议使用apply而不是熊猫的其他方法,因为apply更容易记忆和定制。希望这有帮助。你知道吗

我采用了另一种方法:循环代码的长度,假设您不会有太多的级别。你知道吗

import pandas as pd
df=pd.DataFrame({
    'Code':['A','AM','AMB'],
    'Name':['USA','Massachusetts',"Boston"]
})
# prepare
res=pd.DataFrame({
    'Code':[]
})
df['len']=df['Code'].str.len()
cols=[]
for x in range(df['len'].max()):
    dfX=df[df['len']==x+1].copy()
    dfX['prefix']=dfX['Code'].str.slice(stop=x)

    dfX=dfX.merge(res,how='left',left_on='prefix',right_on='Code')

    dfX[f'Level{x+1}']=dfX['Code_x']
    dfX[f'Level{x+1}Name']=dfX['Name']
    dfX[f'Code']=dfX['Code_x']
    cols+=[f'Level{x+1}',f'Level{x+1}Name']
    res=res.append(dfX[['Code']+cols],sort=False)

res

Code    Level1  Level1Name  Level2  Level2Name  Level3  Level3Name
0   A   A   USA NaN NaN NaN NaN
0   AM  A   USA AM  Massachusetts   NaN NaN
0   AMB A   USA AM  Massachusetts   AMB Boston

我们的想法是首先在查找表中添加级别1,然后添加级别2和级别3。。。 代码看起来很难看,但希望很容易理解。你知道吗

相关问题 更多 >