如果存在重复的列名,如何匹配2个数据帧?

2024-10-16 22:33:33 发布

您现在位置:Python中文网/ 问答频道 /正文

从我前面的问题Is there any method to match tabular list with pivot list format?中,我找到了使用DataFrames.melt方法获得结果的方法。但是,我发现我在excel中的数据有重复的标题名称(颜色),pandas会自动添加.1,我想检查是否有其他方法不需要标题为相同的名称来匹配它

匹配的标准是,如果颜色-粉色、黄色、蓝色、红色与M列表的列匹配,则获取代码号和品牌号

详情如下:

df = pd.DataFrame({'M': ['M1','M2','M3','M4','M5','M6','M7','M8','M9','M10'],
                    'Color' : ['pink','yellow','yellow','yellow','blue','pink','pink','pink','red','blue']})

df1 = pd.DataFrame({'M_list_1': ['S3','M73','M18','M12','M1','M2','M7','S5','S17','S18','S20','M78','Z4','M6','S8','M10','M99','S9'],
                    'M_list_2': ['M3','Z73','Z18','Z12','K1','S2','Z7','M5','M17','M18','M20','M78','M4','Z6','M8','S10','M99','M9'],
                    'pink': [23255,1755,7897,6666,1311,78946,44893,148791,18783,487913,79815,88888,149643,154894,165496,1351979,15648,5222],
                    'yellow': [1000,99999,78945,15543,10000,20000,14613,77778,45139,11776,78,1231,789613,10009,70009,88009,13036,789746],
                    'blue': [2000,8000,9000,6000,104560,204890,146103,70078,40009,11006,78789,13531,71113,10809,70999,889,36,7846],
                    'red': [300000,82200,90,60330,11560,21120,14783,1578,31109,91006,9719,9131,1113,1009,799,871,10116,1146],
                    'pink.1': ["Brand9","Brand9","Brand9","Brand13","Brand77","-","Brand1","Brand1"
                               ,"Brand1","Brand51","Brand71","Brand16","Brand96","Brand15","Brand61","-","Brand13","-"],
                    'yellow.1': ["Brand9","-","-","-","Brand17","Brand61","Brand13","-"
                               ,"Brand8","Brand9","Brand1","-","Brand77","-","Brand1","-","Brand11","Brand1"],
                    'blue.1': ["-","Brand13","Brand9","-","Brand1","Brand1","-","-"
                               ,"Brand1","Brand16","Brand1","-","Brand9","Brand1","Brand1","-","-","-"],
                    'red.1': ["Brand8","Brand91","Brand9","Brand13","Brand59","Brand1","Brand1","Brand1"
                               ,"Brand2","-","Brand1","-","Brand9","Brand77","Brand1","Brand19","-","Brand2"]})

我的期望结果如下:

     M   Color    Code     Brand
0   M1    pink    1311   Brand77
1   M2  yellow   20000   Brand61
2   M3  yellow    1000   Brand9
3   M4  yellow  789613   Brand77
4   M5    blue   70078       -
5   M6    pink  154894   Brand15
6   M7    pink   44893   Brand1
7   M8    pink  165496   Brand61
8   M9     red    1146   Brand2
9  M10    blue     889       -

Tags: 方法bluelistm3m5yellowm4m1
2条回答

从您的DataFrames,我们构建df_full

>>> df_full = pd.concat([pd.merge(df, df1, left_on='M', right_on='M_list_1'),
                         pd.merge(df, df1, left_on='M', right_on='M_list_2')]).reset_index()
>>> df_full
    index   M   Color   M_list_1    M_list_2    pink    yellow  blue    red     pink.1  yellow.1    blue.1  red.1
0   0       M1  pink    M1          K1          1311    10000   104560  11560   Brand77 Brand17 Brand1  Brand59
1   1       M2  yellow  M2          S2          78946   20000   204890  21120   -   Brand61 Brand1  Brand1
2   2       M6  pink    M6          Z6          154894  10009   10809   1009    Brand15 -   Brand1  Brand77
3   3       M7  pink    M7          Z7          44893   14613   146103  14783   Brand1  Brand13 -   Brand1
4   4       M10 blue    M10         S10         1351979 88009   889     871     -   -   -   Brand19
5   0       M3  yellow  S3          M3          23255   1000    2000    300000  Brand9  Brand9  -   Brand8
6   1       M4  yellow  Z4          M4          149643  789613  71113   1113    Brand96 Brand77 Brand9  Brand9
7   2       M5  blue    S5          M5          148791  77778   70078   1578    Brand1  -   -   Brand1
8   3       M8  pink    S8          M8          165496  70009   70999   799     Brand61 Brand1  Brand1  Brand1
9   4       M9  red     S9          M9          5222    789746  7846    1146    -   Brand1  -   Brand2

我们首先从Color列获取pinkyellowbluered列中的值,以使用^{}构建Code列,如下所示:

df_full['Code'] = df_full.lookup(df_full.index, df_full['Color'])

然后我们对一个新列Color1执行相同的操作,它允许我们构建Brand并获得预期的结果:

>>> df_full["Color1"] = df_full["Color"] + ".1"
>>> df_full['Brand'] = df_full.lookup(df_full.index, df_full['Color1'])
>>> df_full[['M', 'Color', 'Code', 'Brand']]
    M   Color   Code    Brand
0   M1  pink    1311    Brand77
1   M2  yellow  20000   Brand61
2   M6  pink    154894  Brand15
3   M7  pink    44893   Brand1
4   M10 blue    889     -
5   M3  yellow  1000    Brand9
6   M4  yellow  789613  Brand77
7   M5  blue    70078   -
8   M8  pink    165496  Brand61
9   M9  red     1146    Brand2

编辑:

正如@bbbbbb所评论的,lookup现在产生一个未来警告,因为它将被贬值。使用pandas>=1.1.0我们必须使用^{}

>>> df_full.melt(id_vars='Color', value_vars=['pink', 'yellow', 'blue', 'red'], ignore_index=False).query('Color == variable').loc[df.index, 'value']
0      1311
1     20000
2    154894
3     44893
4       889
5      1000
6    789613
7     70078
8    165496
9      1146
Name: value, dtype: int64

Brand相同:

>>> df_full["Color1"] = df_full["Color"] + ".1"
>>> df_full['Brand'] = df_full.melt(id_vars='Color1', value_vars=['pink.1', 'yellow.1', 'blue.1', 'red.1'], ignore_index=False).query('Color1 == variable').loc[df.index, 'value']
>>> df_full[['M', 'Color', 'Code', 'Brand']]
    M   Color   Code    Brand
0   M1  pink    1311    Brand77
1   M2  yellow  20000   Brand61
2   M6  pink    154894  Brand15
3   M7  pink    44893   Brand1
4   M10 blue    889     -
5   M3  yellow  1000    Brand9
6   M4  yellow  789613  Brand77
7   M5  blue    70078   -
8   M8  pink    165496  Brand61
9   M9  red     1146    Brand2

此解决方案首先由所有列使用split,而不首先由.使用M,因此需要在MultiIndex中替换NaN,然后由^{}^{}重新设置值,以便在最后一步使用merge

df2 = df1.set_index(['M_list_1','M_list_2'])
f = lambda x: x if pd.notna(x) else 'Code'
df2.columns = df2.columns.str.split('.', expand=True)
df2 = (df2.rename(columns = f)
          .stack(0)
          .reset_index()
          .rename(columns={'level_2':'Color','1':'Brand'})
          .melt(['Color','Code','Brand'], value_name='M')
          .drop('variable', axis=1))
print (df2)
      Color    Code    Brand    M
0      blue    2000        -   S3
1      pink   23255   Brand9   S3
2       red  300000   Brand8   S3
3    yellow    1000   Brand9   S3
4      blue    8000  Brand13  M73
..      ...     ...      ...  ...
139  yellow   13036  Brand11  M99
140    blue    7846        -   M9
141    pink    5222        -   M9
142     red    1146   Brand2   M9
143  yellow  789746   Brand1   M9

[144 rows x 4 columns]

df = df.merge(df2, how='left', on=['M','Color'])
print (df)
     M   Color    Code    Brand
0   M1    pink    1311  Brand77
1   M2  yellow   20000  Brand61
2   M3  yellow    1000   Brand9
3   M4  yellow  789613  Brand77
4   M5    blue   70078        -
5   M6    pink  154894  Brand15
6   M7    pink   44893   Brand1
7   M8    pink  165496  Brand61
8   M9     red    1146   Brand2
9  M10    blue     889        -

相关问题 更多 >