基于名称从多个数据帧生成唯一密钥

2024-09-25 02:27:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧。正如您所看到的,函数正确地合并了它,但它是错误的。因为carid必须是唯一的,并且不能分配两次。我怎样才能解决这个问题?它可以在一个数据帧中出现多次,但在两个数据记录中必须保持唯一。因此Carid = 1 = Mercedes-benz在所有数据记录中,而非Cardid = 1 = Mercedes-Benz & Citroen

import pandas as pd

d = {'Carid ': [1, 2, 3, 1], 'Carname': ['Mercedes-Benz', 'Audi', 'BMW', 'Mercedes-Benz'], 'model': ['S-Klasse AMG 63s', 'S6', 'X6 M-Power', 'Maybach']}
df = pd.DataFrame(data=d)
display(df.head())

enter image description here

d2 = {'Carid ': [4, 1, 5], 'Carname': ['VW', 'Citroen', 'Opel'], 'model': ['GTI', 'S', 'Corsa']}
df2 = pd.DataFrame(data=d2)
display(df2.head())

enter image description here

dfs = []
dfs.append(df)
dfs.append(df2)
pd.concat(dfs)

enter image description here

我想要什么

enter image description here


Tags: 数据dataframedfdatamodeldisplay记录mercedes
3条回答

方法1

如果您不介意将键更改为浮动,第一种方法是使用cumcount递增

df3 = pd.concat([df,df2])

s = df3.groupby('Carname',sort=False)['Carid'].first().to_frame()
s['Carid'] = s['Carid']  + s.groupby('Carid').cumcount() / 10

new_ids = s.to_dict(orient='dict')['Carid']

df3['Carid'] = df3['Carname'].map(new_ids)

  Carid        Carname             model
0    1.0  Mercedes-Benz  S-Klasse AMG 63s
1    2.0           Audi                S6
2    3.0            BMW        X6 M-Power
3    1.0  Mercedes-Benz           Maybach
0    4.0             VW               GTI
1    1.1        Citroen                 S
2    5.0           Opel             Corsa

方法2使用字典的功能性方法

假设

函数的逻辑是基于每个数据帧有一个唯一的carid来预测的

您的ID是按顺序排列的,因此使用max{}生成数字最有意义。如果您有一个carid [1,2,3,200]列表,这可能会生成非序列号

这将为雪铁龙生成一个新的Carid{}唯一的201,因为200的ID已经存在,并且由汽车制造商拥有

作用

import pandas as pd
import numpy as np
from collections import ChainMap


def generate_new_keys(*args,key='Carid',name='Carname'):
    """
    Takes in a number of dataframes and returns any duplicates with a new unique id. 
    groupby columns fixed to CarID and CarName.
    """
    # adds dictionaries into a single list.
    dicts_ = [arg.groupby(key)[name].first().to_dict() for arg in args]
    #merges dicts on unique key, this will exclude duplicates.
    merged_dicts = dict(ChainMap(*dicts_))
    #get the duplicate and pass the name into a list.
    delta = [v for each_dict in dicts_ for k,v in each_dict.items() if v not in merged_dicts.values()]
    # get the max sequence key
    start_key =  max(merged_dicts.keys()) + 1
    # create a new sequence
    sequence = range(start_key, start_key + len(delta) + 1)
    # return a dictionary.
    return {name : number for name,number in zip(delta,sequence)}
    

行动中

new_keys = generate_new_keys(df,df2)

print(new_keys)
{'Citroen': 6}

df3 = pd.concat([df,df2])

df3['Carid'] = np.where(df3['Carname'].isin(new_keys.keys()),
         df3['Carname'].map(new_keys), df3['Carid'])

print(df3)

   Carid        Carname             model
0    1.0  Mercedes-Benz  S-Klasse AMG 63s
1    2.0           Audi                S6
2    3.0            BMW        X6 M-Power
0    4.0             VW               GTI
1    6.0        Citroen                 S
2    5.0           Opel             Corsa

测试额外的数据帧

new_df = pd.DataFrame({'Carid' : [1,2,3],
             'Carname' : ['Mercedes-Benz', 'Toyota','BMW'] })


new_keys = generate_new_keys(df,df2,new_df)
{'Citroen': 6, 'Toyota': 7}

df3 = pd.concat([df1,df2,new_df])

df3['Carid'] = np.where(df3['Carname'].isin(new_keys.keys()),
         df3['Carname'].map(new_keys), df3['Carid'])

print(df3)

  Carid        Carname             model
0    1.0  Mercedes-Benz  S-Klasse AMG 63s
1    2.0           Audi                S6
2    3.0            BMW        X6 M-Power
0    4.0             VW               GTI
1    6.0        Citroen                 S #< new id
2    5.0           Opel             Corsa
0    1.0  Mercedes-Benz               NaN
1    7.0         Toyota               NaN #< new id
2    3.0            BMW               NaN

一种可能的方法是在合并之前进行一些数据处理

您可以考虑通过较小的数据文件,并查看^ {CD1>}的值是冲突的。然后为这些应用新的唯一值

我想到了这一点,但是可以对其进行大量优化:

d2 = {'Carid': [1, 2, 3, 1], 'Carname': ['Mercedes-Benz', 'Audi', 'BMW', 'Mercedes-Benz'], 'model': ['S-Klasse AMG 63s', 'S6', 'X6 M-Power', 'Maybach']}
df2 = pd.DataFrame(data=d2)

d = {'Carid': [0,1,2],'Carname': ['VW','Citroen','Opel'],'Model':['GTI','S','Corsa']}
df = pd.DataFrame(data=d)

#We loop through the bigger dataframe index values
for indx in df2.index:
    #We fetch all the ids that are present in the df
    ids = list(df2['Carid'].values)
    
    #We need this to make sure our loop doesn't break from the index size difference
    try:
        #The row we want to modify
        new_row = df.loc[indx]
        #The id it currently has
        old_id = df.Carid[indx]
        
        #Check if the id is already present 
        if old_id in ids:
            #If it is, we take the highest id out of the present ones and up it by one
            top_id = max(ids)
            new_id = top_id+1
            
            #Add it to the existing ids
            ids.append(new_id)
            
            #Set the new value
            df.Carid[indx] = new_id
        elif old_id not in ids:
            pass
        
    except KeyError as e:
        print("Index out of range")

注意,我确实将列名从Carid 更改为Carid

完成此操作后,所有车辆都应该有unqiue ID,这意味着您可以使用concat合并两个数据帧

我的票到了。请注意,如果可以用新的i=unique id替换df和df2中的Carid,则会容易得多。但继续回答这个问题,我们开始吧

首先,我们为第一个df在carname和carid之间创建一个映射cm

d = {'Carid': [1, 2, 3, 1], 'Carname': ['Mercedes-Benz', 'Audi', 'BMW', 'Mercedes-Benz'], 'model': ['S-Klasse AMG 63s', 'S6', 'X6 M-Power', 'Maybach']}
df = pd.DataFrame(data=d)
display(df.head())
cm = {name : id for name, id in zip(df['Carname'], df['Carid'])}
cm

然后,我们对第二个df执行相同的操作

d2 = {'Carid': [4, 1, 5], 'Carname': ['VW', 'Citroen', 'Opel'], 'model': ['GTI', 'S', 'Corsa']}
df2 = pd.DataFrame(data=d2)
display(df2.head())
cm2= {name : id for name, id in zip(df2['Carname'], df2['Carid'])}
cm2

然后,主要的动作是,组合两个映射,保留原始ID,除非发生冲突,在这种情况下,我们分配唯一ID

unique_id = max(list(cm.values()) + list(cm2.values()))+1
for new_name in df2['Carname']:
    if new_name in cm:
        # already included
        pass
    elif cm2[new_name] not in cm.values():
        # unique carid
        cm[new_name] = cm2[new_name]
    else:
        # the new_name is not in cm but its id is

        cm[new_name] = unique_id
        unique_id += 1

print(cm)

现在,cm每个肉身都有唯一的id,保留最初使用的id,除非它们发生冲突:

{'Mercedes-Benz': 1, 'Audi': 2, 'BMW': 3, 'VW': 4, 'Citroen': 6, 'Opel': 5}

现在重新映射ID

df['Carid'] = df['Carname'].replace(cm)
df2['Carid'] = df2['Carname'].replace(cm)

最后将它们结合在一起

dfs = []
dfs.append(df)
dfs.append(df2)
pd.concat(dfs)

结果是

|    |   Carid | Carname       | model            |
| -:|    :|:       |:        -|
|  0 |       1 | Mercedes-Benz | S-Klasse AMG 63s |
|  1 |       2 | Audi          | S6               |
|  2 |       3 | BMW           | X6 M-Power       |
|  3 |       1 | Mercedes-Benz | Maybach          |
|  0 |       4 | VW            | GTI              |
|  1 |       6 | Citroen       | S                |
|  2 |       5 | Opel          | Corsa            |

相关问题 更多 >