比较并匹配两列和多列中的值

2024-05-05 21:53:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据框,其中包含流行商店及其所在地区的数据。每个商店都是连锁店,可能有多个地区位置id(例如,“Store1”在不同的地方有多个商店)

First df有关于前5名最受欢迎的商店和地区ID的信息,这些信息用分号分隔,例如:

store_name district_id
Store1 |  1;2;3;4;5
Store2 |  1;2
Store3 |  3
Store4 |  4;7;10;15
Store5 |  12;15;

第二个df只有两列,包含城市中的所有地区,每行都是唯一的地区id和名称


district_id  district_name
1           |  District1
2           |  District2
3           |  District3
4           |  District4
5           |  District5
6           |  District6
7           |  District7
8           |  District8
9           |  District9
10          | District10
etc.

目标是在df1中为top-5中的每个商店创建列,并将每个地区id号与地区名称匹配

首先,我将df1拆分为如下形式:

store_name district_id 0   1   2   3   4   5 
Store1    |    1     | 2 | 3 | 4 | 5
Store2    |    1     | 2 |   |   |  
Store3    |    3     |   |   |   |
Store4    |    4     | 7 | 10| 15| 
Store5    |    12    | 15|

但现在我被绊住了,不知道如何匹配从df1到df2的每个值,并获取每个id的地区名称。空单元格是无的,因为列是由每个商店的最大值创建的

我想得到这样的df:

store_name district_name district_name2 district_name3 district_name4 district_name5 
Store1     | District1   | District2   | District3   | District4     | District5
Store2     | District1   | District2   |             |               |   
Store3     | District3   |             |             |               |
Store4     | District4   | District7   | District10  | District15    | 
Store5     | District12  | District15  |             |               |

提前谢谢


Tags: storename名称iddf地区商店district
3条回答

所以有很多方法可以做到这一点,这只是其中之一。假设您将两个数据帧存储为df1和df2:

首先,规范化df1中的district_id列,使其长度相同:

# make all strings the same size when split
def return_full_string(text):
    l = len(text.split(';'))
    for _ in range(5 - l):
        text = f"{text};"
    return text

df1['district_id'] = df1.district_id.apply(return_full_string)

然后将文本列拆分为单独的列并删除原始列:

# split district id's into different columns
district_columns = [f"district_name{n+1}" for n in range(5)]
df1[district_columns] = list(df1.district_id.str.split(';'))
df1.drop('district_id', inplace=True)

然后获取df2中ID到其各自名称的映射,并使用该映射替换新列中的值:

id_to_name = {str(ii): nn for ii, nn in zip(df2['district_id'], df2['district_name'])}
for col in district_columns:
    df1[col] = df1[col].apply(id_to_name.get)

就像我说的,我相信还有其他方法可以做到这一点,但这应该是可行的

您可以stack第一个数据帧,然后将其转换为浮点类型,map第二个数据帧中的列,然后unstack最后add_prefix

df1.stack().astype(float).map(df2['district_name']).unstack().add_prefix('district_name')

输出:

           district_name0 district_name1  ... district_name3 district_name4
store_name                                ...                              
Store1          District1      District2  ...      District4      District5
Store2          District1      District2  ...            NaN            NaN
Store3          District3            NaN  ...            NaN            NaN
Store4          District4      District7  ...            NaN            NaN
Store5                NaN            NaN  ...            NaN            NaN

用于上述代码的数据帧:

>>> df1
             0    1    2    3    4
store_name                        
Store1       1    2    3    4    5
Store2       1    2  NaN  NaN  NaN
Store3       3  NaN  NaN  NaN  NaN
Store4       4    7   10   15  NaN
Store5      12   15  NaN  NaN  NaN

>>> df2
            district_name
district_id              
1               District1
2               District2
3               District3
4               District4
5               District5
6               District6
7               District7
8               District8
9               District9
10             District10
df1=pd.DataFrame(data={'store_name':['store1','store2','store3','store4','store5'],
                   'district_id':[[1,2,3,4,5], [1,2], 3, [4,7,10], [8,10]]})
df2=pd.DataFrame(data={'district_id':[1,2,3,4,5,6,7,8,9,10],
                       'district_name':['District1', 'District2', 'District3', 'District4', 'District5', 'District6', 'District7', 'District8', 'District9', 'District10']})

步骤1:使用explode()将值拆分为行

df3=df1.explode('district_id').reset_index(drop=True)

步骤2:将merge()on='district_id'一起使用

df4=pd.merge(df3,df2, on='district_id' )

步骤3:使用groupby()&agg()以获取包含列表的列

df5=df4.groupby('district_name').agg(list).reset_index()
    store_name  district_id                       district_name
0   store1  [1, 2, 3, 4, 5]   [District1,District2,District3,District4,District5]
1   store2  [1, 2]            [District1,District2]
2   store3  [3]               [District3]
3   store4  [4, 7, 10]        [District4,District7,District10]
4   store5  [10, 8]           [District10,District8]

然后,它可以根据需要进行拆分

相关问题 更多 >