组合2个数据帧以生成一个重复值文件

2024-09-24 22:29:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个2列的.XLSX文件

还有。该列存储到的多个数据链接,并用分号分隔。我需要操作这个数据集的输入,我有困难的想法,最好的方向做

考虑用(,)替换分号,然后将数据打包到字典中,其中的是“key”,列表的是“value”

然而,我不确定这是最有效的途径

d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']

df =pd.DataFrame(data=d)
df.head()

sku = df['<Name>']
upc = df['<Item To Package>']

PartToUPC = {}
PartToUPC[sku]=upc

下面是我需要的文件如何看

A列中的所有SKU和B列中的各自产品包代码

FIL9791 | package_113572195

FIL9791 | package_113594355

FIL9799 | package_113572197

FIL9799 | package_113594357

等等


Tags: 文件to数据namepackagedfitemxlsx
2条回答

我想这就是你需要的

d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']}
df = pd.DataFrame(data=d)
new_df = pd.DataFrame(df["<Item To Package>"].str.split(';').tolist(), index=df["<Name>"]).stack()
new_df = new_df.reset_index([0, '<Name>'])
new_df.columns = ['<Name>', '<Item To Package>']

您可以在将列拆分为两个值的列表后使用.explode()来解决这个问题

import pandas as pd 
import numpy as np
from pandas.io.json import json_normalize
d = {'<Name>':['FIL9791','FIL9799','FIL4056','FIL4056','FIL4057'],'<Item To Package>':['package_113572195;package_113594355','package_113572197;package_113594357','package_113566689;package_113591417','package_113566688;package_113591416','package_113566690;package_113591418']}

df = pd.DataFrame(data=d)
df['<Item To Package>'] = df['<Item To Package>'].str.split(';') 
df = df.explode('<Item To Package>')
print(df)

输出:

    <Name>  <Item To Package>
0  FIL9791  package_113572195
0  FIL9791  package_113594355
1  FIL9799  package_113572197
1  FIL9799  package_113594357
2  FIL4056  package_113566689
2  FIL4056  package_113591417
3  FIL4056  package_113566688
3  FIL4056  package_113591416
4  FIL4057  package_113566690
4  FIL4057  package_113591418

请记住explode()保留所应用列的原始索引。因此,如果您希望重置索引,因为您不需要它们来匹配原始索引。您可以添加:

df = df.reset_index(drop=True)
print(df)

输出:

    <Name>  <Item To Package>
0  FIL9791  package_113572195
1  FIL9791  package_113594355
2  FIL9799  package_113572197
3  FIL9799  package_113594357
4  FIL4056  package_113566689
5  FIL4056  package_113591417
6  FIL4056  package_113566688
7  FIL4056  package_113591416
8  FIL4057  package_113566690
9  FIL4057  package_113591418

相关问题 更多 >