用可变数量的元素和前导文本扩展数据框的列

2024-09-29 23:17:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试展开数据帧的一列 (请参见下面示例中的列段。) 我能把它分解成由两个部件分开的部件; 但是,正如您所看到的,列中的某些行确实如此 没有所有的元素。那么,现在发生的是 应该进入Geo列的数据最终会进入 BusSeg柱,因为没有Geo柱;还是数据 应该在ProdServ列中,最后在Geo列中。 理想情况下,我只希望有数据,而不是指标 在每个单元格中正确放置。所以 在Geo栏中,它应该说“NonUs”。不是“Geo=NonUs” 那是正确分离后,我想删除文本 最多包括每个中的“=”符号。我该怎么做? 代码如下:

import pandas as pd

company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
                    'Subseg=Tr1',
                    'BusSeg=Pharma',
                    'Geo=China;Prd=Alpha;Subseg=Tr4;',
                    'Prd=Beta;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
                    'BusSeg=Pharma;Geo=NonUs;']
print("\ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()

Tags: 数据alpha部件revgeopddf1print
3条回答

你的数据

import pandas as pd

company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
                    'Subseg=Tr1',
                    'BusSeg=Pharma',
                    'Geo=China;Prd=Alpha;Subseg=Tr4;',
                    'Prd=Beta;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
                    'BusSeg=Pharma;Geo=NonUs;']

df:


    company     clv     date    line    segments
0   Rev     500     20191231    1   BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1   Rev     200     20191231    3   BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2   Rev     3000    20191231    2   BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3   Rev     400     20181231    1   Subseg=Tr1
4   Rev     10      20181231    3   BusSeg=Pharma
5   Rev     300     20181231    2   Geo=China;Prd=Alpha;Subseg=Tr4;
6   Rev     560     20171231    1   Prd=Beta;Subseg=Tr1
7   Rev     500     20171231    3   BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8   Rev     600     20171231    2   BusSeg=Pharma;Geo=NonUs;

在代码中注释这一行df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True),然后添加这两行

d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)

df:

  company   clv      date  line                                      segments  BusSeg    Geo    Prd Subseg
0     Rev   500  20191231     1  BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1  Pharma  NonUs  Alpha    Tr1
1     Rev   200  20191231     3               BusSeg=Dev;Prd=Alpha;Subseg=Tr1     Dev    NaN  Alpha    Tr1
2     Rev  3000  20191231     2     BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2  Pharma     US  Alpha    Tr2
3     Rev   400  20181231     1                                    Subseg=Tr1     NaN    NaN    NaN    Tr1
4     Rev    10  20181231     3                                 BusSeg=Pharma  Pharma    NaN    NaN    NaN
5     Rev   300  20181231     2               Geo=China;Prd=Alpha;Subseg=Tr4;     NaN  China  Alpha    Tr4
6     Rev   560  20171231     1                           Prd=Beta;Subseg=Tr1     NaN    NaN   Beta    Tr1
7     Rev   500  20171231     3    BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;  Pharma     US  Delta    Tr1
8     Rev   600  20171231     2                      BusSeg=Pharma;Geo=NonUs;  Pharma  NonUs    NaN    NaN

我建议,要逐个填充列,而不是使用split,类似于下面的代码:

col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
    df1[c] = df1['segments'].str.extract(r'' + v + '=(\w*);')

这里有一个建议:

df1.segments = (df1.segments.str.split(';')
                   .apply(lambda s:
                          dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
                    for col in set().union(*df1.segments)},
                   index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')

结果:

  company   clv      date  line Subseg    Geo  BusSeg    Prd
0     Rev   500  20191231     1    Tr1  NonUs  Pharma  Alpha
1     Rev   200  20191231     3    Tr1            Dev  Alpha
2     Rev  3000  20191231     2    Tr2     US  Pharma  Alpha
3     Rev   400  20181231     1    Tr1                      
4     Rev    10  20181231     3                Pharma       
5     Rev   300  20181231     2    Tr4  China          Alpha
6     Rev   560  20171231     1    Tr1                  Beta
7     Rev   500  20171231     3    Tr1     US  Pharma  Delta
8     Rev   600  20171231     2         NonUs  Pharma       

相关问题 更多 >

    热门问题