基于输入行的长\u到宽\u方法

2024-09-30 14:38:26 发布

您现在位置:Python中文网/ 问答频道 /正文

删除空值时出现问题。我的输入数据帧

name    no     city     tr1_0  tr2_0    tr3_0     tr1_1  tr2_1      tr3_1   tr1_2   tr2_2   tr3_2 
John    11     edi      boa    51        110      cof      52       220   
Rick    12     new      cof    61        100      dcu      61       750   
Mat     t1     nyc

我想要的输出

     name    no city  tr1  tr3  tr2   
0    John    11  edi  boa  110   51  
1    John    11  edi  cof  220   52    
2    Rick    12  new  cof  100   61   
3    Rick    12  new  dcu  750   61  
4    Matt    13  wil  nan  nan  nan

我用了下面的代码

df1 = pd.read_fwf(inputFileName, widths=widths, names=names, dtype=str, index_col=False )

feature_models = [col for col in df1.columns if re.match("tr[0-9]_[0-9]",col) is not None]

features = list(set([ re.sub("_[0-9]","",feature_model) for feature_model in feature_models]))
ub("_[0-9]","",feature_model) for feature_model in feature_models]))

df1 = pd.wide_to_long(df1,i=['name', 'no', 
df1 = pd.wide_to_long(df1,i=['name', 'no', 'city',],j='ModelID',stubnames=features,sep="_")

我的电流输出如下。第2行在我的用例中没有任何意义,所以我根本不想生成那一行。如果没有拖车,我只想要一排好的(第6排)。如果有2个拖车,我只想要2排,但它给我3排。(第2行和第5行是额外的)。我试过用dropna,但没用。同样在我的情况下,它的印刷为楠而不是楠。你知道吗

     name    no city  tr1  tr3  tr2 
0    John    11  edi  boa  110   51 .  
1    John    11  edi  cof  220   52 .  
2    John    11  edi  nan  nan  nan .  
3    Rick    12  new  cof  100   61 .  
4    Rick    12  new  dcu  750   61 .  
5    Rick    12  new  nan  nan  nan .  
6    Matt    13  wil  nan  nan  nan .  

Tags: nonamecitynewcolnanjohnfeature
1条回答
网友
1楼 · 发布于 2024-09-30 14:38:26

您可以将此替代解决方案与^{}^{}一起使用:

df1 = df1.set_index(['name', 'no', 'city'])
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(1, dropna=False).reset_index(level=3, drop=True)

mask = df1.index.duplicated() & df1.isnull().all(axis=1)

df1 = df1[~mask].reset_index()
print (df1)
   name  no city  tr1   tr2    tr3
0  John  11  edi  boa  51.0  110.0
1  John  11  edi  cof  52.0  220.0
2  Rick  12  new  cof  61.0  100.0
3  Rick  12  new  dcu  61.0  750.0
4   Mat  t1  nyc  NaN   NaN    NaN

使用您的解决方案:

df1 = pd.wide_to_long(df1,i=['name', 'no', 'city'],j='ModelID',stubnames=features,sep="_")

对于具有重复MultiIndex值的remove NaN,可以使用^{}过滤:

#remove counting level
df1 = df1.reset_index(level=3, drop=True)
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
df1 = df1[~mask].reset_index()

详细信息:

通过^{}检查重复:

print (df1.index.duplicated())
[False  True False  True False  True]

然后按^{}检查每行的所有True值:

print (df1.isnull().all(axis=1))
name  no  city
John  11  edi     False
          edi     False
Rick  12  new     False
          new     False
Mat   t1  nyc      True
          nyc      True
dtype: bool

bitwise AND&链:

mask = df1.index.duplicated() & df1.isnull().all(axis=1)
print (mask)
name  no  city
John  11  edi     False
          edi     False
Rick  12  new     False
          new     False
Mat   t1  nyc     False
          nyc      True
dtype: bool

通过~反转布尔掩码:

print (~mask)
name  no  city
John  11  edi      True
          edi      True
Rick  12  new      True
          new      True
Mat   t1  nyc      True
          nyc     False
dtype: bool

相关问题 更多 >