合并数据帧时的缺失条目数量

2024-09-25 00:21:58 发布

您现在位置:Python中文网/ 问答频道 /正文

在一个练习中,我被要求用内部联接(df1+df2+df3=mergedDf)合并3个数据帧,然后在另一个问题中,我被要求告诉我在执行这种三向合并时丢失了多少个条目。在

#DataFrame1
df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]])
df1.index = ['Argentina','Angola','Bolivia']
print(df1)
            Goals    Medals
Argentina       5         2
Angola          1         0
Bolivia         3         1

#DataFrame2
df2 = pd.DataFrame(columns=["Dates","Medals"],data=[[1,0],[2,1],[2,2])
df2.index = ['Venezuela','Africa']
print(df2)
            Dates    Medals
Venezuela       1         0
Africa          2         1
Argentina       2         2

#DataFrame3
df3 = pd.DataFrame(columns=["Players","Goals"],data=[[11,5],[11,1],[10,0]])
df3.index = ['Argentina','Australia','Belgica']
print(df3)
           Players    Goals
Argentina       11        5
Australia       11        1
Spain           10        0

#mergedDf
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
print(mergedDF)
           Goals_X  Medals_X  Dates  Medals_Y  Players  Goals_Y
Argentina        5         2      2         2       11        2

#Calculate number of lost entries by code

我试着用outerjoin合并所有的内容,然后减去mergedDf,但是我不知道怎么做,有人能帮我吗? enter image description here


Tags: columnstruedataframedataindexpddatesdf1
3条回答

可以在merge中将True传递给indicator

df1=pd.DataFrame({'A':[1,2,3],'B':[1,1,1]})
df2=pd.DataFrame({'A':[2,3],'B':[1,1]})
df1.merge(df2,on='A',how='inner')
Out[257]: 
   A  B_x  B_y
0  2    1    1
1  3    1    1
df1.merge(df2,on='A',how='outer',indicator =True)
Out[258]: 
   A  B_x  B_y     _merge
0  1    1  NaN  left_only
1  2    1  1.0       both
2  3    1  1.0       both
mergedf=df1.merge(df2,on='A',how='outer',indicator =True)

那么使用value_counts你就知道你在做inner时损失了多少,因为只有both在{}时能保持

^{pr2}$

对于具有两个合并列的3df和filter,单词是both

df1.merge(df2, on='A',how='outer',indicator =True).rename(columns={'_merge':'merge'}).merge(df3, on='A',how='outer',indicator =True)

我找到了一个简单但有效的解决方案:

合并3个数据帧,内部和外部:

df1 = Df1()
df2 = Df2()
df3 = Df3()
inner = pd.merge(pd.merge(df1,df2,on='<Common column>',how='inner'),df3,on='<Common column>',how='inner')
outer = pd.merge(pd.merge(df1,df2,on='<Common column>',how='outer'),df3,on='<Common column>',how='outer')

现在,丢失的条目(行)的数量是:

^{pr2}$

具有外部联接和参数指示符的解决方案,最后计数两个指示符列a和{}中没有both的行数,方法是True值的和(类似1s的进程):

mergedDf = pd.merge(df1,df2,how='outer',left_index=True, right_index=True, indicator='a')
mergedDf = pd.merge(mergedDf,df3,how='outer',left_index=True, right_index=True, indicator='b')
print(mergedDf)
           Goals_x  Medals_x  Dates  Medals_y           a  Players  Goals_y  \
Africa         NaN       NaN    2.0       1.0  right_only      NaN      NaN   
Angola         1.0       0.0    NaN       NaN   left_only      NaN      NaN   
Argentina      5.0       2.0    2.0       2.0        both     11.0      5.0   
Australia      NaN       NaN    NaN       NaN         NaN     11.0      1.0   
Belgica        NaN       NaN    NaN       NaN         NaN     10.0      0.0   
Bolivia        3.0       1.0    NaN       NaN   left_only      NaN      NaN   
Venezuela      NaN       NaN    1.0       0.0  right_only      NaN      NaN   

                    b  
Africa      left_only  
Angola      left_only  
Argentina        both  
Australia  right_only  
Belgica    right_only  
Bolivia     left_only  
Venezuela   left_only

missing = ((mergedDf['a'] != 'both') & (mergedDf['b'] != 'both')).sum()
print (missing)
6

另一种解决方案是使用与mergedDf.index不匹配的每个索引的内部联接和sum过滤值:

^{pr2}$

如果每个索引中的值唯一,则另一种解决方案:

dfs = [df1, df2, df3]
L = [set(x.index) for x in dfs]

#https://stackoverflow.com/a/25324329/2901002
missing = len(set.union(*L) - set.intersection(*L))
print (missing)
6

相关问题 更多 >