<p>“当CustomerMail中存在重复项时,希望我的欺诈列具有空值。”</p>
<p>因此,在预期的输出中,您忘记在<code>customerEmail</code>中添加<code>name_4 </code>,因为它也是重复的</p>
<pre><code> df1 = pd.DataFrame({
'customerEmail':['name0','name1','name2','name3','name4','name1'],
'Fraud':[False,True,True,True,False,False]
}
)
df2 = pd.DataFrame({
'customerEmail': ['name0', 'name1', 'name2', 'name3', 'name4', 'name1'],
'ID':[0,1,2,3,4,5]
})
df3=pd.merge(df1, df2, on='customerEmail', how='left')
#here you need to know which customers are duplicated, to fill for them rows in column Fraud
df_duplicates = df3.drop_duplicates(subset=['customerEmail'],keep='last')
print(df_duplicates)
customerEmail Fraud ID
0 name0 False 0
3 name2 True 2
4 name3 True 3
5 name4 False 4
7 name1 False 5
#now for those duplicates fill cells in column Fraud using iloc and np.nan
df_duplicates.loc[:,'Fraud'] = np.nan
print(df_duplicates)
customerEmail Fraud ID
0 name0 NaN 0
3 name2 NaN 2
4 name3 NaN 3
5 name4 NaN 4
7 name1 NaN 5
#so now you have two df's , one df_duplicates with Nans duplicates values above,
#and main df3 with original merged values
#now you need to add those df's using concat , (add column to column )
#but you dont need values with same customerEmail that you used for df_duplicated, so you can delete them using drop_duplicates
result = pd.concat([df3,df_duplicates]).drop_duplicates(subset=['customerEmail','Fraud'])
#after concat True and False values has been coverted to 1.0 and 0 , for we need to change the type to False and True again
result.Fraud = result.Fraud.astype('boolean')
print(result)
customerEmail Fraud ID
0 name0 False 0
1 name1 True 1
3 name2 True 2
4 name3 True 3
5 name4 False 4
6 name1 False 1
0 name0 <NA> 0
3 name2 <NA> 2
4 name3 <NA> 3
5 name4 <NA> 4
7 name1 <NA> 5
</code></pre>