如何在两个不同的数据帧之间迭代行并捕获整行

2024-10-06 11:19:34 发布

您现在位置:Python中文网/ 问答频道 /正文

是否有更简单的方法来捕获两个数据帧之间的数据差异。我试图比较2个数据帧并找到丢失的数据帧。我发现这个附加的链接,但它不满足我的要求。在那篇文章中,他们使用Id进行比较,但我想比较2个数据帧的整行,并找出不匹配的行,无论它是来自A还是B

Iterate over different dataframe

DF 1:

country    from         to       flag
('GB',  'GB390065', 'GB66903',  'Y')
('FR',  'FR418458', 'FR765617', 'Y')
('FR',  'FR629810', 'FR855277', 'Y')
('FR',  'FR494010', 'FR332891', 'Y')
('FR',  'FR740500', 'FR907878', 'Y')
('FI',  'NB444135', 'NB234471', 'Y')

DF 2:

country    from         to       flag
('GB',  'GB390065', 'GB66903',  'Y')
('FR',  'FR418458', 'FR7656',   'Y')
('FR',  'FR629810', 'FR855277', 'Y')
('FR',  'FR4910',   'FR33891',  'Y')
('FR',  'FR740500', 'FR907878', 'Y')
('FI',  'NB444135', 'NB234471', 'Y')

我的预期产出:

country    from         to       flag   Available
('FR',  'FR418458', 'FR7656',   'Y'       df2)
('FR',  'FR4910',   'FR33891',  'Y'       df2)

我在这里面临的挑战是from columnto column是相同的数据,只是当我尝试使用merge function时,它获取列中的任何一个并给出不符合要求的结果

我得到的结果是

               country_code from_cust_id  ... to_cust_guid Merge_status
       322443         DE       DE180556  ...          NaN    left_only
       322444         DE        DE22191  ...          NaN    left_only
       322445         DE      DE2625168  ...          NaN    left_only
       322446         DE        DE17705  ...          NaN    left_only
       322447         DE      DE2556758  ...          NaN    left_only

      [5 rows x 7 columns]
              country_code from_cust_id  ... to_cust_guid Merge_status
       0                DE          NaN  ...     DE485137   right_only
       1                DE          NaN  ...     DE467209   right_only
       2                DE          NaN  ...     DE651068   right_only
       3                DE          NaN  ...     DE459729   right_only
       4                DE          NaN  ...     DE448683   right_only

Tags: to数据fromrightonlydfdefr
3条回答

试一试

df1['Available'] = 'df1'
df2['Available'] = 'df2'

available = pd.concat([df1, df2]).drop_duplicates(subset=['from', 'to'], keep=False)

以下是将dataframe转换为一组元组的方法:

a1 = set(df1.apply(tuple,1).values)
a2 = set(df2.apply(tuple,1).values)

print(a1)

{('FI', 'NB444135', 'NB234471', 'Y'),
 ('FR', 'FR418458', 'FR765617', 'Y'),
 ('FR', 'FR494010', 'FR332891', 'Y'),
 ('FR', 'FR629810', 'FR855277', 'Y'),
 ('FR', 'FR740500', 'FR907878', 'Y'),
 ('GB', 'GB390065', 'GB66903', 'Y')}


print(a2)

{('FI', 'NB444135', 'NB234471', 'Y'),
 ('FR', 'FR418458', 'FR7656', 'Y'),
 ('FR', 'FR4910', 'FR33891', 'Y'),
 ('FR', 'FR629810', 'FR855277', 'Y'),
 ('FR', 'FR740500', 'FR907878', 'Y'),
 ('GB', 'GB390065', 'GB66903', 'Y')}

# now do set difference (in df2 not in df1)
diffs = pd.DataFrame(np.array([x for x in a2 - a1]))
diffs['Available'] = 'df2'

print(diffs)

    0         1        2  3 Available
0  FR  FR418458   FR7656  Y       df2
1  FR    FR4910  FR33891  Y       df2

试试这个

df1 = pd.DataFrame(Mylist1,columns=['country','from','to','flag'])
df2 = pd.DataFrame(Mylist2,columns=['country','from','to','flag'])
# print(df2)
indexes = (df1 != df2).any(axis=1)
indexes1 = (df2 != df1).any(axis=1)
df3 = df2.loc[indexes]
df4 = df1.loc[indexes1]
pd.options.mode.chained_assignment = None
df3['Available'] = 'df1'
df4['Available'] = 'df2'
# print(df3)
# print(df4)
df_row_reindex = pd.concat([df3, df4], ignore_index=False)
print(df_row_reindex)


country from    to         flag Available
FR  FR418458    FR7656     Y    df1
FR  FR4910      FR33891    Y    df1
FR  FR418458    FR765617   Y    df2
FR  FR494010    FR332891   Y    df2

相关问题 更多 >