如何比较两列数据以确保所有数据都不匹配

2条回答

网友

1楼 · 编辑于 2024-05-20 22:03:36

使用account number作为结果DataFrame中的索引并将行号存储在列中是有意义的。最简单的解决方案是检查df1和df2中的每一对索引，并将行号存储在df3，它的复杂性为O（n^2）。你知道吗

编辑：看起来，您可以通过使用.isin过滤df1和df2来提高性能，尽管我只使用模拟数据对其进行了测试。它仍然是O（n^2），但现在n是匹配帐户的数目，而不是行的总数。你知道吗

import pandas as pd
d1 = {'account': [1234, 5678, 9101, 1121]}
d2 = {'account': [3141, 5161, 7181, 9202, 1222, 1234]}
d3 = {'r1': [], 'r2': []}

df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
df3 = pd.DataFrame(data = d3)

match1 = df1.account.isin(df2.account.values)
match2 = df2.account.isin(df1.account.values)
for r1 in df1[match1].index:
    for r2 in df2[match2].index:
        if df1.account[r1] == df2.account[r2]:
            idx = df1.account[r1]
            row = {'r1': r1, 'r2': r2}
            df3.loc[idx] = row

编辑2:我可以用这个版本获得更好的性能，而且更简单：

match1 = df1.account.isin(df2.account.values)

for r1 in df1[match1].index:
    idx = df1.account[r1]
    r2 = df2[df2.account == idx].index[0]
    row = {'r1': r1, 'r2': r2}
    df3.loc[idx] = row

编辑3:如果账号在df1和df2中不唯一，则无法使用account number作为索引：

df3 = pd.DataFrame()
match1 = df1.account.isin(df2.account.values)

for r1 in df1[match1].index:
    idx = df1.account[r1]
    for r2 in df2[df2.account == idx].index:
        row = {'account': idx, 'r1': r1, 'r2': r2}
        df3 = df3.append(row, ignore_index=True)

网友

2楼 · 编辑于 2024-05-20 22:03:36

您可以对列进行合并，然后使用输出查找原始数据集中的问题行

target_col = 'Account Number'
matching_account_nos = pd.merge(df1[[target_col]], df2[[target_col]], on='Account Number'), how='inner').values

# now use this to look up the rows in the original dataframes
problem_rows_df1 = df1[df1[target_col].isin(matching_account_nos)]
problem_rows_df2 = df2[df2[target_col].isin(matching_account_nos)]

合并将返回一个数据帧，其中包含“Account Number”相等的行。.values将把它转换成一个numpy数组，您可以用它来查找您需要的原始数据帧中的哪些行？你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何比较两列数据以确保所有数据都不匹配

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >