比较表中一行与所有其他行的值

NAME PROFILE URL Final Addres 0 ProfileA appexample.co/userxyz http://example.com 1 ProfileB appexample.co/userxyz_1 http://example.com 2 ProfileC appexample.co/userabc http://anotherexample.com 3 ProfileD appexample.co/userabc_3 http://anotherexample.com 4 ProfileE appexample.co/userjyl http://example123.com

possible_dup = [] for row in test.iterrows(): first = str(row[1]['PROFILE URL']) first_url = str(row[1]['Final Address']) for sec_row in test.iterrows(): second = str(sec_row[1]['PROFILE URL']) second_url = str(sec_row[1]['Final Address']) if (row[1]['PROFILE URL'] == sec_row[1]['PROFILE URL']) : continue if ((first in second) and (first_url == second_url)): dup = '{} , {}'.format(first,second) possible_dup.append(dup)

2条回答

网友

1楼 · 编辑于 2024-10-01 09:26:56

检查一下duplicated()方法。从文档中：

Return boolean Series denoting duplicate rows.

特别有用的是仅选择列子集的可选参数。根据您的具体目标，您可以使用duplicated()方法做几件事：

要识别重复的行，您需要使用

duplicates = test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'], keep = False)

要识别您将使用的所有重复用户

    duplicate_users = test[test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'], keep = First)]

要返回不带副本的数据帧（以前的每个副本现在只存在一次）：

duplicates = test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'])
duplicate_free_df = test.loc[~duplicates]

网友

2楼 · 编辑于 2024-10-01 09:26:56

将duplicated()与keep参数一起使用为False，这允许我们识别所有重复项

df2 = df[df.duplicated(subset=['Final Addres'],keep=False)]

print(df2)


       NAME              PROFILE URL               Final Addres
0  ProfileA    appexample.co/userxyz         http://example.com
1  ProfileB  appexample.co/userxyz_1         http://example.com
2  ProfileC    appexample.co/userabc  http://anotherexample.com
3  ProfileD  appexample.co/userabc_3  http://anotherexample.com

相关问题更多 >

编程相关推荐

热门问题

热门文章