如果重复值在下一个lin的不同列中，则从Pandas Dataframe中删除重复项

term_x Intersections term_y boxers 1 briefs briefs 1 boxers babies 6 costumes costumes 6 babies babies 12 clothes clothes 12 babies babies 1 clothings clothings 1 babies

2条回答

网友

1楼 · 编辑于 2024-10-03 17:29:02

您可以将这两个列放在一起，对这些对进行排序，然后将行放到这些已排序的对上：

df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]

df.drop_duplicates(subset=['together'])
Out[11]: 
   term_x  Intersections     term_y          together
0  boxers              1     briefs     boxers,briefs
2  babies              6   costumes   babies,costumes
4  babies             12    clothes    babies,clothes
6  babies              1  clothings  babies,clothings

编辑：你说时间是这个问题的一个重要因素。以下是我和Allen在20万行数据帧上的解决方案比较的一些时间安排：

^{pr2}$

如你所见，我的方法快98%以上。pandas.DataFrame.apply在许多情况下都很慢。在

网友

2楼 · 编辑于 2024-10-03 17:29:02

df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1},
 'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies',
  5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers',
  2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}})

#create a column to combine team_x and team_y in a sorted order
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
#drop duplicates on the combined fields.
df.drop_duplicates(subset='team_xy',inplace=True)

df
Out[916]: 
   Intersections  term_x     term_y                  team_xy
0              1  boxers     briefs     ['boxers', 'briefs']
2              6  babies   costumes   ['babies', 'costumes']
4             12  babies    clothes    ['babies', 'clothes']
6              1  babies  clothings  ['babies', 'clothings']

相关问题更多 >

编程相关推荐

热门问题

热门文章

如果重复值在下一个lin的不同列中，则从Pandas Dataframe中删除重复项

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >