从数据帧中删除反向重复项

import pandas as pd # Initial data frame. data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 'B': [50, 22, 35, 5, 10, 11, 21, 0]}) data A B 0 0 50 1 10 22 2 11 35 3 21 5 4 22 10 5 35 11 6 5 21 7 50 0 # Desired output with "duplicates" removed. data2 = pd.DataFrame({'A': [0, 5, 10, 11], 'B': [50, 21, 22, 35]}) data2 A B 0 0 50 1 5 21 2 10 22 3 11 35

3条回答

网友

1楼 · 编辑于 2024-06-23 20:06:53

在删除重复项之前，可以对数据帧的每一行进行排序：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

如果希望结果按列A排序：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35

网友

2楼 · 编辑于 2024-06-23 20:06:53

这里有一个更丑陋但更快的解决方案：

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
    A   B
0   0  50
1  10  22
2  11  35
3   5  21

定时：用于8K行DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True)

In [51]: big.shape
Out[51]: (8000, 2)

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop

网友

3楼 · 编辑于 2024-06-23 20:06:53

df.T.apply（已排序）.T.drop_duplicates（）

相关问题更多 >

编程相关推荐

热门问题

热门文章