Pandas过滤器由于太多的群体表现缓慢

pborderid pbcarid wsid to_wpadr colli pk_end_time 10 76079450 61838497 hp1 523-369p 1 2016-07-01 00:00:38 11 76079450 61838504 hp1 523-370p 1 2016-07-01 00:00:47 12 76079450 61838110 hp1 523-372p 1 2016-07-01 00:01:05 13 76079450 61838225 hp1 523-372p 2 2016-07-01 00:01:13 14 76079450 61838504 hp1 523-372p 3 2016-07-01 00:01:30 15 76079450 61838497 hp1 523-373p 1 2016-07-01 00:01:45 16 76079450 61838504 hp1 523-377p 1 2016-07-01 00:01:55 17 76079450 61838110 hp1 523-376p 5 2016-07-01 00:02:26 18 76079450 61838225 hp1 523-376p 1 2016-07-01 00:02:33 19 76079450 61838497 hp1 523-376p 6 2016-07-01 00:02:55

pborderid pbcarid wsid to_wpadr colli pk_end_time 12 76079450 61838110 hp1 523-372p 1 2016-07-01 00:01:05 13 76079450 61838225 hp1 523-372p 2 2016-07-01 00:01:13 14 76079450 61838504 hp1 523-372p 3 2016-07-01 00:01:30 17 76079450 61838110 hp1 523-376p 5 2016-07-01 00:02:26 18 76079450 61838225 hp1 523-376p 1 2016-07-01 00:02:33 19 76079450 61838497 hp1 523-376p 6 2016-07-01 00:02:55

2条回答

网友

1楼 · 编辑于 2024-09-28 21:06:56

受^{}的启发，在本例中我们也可以替换{}操作。实现应该是这样的-

# Create numerical IDs for relevant columns and a combined one
ID1 = np.unique(df['pborderid'],return_inverse=True)[1]
ID2 = np.unique(df['to_wpadr'],return_inverse=True)[1]
ID = np.column_stack((ID1,ID2))

# Convert to linear indices
lidx = np.ravel_multi_index(ID.T,ID.max(0)+1)

# Get unique IDs for each element based on grouped uniqueness and group counts
_,ID,count = np.unique(lidx,return_inverse=True,return_counts=True)

# Look for counts>1 and collect respective IDs and thus respective rows off df
df_out = df[np.in1d(ID,np.where(count>1)[0])]

样本运行-

^{pr2}$

在我这边的运行时测试似乎没有显示出这种方法比另一个解决方案中列出的groupby方法有任何改进。所以，看起来df.groupby将是首选的方式！在

网友

2楼 · 编辑于 2024-09-28 21:06:56

我不知道它是否会更快，但您可以尝试使用^{}只过滤重复项。在

ap = ot[ot.duplicated(subset=['pborderid', 'to_wpadr'], keep=False)]

1米排测向时间：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章