在pandas datafram中分组重复的列id问题的回答

在pandas datafram中分组重复的列id

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这里有一个新方法- <pre><code>def group_duplicate_cols(df): a = df.values sidx = np.lexsort(a) b = a[:,sidx] m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.columns[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] </code></pre> 样本运行- ^{pr2}$ 转换为执行相同的操作，但是对于行（索引），我们只需要沿着另一个轴切换操作，如下- <pre><code>def group_duplicate_rows(df): a = df.values sidx = np.lexsort(a.T) b = a[sidx] m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.index[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] </code></pre> 样本运行- <pre><code>In [260]: df2 Out[260]: a1 a2 a3 a4 a5 A 3 5 3 4 5 B 1 1 1 1 1 C 3 5 3 4 5 D 2 9 2 1 9 E 2 2 2 1 2 F 1 1 1 1 1 In [261]: group_duplicate_rows(df2) Out[261]: [['B', 'F'], ['A', 'C']] </code></pre> <hr/> <h2>标杆管理</h2> 方法- <pre><code># @John Galt's soln-1 from itertools import combinations def combinations_app(df): return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()] # @Abdou's soln def pandas_groupby_app(df): return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1] # @COLDSPEED's soln def triu_app(df): c = df.columns.tolist() i, j = np.triu_indices(len(c), 1) x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()] return x # @cmaher's soln def lambda_set_app(df): return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns])))) </code></pre> 注意：<code>@John Galt's soln-2</code>不包括在内，因为大小为<code>(8000,500)</code>的输入将被提议的<code>broadcasting</code>爆炸。在 时间安排- <pre><code>In [179]: # Setup inputs with sizes as mentioned in the question ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500))) ...: df.columns = ['C'+str(i) for i in range(df.shape[1])] ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0) ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0) ...: df.iloc[:,idx0] = df.iloc[:,idx1].values ...: # @John Galt's soln-1 In [180]: %timeit combinations_app(df) 1 loops, best of 3: 24.6 s per loop # @Abdou's soln In [181]: %timeit pandas_groupby_app(df) 1 loops, best of 3: 3.81 s per loop # @COLDSPEED's soln In [182]: %timeit triu_app(df) 1 loops, best of 3: 25.5 s per loop # @cmaher's soln In [183]: %timeit lambda_set_app(df) 1 loops, best of 3: 27.1 s per loop # Proposed in this post In [184]: %timeit group_duplicate_cols(df) 10 loops, best of 3: 188 ms per loop </code></pre> <hr/> 使用NumPy的查看功能进行超级增强 利用NumPy的视图功能，我们可以将每个元素组看作一个数据类型，我们可以获得更显著的性能提升，比如- <pre><code>def view1D(a): # a is array a = np.ascontiguousarray(a) void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1])) return a.view(void_dt).ravel() def group_duplicate_cols_v2(df): a = df.values sidx = view1D(a.T).argsort() b = a[:,sidx] m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.columns[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] </code></pre> 时间安排- <pre><code>In [322]: %timeit group_duplicate_cols(df) 10 loops, best of 3: 185 ms per loop In [323]: %timeit group_duplicate_cols_v2(df) 10 loops, best of 3: 69.3 ms per loop </code></pre> 只是疯狂的加速！在

在pandas datafram中分组重复的列id

1 个回答

相关Python问题