在pandas datafram中分组重复的列id

2024-10-17 16:22:03 发布

您现在位置:Python中文网/ 问答频道 /正文

现在有很多类似的问题,但大多数都回答了如何删除重复的列。但是,我想知道如何制作一个元组列表,其中每个元组都包含重复列的列名。我假设每个列都有一个唯一的名称。为了进一步说明我的问题:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
                   'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
                   'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]},
                   index = ['a1', 'a2', 'a3', 'a4', 'a5'])

然后我要输出:

^{pr2}$

如果你今天感觉很好,那么也可以把同样的问题扩展到行。如何获取每个元组包含重复行的元组列表。在


Tags: 名称a2dataframedf列表indexa1a3
3条回答

这也应该做到:

[tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]

产量:

^{pr2}$

这里有一个新方法-

def group_duplicate_cols(df):
    a = df.values
    sidx = np.lexsort(a)
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

样本运行-

^{pr2}$

转换为执行相同的操作,但是对于行(索引),我们只需要沿着另一个轴切换操作,如下-

def group_duplicate_rows(df):
    a = df.values
    sidx = np.lexsort(a.T)
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.index[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

样本运行-

In [260]: df2
Out[260]: 
   a1  a2  a3  a4  a5
A   3   5   3   4   5
B   1   1   1   1   1
C   3   5   3   4   5
D   2   9   2   1   9
E   2   2   2   1   2
F   1   1   1   1   1

In [261]: group_duplicate_rows(df2)
Out[261]: [['B', 'F'], ['A', 'C']]

标杆管理

方法-

# @John Galt's soln-1
from itertools import combinations
def combinations_app(df):
    return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]

# @Abdou's soln
def pandas_groupby_app(df):
    return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]                        

# @COLDSPEED's soln
def triu_app(df):
    c = df.columns.tolist()
    i, j = np.triu_indices(len(c), 1)
    x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()]
    return x

# @cmaher's soln
def lambda_set_app(df):
    return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))

注意:@John Galt's soln-2不包括在内,因为大小为(8000,500)的输入将被提议的broadcasting爆炸。在

时间安排-

In [179]: # Setup inputs with sizes as mentioned in the question
     ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
     ...: df.columns = ['C'+str(i) for i in range(df.shape[1])]
     ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: df.iloc[:,idx0] = df.iloc[:,idx1].values
     ...: 

# @John Galt's soln-1
In [180]: %timeit combinations_app(df)
1 loops, best of 3: 24.6 s per loop

# @Abdou's soln
In [181]: %timeit pandas_groupby_app(df)
1 loops, best of 3: 3.81 s per loop

# @COLDSPEED's soln
In [182]: %timeit triu_app(df)
1 loops, best of 3: 25.5 s per loop

# @cmaher's soln
In [183]: %timeit lambda_set_app(df)
1 loops, best of 3: 27.1 s per loop

# Proposed in this post
In [184]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 188 ms per loop

使用NumPy的查看功能进行超级增强

利用NumPy的视图功能,我们可以将每个元素组看作一个数据类型,我们可以获得更显著的性能提升,比如-

def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

def group_duplicate_cols_v2(df):
    a = df.values
    sidx = view1D(a.T).argsort()
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

时间安排-

In [322]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 185 ms per loop

In [323]: %timeit group_duplicate_cols_v2(df)
10 loops, best of 3: 69.3 ms per loop

只是疯狂的加速!在

这是一个单一的班轮

In [22]: from itertools import combinations

In [23]: [x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
Out[23]: [('A', 'C'), ('B', 'D')]

或者,使用NumPy广播。更好的,看看Divakar的solution

^{pr2}$

相关问题 更多 >