识别重复项和相应索引

import pandas as pd df = pd.DataFrame({"Chromosome":[1, 1, 1, 1, 1], 'Position': [100, 220,300,100,220], "Gene":["CHD1","BRCA2","TP53","CHD1", "BRCA2"], "SAMPLE":["A1","A2","A3","A4", "A5"]}) df Output: Chromosome Position Gene SAMPLE 0 1 100 CHD1 S1 1 1 220 BRCA2 S2 2 1 300 TP53 S3 3 1 100 CHD1 S4 4 1 220 BRCA2 S5

df_new Output: Chromosome Position Gene SAMPLES Count 0 1 100 CHD1 [S1, S4] 2 1 1 220 BRCA2 [S2,S5] 2 2 1 300 TP53 S3 1

Samples = array(master_df['Sample_ID'], dtype=str) temp_array = array(master_df[master_df.columns[0:3]], dtype=str) temp_unq, ind1, inv1, cnts1 = unique(temp_array, return_index= True, return_inverse=True, return_counts=True, axis=0) s1 = [[] for i in cnts1] for i in range(temp_unq.shape[0]): whr = np.where(inv1==i)[0] s1[i].append(list(Samples[whr])) unq_combo = master_df.iloc[ind1] unq_combo = unq_combo.reset_index(drop=True) unq_combo['Counts'] =pd.Series(cnts1) unq_combo['Samples#'] = pd.Series(s1)

2条回答

网友

1楼 · 编辑于 2024-09-29 17:14:05

我使用了groupby和聚合dict来返回列表中的组（参见SO post）

创建数据（根据OP中的代码）

df = pd.DataFrame({"Chromosome":[1, 1, 1, 1, 1],
               'Position': [100, 220,300,100,220],
               "Gene":["CHD1","BRCA2","TP53","CHD1", "BRCA2"], 
               "SAMPLE":["A1","A2","A3","A4", "A5"]})
print(df)
   Chromosome  Position   Gene SAMPLE
0           1       100   CHD1     A1
1           1       220  BRCA2     A2
2           1       300   TP53     A3
3           1       100   CHD1     A4
4           1       220  BRCA2     A5

使用聚合dict执行groupby

agg_dict = {'SAMPLE':[list, 'count']}
grouped = grouped = (
    df.groupby(['Chromosome','Position','Gene'], as_index=False)
    .agg(agg_dict)
    )
grouped.columns = grouped.columns.map(' '.join).str.strip()
print(grouped)

   Chromosome  Position   Gene SAMPLE list  SAMPLE count
0           1       100   CHD1    [A1, A4]             2
1           1       220  BRCA2    [A2, A5]             2
2           1       300   TP53        [A3]             1

编辑

根据OP中样本数据的更改进行修改

网友

2楼 · 编辑于 2024-09-29 17:14:05

使用groupby和agg：

df.groupby(['Chromosome', 'Position', 'Gene']).SAMPLE.agg([list, 'count'])
                               list  count
Chromosome Position Gene                  
1          100      CHD1   [S1, S4]      2
           220      BRCA2  [S2, S5]      2
           300      TP53       [S3]      1

(df.groupby(['Chromosome', 'Position', 'Gene']).SAMPLE
   .agg([list, 'count'])
   .reset_index())

   Chromosome  Position   Gene      list  count
0           1       100   CHD1  [S1, S4]      2
1           1       220  BRCA2  [S2, S5]      2
2           1       300   TP53      [S3]      1

相关问题更多 >

编程相关推荐

热门问题

热门文章