如何计算Dataframe中重复元素的数量并给它一个计数

2024-05-01 14:07:57 发布

您现在位置:Python中文网/ 问答频道 /正文

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                     'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
           'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                       'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']}
df1 = pd.DataFrame(data)
df1

对于上面的示例代码,我想对同一位置组的“Sample”列中的重复项进行计数,并在新的“repeat number”列中为其指定一个重复编号。例如,位置组A中有4个S1,我想给出第一个S1重复编号1和第二个S1重复编号2,依此类推。 对于位置B,有3个S1,给出第一个S1重复编号1和第二个S1重复编号2,依此类推

理想的结果如下所示:

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                      'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
            'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                        'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12'],
       'Repeat Number':['1', '2', '3' ,'4' ,'1' ,'2' ,'1' ,'2', '1', '1', '1', '2',
                      '1', '2', '3' ,'1' ,'2' ,'3' ,'1' ,'1', '2', '1', '2', '1',]}
df1 = pd.DataFrame(data)
df1

Tags: samplea2datas3a1locationa3编号
3条回答

@ansev的(原始)答案只有在Sample列已经相对于Location列排序时才有效,因为它比较了SampleSample.shift()

如果不是这样,则应首先使用sort_values,或同时按Sample列和df1['Location'].str.extract('(^[A-Z])')的结果分组:

df1['Repeat Number'] = df1.groupby(['Sample', df1['Location'].str.extract('(^[A-Z])')[0]]).cumcount() + 1
print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

这里有一种使用.factorize().groupby().rank()的方法。我创建了临时列来简化groupby()语句

# pull 'A' or 'B' out of the Location column
df1['location_group'] = df1['Location'].str.extract(r'([A-Za-z]+)')

# convert Sample to integer
df1['x'] = df1['Sample'].factorize()[0]

# use .rank(method='first') so that every entry has a unique number
df1['Repeat Number'] = (
    df1.groupby(['location_group', 'Sample'])['x'].rank(method='first')
    .astype(int))

# clean up
df1 = df1.drop(columns=['location_group', 'x'])

# show results
print(df1)


      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

然后,我调用了预期结果df2,并用以下方法进行验证:

assert (df1 == df2).all

我们可以试试^{}

blocks = df1['Sample'].ne(df1['Sample'].shift()).cumsum()
df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1)
# if you want str type
#df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1).asype(str) 

每次Sample更改时,块将从上一行开始递增

print(blocks)

0      1
1      1
2      1
3      1
4      2
5      2
6      3
7      3
8      4
9      5
10     6
11     7
12     8
13     8
14     8
15     9
16     9
17     9
18    10
19    11
20    11
21    12
22    12
23    13
Name: Sample, dtype: int64

另一个选择是:

location_blocks = df1['Location'].str[0]
df1['Repeat Number'] = df1.groupby([location_blocks, 'Sample']).cumcount().add(1)

输出

print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              1
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

相关问题 更多 >