如何计算Dataframe中重复元素的数量并给它一个计数

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative', 'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12', 'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']} df1 = pd.DataFrame(data) df1

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative', 'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12', 'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12'], 'Repeat Number':['1', '2', '3' ,'4' ,'1' ,'2' ,'1' ,'2', '1', '1', '1', '2', '1', '2', '3' ,'1' ,'2' ,'3' ,'1' ,'1', '2', '1', '2', '1',]} df1 = pd.DataFrame(data) df1

3条回答

网友

1楼 · 编辑于 2024-05-01 14:07:57

@ansev的（原始）答案只有在Sample列已经相对于Location列排序时才有效，因为它比较了Sample和Sample.shift()

如果不是这样，则应首先使用sort_values，或同时按Sample列和df1['Location'].str.extract('(^[A-Z])')的结果分组：

df1['Repeat Number'] = df1.groupby(['Sample', df1['Location'].str.extract('(^[A-Z])')[0]]).cumcount() + 1
print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

网友

2楼 · 编辑于 2024-05-01 14:07:57

这里有一种使用.factorize()和.groupby().rank()的方法。我创建了临时列来简化groupby()语句

# pull 'A' or 'B' out of the Location column
df1['location_group'] = df1['Location'].str.extract(r'([A-Za-z]+)')

# convert Sample to integer
df1['x'] = df1['Sample'].factorize()[0]

# use .rank(method='first') so that every entry has a unique number
df1['Repeat Number'] = (
    df1.groupby(['location_group', 'Sample'])['x'].rank(method='first')
    .astype(int))

# clean up
df1 = df1.drop(columns=['location_group', 'x'])

# show results
print(df1)


      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

然后，我调用了预期结果df2，并用以下方法进行验证：

assert (df1 == df2).all

网友

3楼 · 编辑于 2024-05-01 14:07:57

我们可以试试^{}

blocks = df1['Sample'].ne(df1['Sample'].shift()).cumsum()
df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1)
# if you want str type
#df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1).asype(str)

每次Sample更改时，块将从上一行开始递增

print(blocks)

0      1
1      1
2      1
3      1
4      2
5      2
6      3
7      3
8      4
9      5
10     6
11     7
12     8
13     8
14     8
15     9
16     9
17     9
18    10
19    11
20    11
21    12
22    12
23    13
Name: Sample, dtype: int64

另一个选择是：

location_blocks = df1['Location'].str[0]
df1['Repeat Number'] = df1.groupby([location_blocks, 'Sample']).cumcount().add(1)

输出

print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              1
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

相关问题更多 >

编程相关推荐

热门问题

热门文章