仅在组中删除重复项

A B C spec first second test text1 text2 act text12 text13 act text14 text15 test text32 text33 act text34 text35 test text85 text86 act text87 text88 test text1 text2 act text12 text13 act text14 text15 test text85 text86 act text87 text88 spec third fourth test text1 text2 act text12 text13 act text14 text15 test text85 text86 act text87 text88 test text1 text2 act text12 text13 act text14 text15 test text85 text86 act text87 text88

A B C spec first second test text1 text2 act text12 text13 act text14 text15 test text32 text33 act text34 text35 test text85 text86 act text87 text88 spec third fourth test text1 text2 act text12 text13 act text14 text15 test text85 text86 act text87 text88

dfList = df.index[df["A"] == "spec"].tolist() dfList = np.asarray(dfList) for dfL in dfList: idx = np.where(dfList == dfL) if idx[0][0]!=(len(dfList)-1): df.loc[dfList[idx[0][0]]:dfList[idx[0][0]+1]-1] = df.loc[dfList[idx[0][0]]:dfList[idx[0][0]+1]-1].drop_duplicates() else: df.loc[dfList[idx[0][0]]:] = df.loc[dfList[idx[0][0]]:].drop_duplicates()

3条回答

网友

1楼 · 编辑于 2024-10-05 10:35:35

这应该起作用：

df2 = df.drop_duplicates(subset=['A', 'B','C'])

网友

2楼 · 编辑于 2024-10-05 10:35:35

使用groupby+duplicated：

df[~df.groupby(df.A.eq('spec').cumsum()).apply(lambda x: x.duplicated()).values]

       A       B       C
0   spec   first  second
1   test   text1   text2
2    act  text12  text13
3    act  text14  text15
4   test  text32  text33
5    act  text34  text35
6   test  text85  text86
7    act  text87  text88
13  spec   third  fourth
14  test   text1   text2
15   act  text12  text13
16   act  text14  text15
17  test  text85  text86
18   act  text87  text88

细节

我们使用cumsum查找特定“spec”条目下的所有行。组标签包括：

df.A.eq('spec').cumsum()

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    2
14    2
15    2
16    2
17    2
18    2
19    2
20    2
21    2
22    2
23    2
Name: A, dtype: int64

然后对该序列进行分组，并计算每组的重复项：

df.groupby(df.A.eq('spec').cumsum()).apply(lambda x: x.duplicated()).values

array([False, False, False, False, False, False, False, False,  True,
        True,  True,  True,  True, False, False, False, False, False,
       False,  True,  True,  True,  True,  True])

由此，剩下的就是保留那些对应于“False”的行（即，不复制）。你知道吗

网友

3楼 · 编辑于 2024-10-05 10:35:35

另一个可能的解决办法是。。。您可以拥有一个计数器，并使用计数器值从列a创建一个新的列，只要在列值中遇到spec，就增加计数器值。你知道吗

counter = 0
def counter_fun(val):
    if val == 'spec': counter+=1
    return counter

df['new_col'] = df.A.apply(counter_fun)

然后在新列上分组，并删除重复项。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章