如何在一定的条件下找到数据帧中的重复项?

2024-09-30 06:33:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个熊猫数据框

RTYPE  PERIOD_ID    STORE_ID                       MKT MTYPE  RGROUP  RZF  RXF
0    MKT   20171411  3102300001  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
1    MKT   20171411  3102300002  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
2    MKT   20171411  3104001193              PM Provision  CELL     NaN  NaN  NaN
3    MKT   20171411  3104001193  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
4    MKT   20171411  3104001193    Provision including MM  CELL     NaN  NaN  NaN
5    MKT   20171411  3104001641              PM Provision  CELL     NaN  NaN  NaN
6    MKT   20171411  3104001641  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
7    MKT   20171411  3104001641    Provision including MM  CELL     NaN  NaN  NaN
8    MKT   20171411  3104001682              PM Provision  CELL     NaN  NaN  NaN
9    MKT   20171411  3104001682  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
10   MKT   20171411  3104001682    Provision including MM  CELL     NaN  NaN  NaN
11   MKT   20171412  3104001682                   Alcohol  CELL     NaN  NaN  NaN
12   MKT   20171412  3104001682                      Fish  CELL     NaN  NaN  NaN
13   MKT   20171412  3104001684                   Alcohol  CELL     NaN  NaN  NaN
14   MKT   20171412  3104001684                      Fish  CELL     NaN  NaN  NaN

我需要根据这个条件找到MKT的复制品, 如果存储id的集合在特定时间段\u id中与MKTs完全相同,则这些MKTs是重复的。 所以在这种情况下 期间20171411,副本为PM准备金和准备金,包括MM,以及 在20171412期间,复制品是酒精和鱼。你知道吗

我已经试过这个了现在:-你知道吗

df1 = newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'], keep=False)]
d1 = {k:tuple(set(v)) for k, v in df1.groupby('PERIOD_ID')['MKT']}
print (d1)

哪个是返回:-你知道吗

{20171411L: ('Provision including MM', 'PM Provision', 'PM KA+PM PROV+SMKT+PETRO'), 20171412L: ('Fish', 'Alcohol')}

上面的输出不是返回重复的,而是只返回该时段的唯一mkt集。你知道吗

我需要的是这样的东西,我把周期作为键,把那个周期的mkt作为值。作为复制品的条件在上文的帖子中提到-

{20171411L: ('Provision including MM', 'PM Provision'), 20171412L: ('Fish', 'Alcohol')}

我对熊猫真的很陌生,对Python有一些基本的了解。 任何帮助都会很好。你知道吗


Tags: idcellnanperiodmmfishincludingka
3条回答

这对你的情况应该有用。我刚从你找到的重复的MKT中删除了唯一的MKT。你知道吗

duplicate = {k:set(v) for k, v in newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'], 
                                                         keep=False)].groupby('PERIOD_ID')['MKT']}
unique = {k:set(v) for k, v in newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'], 
                                                      keep=False) == False].groupby('PERIOD_ID')['MKT']}

final = dict()
for k in duplicate:
    if k in unique:
        final[k] = tuple(duplicate[k] - unique[k])
    else:
        final[k] = tuple(duplicate[k])

print(final)

我希望我能正确地理解你,如果我忘了什么或没有正确理解,请随意评论。你知道吗

df_grouped = df.groupby(['PERIOD_ID','STORE_ID','MKT'],
                    as_index=False)\
                    .agg({'MTYPE':'count'})\
                    .rename(columns={'MTYPE': 'count'})

df_grouped[df_grouped['count'] > 1]\
           .groupby('PERIOD_ID')\
           .agg({'MKT':lambda x: list(set(x))}).to_dict()['MKT']

我可以用下面的代码来解决这个问题

    df1=df[['PERIOD_ID','STORE_ID','MKT']]
    df1=df1.sort_values(['PERIOD_ID','STORE_ID'],ascending=True)
    duplicatedf = df1.groupby(['PERIOD_ID','MKT'])['STORE_ID'].agg(lambda STORE_ID: ','.join(STORE_ID.astype(str).replace(' ','').unique())).reset_index()
    duplicates =duplicatedf[ duplicatedf.duplicated(['PERIOD_ID','STORE_ID'],keep='first') | duplicatedf.duplicated(['PERIOD_ID','STORE_ID'],keep='last')]
    duplicates= duplicates.groupby(['PERIOD_ID','STORE_ID']).agg(lambda MKT: ','.join(MKT.astype(str))).reset_index()
    print (duplicates)


#Converting the df into dict
    dupdictdf=duplicates[['PERIOD_ID','MKT']]
    dicta=dupdictdf.to_dict("records")
    print (dicta)

相关问题 更多 >

    热门问题