Pandas会根据不同的情况用groupby丢弃复制品

2024-10-04 01:23:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有数据帧:

df = pd.read_csv(...)

a   b   c      d     e     f      
1  two  adc   aaaa   Nan   mmm    
2  one  Nan   aaa    Nan   nnn    
1  one  ab    Nan    Nan   ww     
1  two  abcd  aaa    ff    uiww  
1  two  a     aaa    d     iii 

我想根据“a”和“b”删除副本。你知道吗

df = df.drop_duplicates(['a', 'b'])
  1. 但我想在“c”、“d”和“e”列中保留最大长度的值。你知道吗
  2. 我想在f栏留下: 不包含('m'或'n'的值) 或包含('w'或'y')的值。 如果不满足上述条件,则取任意值。你知道吗

我想得到这个结果:

a   b   c      d     e     f      
1  two  abcd  aaaa   ff    uiww   
2  one  Nan   aaa    Nan   nnn    
1  one  ab    Nan    Nan   ww     

我尝试使用transformapply,但不可能简化为一个方案。实现这一目标最有效的方法是什么?你知道吗


Tags: 数据dfreadabnanonepdff
2条回答

根据您的条件创建函数,然后使用agggroupby

def yourfunc1(x):
    return x.loc[x.str.len().idxmax()]
def yourfunc2(x):
    if any(x.str.contains('w|y')|(~x.str.contains('m|n'))):
       return x.loc[x.str.contains('w|y')|(~x.str.contains('m|n'))].iloc[0]
    else :
        return x.iloc[0]
df=df.replace({'Nan':''})
s=df.groupby(['a','b'],as_index=False).agg({'c':yourfunc1,'d':yourfunc1,'e':yourfunc1,'f':yourfunc2})
   a    b     c     d   e     f
0  1  one    ab              ww
1  1  two  abcd  aaaa  ff  uiww
2  2  one         aaa       nnn

除非需要使用groupby(对于大数据帧来说,groupby的速度很慢),否则可以执行以下操作:

def custom_drop_duplicates(dataframe):
    localDF = dataframe.copy()

    criteria_list = []
    for i, col in enumerate(['c', 'd', 'f']):
        localDF.loc[:, 'criteria{}'.format(i)] = [len(x) for x in localDF[col]]
        criteria_list.append('criteria{}'.format(i))

    localDF.loc[:, 'criteria{}'.format(i+1)] = [all(x not in y for x in ['m', 'n']) or any(x in y for x in ['w', 'y']) for y in localDF['f']]
    criteria_list.append('criteria{}'.format(i+1))

    # here you have a judgement call: if criteria are in conflict, you need to order them. I just assume they are ordered in the way you described them.

    localDF.sort_values(by=criteria_list, ascending=True, inplace=True)
    localDF.drop_duplicates(subset=['a', 'b'], keep='last', inplace=True)

    localDF.drop(columns=criteria_list, inplace=True)

    return localDF

希望这有帮助

相关问题 更多 >