Pandas会根据不同的情况用groupby丢弃复制品

df = pd.read_csv(...) a b c d e f 1 two adc aaaa Nan mmm 2 one Nan aaa Nan nnn 1 one ab Nan Nan ww 1 two abcd aaa ff uiww 1 two a aaa d iii

2条回答

网友

1楼 · 编辑于 2024-10-04 01:23:50

根据您的条件创建函数，然后使用agg和groupby

def yourfunc1(x):
    return x.loc[x.str.len().idxmax()]
def yourfunc2(x):
    if any(x.str.contains('w|y')|(~x.str.contains('m|n'))):
       return x.loc[x.str.contains('w|y')|(~x.str.contains('m|n'))].iloc[0]
    else :
        return x.iloc[0]
df=df.replace({'Nan':''})
s=df.groupby(['a','b'],as_index=False).agg({'c':yourfunc1,'d':yourfunc1,'e':yourfunc1,'f':yourfunc2})
   a    b     c     d   e     f
0  1  one    ab              ww
1  1  two  abcd  aaaa  ff  uiww
2  2  one         aaa       nnn

网友

2楼 · 编辑于 2024-10-04 01:23:50

除非需要使用groupby（对于大数据帧来说，groupby的速度很慢），否则可以执行以下操作：

def custom_drop_duplicates(dataframe):
    localDF = dataframe.copy()

    criteria_list = []
    for i, col in enumerate(['c', 'd', 'f']):
        localDF.loc[:, 'criteria{}'.format(i)] = [len(x) for x in localDF[col]]
        criteria_list.append('criteria{}'.format(i))

    localDF.loc[:, 'criteria{}'.format(i+1)] = [all(x not in y for x in ['m', 'n']) or any(x in y for x in ['w', 'y']) for y in localDF['f']]
    criteria_list.append('criteria{}'.format(i+1))

    # here you have a judgement call: if criteria are in conflict, you need to order them. I just assume they are ordered in the way you described them.

    localDF.sort_values(by=criteria_list, ascending=True, inplace=True)
    localDF.drop_duplicates(subset=['a', 'b'], keep='last', inplace=True)

    localDF.drop(columns=criteria_list, inplace=True)

    return localDF

希望这有帮助

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pandas会根据不同的情况用groupby丢弃复制品

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >