按键分组并使用自定义条件聚合

df = pd.DataFrame([['A', 'a', 'web'], ['A', 'b', 'mobile'], ['B', 'c', 'web'], ['C', 'd', 'web'], ['D', 'e', 'mobile'], ['D', 'f', 'web'], ['D', 'g', 'web'], ['D', 'g', 'web']], columns=['seller_id', 'item_id', 'selling_channel'])

2条回答

网友

1楼 · 编辑于 2024-09-28 13:25:08

这不是一种优雅的方式，但却能做到：

temp = df.groupby(["seller_id", "selling_channel"])\
         .count().reset_index()\
         .groupby("seller_id")["item_id"].agg(["max", "sum"])
temp

    max sum
seller_id       
A   1   2
B   1   1
C   1   1
D   3   4

top_channel = df.groupby("seller_id")["selling_channel"]\
                .apply(lambda x: x.value_counts().index[0])
top_channel

seller_id
A    mobile
B       web
C       web
D       web

temp["selling_channel"] = top_channel
final = temp.apply(lambda r: "mixed" if r["max"]/r["sum"]<0.75 else r["selling_channel"], axis=1).to_frame().reset_index()
final.columns = ["seller_id", "main_selling_channel"]

    seller_id   main_selling_channel
0   A   mixed
1   B   web
2   C   web
3   D   web

网友

2楼 · 编辑于 2024-09-28 13:25:08

这里有一种方法，使用df.groupby和normalize=True的值计数来检查每组值的pct，然后检查%是否大于或等于0.75，然后使用np.where设置返回Tue到mixed的值，最后df.groupby()和idxmax将返回1个值，否则mixed

a = (df.groupby('seller_id')['selling_channel'].value_counts(normalize=True).ge(0.75)
       .rename('Pct').reset_index())

out = (a.assign(selling_channel=np.where(a['Pct'],a['selling_channel'],'mixed'))
       .loc[lambda x: x.groupby('seller_id')['Pct'].idxmax()].drop('Pct',1))

print(out)

  seller_id selling_channel
0         A           mixed
2         B             web
3         C             web
4         D             web

相关问题更多 >

编程相关推荐

热门问题

热门文章

按键分组并使用自定义条件聚合

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >