按行合并具有相同id的列表

location_id city all_poi 6265981 Port Severn [Mount St. Louis Moonstone , Horseshoe Valley , Lake Muskoka] 6265981 Port Severn [Mount St. Louis Moonstone , Little Lake Park , Bamboo Spa , Lake Huron]

3条回答

网友

1楼 · 编辑于 2024-09-29 22:24:04

您可以在^{}中的自定义函数中创建集合：

f = lambda x: list(set(z for y in x for z in y))
df = df.groupby(['location_id', 'city'])['all_poi'].agg(f).reset_index()
print (df)
  location_id    city                                            all_poi
0        Port  Severn  [Bamboo Spa, Mount St.Louis Moonstone, Lake Hu...

如果顺序和性能很重要，请使用dict删除重复项：

f = lambda x: list(dict.fromkeys([z for y in x for z in y]).keys())

另一个想法是使用unique：

f = lambda x: pd.unique([z for y in x for z in y]).tolist()

编辑：

如果有多个列并且每个组需要第一个值：

df.groupby('location_id').agg({'city': 'first', 'all_poi': f}).reset_index()

如果需要其他一些聚合方法，如sum、mean、join：

df.groupby('location_id').agg({'city': 'first', 
                               'all_poi': f, 
                               'cols1':'sum', 
                               'vals': ','.join, 
                               'vals1': lambda x: list(x)}).reset_index()

网友

2楼 · 编辑于 2024-09-29 22:24:04

简单的sum()怎么样：

res=df.groupby(["location_id"], as_index=False).agg({"city": "last", "all_poi": "sum"})
res["all_poi"]=res["all_poi"].map(set)

产出：

Before
   location_id  ...                                                                all_poi
0  6265981      ...  [Mount St. Louis Moonstone, Horseshoe Valley, Lake Muskoka]
1  6265981      ...  [Mount St. Louis Moonstone, Little Lake Park, Bamboo Spa, Lake Huron]

After:
   location_id  ...                                                                                                all_poi
0  6265981      ...  {Horseshoe Valley, Lake Muskoka, Lake Huron, Bamboo Spa, Little Lake Park, Mount St. Louis Moonstone}

网友

3楼 · 编辑于 2024-09-29 22:24:04

看起来下面的答案更简洁，但是您可以将sum与groupby一起应用来组合列表。然后创建一个集合以消除重复项，并从set转换为list

import pandas as pd

df = pd.DataFrame([['1' ,'New York', ['a','b','c']], ['1', 'New York', ['a','d']]],
                   columns = ['location_id', 'city','all_poi'])

df.groupby(('location_id'))['all_poi'].apply(sum).apply(set).apply(list)

相关问题更多 >

编程相关推荐

热门问题

热门文章