Python与tieresolution性能的结合最为频繁

| id | string_col_A | string_col_B | creation_date | |-------|--------------|--------------|---------------| | x12ga | STR_X1 | STR_Y1 | 2020-11-01 | | x12ga | STR_X1 | STR_Y1 | 2020-10-10 | | x12ga | STR_X2 | STR_Y2 | 2020-11-06 | | x21ab | STR_X4 | STR_Y4 | 2020-11-06 | | x21ab | STR_X5 | STR_Y5 | 2020-11-02 | | x11aa | STR_X3 | STR_Y3 | None |

| id | string_col_A | string_col_B | |-------|--------------|--------------| | x12ga | STR_X1 | STR_Y1 | | x21ab | STR_X4 | STR_Y4 | | x11aa | STR_X3 | STR_Y3 |

def reducer(id_group): id_with_sizes = id_group.groupby( ["id", "string_col_A", "string_col_B"], dropna=False).agg({ 'creation_date': [len, max] }).reset_index() id_with_sizes.columns = [ "id", "string_col_A", "string_col_B", "row_count", "recent_date" ] id_with_sizes.sort_values(by=["row_count", "recent_date"], ascending=[False, False], inplace=True) return id_with_sizes.head(1).drop(["recent_date", "row_count"], axis=1)

3条回答

网友

1楼 · 编辑于 2024-10-02 10:18:37

您可以创建一个组合两列的系列s
返回最大计数的索引
按该索引过滤注意：如果您使用的是早期版本的pandas，则从.groupby代码中取出, sort=False并在末尾进行排序

s = df['string_col_A'] + df['string_col_B']
df['max'] = df.groupby(['id',s])['id'].transform('count')
df = df.iloc[df.groupby('id', sort=False)['max'].idxmax().values].drop(['max', 'creation_date'], axis=1)
df
Out[1]: 
      id string_col_A string_col_B
0  x12ga       STR_X1       STR_Y1
3  x21ab       STR_X4       STR_Y4
5  x11aa       STR_X3       STR_Y3

网友

2楼 · 编辑于 2024-10-02 10:18:37

让我们尝试使用groupby和transform，然后获得最常见值的计数，然后使用drop_duplicates和sort_values

df['help'] = df.groupby(['id','string_col_A','string_col_B'])['string_col_A'].transform('count')
out = df.sort_values(['help','creation_date'],na_position='first').drop_duplicates('id',keep='last').drop(['help','creation_date'],1)
out
Out[122]: 
      id string_col_A string_col_B
3  x21ab       STR_X4       STR_Y4
5  x11aa       STR_X3       STR_Y3
0  x12ga       STR_X1       STR_Y1

网友

3楼 · 编辑于 2024-10-02 10:18:37

您只需要按id列进行分组，并在此基础上查找最频繁的数据（模式）

为了简化操作，您可以创建另一列combined_str：

df['combined_str'] = df['string_col_A'] + df['string_col_B']

按`id`分组并使用`pd.Series.mode`函数减少：

df = df.sort_values(by=['creation_date'])
df = df.groupby(['id'])['combined_str'].agg(most_common = ('combined_str', pd.Series.mode))

按`id`分组并使用`pd.Series.mode`函数减少：

相关问题更多 >

编程相关推荐

热门问题

热门文章