如何提高PySpark中groupby和aggregate的性能？

2024-09-28 22:22:01 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试将数据帧转换为RDD，然后执行聚合操作（key=（AccountKey，x）。键1的聚合最大值、键2的聚合最大值和键3的聚合最大值）。但不幸的是，我觉得我仍然不明白如何让它发生。我可以在数据帧上执行此操作，但由于洗牌，性能很差（我尝试重新分区，但没有帮助）。如何从性能方面进行改进？你知道吗

下面是我在数据帧上执行groupby和aggregate的代码：

def operation_xy(df):
    # Groupby Account and x
    groupByExpr = ['Account', 'x']
    exprs = [F.max(F.col(c)) for c in ['1', '2', '3']]

    return df.groupBy(groupByExpr).agg(*exprs)

以下是我的意见：

[Row(AccountKey='5878', x=32.0, 1=False, 2=False, 3=False)]
[Row(AccountKey='5178', x=24.0, 1=False, 2=False, 3=True)]
[Row(AccountKey='5178', x=24.0, 1=False, 2=True, 3=False)]
[Row(AccountKey='5178', x=32.0, 1=False, 2=False, 3=False)]
[Row(AccountKey='5878', x=32.0, 1=True, 2=False, 3=True)]

预期产量：

[Row(AccountKey='5878', x=32.0, 1=True, 2=False, 3=True)]
[Row(AccountKey='5178', x=24.0, 1=False, 2=True, 3=True)]
[Row(AccountKey='5178', x=32.0, 1=False, 2=True, 3=False)]

我是火花的初学者，所以请温柔一点：-）

Tags：数据 key false true df account 性能 row

0条回答

目前没有回答

如何提高PySpark中groupby和aggregate的性能？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何提高PySpark中groupby和aggregate的性能？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >