PySpark数据帧上分组数据的Pandasstyle变换

nameToMean = {...} f = lambda category, value: value - nameToMean[category] categoryDemeaned = pyspark.sql.functions.udf(f, pyspark.sql.types.DoubleType()) df = df.withColumn("DemeanedValue", categoryDemeaned(df.Category, df.Value))

3条回答

网友

1楼 · 编辑于 2024-09-28 03:18:31

您可以使用Window来执行此操作

即

import pyspark.sql.functions as F
from pyspark.sql.window import Window

window_var = Window().partitionBy('Categroy')
df = df.withColumn('DemeanedValues', F.col('Values') - F.mean('Values').over(window_var))

网友

2楼 · 编辑于 2024-09-28 03:18:31

实际上，在Spark中有一种惯用的方法来实现这一点，使用HiveOVER表达式。在

即

df.registerTempTable('df')
with_category_means = sqlContext.sql('select *, mean(Values) OVER (PARTITION BY Category) as category_mean from df')

在发动机罩下，这是使用车窗功能。不过，我不确定这是否比你的解决方案快

网友

3楼 · 编辑于 2024-09-28 03:18:31

I understand, each category requires a full scan of the DataFrame.

不，它不是。数据帧聚合是使用类似于aggregateByKey的逻辑执行的。请参阅DataFrame groupBy behaviour/optimization较慢的部分是join，它需要排序/洗牌。但它仍然不需要每个组扫描。在

如果这是一个确切的代码，那么使用它是很慢的，因为您没有提供连接表达式。因为它只是执行笛卡尔积。因此，它不仅效率低下，而且是不正确的。你想要这样的东西：

from pyspark.sql.functions import col

means = df.groupBy("Category").mean("Values").alias("means")
df.alias("df").join(means, col("df.Category") == col("means.Category"))

I think (but have not verified) that I can speed this up a great deal if I collect the result of the group-by/mean into a dictionary, and then use that dictionary in a UDF

这是可能的，尽管性能会因具体情况而有所不同。使用Python udf的一个问题是它必须在Python之间来回移动数据。不过，这绝对值得一试。但是，您应该考虑为nameToMean使用广播变量。在

Is there an idiomatic way to express this type of operation without sacrificing performance?

在PySpark 1.6中，您可以使用broadcast函数：

^{pr2}$

但在<；=1.5中不可用。在

相关问题更多 >

编程相关推荐

热门问题

热门文章