Spark计算窗口上的方差

2024-09-27 07:22:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我的数据框架结构如下:

+------------------+
|   id  |     value|
+------------------+
|  user0|     100  |
|  user1|     102  |
|  user0|     109  |
|  user2|     103  |
|  user1|     108  |
|  user0|     119  |
|  user0|     140  |
|  user0|     142  |
+------------------+

我想计算每一行和每个ID的前一行之间的方差,为此,我尝试了以下代码:

import pyspark.sql.functions as F

w_vv = Window.partitionBy('id')  
df=df.withColumn('variances',F.round(F.var_pop("value"),2).over(w_vv.rowsBetween(Window.unboundedPreceding,0)))

这是理想的输出

+--------------------------------------------------------------+
|   User|  value|                                     variances|
+--------------------------------------------------------------+
|  user0| value1|         -                                    |
|  user1| value1|         -                                    |
|  user0| value2|  variance(value2,value1)                     |
|  user1| value2|  variance(value2,value1)                     |
|  user1| value3|  variance(value3,value2,value1)              |
|  user0| value3|  variance(value4,value3,value2,value1)       |
|  user0| value4|  variance(value4,value3,value2,value1)       |
|  user0| value5|  variance(value5,value4,value3,value2,value1)|
+--------------------------------------------------------------+

前面的输出以数字为例:

+---------------------------+
|   User|  value|  variances|
+---------------------------+
|  user0| 2|         -      |
|  user1| 4|         -      |
|  user0| 3| 0.25           |
|  user1| 3| 0.25           |
|  user1| 9| 6.9            |
|  user0| 7| 4.7            |
|  user0| 3| 3.7            |     
|  user0| 4| 3              |
+---------------------------+

但是,代码返回以下错误

grouping expressions sequence is empty, and '`timestamp`' is not an aggregate function.  
Wrap '(var_pop(CAST(`value` AS DOUBLE)) AS `_w0`)' in windowing function(s) or wrap  
'`timestamp`' in first() (or first_value) if you don't care which value you get.;;

我知道聚合函数应该在groupBy上使用,但我不知道如何编写代码使其工作,有什么想法吗?谢谢


Tags: 代码iddfvaluevarwindowvalue1variance

热门问题