我的数据框架结构如下:
+------------------+
| id | value|
+------------------+
| user0| 100 |
| user1| 102 |
| user0| 109 |
| user2| 103 |
| user1| 108 |
| user0| 119 |
| user0| 140 |
| user0| 142 |
+------------------+
我想计算每一行和每个ID的前一行之间的方差,为此,我尝试了以下代码:
import pyspark.sql.functions as F
w_vv = Window.partitionBy('id')
df=df.withColumn('variances',F.round(F.var_pop("value"),2).over(w_vv.rowsBetween(Window.unboundedPreceding,0)))
这是理想的输出
+--------------------------------------------------------------+
| User| value| variances|
+--------------------------------------------------------------+
| user0| value1| - |
| user1| value1| - |
| user0| value2| variance(value2,value1) |
| user1| value2| variance(value2,value1) |
| user1| value3| variance(value3,value2,value1) |
| user0| value3| variance(value4,value3,value2,value1) |
| user0| value4| variance(value4,value3,value2,value1) |
| user0| value5| variance(value5,value4,value3,value2,value1)|
+--------------------------------------------------------------+
前面的输出以数字为例:
+---------------------------+
| User| value| variances|
+---------------------------+
| user0| 2| - |
| user1| 4| - |
| user0| 3| 0.25 |
| user1| 3| 0.25 |
| user1| 9| 6.9 |
| user0| 7| 4.7 |
| user0| 3| 3.7 |
| user0| 4| 3 |
+---------------------------+
但是,代码返回以下错误
grouping expressions sequence is empty, and '`timestamp`' is not an aggregate function.
Wrap '(var_pop(CAST(`value` AS DOUBLE)) AS `_w0`)' in windowing function(s) or wrap
'`timestamp`' in first() (or first_value) if you don't care which value you get.;;
我知道聚合函数应该在groupBy上使用,但我不知道如何编写代码使其工作,有什么想法吗?谢谢
您应该将窗口附加到
var_pop
,而不是round
:相关问题 更多 >
编程相关推荐