如果窗口的范围取决于列的值,如何计算窗口内的平均值?

2024-09-24 02:19:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据:

columns = ['aircraft_id',      'Liftoff',    'timestamp',      'value']

l =[
(   '0003177d',1550000476500,1550000467000,  -80.15625),
(   '0003177d',1550000476500,1550000467500,  -80.15625),
(   '0003177d',1550000476500,1550000468000,  -80.15625),
(   '0003177d',1550000476500,1550000468500,  -80.15625),
(   '0003177d',1550000476500,1550000469000,-79.8046875),
(   '0003177d',1550000476500,1550000469500,-79.8046875),
(   '0003177d',1550000476500,1550000470000,-79.8046875),
(   '0003177d',1550000476500,1550000470500,-79.8046875),
(   '0003177d',1550000476500,1550000471000,-79.8046875),
(   '0003177d',1550000476500,1550000471500,-79.8046875),
(   '0003177d',1550000476500,1550000472000,  -80.15625),
(   '0003177d',1550000476500,1550000472500,-80.5078125),
(   '0003177d',1550000476500,1550000473000, -80.859375),
(   '0003177d',1550000476500,1550000473500, -80.859375),
(   '0003177d',1550000476500,1550000474000, -80.859375),
(   '0003177d',1550000476500,1550000474500, -80.859375),
(   '0003177d',1550000476500,1550000475000, -80.859375),
(   '0003177d',1550000476500,1550000475500, -80.859375),
(   '0003177d',1550000476500,1550000476000, -80.859375),
(   '0003177d',1550000476500,1550000476500,-80.5078125)]

df=spark.createDataFrame(l, columns)

df.show()
+-----------+-------------+-------------+-----------+
|aircraft_id|      Liftoff|    timestamp|      value|
+-----------+-------------+-------------+-----------+
|   0003177d|1550000476500|1550000467000|  -80.15625|
|   0003177d|1550000476500|1550000467500|  -80.15625|
|   0003177d|1550000476500|1550000468000|  -80.15625|
|   0003177d|1550000476500|1550000468500|  -80.15625|
|   0003177d|1550000476500|1550000469000|-79.8046875|
|   0003177d|1550000476500|1550000469500|-79.8046875|
|   0003177d|1550000476500|1550000470000|-79.8046875|
|   0003177d|1550000476500|1550000470500|-79.8046875|
|   0003177d|1550000476500|1550000471000|-79.8046875|
|   0003177d|1550000476500|1550000471500|-79.8046875|
|   0003177d|1550000476500|1550000472000|  -80.15625|
|   0003177d|1550000476500|1550000472500|-80.5078125|
|   0003177d|1550000476500|1550000473000| -80.859375|
|   0003177d|1550000476500|1550000473500| -80.859375|
|   0003177d|1550000476500|1550000474000| -80.859375|
|   0003177d|1550000476500|1550000474500| -80.859375|
|   0003177d|1550000476500|1550000475000| -80.859375|
|   0003177d|1550000476500|1550000475500| -80.859375|
|   0003177d|1550000476500|1550000476000| -80.859375|
|   0003177d|1550000476500|1550000476500|-80.5078125|
+-----------+-------------+-------------+-----------+

我想计算一个窗口内的值的平均值,其中窗口的范围取决于时间戳的当前值和升空的时间戳。每架飞机的升空值不同。你知道吗

我试着:

from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('val', F.mean('value').over(Window.partitionBy('aircraft_id','ini_TO','Liftoff').orderBy('timestamp').rangeBetween(df['timestamp'], df['Liftoff']))

但没用,有解决办法吗?你知道吗

预期结果:

  • 对于第一行,窗口的范围是从1550000467000到1550000476500,因此平均值是20个值的和除以20(-8033203)。你知道吗
  • 对于第二行,窗口的范围是从1550000467500到1550000476500,因此平均值是19个值的和除以19(-8034128 |)。你知道吗
  • 等等。。。你知道吗
+-----------+-------------+-------------+---------+---------+
|aircraft_id|      Liftoff|    timestamp|    value|      val|
+-----------+-------------+-------------+---------+---------+
|   0003177d|1550000476500|1550000467000|-80,15625|-80,33203|
|   0003177d|1550000476500|1550000467500|-80,15625|-80,34128|
|   0003177d|1550000476500|1550000468000|-80,15625|-80,35156|
|   0003177d|1550000476500|1550000468500|-80,15625|-80,36305|
|   0003177d|1550000476500|1550000469000|-79,80469|-80,37598|
|   0003177d|1550000476500|1550000469500|-79,80469|-80,41406|
|   0003177d|1550000476500|1550000470000|-79,80469|-80,45759|
|   0003177d|1550000476500|1550000470500|-79,80469|-80,50781|
|   0003177d|1550000476500|1550000471000|-79,80469|-80,56641|
|   0003177d|1550000476500|1550000471500|-79,80469|-80,63565|
|   0003177d|1550000476500|1550000472000|-80,15625|-80,71875|
|   0003177d|1550000476500|1550000472500|-80,50781|-80,78125|
|   0003177d|1550000476500|1550000473000|-80,85938|-80,81543|
|   0003177d|1550000476500|1550000473500|-80,85938|-80,80915|
|   0003177d|1550000476500|1550000474000|-80,85938|-80,80078|
|   0003177d|1550000476500|1550000474500|-80,85938|-80,78906|
|   0003177d|1550000476500|1550000475000|-80,85938|-80,77148|
|   0003177d|1550000476500|1550000475500|-80,85938|-80,74219|
|   0003177d|1550000476500|1550000476000|-80,85938|-80,68359|
|   0003177d|1550000476500|1550000476500|-80,50781|-80,50781|
+-----------+-------------+-------------+---------+---------+

Tags: columnsfromimportiddfsqlvalue时间
1条回答
网友
1楼 · 发布于 2024-09-24 02:19:27

我想你差不多到了,你只需要在windowspec中设置rangeBetween,从当前行Window.currentRow开始,一直到窗口范围的末尾Window.unboundedFollowing,如下所示:

注:ini_TO未在样本数据集中提供,因此从partitionBy中删除用于测试。你知道吗

wind_spec = Window.partitionBy('aircraft_id','Liftoff').orderBy('timestamp').rangeBetween(Window.currentRow, Window.unboundedFollowing)

上述窗口将提供所需的输出:

df.withColumn('val', F.mean('value').over(wind_spec)).show()
+     -+      -+      -+     -+         +
|aircraft_id|      Liftoff|    timestamp|      value|               val|
+     -+      -+      -+     -+         +
|   0003177d|1550000476500|1550000467000|  -80.15625|      -80.33203125|
|   0003177d|1550000476500|1550000467500|  -80.15625|-80.34128289473684|
|   0003177d|1550000476500|1550000468000|  -80.15625|       -80.3515625|
|   0003177d|1550000476500|1550000468500|  -80.15625|-80.36305147058823|
|   0003177d|1550000476500|1550000469000|-79.8046875|    -80.3759765625|
|   0003177d|1550000476500|1550000469500|-79.8046875|       -80.4140625|
|   0003177d|1550000476500|1550000470000|-79.8046875|-80.45758928571429|
|   0003177d|1550000476500|1550000470500|-79.8046875|       -80.5078125|
|   0003177d|1550000476500|1550000471000|-79.8046875|      -80.56640625|
|   0003177d|1550000476500|1550000471500|-79.8046875| -80.6356534090909|
|   0003177d|1550000476500|1550000472000|  -80.15625|         -80.71875|
|   0003177d|1550000476500|1550000472500|-80.5078125|         -80.78125|
|   0003177d|1550000476500|1550000473000| -80.859375|    -80.8154296875|
|   0003177d|1550000476500|1550000473500| -80.859375|-80.80915178571429|
|   0003177d|1550000476500|1550000474000| -80.859375|      -80.80078125|
|   0003177d|1550000476500|1550000474500| -80.859375|       -80.7890625|
|   0003177d|1550000476500|1550000475000| -80.859375|     -80.771484375|
|   0003177d|1550000476500|1550000475500| -80.859375|       -80.7421875|
|   0003177d|1550000476500|1550000476000| -80.859375|      -80.68359375|
|   0003177d|1550000476500|1550000476500|-80.5078125|       -80.5078125|
+     -+      -+      -+     -+         +

相关问题 更多 >