转置Pypark数据帧

2024-06-25 05:19:40 发布

您现在位置:Python中文网/ 问答频道 /正文

如何转置以下PySpark数据帧

以下是pyspark数据帧

+----+------+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------+--------+------+
|srab|srsbtp|avgm1|avgm2|avgm3|avgm4|avgm4|avgm6|avgm7|avgm8|avgm9|          avgm10|  avgm11|avgm12|
+----+------+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------+--------+------+
|2389|     D| null| null| null| null| null| null| null| null| null|            null|    null|  null|
|2389|     C| null| null| null| null| null| null| null| null| null|54674.1935483871|156820.0|  null|
+----+------+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------+--------+------+

我想将上面的dataframe转换成下表

期望输出:

srab    month   D        C
2389    avgm1   null    null
2389    avgm2   null    null
2389    avgm3   null    null
2389    avgm4   null    null
2389    avgm5   null    null
2389    avgm6   null    null
2389    avgm7   null    null
2389    avgm8   null    null
2389    avgm9   null    null
2389    avgm10  null    54674.19355
2389    avgm11  null    156820
2389    avgm12  null    null

Tags: 数据nullpysparksrabavgm8avgm9avgm2avgm10
2条回答

在Spark SQL中,您可以使用union all和条件聚合取消pivot/pivot:

select srab, month, 
    max(case when srsbtp = 'D' then avgm1 end) as d,
    max(case when srsbtp = 'C' then avgm1 end) as c
from (
    select srab, srsbtp, 'avgm1' as month, avgm1 from mytable
    union all srab, srsbtp, 'avgm2', avgm2 from mytable
    union all srab, srsbtp, 'avgm3', avgm3 from mytable
    ...
) t
gorup by srab, month

首先我们可以stackavgm列转换为行,然后我们可以pivotsrsbtp行转换为列

df.createOrReplaceTempView('table')
col_list = ' '.join([f"'{'avgm'+str(i+1)}', {'avgm'+str(i+1)}," for i in range(12)])[:-1]
## col_list is a string
## "'avgm1', avgm1, 'avgm2', avgm2, 'avgm3', avgm3, 'avgm4', avgm4, 'avgm5', avgm5, 'avgm6', avgm6, 'avgm7', avgm7, 'avgm8', avgm8, 'avgm9', avgm9, 'avgm10', avgm10, 'avgm11', avgm11, 'avgm12', avgm12"

result = spark.sql(f"select srab, srsbtp, stack(12, {col_list}) as (month, value) from table") \
              .groupBy('srab', 'month') \
              .pivot('srsbtp') \
              .agg(F.sum('value')) \
              .orderBy('month')
result.show()
+  +   +        +  +
|srab| month|               C|   D|
+  +   +        +  +
|2389| avgm1|            null|null|
|2389|avgm10|54674.1935483871|null|
|2389|avgm11|        156820.0|null|
|2389|avgm12|            null|null|
|2389| avgm2|            null|null|
|2389| avgm3|            null|null|
|2389| avgm4|            null|null|
|2389| avgm5|            null|null|
|2389| avgm6|            null|null|
|2389| avgm7|            null|null|
|2389| avgm8|            null|null|
|2389| avgm9|            null|null|
+  +   +        +  +

相关问题 更多 >