如何按列对数据进行分组?

2024-10-03 09:07:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在学习Pyspark,我用它将csv文件读入数据帧:

>>> df = spark.read.option("header",True).csv('example.csv')
>>> df.show(n=4)
+-------+------+------+
|main_id|    id| price|
+-------+------+------+
|    100|aaaaa1|190000|
|    101| bbbbb|216000|
|    100|aaaaa2|276000|
|    100|aaaaa3|340000|
+-------+------+------+
only showing top 4 rows

如何基于第一列main_id对数据进行分组,并将数组中的idprice分组,然后将数据帧转换为换行分隔的json格式?比如:

{"main_id": "100", "items": [{"id": "aaaaa1", "price": 190000},{"id": "aaaaa2", "price": 276000},{"id": "aaaaa3", "price": 340000}]}
{"main_id": "101", "id": "bbbbb", "price": 216000}
...
...

Tags: 文件csv数据iddfreadmainprice
1条回答
网友
1楼 · 发布于 2024-10-03 09:07:39

您可以将structgroupbycollect_list一起用作

from pyspark.sql import functions as f

df1.select(f.col("main_id"), f.struct(f.col("id"), f.struct("price")).alias("items"))\
    .groupby("main_id")\
    .agg(f.collect_list("items").alias("items"))

或:

from pyspark.sql import functions as f
df1.groupby("main_id") \
    .agg(f.collect_list("id").alias("id"), f.collect_list("price").alias("price"))\
    .select("main_id", f.arrays_zip(f.col("id"), f.col("price")).alias("items"))

输出:

{"main_id":100,"items":[{"id":"aaaaa1","col2":{"price":190000}},{"id":"aaaaa2","col2":{"price":276000}},{"id":"aaaaa3","col2":{"price":340000}}]}
{"main_id":101,"items":[{"id":"bbbbb","col2":{"price":216000}}]}

相关问题 更多 >