我正在学习Pyspark,我用它将csv文件读入数据帧:
>>> df = spark.read.option("header",True).csv('example.csv')
>>> df.show(n=4)
+-------+------+------+
|main_id| id| price|
+-------+------+------+
| 100|aaaaa1|190000|
| 101| bbbbb|216000|
| 100|aaaaa2|276000|
| 100|aaaaa3|340000|
+-------+------+------+
only showing top 4 rows
如何基于第一列main_id
对数据进行分组,并将数组中的id
和price
分组,然后将数据帧转换为换行分隔的json格式?比如:
{"main_id": "100", "items": [{"id": "aaaaa1", "price": 190000},{"id": "aaaaa2", "price": 276000},{"id": "aaaaa3", "price": 340000}]}
{"main_id": "101", "id": "bbbbb", "price": 216000}
...
...
您可以将
struct
和groupby
与collect_list
一起用作或:
输出:
相关问题 更多 >
编程相关推荐