将pyspark dataframe转换为字典:结果与预期不同

2024-06-01 19:07:54 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有以下pyspark数据帧:

data = [("USA",20,40,60),
    ("India",50,40,30),
    ("Nepal",20,50,30),
    ("Ireland",40,60,70),
    ("Norway",50,50,60)
  ]

columns = ["country", "A", "B", "C"]
 
df = spark.createDataFrame(data=data,schema=columns)

为了从中创建词典,我采用了以下方法:

import pyspark.sql.functions as F
list_test = [row.asDict() for row in df.collect()]
dict_test = {country['country']: country for country in list_test}

结果如下:

{'USA': {'country': 'USA', 'A': 20, 'B': 40, 'C': 60}, 'India': {'country': 'India', 'A': 50, 'B': 40, 'C': 30}, 'Nepal': {'country': 'Nepal', 'A': 20, 'B': 50, 'C': 30}, 'Ireland': {'country': 'Ireland', 'A': 40, 'B': 60, 'C': 70}, 'Norway': {'country': 'Norway', 'A': 50, 'B': 50, 'C': 60}}

然而,我想要的是:

{'USA': {'A': 20, 'B': 40, 'C': 60}, 'India': {'A': 50, 'B': 40, 'C': 30}, 'Nepal': {'A': 20, 'B': 50, 'C': 30}, 'Ireland': {'A': 40, 'B': 60, 'C': 70}, 'Norway': {'A': 50, 'B': 50, 'C': 60}}

我怎样才能得到这个?我不确定我是否明白我做错了什么


Tags: columnsintestdffordatacountrylist
2条回答

您可以进行dict理解以删除不需要的项目:

list_test = [row.asDict() for row in df.collect()]
dict_test = {country['country']: {k:v for k,v in country.items() if k != 'country'} for country in list_test}

print(dict_test)
{'USA': {'A': 20, 'B': 40, 'C': 60}, 'India': {'A': 50, 'B': 40, 'C': 30}, 'Nepal': {'A': 20, 'B': 50, 'C': 30}, 'Ireland': {'A': 40, 'B': 60, 'C': 70}, 'Norway': {'A': 50, 'B': 50, 'C': 60}}

另一种方法是在一些转换之后直接从数据帧收集json字符串,然后使用json.loads获取dict对象:

import json
    
from pyspark.sql.functions import to_json, collect_list, struct, map_from_arrays

dict_test = json.loads(
    df.groupBy().agg(
        collect_list("country").alias("countries"),
        collect_list(struct("A", "B", "C")).alias("values")
    ).select(
        to_json(map_from_arrays("countries", "values")).alias("json_str")
    ).collect()[0].json_str
)

print(dict_test)

#{'USA': {'A': 20, 'B': 40, 'C': 60}, 'India': {'A': 50, 'B': 40, 'C': 30}, 'Nepal': {'A': 20, 'B': 50, 'C': 30}, 'Ireland': {'A': 40, 'B': 60, 'C': 70}, 'Norway': {'A': 50, 'B': 50, 'C': 60}}

相关问题 更多 >