Spark：如何从Spark数据帧行解析和转换json字符串

from pyspark.sql import Row jstr1 = '{"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}]}' jstr2 = '{"id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}' df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) schema = F.schema_of_json(df.select(F.col("json")).take(1)[0].json) df2 = df.withColumn('json', F.from_json(F.col('json'), schema)) df2.show()

+---------+----------+-------+ | a | b | id | +--------------------+-------+ | 1 | 2 | id_1 | | 3 | 4 | id_1 | | 5 | 6 | id_2 | | 7 | 8 | id_2 | +---------+----------+-------+

1条回答

网友

1楼 · 发布于 2024-10-03 15:31:06

第二行的值为null，因为您只使用了第一行与第二行不同的模式。您可以将JSON解析为MapType，其中键的类型为string，值的类型为array of structs：

schema = "map<string, array<struct<a:int,b:int>>>"

df = df.withColumn('json', F.from_json(F.col('json'), schema))

df.printSchema()
#root
# |  json: map (nullable = true)
# |    |  key: string
# |    |  value: array (valueContainsNull = true)
# |    |    |  element: struct (containsNull = true)
# |    |    |    |  a: integer (nullable = true)
# |    |    |    |  b: integer (nullable = true)

然后，通过一些简单的转换，您可以获得预期的输出：

id列表示映射中的键，您可以通过map_keys函数获得它
结构<a:int, b:int>表示使用map_values函数获得的值

output1 = df.withColumn("id", F.map_keys("json").getItem(0)) \
            .withColumn("json", F.map_values("json").getItem(0))

output1.show(truncate=False)

# +        +  +
# |json            |id  |
# +        +  +
# |[[1, 2], [3, 4]]|id_1|
# |[[5, 6], [7, 8]]|id_2|
# +        +  +

output2 = output1.withColumn("attr", F.explode("json")) \
    .select("id", "attr.*")

output2.show(truncate=False)

# +  + -+ -+
# |id  |a  |b  |
# +  + -+ -+
# |id_1|1  |2  |
# |id_1|3  |4  |
# |id_2|5  |6  |
# |id_2|7  |8  |
# +  + -+ -+

相关问题更多 >

编程相关推荐

热门问题

热门文章