Spark：从_json中删除后面的空值，或者只从json中获取值

df = spark.createDataFrame( [ (1, '{"a": "hello"}'), (2, '{"b": ["foo", "bar"]}'), (3, '{"c": {"cc": "baz"}}'), (4, '{"d": [{"dd": "foo"}, {"dd": "bar"}]}'), ], schema=['id', 'jsonData'], ) df.show() +---+--------------------+ | id| jsonData| +---+--------------------+ | 1| {"a": "hello"}| | 2|{"b": ["foo", "ba...| | 3|{"c": {"cc": "baz"}}| | 4|{"d": [{"dd": "fo...| +---+--------------------+

from pyspark.sql.functions import from_json json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema df = df.withColumn("jsonParsedData", from_json("jsonData", json_schema)) df.show() +---+--------------------+--------------------+ | id| jsonData| jsonParsedData| +---+--------------------+--------------------+ | 1| {"a": "hello"}| [hello,,,]| | 2|{"b": ["foo", "ba...| [, [foo, bar],,]| | 3|{"c": {"cc": "baz"}}| [,, [baz],]| | 4|{"d": [{"dd": "fo...|[,,, [[foo], [bar]]]| +---+--------------------+--------------------+

+---+--------------------+--------------------+ | id| jsonData| jsonParsedData| +---+--------------------+--------------------+ | 1| {"a": "hello"}| hello| | 2|{"b": ["foo", "ba...| [foo, bar]| | 3|{"c": {"cc": "baz"}}| {"cc": "baz"}| | 4|{"d": [{"dd": "fo...|[{"dd": "foo"}, {...| +---+--------------------+--------------------+

1条回答

网友

1楼 · 发布于 2024-10-02 00:38:01

尝试使用regexp_extract从json中提取值：

import pyspark.sql.functions as F

df2 = df.withColumn('jsonParsedData', F.regexp_extract('jsonData', '\\{"[^"]+": (.*)\\}', 1))

df2.show(truncate=False)
+ -+                  -+               +
|id |jsonData                             |jsonParsedData                |
+ -+                  -+               +
|1  |{"a": "hello"}                       |"hello"                       |
|2  |{"b": ["foo", "bar"]}                |["foo", "bar"]                |
|3  |{"c": {"cc": "baz"}}                 |{"cc": "baz"}                 |
|4  |{"d": [{"dd": "foo"}, {"dd": "bar"}]}|[{"dd": "foo"}, {"dd": "bar"}]|
+ -+                  -+               +

另一种可能更好的方法是将from_json与map<string, string>模式一起使用：

import pyspark.sql.functions as F

df2 = df.withColumn('jsonParsedData', F.map_values(F.from_json('jsonData', 'map<string,string>'))[0])

df2.show(truncate=False)
+ -+                  -+             -+
|id |jsonData                             |jsonParsedData             |
+ -+                  -+             -+
|1  |{"a": "hello"}                       |hello                      |
|2  |{"b": ["foo", "bar"]}                |["foo","bar"]              |
|3  |{"c": {"cc": "baz"}}                 |{"cc":"baz"}               |
|4  |{"d": [{"dd": "foo"}, {"dd": "bar"}]}|[{"dd":"foo"},{"dd":"bar"}]|
+ -+                  -+             -+

相关问题更多 >

编程相关推荐

热门问题

热门文章