从json字符串推断模式

cSchema = StructType([StructField("id1", StringType()), StructField("id2", StringType()), StructField("params", StringType())\ ,StructField("Col2", IntegerType())]) test_list = [[1, 2, '{"param1": "val1", "param2": "val2"}', 1], [1, 3, '{"param1": "val4", "param2": "val5"}', 3]] df = spark.createDataFrame(test_list,schema=cSchema) +---+---+--------------------+----+ |id1|id2| params|Col2| +---+---+--------------------+----+ | 1| 2|{"param1": "val1"...| 1| | 1| 3|{"param1": "val4"...| 3| +---+---+--------------------+----+

schema2 = StructType([StructField("param1", StringType()), StructField("param2", StringType())]) df.withColumn( "params", from_json("params", schema2) ).select( col('id1'), col('id2'),col('Col2'), col('params.*') ).show()

2条回答

网友

1楼 · 编辑于 2024-09-27 21:26:34

下面是如何实现的，希望您可以将其更改为python

使用值中的schema_of_json动态获取模式，并使用from_json进行读取

val schema = schema_of_json(df.first().getAs[String]("params"))
df.withColumn("params", from_json($"params", schema))
  .select("id1", "id2", "Col2", "params.*")
  .show(false)

网友

2楼 · 编辑于 2024-09-27 21:26:34

在Pyspark中，语法应为：

import pyspark.sql.functions as F
schema = F.schema_of_json(df.select('params').head()[0])

df2 = df.withColumn(
  "params", F.from_json("params", schema)
).select(
  'id1', 'id2', 'Col2', 'params.*'
)

df2.show()
+ -+ -+  +   +   +
|id1|id2|Col2|param1|param2|
+ -+ -+  +   +   +
|  1|  2|   1|  val1|  val2|
|  1|  3|   3|  val4|  val5|
+ -+ -+  +   +   +

相关问题更多 >

编程相关推荐

热门问题

热门文章