Pyspark访问和分解json的嵌套项

2024-10-02 10:28:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我对spark很陌生,我试图解析一个包含要聚合的数据的json文件,但我无法导航其内容。 我寻找其他的解决办法,但我没有找到任何对我的案件有效的方法。在

以下是导入json的dataframe的架构:

root
  |-- Urbandataset: struct (nullable = true)
  |    |-- context: struct (nullable = true)
  |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |-- format: string (nullable = true)
  |    |    |    |-- height: long (nullable = true)
  |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |-- longitude: double (nullable = true)
  |    |    |-- language: string (nullable = true)
  |    |    |-- producer: struct (nullable = true)
  |    |    |    |-- id: string (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |-- timeZone: string (nullable = true)
  |    |    |-- timestamp: string (nullable = true)
  |    |-- specification: struct (nullable = true)
  |    |    |-- id: struct (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |    |-- value: string (nullable = true)
  |    |    |-- name: string (nullable = true)
  |    |    |-- properties: struct (nullable = true)
  |    |    |    |-- propertyDefinition: array (nullable = true)
  |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |-- codeList: string (nullable = true)
  |    |    |    |    |    |-- dataType: string (nullable = true)
  |    |    |    |    |    |-- propertyDescription: string (nullable = true)
  |    |    |    |    |    |-- propertyName: string (nullable = true)
  |    |    |    |    |    |-- subProperties: struct (nullable = true)
  |    |    |    |    |    |    |-- propertyName: array (nullable = true)
  |    |    |    |    |    |    |    |-- element: string (containsNull = true)
  |    |    |    |    |    |-- unitOfMeasure: string (nullable = true)
  |    |    |-- uri: string (nullable = true)
  |    |    |-- version: string (nullable = true)
  |    |-- values: struct (nullable = true)
  |    |    |-- line: array (nullable = true)
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |    |    |-- format: string (nullable = true)
  |    |    |    |    |    |-- height: double (nullable = true)
  |    |    |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |    |    |-- longitude: double (nullable = true)
  |    |    |    |    |-- id: long (nullable = true)
  |    |    |    |    |-- period: struct (nullable = true)
  |    |    |    |    |    |-- end_ts: string (nullable = true)
  |    |    |    |    |    |-- start_ts: string (nullable = true)
  |    |    |    |    |-- property: array (nullable = true)
  |    |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |    |-- name: string (nullable = true)
  |    |    |    |    |    |    |-- val: string (nullable = true)

整个json的一个子集被附加here

我的目标是从这个模式中检索结构,并操作/聚合位于line.element.property行.element.property.元素.val在

我也尝试将其分解,以获取列“csv-style”中的每个字段,但出现错误:

pyspark.sql.utils.AnalysisException: u"cannot resolve 'array(UrbanDataset.context, UrbanDataset.specification, UrbanDataset.values)' due to data type mismatch: input to function array should all be the same type

^{pr2}$

谢谢


Tags: idjsontrueformatstringcontextpropertyelement
1条回答
网友
1楼 · 发布于 2024-10-02 10:28:23

不能直接访问嵌套数组,需要在前面使用explode。 它将为数组中的每个元素创建一条线。在

from pyspark.sql import functions as F
df.withColumn("Value", F.explode("Values"))

相关问题 更多 >

    热门问题