Pyspark访问和分解json的嵌套项

2024-10-02 10:28:23 发布

您现在位置：Python中文网/ 问答频道 /正文

8400

网友

男 | 程序猿一只，喜欢编程写python代码。

我对spark很陌生，我试图解析一个包含要聚合的数据的json文件，但我无法导航其内容。我寻找其他的解决办法，但我没有找到任何对我的案件有效的方法。在

以下是导入json的dataframe的架构：

root
  |-- Urbandataset: struct (nullable = true)
  |    |-- context: struct (nullable = true)
  |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |-- format: string (nullable = true)
  |    |    |    |-- height: long (nullable = true)
  |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |-- longitude: double (nullable = true)
  |    |    |-- language: string (nullable = true)
  |    |    |-- producer: struct (nullable = true)
  |    |    |    |-- id: string (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |-- timeZone: string (nullable = true)
  |    |    |-- timestamp: string (nullable = true)
  |    |-- specification: struct (nullable = true)
  |    |    |-- id: struct (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |    |-- value: string (nullable = true)
  |    |    |-- name: string (nullable = true)
  |    |    |-- properties: struct (nullable = true)
  |    |    |    |-- propertyDefinition: array (nullable = true)
  |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |-- codeList: string (nullable = true)
  |    |    |    |    |    |-- dataType: string (nullable = true)
  |    |    |    |    |    |-- propertyDescription: string (nullable = true)
  |    |    |    |    |    |-- propertyName: string (nullable = true)
  |    |    |    |    |    |-- subProperties: struct (nullable = true)
  |    |    |    |    |    |    |-- propertyName: array (nullable = true)
  |    |    |    |    |    |    |    |-- element: string (containsNull = true)
  |    |    |    |    |    |-- unitOfMeasure: string (nullable = true)
  |    |    |-- uri: string (nullable = true)
  |    |    |-- version: string (nullable = true)
  |    |-- values: struct (nullable = true)
  |    |    |-- line: array (nullable = true)
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |    |    |-- format: string (nullable = true)
  |    |    |    |    |    |-- height: double (nullable = true)
  |    |    |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |    |    |-- longitude: double (nullable = true)
  |    |    |    |    |-- id: long (nullable = true)
  |    |    |    |    |-- period: struct (nullable = true)
  |    |    |    |    |    |-- end_ts: string (nullable = true)
  |    |    |    |    |    |-- start_ts: string (nullable = true)
  |    |    |    |    |-- property: array (nullable = true)
  |    |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |    |-- name: string (nullable = true)
  |    |    |    |    |    |    |-- val: string (nullable = true)

整个json的一个子集被附加here

我的目标是从这个模式中检索值结构，并操作/聚合位于line.element.property行.element.property.元素.val在

我也尝试将其分解，以获取列“csv-style”中的每个字段，但出现错误：

pyspark.sql.utils.AnalysisException: u"cannot resolve 'array(UrbanDataset.context, UrbanDataset.specification, UrbanDataset.values)' due to data type mismatch: input to function array should all be the same type

^{pr2}$

谢谢

Tags： id json true format string context property element

1条回答

网友

1楼 · 发布于 2024-10-02 10:28:23

不能直接访问嵌套数组，需要在前面使用explode。它将为数组中的每个元素创建一条线。在

from pyspark.sql import functions as F
df.withColumn("Value", F.explode("Values"))

Pyspark访问和分解json的嵌套项

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pyspark访问和分解json的嵌套项

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >