我对spark很陌生,我试图解析一个包含要聚合的数据的json文件,但我无法导航其内容。 我寻找其他的解决办法,但我没有找到任何对我的案件有效的方法。在
以下是导入json的dataframe的架构:
root
|-- Urbandataset: struct (nullable = true)
| |-- context: struct (nullable = true)
| | |-- coordinates: struct (nullable = true)
| | | |-- format: string (nullable = true)
| | | |-- height: long (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- language: string (nullable = true)
| | |-- producer: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| | |-- timestamp: string (nullable = true)
| |-- specification: struct (nullable = true)
| | |-- id: struct (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | | |-- value: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- propertyDefinition: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- codeList: string (nullable = true)
| | | | | |-- dataType: string (nullable = true)
| | | | | |-- propertyDescription: string (nullable = true)
| | | | | |-- propertyName: string (nullable = true)
| | | | | |-- subProperties: struct (nullable = true)
| | | | | | |-- propertyName: array (nullable = true)
| | | | | | | |-- element: string (containsNull = true)
| | | | | |-- unitOfMeasure: string (nullable = true)
| | |-- uri: string (nullable = true)
| | |-- version: string (nullable = true)
| |-- values: struct (nullable = true)
| | |-- line: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- coordinates: struct (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- height: double (nullable = true)
| | | | | |-- latitude: double (nullable = true)
| | | | | |-- longitude: double (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- period: struct (nullable = true)
| | | | | |-- end_ts: string (nullable = true)
| | | | | |-- start_ts: string (nullable = true)
| | | | |-- property: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- name: string (nullable = true)
| | | | | | |-- val: string (nullable = true)
整个json的一个子集被附加here
我的目标是从这个模式中检索值结构,并操作/聚合位于line.element.property行.element.property.元素.val在
我也尝试将其分解,以获取列“csv-style”中的每个字段,但出现错误:
^{pr2}$pyspark.sql.utils.AnalysisException: u"cannot resolve 'array(
UrbanDataset
.context
,UrbanDataset
.specification
,UrbanDataset
.values
)' due to data type mismatch: input to function array should all be the same type
谢谢
不能直接访问嵌套数组,需要在前面使用
explode
。 它将为数组中的每个元素创建一条线。在相关问题 更多 >
编程相关推荐