我在Pyspark中读取多行json时遇到问题。例如:
{
"_index": "kl.service-log.2021.04.06",
"_type": "_doc",
"_id": "hZ3SpHgBhp2ht1Q8n8ym",
"_version": 1,
"_score": null,
"_source": {
"publishTime": "2021-04-06T01:36:09.422Z",
"client_ips": "2601:247:c580:3337:45c0:dd63:35e0:9247",
"body": {
"events": "[{\"key\":\"Key Launched\",\"count\":1,\"timestamp\":1617672914673,\"sum\":0},{\"key\":\"Viewed Screen\",\"count\":1,\"timestamp\":1617672969301,\"sum\":0}]",
"sdk_name": "java-native-android",
"tz": "-300"
}
}
}
模式定义如下:
root
|-- _id: string (nullable = true)
|-- _index: string (nullable = true)
|-- _score: string (nullable = true)
|-- _source: struct (nullable = true)
| |-- body: struct (nullable = true)
| | |-- events: string (nullable = true)
| | |-- sdk_name: string (nullable = true)
| | |-- tz: string (nullable = true)
| |-- client_ips: string (nullable = true)
| |-- publishTime: string (nullable = true)
|-- _type: string (nullable = true)
|-- _version: long (nullable = true)
在_source.body.events
下,我看到数据类型是string,但它是一个包含两个不同记录的命令式。我想让他们作为两个特定的列不同的行
您可以使用
from_json
解析事件列,并重建_源列:如果要将数组分解为单独的行,可以对上面获得的
df2
进行操作:相关问题 更多 >
编程相关推荐