将嵌套json读入数据帧

2024-09-28 03:23:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个从Firebase导出的json文件,如下所示

{
  "reports" : {
    "Google-Pixel 2 XL" : {
      "-MIoCtD9YUF2G9Esfrfz" : {
        "message" : "04 Oct 2020 23:25:17:047 onCreate MainActivity",
        "timestamp" : 1601825117067
      },
      "-MIoCtFVOxu8wdEHtm6q" : {
        "message" : "04 Oct 2020 23:25:17:214 onCreate Service",
        "timestamp" : 1601825117216
      },
      "-MIoCyBtKMQqQzUHEXsW" : {
        "message" : "04 Oct 2020 23:25:37:682 onStartCommand Service",
        "timestamp" : 1601825137685
      },
      "-MIoFWll9r3qwzWNoGMn" : {
        "message" : "04 Oct 2020 23:36:47:687: (1.3212517, 103.860314)",
        "timestamp" : 1601825807693
      }
    },
    "Vivo 1820" : {
      "-MIoF14JUm6JMZrOzDlL" : {
        "message" : "04 Oct 2020 23:34:37:623 onCreate MainActivity",
        "timestamp" : 1601825677653
      },
      "-MIoF1A9ZZNqTu5W-rQD" : {
        "message" : "04 Oct 2020 23:34:38:016 onCreate Service",
        "timestamp" : 1601825678026
      },
      "-MIoF2gNDua9FfLBTg6q" : {
        "message" : "04 Oct 2020 23:34:44:235 onCreate MainActivity",
        "timestamp" : 1601825684248
      }
    }
  }
}

我正在尝试将其转换为一个4列数据帧,如下所示

model             | id                   | message                               | timestamp
Google-Pixel 2 XL | -MIoCtD9YUF2G9Esfrfz | 04 Oct 2020 23:25:17:047 onCreate...  | 1601825117067
Vivo 1820         | -MIoF14JUm6JMZrOzDlL | 04 Oct 2020 23:34:37:623 onCreate...  | 1601825677653

我该怎么做?我尝试了各种方法和正常化,但似乎没有得到它

data = pd.read_json("firebase-file.json")
df = pd.json_normalize(data, record_path="reports")

多谢各位


Tags: jsonmessagedataservicegoogletimestampoctreports
3条回答

我认为你需要自己把它弄平,幸好它并不复杂:

s = [[k, i, *j.values()] for k,v in data["reports"].items() for i, j in v.items()]

print (pd.DataFrame(s))

                   0                     1                                                  2              3
0  Google-Pixel 2 XL  -MIoCtD9YUF2G9Esfrfz     04 Oct 2020 23:25:17:047 onCreate MainActivity  1601825117067
1  Google-Pixel 2 XL  -MIoCtFVOxu8wdEHtm6q          04 Oct 2020 23:25:17:214 onCreate Service  1601825117216
2  Google-Pixel 2 XL  -MIoCyBtKMQqQzUHEXsW    04 Oct 2020 23:25:37:682 onStartCommand Service  1601825137685
3  Google-Pixel 2 XL  -MIoFWll9r3qwzWNoGMn  04 Oct 2020 23:36:47:687: (1.3212517, 103.860314)  1601825807693
4          Vivo 1820  -MIoF14JUm6JMZrOzDlL     04 Oct 2020 23:34:37:623 onCreate MainActivity  1601825677653
5          Vivo 1820  -MIoF1A9ZZNqTu5W-rQD          04 Oct 2020 23:34:38:016 onCreate Service  1601825678026
6          Vivo 1820  -MIoF2gNDua9FfLBTg6q     04 Oct 2020 23:34:44:235 onCreate MainActivity  1601825684248

试试这个(见我上面的评论)

import pandas as pd


data = []
for k, v in test['reports'].items():
    model_name = k
    for model in v.items():
        _data = {}
        _data['model'] = model_name
        _data['id'] = model[0]
        _data['message'] = model[1]['message']
        _data['timestamp'] = model[1]['timestamp']
        data.append(_data)
df = pd.DataFrame(data)

其中test是您的数据,因此test['reports']访问要解析的嵌套信息

根据pd.json_normalize()official docs,它假定一个数组(列表)输入。然而,原始的json与dict列表不同,最重要的是,键“id”不存在。因此,我认为绝对需要一个手工制作的解析器

代码

import pandas as pd
import json

file_path = "/mnt/ramdisk/in.json"

with open(file_path) as f:
    dic = json.load(f)

# discard the redundant "report" layer
dic = dic["reports"]

# produce a flattened list of dict
ls = []
for k1, v1 in dic.items():
    # k1 = model
    for k2, v2 in v1.items():
        # k2 = the hash-like id
        v2["model"] = k1
        v2["id"] = k2
        ls.append(v2)

df = pd.json_normalize(ls)

输出

# Trim the message for printing purpose
df2 = df.copy()
df2["message"] = df["message"].apply(lambda s: s[:10])
df2

Out[28]: 
      message      timestamp              model                    id
0  04 Oct 202  1601825117067  Google-Pixel 2 XL  -MIoCtD9YUF2G9Esfrfz
1  04 Oct 202  1601825117216  Google-Pixel 2 XL  -MIoCtFVOxu8wdEHtm6q
2  04 Oct 202  1601825137685  Google-Pixel 2 XL  -MIoCyBtKMQqQzUHEXsW
3  04 Oct 202  1601825807693  Google-Pixel 2 XL  -MIoFWll9r3qwzWNoGMn
4  04 Oct 202  1601825677653          Vivo 1820  -MIoF14JUm6JMZrOzDlL
5  04 Oct 202  1601825678026          Vivo 1820  -MIoF1A9ZZNqTu5W-rQD
6  04 Oct 202  1601825684248          Vivo 1820  -MIoF2gNDua9FfLBTg6q

注意:进入像散列一样的id所在的层的深度似乎是必要的。这是因为id最初是keys,但似乎必须将它们重新格式化为values,才能被pd.json_normalize正确地解释为值。我在互联网上的简单调查也没有发现使用简单的内置方法解析这种递归结构的例子

相关问题 更多 >

    热门问题