当试图提取某些属性时，json_normalize会产生一个KeyError

d = {'data': {'questions': [{'id': 6574, 'text': 'Question #1', 'instructionalText': '', 'minimumResponses': 0, 'maximumResponses': None, 'sortOrder': 1, 'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None}, {'id': 362950, 'text': 'Answer #2', 'parentId': None}, {'id': 362951, 'text': 'Answer #3', 'parentId': None}, {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

from pandas import json_normalize import json fields = ['text','answers.text'] with open(R'response.json') as f: d = json.load(f) data = json_normalize(d['data'],['questions'],errors='ignore') data = data[fields] print(data)

2条回答

网友

1楼 · 编辑于 2024-09-27 09:30:29

使用record_prefix，与record_path和meta一起使用，这样d可以一次全部规范化
- ^当record_path和meta之间存在重叠的key名称时，{a1}将导致ValueError，并且'id'和'text'都在这两个名称中
- ValueError: Conflicting metadata name id, need distinguishing prefix在不使用record_path的情况下发生
发生KeyError是因为'answers.text'不在d中，它是由.json_normalize()创建的
如果有任何顶层keys在df中不是必需的，请将它们从meta中删除

import pandas as pd

# normalize d
df = pd.json_normalize(data=d['data']['questions'],
                       record_path= ['answers'],
                       meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
                       record_prefix='answers_')

# display(df)
   answers_id answers_text answers_parentId    id         text     instructionalText minimumResponses maximumResponses sortOrder
0      362949    Answer #1             None  6574  Question #1                                      0             None         1
1      362950    Answer #2             None  6574  Question #1                                      0             None         1
2      362951    Answer #3             None  6574  Question #1                                      0             None         1
3      362952    Answer #4             None  6574  Question #1                                      0             None         1
4      262949    Answer #1             None  4756  Question #2  No cheating, cheater                0             None         1
5      262950    Answer #2             None  4756  Question #2  No cheating, cheater                0             None         1
6      262951    Answer #3             None  4756  Question #2  No cheating, cheater                0             None         1
7      262952    Answer #4             None  4756  Question #2  No cheating, cheater                0             None         1

扩展测试数据

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
                            {'id': 4756,
                             'text': 'Question #2',
                             'instructionalText': 'No cheating, cheater',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 262950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 262951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}

对于另一个answer，不建议使用.apply(pd.Series)，因为它非常慢。
- 见SO: Splitting dictionary/list inside a Pandas Column into Separate Columns中的timing analysis
- 10米行53分钟

网友

2楼 · 编辑于 2024-09-27 09:30:29

这是我通常使用的技术

json_normalize()顶级列表
explode()子list，reset_index()步骤#3
用apply(pd.Series)展开子list中的dict

d = {'data': {'questions': [{'id': 6574,
    'text': 'Question #1',
    'instructionalText': '',
    'minimumResponses': 0,
    'maximumResponses': None,
    'sortOrder': 1,
    'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
     {'id': 362950, 'text': 'Answer #2', 'parentId': None},
     {'id': 362951, 'text': 'Answer #3', 'parentId': None},
     {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

df = pd.json_normalize(d["data"]["questions"]).explode("answers").reset_index(drop=True)
df = df.join(df["answers"].apply(pd.Series), rsuffix="_ans").drop(columns="answers")

^{tb1}$

扩展测试数据

相关问题更多 >

编程相关推荐

热门问题

热门文章