使用pandas dataframe中的JSON对象优化解析文件，其中某些行中可能缺少键

{"location":{"town":"Rome","groupe":"Advanced", "school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}}, "id":"145", "Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2, "Father":{"FatherName":"Peter","FatherAge":"51"}, "Teacher":["MrCrock","MrDaniel"],"Field":"Marketing", "season":["summer","spring"]}

df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother']) with open (path/to/file) as f: for chunk in f: jfile = json.loads(chunk) if 'groupe' in jfile['location']: groupe = jfile['location']['groupe'] else: groupe=np.nan if 'id' in jfile: id = jfile['id'] else: id = np.nan if 'MotherName' in jfile['Mother']: MotherName = jfile['Mother']['MotherName'] else: MotherName = np.nan if 'FatherName' in jfile['Father']: FatherName = jfile['Father']['FatherName'] else: FatherName = np.nan df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName}, ignore_index=True)

2条回答

网友

1楼 · 编辑于 2024-10-03 11:13:41

如果在初始化过程中只需一步就可以构建数据帧，那么您将获得最佳性能。DataFrame.from_record获取一个元组序列，您可以从一次读取一条记录的生成器中提供这些元组。您可以使用get更快地解析数据，当找不到项时，它将提供一个默认参数。我创建了一个名为dummy的空dict来传递中间的{}，这样您就知道链式get可以工作了。在

我创建了1000个记录数据集，在我的破笔记本上，时间从18秒变为0.06秒。很不错。在

import numpy as np
import pandas as pd
import json
import time

def extract_data(data):
    """ convert 1 json dict to records for import"""
    dummy = {}
    jfile = json.loads(data.strip())
    return (
        jfile.get('location', dummy).get('groupe', np.nan), 
        jfile.get('id', np.nan),
        jfile.get('Mother', dummy).get('MotherName', np.nan),
        jfile.get('Father', dummy).get('FatherName', np.nan))

start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
    columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)

#
# The original way
#

start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
      for chunk in f:
           jfile=json.loads(chunk)
           if 'groupe' in jfile['location']:
               groupe=jfile['location']['groupe']
           else:
               groupe=np.nan
           if 'id' in jfile:
                id=jfile['id']
           else:
                id=np.nan
           if 'MotherName' in jfile['Mother']:
                MotherName=jfile['Mother']['MotherName']
           else:
                MotherName=np.nan
           if 'FatherName' in jfile['Father']:
                FatherName=jfile['Father']['FatherName']
           else: 
                FatherName=np.nan
           df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
            ignore_index=True)
print('original', time.time()-start)

网友

2楼 · 编辑于 2024-10-03 11:13:41

关键部分是不将每一行追加到循环中的数据帧。您希望将集合保存在一个list或dict容器中，然后一次连接所有这些集合。您还可以使用一个简单的返回默认值的get来简化if/else结构（例如。np.nan公司)如果在字典中找不到该项。在

with open (path/to/file) as f:
    d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
    for chunk in f:
        jfile = json.loads(chunk)
        d['groupe'].append(jfile['location'].get('groupe', np.nan))
        d['id'].append(jfile.get('id', np.nan))
        d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
        d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))

    df = pd.DataFrame(d)

相关问题更多 >

编程相关推荐

热门问题

热门文章