<p>如果在初始化过程中只需一步就可以构建数据帧,那么您将获得最佳性能。<code>DataFrame.from_record</code>获取一个元组序列,您可以从一次读取一条记录的生成器中提供这些元组。您可以使用<code>get</code>更快地解析数据,当找不到项时,它将提供一个默认参数。我创建了一个名为<code>dummy</code>的空<code>dict</code>来传递中间的{<cd2>},这样您就知道链式get可以工作了。在</p>
<p>我创建了1000个记录数据集,在我的破笔记本上,时间从18秒变为0.06秒。很不错。在</p>
<pre><code>import numpy as np
import pandas as pd
import json
import time
def extract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('location', dummy).get('groupe', np.nan),
jfile.get('id', np.nan),
jfile.get('Mother', dummy).get('MotherName', np.nan),
jfile.get('Father', dummy).get('FatherName', np.nan))
start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)
#
# The original way
#
start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
for chunk in f:
jfile=json.loads(chunk)
if 'groupe' in jfile['location']:
groupe=jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id=jfile['id']
else:
id=np.nan
if 'MotherName' in jfile['Mother']:
MotherName=jfile['Mother']['MotherName']
else:
MotherName=np.nan
if 'FatherName' in jfile['Father']:
FatherName=jfile['Father']['FatherName']
else:
FatherName=np.nan
df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
print('original', time.time()-start)
</code></pre>