使用pandas dataframe中的JSON对象优化解析文件，其中某些行中可能缺少键问题的回答

使用pandas dataframe中的JSON对象优化解析文件，其中某些行中可能缺少键

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我希望优化下面的代码，这需要大约5秒，这对于一个只有1000行的文件来说太慢了。在 我有一个大文件，其中每一行都包含有效的JSON，每个JSON看起来如下所示（实际数据要大得多，并且是嵌套的，所以我用这个JSON片段来说明）： <pre><code> {"location":{"town":"Rome","groupe":"Advanced", "school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}}, "id":"145", "Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2, "Father":{"FatherName":"Peter","FatherAge":"51"}, "Teacher":["MrCrock","MrDaniel"],"Field":"Marketing", "season":["summer","spring"]} </code></pre> 我需要解析这个文件，以便从每个JSON中只提取一些键值，以获得结果数据帧： ^{pr2}$ 但是我在dataframe中需要的一些键在一些JSON对象中丢失了，所以我应该验证该键是否存在，如果没有，则用Null填充相应的值。我使用以下方法： <pre><code>df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother']) with open (path/to/file) as f: for chunk in f: jfile = json.loads(chunk) if 'groupe' in jfile['location']: groupe = jfile['location']['groupe'] else: groupe=np.nan if 'id' in jfile: id = jfile['id'] else: id = np.nan if 'MotherName' in jfile['Mother']: MotherName = jfile['Mother']['MotherName'] else: MotherName = np.nan if 'FatherName' in jfile['Father']: FatherName = jfile['Father']['FatherName'] else: FatherName = np.nan df = df.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName}, ignore_index=True) </code></pre> 我需要将整个1000行文件的运行时优化为&lt；=2秒。在PERL中，同样的解析函数需要&lt；1秒，但我需要用Python实现它。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如果在初始化过程中只需一步就可以构建数据帧，那么您将获得最佳性能。<code>DataFrame.from_record</code>获取一个元组序列，您可以从一次读取一条记录的生成器中提供这些元组。您可以使用<code>get</code>更快地解析数据，当找不到项时，它将提供一个默认参数。我创建了一个名为<code>dummy</code>的空<code>dict</code>来传递中间的{<cd2>}，这样您就知道链式get可以工作了。在 我创建了1000个记录数据集，在我的破笔记本上，时间从18秒变为0.06秒。很不错。在 <pre><code>import numpy as np import pandas as pd import json import time def extract_data(data): """ convert 1 json dict to records for import""" dummy = {} jfile = json.loads(data.strip()) return ( jfile.get('location', dummy).get('groupe', np.nan), jfile.get('id', np.nan), jfile.get('Mother', dummy).get('MotherName', np.nan), jfile.get('Father', dummy).get('FatherName', np.nan)) start = time.time() df = pd.DataFrame.from_records(map(extract_data, open('file.json')), columns=['group', 'id', 'Father', 'Mother']) print('New algorithm', time.time()-start) # # The original way # start= time.time() df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother']) with open ('file.json') as f: for chunk in f: jfile=json.loads(chunk) if 'groupe' in jfile['location']: groupe=jfile['location']['groupe'] else: groupe=np.nan if 'id' in jfile: id=jfile['id'] else: id=np.nan if 'MotherName' in jfile['Mother']: MotherName=jfile['Mother']['MotherName'] else: MotherName=np.nan if 'FatherName' in jfile['Father']: FatherName=jfile['Father']['FatherName'] else: FatherName=np.nan df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName}, ignore_index=True) print('original', time.time()-start) </code></pre>

使用pandas dataframe中的JSON对象优化解析文件，其中某些行中可能缺少键

1 个回答

相关Python问题