我有这个JSON文件:
[[
{
"company name": "MICROMUSE INC",
"cik_number": "1036425",
"form_id": "10-Q",
"date": "20030213",
"file_url": "https://www.sec.gov/Archives/edgar/data/1036425/0001021408-03-002741.txt"
}, {
"company name": "VENTURE LENDING & LEASING II INC",
"cik_number": "1039802",
"form_id": "10-Q",
"date": "20030213",
"file_url": "https://www.sec.gov/Archives/edgar/data/1039802/0001039802-03-000002.txt"
}, {
"company name": "PHARSIGHT CORP",
"cik_number": "1040853",
"form_id": "10-Q",
"date": "20030213",
"file_url": "https://www.sec.gov/Archives/edgar/data/1040853/0001104659-03-002127.txt"
}
]]
我对JSON结构非常陌生,但我的理解是 其中每一个都称为JSON对象:
{
"company name": "PHARSIGHT CORP",
"cik_number": "1040853",
"form_id": "10-Q",
"date": "20030213",
"file_url": "https://www.sec.gov/Archives/edgar/data/1040853/0001104659-03-002127.txt"
}
我想数一数,所以我用Python做了这个:
import json
with open('file.json', 'r') as f:
urls_dict = json.load(f)
itr = iter(urls_dict)
len(list(itr))
我的预期结果是3,但我得到了1。 我有几个问题:
我这样问是因为尽管我的JSON文件只有160MB,但当我尝试用文本编辑器打开它时,内存压力会上升到36GB。我还估计有1.000.000个URL,每个URL包含35MB的XML表,因此1.000.000 x 35MB=35TB。(这么多的文件足以称为大数据吗?:D)
编辑:
根据Shashank Bharadwaj的建议,我试图删除[]以避免出现内部列表,但看起来
json.load
不解码多个JSON对象
我认为我的JSON的结构应该是这样的:
{
"url's:[
{"company name":"MICROMUSE INC",
"cik_number": "1036425",
"form_id": "10-Q",
"date": "20030213",
"file_url": "https://www.sec.gov/Archives/edgar/data/1036425/0001021408-03-002741.txt"
}, {
"company name": "VENTURE LENDING & LEASING II INC",
"cik_number": "1039802",
"form_id": "10-Q",
"date": "20030213",
"file_url": "https://www.sec.gov/Archives/edgar/data/1039802/0001039802-03-000002.txt"
}, {
"company name": "PHARSIGHT CORP",
"cik_number": "1040853",
"form_id": "10-Q",
"date": "20030213",
"file_url": "https://www.sec.gov/Archives/edgar/data/1040853/0001104659-03-002127.txt"}
]
}
这就是我创建JSON文件的方式:
def url_ext:
#some code to read urls, request those urls and create index
.
.
.
#loop through each document in the master list.
for index, document in enumerate(master_data):
# create a dictionary for each document in the master list
document_dict = {}
document_dict['cik_number'] = document[0]
document_dict['company_name'] = document[1]
document_dict['form_id'] = document[2]
document_dict['date'] = document[3]
document_dict['file_url'] = document[4]
master_data[index] = document_dict
jsonList = []
for document_dict in master_data:
# if it's a 10-K document pull the url and the name.
if document_dict['form_id'] == '10-K':
# get the components
data = {}
data['company name'] = document_dict['company_name']
data['cik_number'] = document_dict['cik_number']
data['form_id'] = document_dict['form_id']
data['date'] = document_dict['date']
data['file_url'] = document_dict['file_url']
jsonList.append(data)
if document_dict['form_id'] == '10-Q':
# get the components
data = {}
data['company name'] = document_dict['company_name']
data['cik_number'] = document_dict['cik_number']
data['form_id'] = document_dict['form_id']
data['date'] = document_dict['date']
data['file_url'] = document_dict['file_url']
jsonList.append(data)
if document_dict['form_id'] == 'NT 10-K':
# get the components
data = {}
data['company name'] = document_dict['company_name']
data['cik_number'] = document_dict['cik_number']
data['form_id'] = document_dict['form_id']
data['date'] = document_dict['date']
data['file_url'] = document_dict['file_url']
jsonList.append(data)
if document_dict['form_id'] == 'NT 10-Q':
# get the components
data = {}
data['company name'] = document_dict['company_name']
data['cik_number'] = document_dict['cik_number']
data['form_id'] = document_dict['form_id']
data['date'] = document_dict['date']
data['file_url'] = document_dict['file_url']
jsonList.append(data)
return jsonList
这就是我调用这个函数的方式
with open("SECmasterURLs.txt",'r') as f:
byte_data = f.read()
master_urls = byte_data.splitlines()
JSON_file = open("urls.JSON", 'w')
jsonList = []
for line in master_urls:
data = url_ext(line)
jsonList.append(data)
JSON_file.write(json.dumps(jsonList))
我可以知道我应该如何修改我的代码。我觉得我首先编写的代码非常复杂,特别是最后一部分,我过滤了10-K和10-Q,我不知道如何将其更改为更简单
因此json文件内部的结构可能会有所不同。 在您的例子中,您会看到列表中有一个列表。 因此,当您实际将json文件加载到URL_dict中时,它在两个列表中包含json对象。因此,您可以访问内部列表,然后获得所需的结果
因此,当您执行URL_dict[0]时,它会访问内部列表(其索引为0),从而消除嵌套列表
相关问题 更多 >
编程相关推荐