在用Python将URL写入JSON文件时遇到问题

2024-05-19 19:18:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这个JSON文件:

[[ 
{
    "company name": "MICROMUSE INC",
    "cik_number": "1036425",
    "form_id": "10-Q",
    "date": "20030213",
    "file_url": "https://www.sec.gov/Archives/edgar/data/1036425/0001021408-03-002741.txt"
}, {
    "company name": "VENTURE LENDING & LEASING II INC",
    "cik_number": "1039802",
    "form_id": "10-Q",
    "date": "20030213",
    "file_url": "https://www.sec.gov/Archives/edgar/data/1039802/0001039802-03-000002.txt"
}, {
    "company name": "PHARSIGHT CORP",
    "cik_number": "1040853",
    "form_id": "10-Q",
    "date": "20030213",
    "file_url": "https://www.sec.gov/Archives/edgar/data/1040853/0001104659-03-002127.txt"
}
]]

我对JSON结构非常陌生,但我的理解是 其中每一个都称为JSON对象:

{
        "company name": "PHARSIGHT CORP",
        "cik_number": "1040853",
        "form_id": "10-Q",
        "date": "20030213",
        "file_url": "https://www.sec.gov/Archives/edgar/data/1040853/0001104659-03-002127.txt"
    }

我想数一数,所以我用Python做了这个:

import json
with open('file.json', 'r') as f:
    urls_dict = json.load(f)

itr = iter(urls_dict)

len(list(itr))

我的预期结果是3,但我得到了1。 我有几个问题:

  1. 我的JSON结构有什么问题吗?(我用自己的代码编写了它们。)
  2. 这个JSON文件实际上包含数百万个URL,我应该遍历它们来下载这些URL目标。使用Python处理这些下载是个坏主意吗

我这样问是因为尽管我的JSON文件只有160MB,但当我尝试用文本编辑器打开它时,内存压力会上升到36GB。我还估计有1.000.000个URL,每个URL包含35MB的XML表,因此1.000.000 x 35MB=35TB。(这么多的文件足以称为大数据吗?:D)


编辑:

根据Shashank Bharadwaj的建议,我试图删除[]以避免出现内部列表,但看起来

json.load

不解码多个JSON对象

我认为我的JSON的结构应该是这样的:

{
      "url's:[
           {"company name":"MICROMUSE INC",
            "cik_number": "1036425",
            "form_id": "10-Q",
            "date": "20030213",
            "file_url": "https://www.sec.gov/Archives/edgar/data/1036425/0001021408-03-002741.txt"
        }, {
            "company name": "VENTURE LENDING & LEASING II INC",
            "cik_number": "1039802",
            "form_id": "10-Q",
            "date": "20030213",
            "file_url": "https://www.sec.gov/Archives/edgar/data/1039802/0001039802-03-000002.txt"
        }, {
            "company name": "PHARSIGHT CORP",
            "cik_number": "1040853",
            "form_id": "10-Q",
            "date": "20030213",
            "file_url": "https://www.sec.gov/Archives/edgar/data/1040853/0001104659-03-002127.txt"}
               ]
    }

这就是我创建JSON文件的方式:

def url_ext:
   #some code to read urls, request those urls and create index
   .
   .
   .

   #loop through each document in the master list.
    for index, document in enumerate(master_data):

        # create a dictionary for each document in the master list
        document_dict = {}
        document_dict['cik_number'] = document[0]
        document_dict['company_name'] = document[1]
        document_dict['form_id'] = document[2]
        document_dict['date'] = document[3]
        document_dict['file_url'] = document[4]

        master_data[index] = document_dict

    jsonList = []
    for document_dict in master_data:

        # if it's a 10-K document pull the url and the name.
        if document_dict['form_id'] == '10-K':
            # get the components
            data = {}
            data['company name'] = document_dict['company_name']
            data['cik_number'] = document_dict['cik_number']
            data['form_id'] = document_dict['form_id']
            data['date'] = document_dict['date']
            data['file_url'] = document_dict['file_url']
            jsonList.append(data)
        if document_dict['form_id'] == '10-Q':
            # get the components
            data = {}
            data['company name'] = document_dict['company_name']
            data['cik_number'] = document_dict['cik_number']
            data['form_id'] = document_dict['form_id']
            data['date'] = document_dict['date']
            data['file_url'] = document_dict['file_url']
            jsonList.append(data)
        if document_dict['form_id'] == 'NT 10-K':
            # get the components
            data = {}
            data['company name'] = document_dict['company_name']
            data['cik_number'] = document_dict['cik_number']
            data['form_id'] = document_dict['form_id']
            data['date'] = document_dict['date']
            data['file_url'] = document_dict['file_url']
            jsonList.append(data)
        if document_dict['form_id'] == 'NT 10-Q':
            # get the components
            data = {}
            data['company name'] = document_dict['company_name']
            data['cik_number'] = document_dict['cik_number']
            data['form_id'] = document_dict['form_id']
            data['date'] = document_dict['date']
            data['file_url'] = document_dict['file_url']
            jsonList.append(data)

    return jsonList

这就是我调用这个函数的方式

with open("SECmasterURLs.txt",'r') as f:
    byte_data = f.read()

master_urls = byte_data.splitlines()
JSON_file = open("urls.JSON", 'w')
jsonList = []

for line in master_urls:

    data = url_ext(line)
    jsonList.append(data)

JSON_file.write(json.dumps(jsonList))

我可以知道我应该如何修改我的代码。我觉得我首先编写的代码非常复杂,特别是最后一部分,我过滤了10-K和10-Q,我不知道如何将其更改为更简单


Tags: nameformtxtidjsonurlnumberdata
1条回答
网友
1楼 · 发布于 2024-05-19 19:18:35

因此json文件内部的结构可能会有所不同。 在您的例子中,您会看到列表中有一个列表。 因此,当您实际将json文件加载到URL_dict中时,它在两个列表中包含json对象。因此,您可以访问内部列表,然后获得所需的结果

import json
with open('file.json', 'r') as f:
    urls_dict = json.load(f)

urls_dict = urls_dict[0]
itr = iter(urls_dict)

len(list(itr))

因此,当您执行URL_dict[0]时,它会访问内部列表(其索引为0),从而消除嵌套列表

相关问题 更多 >