如何将嵌套的JSON解析为CSV

2024-10-03 17:21:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个新项目,我从RESTAPI获取JSON数据-我正在尝试将这些数据解析为csv管道分隔,以便导入到我们的遗留软件 我似乎无法正确解析所有的值对—这是我第一次接触JSON,我尝试了很多方法,但每次只做了一点正确的事情

我用过Python,可以得到一些我需要的东西,但不能得到整个JSON树——它是一个列表,里面还有一些字典和列表 我知道我的代码是不完整的,只是想找人给我指出一个正确的方向,在python中什么工具可以完成这项工作

import json
import csv

with open('tenants.json') as access_json:
    read_content = json.load(access_json)


for rm_access in read_content:
    rm_data = rm_access

print(rm_data)
contacts_data = rm_data['Contacts']
leases_data = rm_data['Leases']
udfs_data = rm_data['UserDefinedValues']

for contacts_access in contacts_data:
    rm_contacts = contacts_access

更新时间:

import pandas as pd

with open('tenants.json') as access_json:
    read_content = json.load(access_json)

for rm_access in read_content:
    rm_data = rm_access

pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 150)

TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
    rm_data = rm_access

    TenantID.append(rm_data["TenantID"])
    TenantDisplayID.append(rm_data["TenantDisplayID"])
    Name.append(rm_data["Name"])
    FirstName.append(rm_data["FirstName"])
    LastName.append(rm_data["LastName"])
    WebMessage.append(rm_data["WebMessage"])
    Comment.append(rm_data["Comment"])
    RentDueDay.append(rm_data["RentDueDay"])
    RentPeriod.append(rm_data["RentPeriod"])
#    FirstContact.append(rm_data["FirstContact"])
    PropertyID.append(rm_data["PropertyID"])
    PostingStartDate.append(rm_data["PostingStartDate"])
    CreateDate.append(rm_data["CreateDate"])
    CreateUserID.append(rm_data["CreateUserID"])
    UpdateUserID.append(rm_data["UpdateUserID"])
    Contacts.append(rm_data["Contacts"])


df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name"
: Name,"FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,"Com
ment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod, "PropertyID"
: PropertyID, "PostingStartDate": PostingStartDate,"CreateDate": CreateDate, "Cr
eateUserID": CreateUserID,"UpdateUserID": UpdateUserID,"Contacts": Contacts})

print(df)

这是文件的样本

[
  {
    "TenantID": 115,
    "TenantDisplayID": 115,
    "Name": "Jane Doe",
    "FirstName": "Jane",
    "LastName": "Doe",
    "WebMessage": "",
    "Comment": "",
    "RentDueDay": 1,
    "RentPeriod": "Monthly",
    "FirstContact": "2015-11-01T15:30:00",
    "PropertyID": 17,
    "PostingStartDate": "2010-10-01T00:00:00",
    "CreateDate": "2014-04-16T13:35:37",
    "CreateUserID": 1,
    "UpdateDate": "2017-03-22T11:31:48",
    "UpdateUserID": 1,
    "Contacts": [
      {
        "ContactID": 128,
        "FirstName": "Jane",
        "LastName": "Doe",
        "MiddleName": "",
        "IsPrimary": true,
        "DateOfBirth": "1975-02-27T00:00:00",
        "FederalTaxID": "111-11-1111",
        "Comment": "",
        "Email": "jane.doe@mail.com",
        "License": "ZZT4532",
        "Vehicle": "BMW 3 Series",
        "IsShowOnBill": true,
        "Employer": "REW",
        "ApplicantType": "Applicant",
        "CreateDate": "2014-04-16T13:35:37",
        "CreateUserID": 1,
        "UpdateDate": "2017-03-22T11:31:48",
        "AnnualIncome": 0.0,
        "UpdateUserID": 1,
        "ParentID": 115,
        "ParentType": "Tenant",
        "PhoneNumbers": [
          {
            "PhoneNumberID": 286,
            "PhoneNumberTypeID": 2,
            "PhoneNumber": "703-555-5610",
            "Extension": "",
            "StrippedPhoneNumber": "7035555610",
            "IsPrimary": true,
            "ParentID": 128,
            "ParentType": "Contact"
          }
        ]
      }
    ],
    "UserDefinedValues": [
      {
        "UserDefinedValueID": 1,
        "UserDefinedFieldID": 4,
        "ParentID": 115,
        "Name": "Emerg Contact Name",
        "Value": "Terry Harper",
        "UpdateDate": "2016-01-22T15:41:53",
        "FieldType": "Text",
        "UpdateUserID": 2,
        "CreateUserID": 2
      },
      {
        "UserDefinedValueID": 174,
        "UserDefinedFieldID": 5,
        "ParentID": 115,
        "Name": "Emerg Contact Phone",
        "Value": "703-555-3568",
        "UpdateDate": "2016-01-22T15:42:03",
        "FieldType": "Text",
        "UpdateUserID": 2,
        "CreateUserID": 2
      }
    ],
    "Leases": [
      {
        "LeaseID": 115,
        "TenantID": 115,
        "UnitID": 181,
        "PropertyID": 17,
        "MoveInDate": "2010-10-01T00:00:00",
        "SortOrder": 1,
        "CreateDate": "2014-04-16T13:35:37",
        "UpdateDate": "2017-03-22T11:31:48",
        "CreateUserID": 1,
        "UpdateUserID": 1
      }
    ],
    "Addresses": [
      {
        "AddressID": 286,
        "AddressTypeID": 1,
        "Address": "14393 Montgomery Road Lot #102\r\nCincinnati, OH 45122",
        "Street": "14393 Montgomery Road Lot #102",
        "City": "Cincinnati",
        "State": "OH",
        "PostalCode": "45122",
        "IsPrimary": true,
        "ParentID": 115,
        "ParentType": "Tenant"
      }
    ],
    "OpenReceivables": [],
    "Status": "Current"
  },

不是所有的租户都会有所有的元素,这也是很棘手的

我需要顶部的数据,那里有TenantID,TenantDisplayID,等等 我还需要联系人、电话号码、租约等数据 每一行应该是静态的,所以如果它没有特定的标记,那么我希望它看起来像Null或None tentandi | TenantDisplayID | FirstName….等等,所以每行有相同数量的字段


Tags: rmnamejsondataaccessfirstnamecontactsappend
2条回答

一般问题

这个任务(以及其他类似任务)的问题不仅仅是如何创建一个算法——我相信,从理论上讲,您将能够通过(不是那么)大量的嵌套for循环来解决这个问题。问题是如何以一种你不会感到头痛的方式组织代码——即,你可以很容易地修复bug,你可以编写单元测试,你可以很容易地从阅读中理解代码(从现在起的六个月内),并且你可以很容易地在需要时更改代码。 我不知道有谁不犯错误时,他们的头围绕着一个深嵌套的结构。在一个代码中寻找错误,因为它反映了数据的嵌套结构,所以代码嵌套得很重,这可能会让人非常沮丧。你知道吗

快速(最有可能:最佳)解决方案

依赖为您的具体用例而设计的包,例如

https://github.com/cwacek/python-jsonschema-objects

如果您有API模式的正式定义,您可以使用包来实现。例如,如果您的API有一个Swagger模式定义,那么您可以使用swagger-pyhttps://github.com/digium/swagger-py)将JSON响应转换为Python对象。你知道吗

主要解决方案:面向对象编程和递归

即使您的具体用例可能有一些库,我还是想解释一下如何处理“那种”任务的原则:

组织此类问题代码的一个好方法是使用面向对象编程。利用递归原理,可以更清楚地解决嵌套问题。这也使得在API响应的JSON模式由于任何原因(例如API的更新)发生变化时,更容易对代码进行检查。在您的情况下,我建议您创建如下内容:

class JsonObject:
    """Parent Class for any Object that will be retrieved from the JSON
    and potentially has nested JsonObjects inside.

    This class takes care of parsing the json into python Objects and deals
    with the recursion into the nested structures."""

    primitives = []
    json_objects = {
        # For each class, this dict defines all the "embedded" classes which
        # live directly "under" that class in the nested JSON. It will have the
        # following structure:

        # attribute_name : class

        # In your case the JSON schema does not have any "single" objects
        # in the nesting strcuture, but only lists of nested objects. I
        # still , to demonstrate how you would do it in case, there would be
        # single "embedded"
    }
    json_object_lists = {
        # For each class, this dict defines all the "embedded" subclasses which
        # are provided in a list "under" that class in the nested JSON.
        # It will have the following structure:

        # attribute_name : class
    }

    @classmethod
    def from_dict(cls, d: dict) -> "JsonObject":
        instance = cls()

        for attribute in cls.primitives:
            # Here we just parse all the primitives
            instance.attribute = getattr(d, attribute, None)

        for attribute, klass in cls.json_object_lists.items():
            # Here we parse all lists of embedded JSON Objects
            nested_objects = []
            l = getattr(d, attribute, [])
            for nested_dict in l:
                nested_objects += klass.from_dict(nested_dict)

            setattr(instance, attribute, nested_objects)

        for attribute, klass in cls.json_objects.items():
            # Here we parse all "single" embedded JSON Objects
            setattr(
                instance,
                attribute,
                klass.from_dict(getattr(d, attribute, None)
            )

    def to_csv(self) -> str:
        pass

由于您没有解释如何准确地从JSON创建csv,所以我没有实现该方法,而是将此留给您。也没有必要解释整个方法。你知道吗

现在我们有了通用的父类,所有特定的父类都将从中继承,这样我们就可以将递归应用到我们的问题中。现在我们只需要根据要解析的JSON模式定义这些具体的结构。我从您的示例中得到以下内容,但您可以轻松地更改所需的内容:

class Address(JsonObject):
    primitives = [
        "AddressID",
        "AddressTypeID",
        "Address",
        "Street",
        "City",
        "State",
        "PostalCode",
        "IsPrimary",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {}


class Lease(JsonObject):
    primitives = [
        "LeaseID",
        "TenantID",
        "UnitID",
        "PropertyID",
        "MoveInDate",
        "SortOrder",
        "CreateDate",
        "UpdateDate",
        "CreateUserID",
        "UpdateUserID",
    ]

    json_objects = {}
    json_object_lists = {}


class UserDefinedValue(JsonObject):
    primitives = [
        "UserDefinedValueID",
        "UserDefinedFieldID",
        "ParentID",
        "Name",
        "Value",
        "UpdateDate",
        "FieldType",
        "UpdateUserID",
        "CreateUserID",
    ]

    json_objects = {}
    json_object_lists = {}


class PhoneNumber(JsonObject):
    primitives = [
        "PhoneNumberID",
        "PhoneNumberTypeID",
        "PhoneNumber",
        "Extension",
        "StrippedPhoneNumber",
        "IsPrimary",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {}

class Contact(JsonObject):
    primitives = [
        "ContactID",
        "FirstName",
        "LastName",
        "MiddleName",
        "IsPrimary",
        "DateOfBirth",
        "FederalTaxID",
        "Comment",
        "Email",
        "License",
        "Vehicle",
        "IsShowOnBill",
        "Employer",
        "ApplicantType",
        "CreateDate",
        "CreateUserID",
        "UpdateDate",
        "AnnualIncome",
        "UpdateUserID",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {
        "PhoneNumbers": PhoneNumber,
    }


class Tenant(JsonObject):
    primitives = [
        "TenantID",
        "TenantDisplayID",
        "Name",
        "FirstName",
        "LastName",
        "WebMessage",
        "Comment",
        "RentDueDay",
        "RentPeriod",
        "FirstContact",
        "PropertyID",
        "PostingStartDate",
        "CreateDate",
        "CreateUserID",
        "UpdateDate",
        "UpdateUserID",
        "OpenReceivables",  # Maybe this is also a nested Object? Not clear from your sample.
        "Status",
    ]

    json_object_lists = {
        "Contacts": Contact,
        "UserDefinedValues": UserDefinedValue,
        "Leases": Lease,
        "Addresses": Address,
    }

    json_objects = {}

您可以想象这种方法的“美”(至少:顺序),它在于以下几点:通过这种结构,我们可以处理API的JSON响应中的任何嵌套级别,而不会带来额外的麻烦—我们的代码不会加深缩进级别,因为我们已经将讨厌的嵌套分离到了JsonObjectsfrom_json方法的递归定义中。这就是为什么现在识别错误或对代码应用更改要容易得多的原因。你知道吗

要最终将JSON解析到我们的对象中,您可以执行以下操作:

import typing
import json


def tenants_from_json(json_string: str) -> typing.Iterable["Tenant"]:
    tenants = [
        Tenant.from_dict(tenant_dict)
        for tenant_dict in json.loads(json_string)
    ]
    return tenants

重要的最后一点:这只是基本原则

我的代码示例只是对使用对象和递归来处理结构的大量(讨厌的)嵌套的想法的一个非常简短的介绍。代码有一些缺陷。例如,应该避免定义可变的类变量。当然,整个代码应该验证它从API获得的数据。您还可能希望添加每个属性的类型,并在Python对象中正确地表示它们(例如,您的示例包含整数、datetimes和字符串)。你知道吗

我真的只想在这里向大家展示面向对象编程的原理。你知道吗

我没有花时间测试我的代码。所以可能还有虫子。再说一遍,我只是想证明一下这个原则。你知道吗

这样的方法应该有用:

import pandas as pd 
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 100000)
TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
    rm_data = rm_access

    print(rm_data)
    TenantID.append(rm_data["TenantID"])
    TenantDisplayID.append(rm_data["TenantDisplayID"])
    Name.append(rm_data["Name"])
    FirstName.append(rm_data["FirstName"])
    LastName.append(rm_data["LastName"])
    WebMessage.append(rm_data["WebMessage"])
    Comment.append(rm_data["Comment"])
    RentDueDay.append(rm_data["RentDueDay"])
    RentPeriod.append(rm_data["RentPeriod"])
    FirstContact.append(rm_data["FirstContact"])
    PropertyID.append(rm_data["PropertyID"])
    PostingStartDate.append(rm_data["PostingStartDate"])
    CreateDate.append(rm_data["CreateDate"])
    CreateUserID.append(rm_data["CreateUserID"])
    UpdateUserID.append(rm_data["UpdateUserID"])
    Contacts.append(rm_data["Contacts"])


df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name": Name,
                   "FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,
                   "Comment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod,
                   "FirstContact": FirstContact, "PropertyID": PropertyID, "PostingStartDate": PostingStartDate,
                   "CreateDate": CreateDate, "CreateUserID": CreateUserID,"UpdateUserID": UpdateUserID,
                   "Contacts": Contacts})

print(df)

相关问题 更多 >