将嵌套JSON展平到pandas.DataFrame:基于字典值对列进行排序和命名

2024-09-29 17:13:49 发布

您现在位置:Python中文网/ 问答频道 /正文

当我利用Trenton McKinney提供的this helpful answer处理多个嵌套JSON文件以在pandas中处理时,我提出了一个问题。
按照他的建议,我使用^{} function described here来扁平化一批嵌套的json文件但是,我的JSON文件的一致性遇到了问题。

单个JSON文件大致如下所示:

{
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

利用引用的代码,我将得到如下数据:

^{tb1}$

这有多个问题
首先,在我的数据中,有一个有限的“补充”列表,但是,它们并不总是出现,如果出现,它们也不总是以相同的顺序出现在示例表中,您可以看到第二行中的两个“补充”切换了位置。我更喜欢“补充栏目”的固定顺序

其次,最好的选择是这样的表格:

^{tb2}$

我已经尝试过编辑引用的flatten_json函数,但我不知道如何使其工作。
解决方案包括简单地编辑字典(感谢Andrej Kesely)。我刚刚添加了一个异常传递,以防某些列不存在

d = {
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

for s in d["supplement"]:
    try:
        d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
    except:
        pass
    try:
        d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
    except:
        pass

del d["supplement"]

df = pd.DataFrame([d])
print(df)
               product         product_id         product_type          producer          currency         client_id  supplementtype_RTZ_price supplementtype_RTZ_rebate  supplementtype_CVB_price supplementtype_CVB_rebate  supplementtype_JKL_price supplementtype_JKL_rebate
0  example_productname  example_productid  example_producttype  example_producer  example_currency  example_clientid                    300000                       500                    500000                       250                    100000                       750

使用/引用的代码

def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
    """
    Flatten a list of nested dicts.
    """
    out = dict()
    def flatten(x: (list, dict, str), name: str='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude:
                    flatten(x[a], f'{name}{a}{sep}')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, f'{name}{i}{sep}')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

# list of files
files = ['test1.json', 'test2.json']

# list to add dataframe from each file
df_list = list()

# iterate through files
for file in files:
    with open(file, 'r') as f:

        # read with json
        data = json.loads(f.read())

        # flatten_json into a dataframe and add to the dataframe list
        df_list.append(pd.DataFrame.from_dict(flatten_json(data), orient='index').T)
        
# concat all dataframes together
df = pd.concat(df_list).reset_index(drop=True)

Tags: producerinidjsondfexampletypeproduct
1条回答
网友
1楼 · 发布于 2024-09-29 17:13:49

您可以在创建数据帧之前修改字典:

d = {
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

for s in d["supplement"]:
    d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
    d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]

del d["supplement"]

df = pd.DataFrame([d])
print(df)

印刷品:

               product         product_id         product_type          producer          currency         client_id  supplementtype_RTZ_price supplementtype_RTZ_rebate  supplementtype_CVB_price supplementtype_CVB_rebate  supplementtype_JKL_price supplementtype_JKL_rebate
0  example_productname  example_productid  example_producttype  example_producer  example_currency  example_clientid                    300000                       500                    500000                       250                    100000                       750

相关问题 更多 >

    热门问题