如何进行多键groupby json转换?

2024-10-01 05:07:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要将一个平面json模式(MySQL查询的结果)转换成一个带有两个键的层次json结构。我有一个可行的解决方案,使用itertools groupby,但是我有更多的代码转换(有些比这更复杂),我正在寻找一个更好的方法在Python中实现这一点(我使用的是3.7)。也许我忽略了Python中的一些基本操作符,它们会减少我的代码行,或者可能有一个更好的库。我读过关于pandas的文章,它有groupby操作,但它的重点是数据分析,而不是像这样的数据转换。在节点.js我使用了jsonata,所以我想知道python中是否有更好的库用于json转换。你知道吗

澄清一下:我想提高我的开发效率;我不关心运行时效率,因为我的数据集很小。你知道吗

示例输入显示在下面的代码示例中,输出需要如下所示(两个级别的groupby和rename元素):

{'researchSubTypeToResolutionCodes': [
  {'researchSubTypeCode': None, 'resolutionTypes': [
    {'resolutionCode': 999991, 'resolutionSubTypeCodes': [99992, 99993]},
    {'resolutionCode': 999995, 'resolutionSubTypeCodes': [99996]}
    ]
  },
  {'researchSubTypeCode': 33533, 'resolutionTypes': [
    {'resolutionCode': 33726, 'resolutionSubTypeCodes': [33730, 33731, 33732, 33774]},
    {'resolutionCode': 33727, 'resolutionSubTypeCodes': [33730, 33731]}
    ]
  },
  {'researchSubTypeCode': 33534, 'resolutionTypes': [
    {'resolutionCode': 33726, 'resolutionSubTypeCodes': [33730]}
    ]
  }
]}

下面是使用itertools提供所需输出的工作代码:

from itertools import groupby
from operator import itemgetter

def mapResearchSubTypeToResolutionCodesToSchema(qryResult):
    groupByRschSubTypeDict = {}
    grouper = itemgetter("rsch_sub_typ_cd","resl_cd")
    for key, grp in groupby(qryResult, grouper):
        key_dict = dict(zip(["rsch_sub_typ_cd","resl_cd"], key))
        rschSubTyp = key_dict["rsch_sub_typ_cd"]
        reslSubTypCds = []
        for itm in grp:
            reslSubTypCds.append(itm["sub_resl_cd"])
        resolutionType = {
            "resolutionCode": key_dict["resl_cd"],
            "resolutionSubTypeCodes": reslSubTypCds
        }
        # Add to resolutionTypes list if already there, or create new one
        researchSubTypeCode_resolutionTypes = groupByRschSubTypeDict.get(rschSubTyp)
        if not researchSubTypeCode_resolutionTypes:
            researchSubTypeCode_resolutionTypes = []
            groupByRschSubTypeDict[rschSubTyp] = researchSubTypeCode_resolutionTypes
        researchSubTypeCode_resolutionTypes.append(resolutionType)

    finalResult = _transformToFinalSchema(groupByRschSubTypeDict)
    return finalResult

def _transformToFinalSchema(groupByRschSubTypeDict):
    researchSubTypeToResolutionCodesList = []
    for k,v in groupByRschSubTypeDict.items():
        newItem = {
            "researchSubTypeCode": k,
            "resolutionTypes": v
        }
        researchSubTypeToResolutionCodesList.append(newItem)

    finalResult = {
        "researchSubTypeToResolutionCodes": researchSubTypeToResolutionCodesList
    }
    return finalResult

if __name__ == '__main__':
    TEST_QRY_DATA = [
        {"rsch_sub_typ_cd": None, "resl_cd": 999991, "sub_resl_cd": 99992},
        {"rsch_sub_typ_cd": None, "resl_cd": 999991, "sub_resl_cd": 99993},
        {"rsch_sub_typ_cd": None, "resl_cd": 999995, "sub_resl_cd": 99996},
        {"rsch_sub_typ_cd": 33533, "resl_cd": 33726, "sub_resl_cd": 33730},
        {"rsch_sub_typ_cd": 33533, "resl_cd": 33726, "sub_resl_cd": 33731},
        {"rsch_sub_typ_cd": 33533, "resl_cd": 33726, "sub_resl_cd": 33732},
        {"rsch_sub_typ_cd": 33533, "resl_cd": 33726, "sub_resl_cd": 33774},
        {"rsch_sub_typ_cd": 33533, "resl_cd": 33727, "sub_resl_cd": 33730},
        {"rsch_sub_typ_cd": 33533, "resl_cd": 33727, "sub_resl_cd": 33731},
        {"rsch_sub_typ_cd": 33534, "resl_cd": 33726, "sub_resl_cd": 33730}
    ]
    result = mapResearchSubTypeToResolutionCodesToSchema(TEST_QRY_DATA)
    print(result)

Tags: key代码nonecddictgroupbytypresl
1条回答
网友
1楼 · 发布于 2024-10-01 05:07:02

我花了两步,但少了很多行,这应该是概念上更容易通读。你知道吗

首先让我们得到我们想要的数字。这基本上是一个groupby函数。 为了更好地理解它的工作原理,在for循环的末尾添加一个print语句,比如print(temp_dic)。你知道吗

temp_dic = dict()
for entry in TEST_QRY_DATA:
    if entry["rsch_sub_typ_cd"] not in temp_dic:
        temp_dic[entry["rsch_sub_typ_cd"]] = dict()
    if entry["resl_cd"] in temp_dic[entry["rsch_sub_typ_cd"]]:
        temp_dic[entry["rsch_sub_typ_cd"]][entry["resl_cd"]].append(entry["sub_resl_cd"])
    else:
        temp_dic[entry["rsch_sub_typ_cd"]][entry["resl_cd"]] = [entry["sub_resl_cd"]]
print(temp_dic)

输出:

{
  None: {999991: [99992, 99993], 999995: [99996]}, 
  33533: {33726: [33730, 33731, 33732, 33774], 33727: [33730, 33731]}, 
  33534: {33726: [33730]}
}

现在我们可以添加所需的标记:

final_dict = {'researchSubTypeToResolutionCodes': []}
for researchSubTypeCode, dic in temp_dic.items():
    temp_list = [{'resolutionCode': key, 'resolutionSubTypeCodes': val} for key, val in dic.items()]
    temp_dic = {'researchSubTypeCode': researchSubTypeCode, 'resolutionTypes': temp_list}
    final_dict['researchSubTypeToResolutionCodes'].append(temp_dic)

from pprint import pprint
pprint(final_dict)

输出:

{'researchSubTypeToResolutionCodes': [
    {'researchSubTypeCode': None, 'resolutionTypes': [{'resolutionCode': 999991, 'resolutionSubTypeCodes': [99992, 99993]}, {'resolutionCode': 999995, 'resolutionSubTypeCodes': [99996]}]}, 
    {'researchSubTypeCode': 33533, 'resolutionTypes': [{'resolutionCode': 33726, 'resolutionSubTypeCodes': [33730, 33731, 33732, 33774]}, {'resolutionCode': 33727, 'resolutionSubTypeCodes': [33730, 33731]}]}, 
    {'researchSubTypeCode': 33534, 'resolutionTypes': [{'resolutionCode': 33726, 'resolutionSubTypeCodes': [33730]}]}
]}

使用OrderedDictdefaultdict以及example可以实现一个更动态的递归解决方案,但这需要一些时间来解决。你知道吗

相关问题 更多 >