将分层数据帧转换为嵌套的词典列表

2024-05-10 11:23:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我的数据可以下载here,看起来是这样的

enter image description here

我的目标是建立一个网络,其中节点是州、市和县,按人口大小排列。这将是应用程序的一部分,因此节点级别的选择将是动态的,可以是州、市和县的任意组合Here是我想要实现的可视化。 数据需要如下所示:

[{name: "state1",
  children:[{name: "county1",
             children:[{name: "city1",
                        population: "13000"
                       },
                       {name: "city2",
                        population: "10000"
                       },
            {name: "county2",
             children:[{name: "city1",
                        population: "1000"
                       },
                       {name: "city2",
                        population: "100000"
                       }]
            }]
},{name: "state2",
  children:[{name: "county1",
             children:[{name: "city1",
                        population: "13000"
                       },
                       {name: "city2",
                        population: "10000"
                       },
            {name: "county2",
             children:[{name: "city1",
                        population: "1000"
                       },
                       {name: "city2",
                        population: "100000"
                       }]
            }]
}]

这就是我到目前为止所尝试的

import pandas as pd
from benedict import benedict

# read in the data
df = pd.read_csv("C:\\Users\\m316375\\Downloads\\uscities.csv")

# Using Benedict to create a nested list
df_benedict = df[["state_name","city", "county_name", "population"]]
node_id = ["state_name", "county_name","city"]
df_benedict['dict_path'] = df[node_id].astype(str).apply('_'.join, axis=1)

d = benedict()
d.keypath_separator = '_'

for row in df_benedict.iterrows():
    dict_path = row[1]["dict_path"]
    d[dict_path] = row[1]["population"]

##### First Attempt ########
#looping through the nested dictionary
state_children = []
city_children = []
county_children = []
full_children = []
dict_list = []
counter = 0
for state, v0 in d.items():
    #print(f"state={state}, population={v0})")
    for city, v1 in v0.items():
        for county, v2 in v1.items():
            county_children.append({"name": city,
                                  "value": v2})
            counter += 1
            # print(counter)
            if counter > len(v1.items()):
                city_children.append({"name": county,
                                        "children": county_children})
                county_children = []
                counter = 0
                state_children = [{"name": city,
                                "children": city_children}]
                dict_list.append({"name": state,
                                  "children": state_children})

问题:我的方法不是动态的。如果我只想显示州和城市,我需要删除其中一个for循环。不理想


Tags: pathnameincitydfforcounterdict
1条回答
网友
1楼 · 发布于 2024-05-10 11:23:01

我想我得到了你需要的东西,虽然有点笨重。如果您提供的链接中的数据加载到dataframedf,则代码如下:

首先,groupby将州、市和县移动到多索引,并将人口作为唯一列:

df_gr = df.groupby(['state_name', 'county_name', 'city']).sum()['population']

然后,我们可以使用字典理解构建所需的字典:

resulting_dict = {level0: {level1: {level2: df_gr.xs([level0, level1, level2]) for level2 in df_gr.xs([level0, level1]).reset_index().groupby(['city']).sum().index} for level1 in df_gr.xs([level0]).reset_index().groupby(['county_name', 'city']).sum().index.levels[0]} for level0 in df_gr.index.levels[0]}

基本上,我们使用.xs()返回数据帧在所需级别的横截面。我们还确保不会循环使用不存在的级别组合。用.reset_index()后跟.groupby()来获取横截面的索引,而不是整个数据帧(因为在.xs()之后使用.index.levels返回整个数据帧的级别,我不知道有什么更简单的方法使其仅返回横截面的索引)

您可以根据所需的输出格式定制词典理解

相关问题 更多 >