如何在python中将OSM键递归细分为字典树(XML到JSON)

2024-06-26 14:27:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Jupyter笔记本中的Python将OSM文档转换成MongoDB。我使用xml.etree.ElementTree解析xml文件并写入JSON文件

有许多标记键是以冒号分隔键表示的复合键:

<node id='1234'>
    <tag k='service:bicycle:diy', v='yes'/>
    <tag k='service:bicycle:second_hand', v='yes'/>
    <tag k='service:vehicle:brakes', v='yes'/>
</node>

在解析XML时,我想用这些标记创建一个字典树:

{ 'id': '1234',
  'service': {'bicycle': {'diy': 'yes',
                          'second_hand': 'yes'},
              'vehicle': {'brakes': 'yes'}}}

而且,我希望递归地执行此操作,以便可以处理具有任意冒号数目的键:<tag k=addr:street', v='Main Street'/>

我尝试了几种方法,但它总是覆盖字典,因此在每个级别上只有一个文档。(例如,您丢失了{'diy': 'yes'}条目。)

这是我能得到的最精简的部分,同时还包括给你的重要部分:

### bicycle_node.osm ###
# <?xml version="1.0" encoding="UTF-8"?>
# <osm version="0.6" generator="Overpass API 0.7.56.7 b85c4387">
# <note>Data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note>
# <meta osm_base="2020-11-05T23:56:03Z"/>
#   <bounds minlat="48.6458000" minlon="-122.5844000" maxlat="48.8595000" maxlon="-122.3455000"/>
#   <node id="255801452">
#     <tag k="name" v="The Hub"/>
#     <tag k="service:bicycle:diy" v="yes"/>
#     <tag k="service:bicycle:second_hand" v="yes"/>
#     <tag k="service:vehicle:painting" v="no"/>
#     <tag k="payment:coin" v="yes"/>
#     <tag k="payment:cash" v="yes"/>
#   </node>
#   <way id="4176487913">
#     <tag k="name" v="Some Place"/>
#     <tag k="service" v="driveway"/>
#   </way>
# </osm>

### Expected JSON ###
# {"_id": "255801452",
#  "name": "The Hub",
#  "service": {"bicycle": {"diy": "yes",
#                          "second_hand": "yes"},
#              "vehicle": {"painting": "no"}},
#  "payment": {"coin": "yes",
#              "cash": "yes"}}
# {"_id": "4176487913",
#  "name": "Some Place",
#  "service": "driveway"}

import xml.etree.ElementTree as ET
import codecs
import json

def get_subdiv_dict():
    return {"service": dict(), "payment": dict(), "wiki": dict()}

def subdiv_key(k, v, subdoc_dict):
    k_split = k.split(":")
    if len(k_split) == 1:
        subdoc_dict.update({ k_split[0]: v })
    else:
        subdoc_dict.update({ k_split[0]: subdiv_key(k=":".join(k_split[1:]),
                                                    v=v,
                                                    subdoc_dict=dict()) })
        
    return subdoc_dict

def shape_element(element):
    doc = dict()
    
    if element.tag in ["node", "way"]:
# Get attributes.
        for att_k, att_v in element.attrib.items():
            if att_k == "id":
                doc["_id"] = att_v
# Handle subelements.
        # Subdocs for subdivided keys.
        subdiv_dict = get_subdiv_dict()
        for sub_el in element.iter():
            if sub_el.tag == "tag":
                k = sub_el.attrib["k"]
                v = sub_el.attrib["v"]
        # Subdivide where appropriate.
                k_split = k.split(":")
                if k_split[0] in subdiv_dict.keys() and len(k_split) > 1:
                    subdiv_dict = subdiv_key(k=k, v=v, subdoc_dict=subdiv_dict)
                else:    
                    doc[k] = v
        # Add subdocs to element            
        for subdoc_k in subdiv_dict.keys():
            if subdiv_dict[subdoc_k]:
                doc[subdoc_k] = subdiv_dict[subdoc_k]
                
    return doc

def process_map(file_in, file_out):
    file_out = file_out.format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                fo.write(json.dumps(el) + "\n")
    return data 

process_map('bicycle_node.osm', 'bicycle_node.json')

# Out[1]:
# [{'_id': '255801452',
#   'name': 'The Hub',
#   'service': {'vehicle': {'painting': 'no'}},
#   'payment': {'cash': 'yes'}},
#  {'_id': '4176487913', 'name': 'Some Place', 'service': 'driveway'}]

Tags: inidnodeiftagserviceelementel
1条回答
网友
1楼 · 发布于 2024-06-26 14:27:19

啊,算了

shape_element()中,将subdiv_dict = get_subdiv_dict()替换为subdiv_dict = dict()

然后重写subdiv_key()

def subdiv_key(k, v, subdoc_dict):
    k_split = k.split(":")
    # Base case.
    if len(k_split) == 1:
        subdoc_dict.update({k_split[0]: v})
    # Recursive case.
    else:
        if k_split[0] not in subdoc_dict.keys():
            subdoc_dict.update({k_split[0]: dict()})
        new_k = ":".join(k_split[1:])
        new_subd_dict = subdoc_dict[k_split[0]]
        subdoc_dict[k_split[0]].update(subdiv_key(new_k, v, new_subd_dict))
    
    return subdoc_dict

相关问题 更多 >