我正在使用Jupyter笔记本中的Python将OSM文档转换成MongoDB。我使用xml.etree.ElementTree解析xml文件并写入JSON文件
有许多标记键是以冒号分隔键表示的复合键:
<node id='1234'>
<tag k='service:bicycle:diy', v='yes'/>
<tag k='service:bicycle:second_hand', v='yes'/>
<tag k='service:vehicle:brakes', v='yes'/>
</node>
在解析XML时,我想用这些标记创建一个字典树:
{ 'id': '1234',
'service': {'bicycle': {'diy': 'yes',
'second_hand': 'yes'},
'vehicle': {'brakes': 'yes'}}}
而且,我希望递归地执行此操作,以便可以处理具有任意冒号数目的键:<tag k=addr:street', v='Main Street'/>
我尝试了几种方法,但它总是覆盖字典,因此在每个级别上只有一个文档。(例如,您丢失了{'diy': 'yes'}
条目。)
这是我能得到的最精简的部分,同时还包括给你的重要部分:
### bicycle_node.osm ###
# <?xml version="1.0" encoding="UTF-8"?>
# <osm version="0.6" generator="Overpass API 0.7.56.7 b85c4387">
# <note>Data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note>
# <meta osm_base="2020-11-05T23:56:03Z"/>
# <bounds minlat="48.6458000" minlon="-122.5844000" maxlat="48.8595000" maxlon="-122.3455000"/>
# <node id="255801452">
# <tag k="name" v="The Hub"/>
# <tag k="service:bicycle:diy" v="yes"/>
# <tag k="service:bicycle:second_hand" v="yes"/>
# <tag k="service:vehicle:painting" v="no"/>
# <tag k="payment:coin" v="yes"/>
# <tag k="payment:cash" v="yes"/>
# </node>
# <way id="4176487913">
# <tag k="name" v="Some Place"/>
# <tag k="service" v="driveway"/>
# </way>
# </osm>
### Expected JSON ###
# {"_id": "255801452",
# "name": "The Hub",
# "service": {"bicycle": {"diy": "yes",
# "second_hand": "yes"},
# "vehicle": {"painting": "no"}},
# "payment": {"coin": "yes",
# "cash": "yes"}}
# {"_id": "4176487913",
# "name": "Some Place",
# "service": "driveway"}
import xml.etree.ElementTree as ET
import codecs
import json
def get_subdiv_dict():
return {"service": dict(), "payment": dict(), "wiki": dict()}
def subdiv_key(k, v, subdoc_dict):
k_split = k.split(":")
if len(k_split) == 1:
subdoc_dict.update({ k_split[0]: v })
else:
subdoc_dict.update({ k_split[0]: subdiv_key(k=":".join(k_split[1:]),
v=v,
subdoc_dict=dict()) })
return subdoc_dict
def shape_element(element):
doc = dict()
if element.tag in ["node", "way"]:
# Get attributes.
for att_k, att_v in element.attrib.items():
if att_k == "id":
doc["_id"] = att_v
# Handle subelements.
# Subdocs for subdivided keys.
subdiv_dict = get_subdiv_dict()
for sub_el in element.iter():
if sub_el.tag == "tag":
k = sub_el.attrib["k"]
v = sub_el.attrib["v"]
# Subdivide where appropriate.
k_split = k.split(":")
if k_split[0] in subdiv_dict.keys() and len(k_split) > 1:
subdiv_dict = subdiv_key(k=k, v=v, subdoc_dict=subdiv_dict)
else:
doc[k] = v
# Add subdocs to element
for subdoc_k in subdiv_dict.keys():
if subdiv_dict[subdoc_k]:
doc[subdoc_k] = subdiv_dict[subdoc_k]
return doc
def process_map(file_in, file_out):
file_out = file_out.format(file_in)
data = []
with codecs.open(file_out, "w") as fo:
for _, element in ET.iterparse(file_in):
el = shape_element(element)
if el:
data.append(el)
fo.write(json.dumps(el) + "\n")
return data
process_map('bicycle_node.osm', 'bicycle_node.json')
# Out[1]:
# [{'_id': '255801452',
# 'name': 'The Hub',
# 'service': {'vehicle': {'painting': 'no'}},
# 'payment': {'cash': 'yes'}},
# {'_id': '4176487913', 'name': 'Some Place', 'service': 'driveway'}]
啊,算了
在
shape_element()
中,将subdiv_dict = get_subdiv_dict()
替换为subdiv_dict = dict()
然后重写
subdiv_key()
:相关问题 更多 >
编程相关推荐