在python中解析xml到pandas数据帧问题的回答

在python中解析xml到pandas数据帧

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p>解决方案中的问题是“元素数据提取”没有正确完成。您在问题中提到的xml嵌套在几个层中。这就是为什么我们需要递归地读取和提取数据。在这种情况下，下面的解决方案应该能满足您的需要。尽管我鼓励你看一下<a href="https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c" rel="nofollow noreferrer">this article</a>和{a2}以获得更清晰的理解。在</p> <h2>方法：1</h2> <pre class="lang-py prettyprint-override"><code>import numpy as np import pandas as pd #import os import xml.etree.ElementTree as ET def xml2df(xml_source, df_cols, source_is_file = False, show_progress=True): """Parse the input XML source and store the result in a pandas DataFrame with the given columns. For xml_source = xml_file, Set: source_is_file = True For xml_source = xml_string, Set: source_is_file = False <element attribute_key1=attribute_value1, attribute_key2=attribute_value2> <child1>Child 1 Text</child1> <child2>Child 2 Text</child2> <child3>Child 3 Text</child3> </element> Note that for an xml structure as shown above, the attribute information of element tag can be accessed by list(element). Any text associated with <element> tag can be accessed as element.text and the name of the tag itself can be accessed with element.tag. """ if source_is_file: xtree = ET.parse(xml_source) # xml_source = xml_file xroot = xtree.getroot() else: xroot = ET.fromstring(xml_source) # xml_source = xml_string consolidator_dict = dict() default_instance_dict = {label: None for label in df_cols} def get_children_info(children, instance_dict): # We avoid using element.getchildren() as it is deprecated. # Instead use list(element) to get a list of attributes. for child in children: #print(child) #print(child.tag) #print(child.items()) #print(child.getchildren()) # deprecated method #print(list(child)) if len(list(child))>0: instance_dict = get_children_info(list(child), instance_dict) if len(list(child.keys()))>0: items = child.items() instance_dict.update({key: value for (key, value) in items}) #print(child.keys()) instance_dict.update({child.tag: child.text}) return instance_dict # Loop over all instances for instance in list(xroot): instance_dict = default_instance_dict.copy() ikey, ivalue = instance.items()[0] # The first attribute is "ID" instance_dict.update({ikey: ivalue}) if show_progress: print('{}: {}={}'.format(instance.tag, ikey, ivalue)) # Loop inside every instance instance_dict = get_children_info(list(instance), instance_dict) #consolidator_dict.update({ivalue: instance_dict.copy()}) consolidator_dict[ivalue] = instance_dict.copy() df = pd.DataFrame(consolidator_dict).T df = df[df_cols] return df </code></pre> <p>运行以下命令以生成所需的输出。在</p> ^{pr2}$ <h2>方法：2</h2> <p>{{cd2>你可以转换。运行以下命令以获得所需的输出。在</p> <p><strong>注意</strong>：您需要安装<a href="https://github.com/martinblech/xmltodict" rel="nofollow noreferrer">^{<cd3>}</a>才能使用方法2。这个方法的灵感来自@martin blech在<a href="https://stackoverflow.com/questions/471946/how-to-convert-xml-to-json-in-python">How to convert XML to JSON in Python? [duplicate] </a>提出的解决方案。为制作它而向<a href="https://stackoverflow.com/users/113643/martin-blech">@martin-blech</a>致敬。在</p> <pre><code>pip install -U xmltodict </code></pre> <blockquote> <p>Solution</p> </blockquote> <pre class="lang-py prettyprint-override"><code>def read_recursively(x, instance_dict): #print(x) txt = '' for key in x.keys(): k = key.replace("@","") if k in df_cols: if isinstance(x.get(key), dict): instance_dict, txt = read_recursively(x.get(key), instance_dict) #else: instance_dict.update({k: x.get(key)}) #print('{}: {}'.format(k, x.get(key))) else: #print('else: {}: {}'.format(k, x.get(key))) # dig deeper if value is another dict if isinstance(x.get(key), dict): instance_dict, txt = read_recursively(x.get(key), instance_dict) # add simple text associated with element if k=='#text': txt = x.get(key) # update text to corresponding parent element if (k!='#text') and (txt!=''): instance_dict.update({k: txt}) return (instance_dict, txt) </code></pre> <p>您需要上面给出的函数<code>read_recursively()</code>。现在运行以下命令。在</p> <pre class="lang-py prettyprint-override"><code>import xmltodict, json o = xmltodict.parse(xml_string) # INPUT: XML_STRING #print(json.dumps(o)) # uncomment to see xml to json converted string consolidated_dict = dict() oi = o['Instances']['Instance'] for x in oi: instance_dict = dict() instance_dict, _ = read_recursively(x, instance_dict) consolidated_dict.update({x.get("@ID"): instance_dict.copy()}) df = pd.DataFrame(consolidated_dict).T df = df[df_cols] df </code></pre>

在python中解析xml到pandas数据帧

1 个回答

相关Python问题