将不同大小的嵌套XML元素提取到问题的回答

将不同大小的嵌套XML元素提取到

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

假设我们有一个任意的XML文档，如下所示 <pre><code><?xml version="1.0" encoding="UTF-8"?> <programs xmlns="http://something.org/schema/s/program"> <program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd"> <orgUnitId>Organization 1</orgUnitId> <requiredLevel>academic bachelor</requiredLevel> <requiredLevel>academic master</requiredLevel> <programDescriptionText xml:lang="nl">Here is some text; blablabla</programDescriptionText> <searchword xml:lang="nl">Scrum master</searchword> </program> <program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd"> <requiredLevel>bachelor</requiredLevel> <requiredLevel>academic master</requiredLevel> <requiredLevel>academic bachelor</requiredLevel> <orgUnitId>Organization 2</orgUnitId> <programDescriptionText xml:lang="nl">Text from another organization about some stuff.</programDescriptionText> <searchword xml:lang="nl">Excutives</searchword> </program> <program xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <orgUnitId>Organization 3</orgUnitId> <programDescriptionText xml:lang="nl">Also another huge text description from another organization.</programDescriptionText> <searchword xml:lang="nl">Negotiating</searchword> <searchword xml:lang="nl">Effective leadership</searchword> <searchword xml:lang="nl">negotiating techniques</searchword> <searchword xml:lang="nl">leadership</searchword> <searchword xml:lang="nl">strategic planning</searchword> </program> </programs> </code></pre> 目前，我通过使用元素的绝对路径来<code>looping</code>遍历所需的元素，因为我无法使用ElementTree中的任何<code>get</code>或<code>find</code>方法。因此，我的代码如下所示： <pre><code>import pandas as pd import xml.etree.ElementTree as ET import numpy as np import itertools tree = ET.parse('data.xml') root = tree.getroot() root.tag dfcols=['organization','description','level','keyword'] organization=[] description=[] level=[] keyword=[] for node in root: for child in node.findall('.//{http://something.org/schema/s/program}orgUnitId'): organization.append(child.text) for child in node.findall('.//{http://something.org/schema/s/program}programDescriptionText'): description.append(child.text) for child in node.findall('.//{http://something.org/schema/s/program}requiredLevel'): level.append(child.text) for child in node.findall('.//{http://something.org/schema/s/program}searchword'): keyword.append(child.text) </code></pre> 当然，目标是创建一个数据帧。但是，由于XML文件中的每个节点都包含一个或多个元素，例如<code>requiredLevel</code>或<code>searchword</code>，因此在通过以下方式将数据强制转换到数据帧时，当前正在丢失数据： <pre><code>df=pd.DataFrame(list(itertools.zip_longest(organization, description,level,searchword, fillvalue=np.nan)),columns=dfcols) </code></pre> 或者使用<code>pd.Series</code>给定的<a href="https://stackoverflow.com/questions/49891200/generate-a-dataframe-from-list-with-different-length">here</a>或者另一个我似乎无法从<a href="https://stackoverflow.com/questions/53427905/extracting-data-from-xml-tree-into-pandas-csv-with-python">here</a>得到的解决方案 我最好的办法是根本不使用列表，因为它们似乎不能正确地索引数据。也就是说，我丢失了第2到第x个子节点的数据。但现在我被困住了，没有其他选择。你知道吗 我的最终结果应该是这样的： <pre><code>organization description level keyword Organization 1 .... academic bachelor, Scrum master academic master Organization 2 .... bachelor, Executives academic master, academic bachelor Organization 3 .... Negotiating, Effective leadership, negotiating techniques, .... </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

一个轻量级的<code>xml_to_dict</code>转换器可以找到<a href="https://stackoverflow.com/a/10077069/1498199">here</a>。可以通过<a href="http://Reference:%20https://stackoverflow.com/a/25920989/1498199" rel="nofollow noreferrer">this</a>处理名称空间来改进它。你知道吗 <pre><code>def xml_to_dict(xml='', remove_namespace=True): """Converts an XML string into a dict Args: xml: The XML as string remove_namespace: True (default) if namespaces are to be removed Returns: The XML string as dict Examples: >>> xml_to_dict('<text><para>hello world</para></text>') {'text': {'para': 'hello world'}} """ def _xml_remove_namespace(buf): # Reference: https://stackoverflow.com/a/25920989/1498199 it = ElementTree.iterparse(buf) for _, el in it: if '}' in el.tag: el.tag = el.tag.split('}', 1)[1] return it.root def _xml_to_dict(t): # Reference: https://stackoverflow.com/a/10077069/1498199 from collections import defaultdict d = {t.tag: {} if t.attrib else None} children = list(t) if children: dd = defaultdict(list) for dc in map(_xml_to_dict, children): for k, v in dc.items(): dd[k].append(v) d = {t.tag: {k: v[0] if len(v) == 1 else v for k, v in dd.items()}} if t.attrib: d[t.tag].update(('@' + k, v) for k, v in t.attrib.items()) if t.text: text = t.text.strip() if children or t.attrib: if text: d[t.tag]['#text'] = text else: d[t.tag] = text return d buffer = io.StringIO(xml.strip()) if remove_namespace: root = _xml_remove_namespace(buffer) else: root = ElementTree.parse(buffer).getroot() return _xml_to_dict(root) </code></pre> 因此，让<code>s</code>成为保存xml的字符串。我们可以通过 <code>d = xml_to_dict(s, remove_namespace=True)</code> 现在的解决方案是直截了当的： <pre><code>rows = [] for program in d['programs']['program']: cols = [] cols.append(program['orgUnitId']) cols.append(program['programDescriptionText']['#text']) try: cols.append(','.join(program['requiredLevel'])) except KeyError: cols.append('') try: searchwords = program['searchword']['#text'] except TypeError: searchwords = [] for searchword in program['searchword']: searchwords.append(searchword['#text']) searchwords = ','.join(searchwords) cols.append(searchwords) rows.append(cols) df = pd.DataFrame(rows, columns=['organization', 'description', 'level', 'keyword']) </code></pre>

将不同大小的嵌套XML元素提取到

1 个回答

相关Python问题