从嵌套XML数据的子集创建数据帧?

2024-06-29 01:13:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一大堆大型XML文件,其中包含数千条如下所示的记录:

XML示例:

<Report:Report xmlns:Report ="http://someplace.com">
 <Id root="1234567890"/>
 <Records value="10"/>
 <ReportDate>2020-06-20</ReportDate>
 <Record>
  <Id root="001"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="12345"/>
     <Status code="1"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="A"/>
     <Secondary code="B"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="002"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="789AB"/>
     <Status code="2"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Secondary code="D"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="003"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="CDEFG"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="E"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
</Report:Report>

原始文件在每个记录元素下有数百个不同深度的子元素,因此我在这里对其进行了一些简化,同时仍然保留了核心问题。我的目标是将XML读入pandas数据框架,这样我就可以处理如下内容:

Record Id | Number | Status | Primary | Secondary
-------------------------------------------------
001       | 12345  | 1      | A       | B
-------------------------------------------------
002       | 789AB  | 2      |         | D
-------------------------------------------------
003       | CDEFG  |        | E       | 

正如您所看到的,大多数数据都有五个级别,并且不是每个元素都存在于每个记录中,但是我需要能够处理缺少的元素,如上表所示

我已经开始玩lxml,但我真的不知道我在做什么!我知道我可以(非常笨拙地)通过遍历树来提取属性或文本,如下所示:

from lxml import etree as et
xtree = et.parse('file1.xml')
xroot = xtree.getroot()

for n in xroot.iter('Primary'):
    print(n.attrib['code'])

但是。。。在这之后,我已经失去了动力。我不确定如何继续并构造代码,以便确保任何转换的数据实际上都与它的原始记录对应

任何善良的灵魂都能指引我走出黑暗的XML山谷,走向阳光普照的熊猫山吗

任何帮助都将不胜感激


Tags: namereportid元素status记录sitecode
2条回答

我通常的方法是使用xmlplain,然后使用json_normalize

so.xml只是保存到文件中的示例xml

import pandas as pd
import xmlplain
from collections import OrderedDict 

with open("so.xml") as f: js = xmlplain.xml_to_obj(f, strip_space=True, fold_dict=True)
df = pd.json_normalize(js['Report:Report'])
# work out columns that are info that do not form records
rootcols = [k for r in js['Report:Report'] for k in r.keys() for v in [r[k]] if not isinstance(v, OrderedDict)]
rootcols = [c for c in df.columns if c.split(".")[0] in rootcols]
# fill the columns that are "info" columns"
df.loc[:,rootcols] = df.loc[:,rootcols].fillna(method="ffill").fillna(method="bfill")
# drop rows that don't hold records
df = (df.dropna(how="all", subset=[c for c in df.columns if c not in rootcols])
 .reset_index(drop=True)
 # cleanup column names
 .rename(columns={c:c.replace("Record.Site.SiteData.","") for c in df.columns})
)

print(df.to_string(index=False))

输出

        @xmlns:Report    Id.@root Records.@value  ReportDate Record.Id.@root SiteDataInfo1.Name.@code SiteDataInfo1.Status.@code SiteDataInfo2.Primary.@code SiteDataInfo2.Secondary.@code
 http://someplace.com  1234567890             10  2020-06-20             001                    12345                          1                           A                             B
 http://someplace.com  1234567890             10  2020-06-20             002                    789AB                          2                         NaN                             D
 http://someplace.com  1234567890             10  2020-06-20             003                    CDEFG                        NaN                           E                           NaN

见下文

import xml.etree.ElementTree as ET

xml = '''<Report:Report xmlns:Report ="http://someplace.com">
 <Id root="1234567890"/>
 <Records value="10"/>
 <ReportDate>2020-06-20</ReportDate>
 <Record>
  <Id root="001"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="12345"/>
     <Status code="1"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="A"/>
     <Secondary code="B"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="002"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="789AB"/>
     <Status code="2"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Secondary code="D"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
 <Record>
  <Id root="003"/>
  <Site>
   <SiteData>
    <SiteDataInfo1>
     <Name code="CDEFG"/>
    </SiteDataInfo1>
    <SiteDataInfo2>
     <Primary code="E"/>
    </SiteDataInfo2>
   </SiteData>
  </Site>
 </Record>
</Report:Report>'''

data = []
root = ET.fromstring(xml)
records = root.findall('.//Record')
for record in records:
  entry = {'id': record.find('./Id').attrib['root']}
  entry['Number'] = record.find('./Site/SiteData/SiteDataInfo1/Name').attrib['code']
  status = record.find('./Site/SiteData/SiteDataInfo1/Status')
  entry['Status'] = status.attrib['code'] if status is not None else ''
  primary = record.find('.//Primary')
  entry['Primary'] = primary.attrib['code'] if primary is not None else ''
  secondary = record.find('.//Secondary')
  entry['Secondary'] = secondary.attrib['code'] if secondary is not None else ''
  data.append(entry)

for entry in data:
  print(entry)

相关问题 更多 >