用Python将DATEXII XML文件转换为DataFrame

2024-09-28 03:17:50 发布

您现在位置:Python中文网/ 问答频道 /正文

最近几天,我试图打开并读取某个XML文件(DATEXII格式),但到目前为止还没有成功。它是关于来自NDW Open Data website(荷兰道路和交通数据数据库)的交通数据,超链接是XML文件的源。树的头类似于in this picture,并继续like this,另请参阅下面的片段。虽然这些数据只占数据的一小部分。在

<?xml version="1.0"?> - <soapenv:Envelope xmlns:_0="http://datex2.eu/schema/2/2_0" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> <soapenv:Header/> - <soapenv:Body> - <d2LogicalModel modelBaseVersion="2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> - <exchange xmlns="http://datex2.eu/schema/2/2_0"> - <supplierIdentification> <country>nl</country> <nationalIdentifier>NLNDW</nationalIdentifier> </supplierIdentification> </exchange> - <payloadPublication lang="nl" xmlns="http://datex2.eu/schema/2/2_0" xsi:type="MeasuredDataPublication"> <publicationTime>2017-10-30T05:00:40.007Z</publicationTime> - <publicationCreator> <country>nl</country> <nationalIdentifier>NLNDW</nationalIdentifier> </publicationCreator> <measurementSiteTableReference targetClass="MeasurementSiteTable" version="955" id="NDW01_MT" /> - <headerInformation> <confidentiality>noRestriction</confidentiality> <informationStatus>real</informationStatus> </headerInformation> - <siteMeasurements> <measurementSiteReference targetClass="MeasurementSiteRecord" version="1" id="PZH01_MST_0690_00" /> <measurementTimeDefault>2017-10-30T04:59:00Z</measurementTimeDefault> - <measuredValue index="1"> - <measuredValue> - <basicData xsi:type="TrafficFlow"> - <vehicleFlow> <vehicleFlowRate>60</vehicleFlowRate> </vehicleFlow> </basicData> </measuredValue> </measuredValue> - <measuredValue index="2"> - <measuredValue> - <basicData xsi:type="TrafficFlow"> - <vehicleFlow> <vehicleFlowRate>0</vehicleFlowRate> </vehicleFlow> </basicData> </measuredValue> </measuredValue> - <measuredValue index="3"> - <measuredValue> - <basicData xsi:type="TrafficFlow"> - <vehicleFlow> <vehicleFlowRate>0</vehicleFlowRate> </vehicleFlow> </basicData> </measuredValue> </measuredValue> - <measuredValue index="4"> - <measuredValue> - <basicData xsi:type="TrafficFlow"> - <vehicleFlow> <vehicleFlowRate>60</vehicleFlowRate> </vehicleFlow> </basicData> </measuredValue> </measuredValue> - <measuredValue index="5"> - <measuredValue> - <basicData xsi:type="TrafficSpeed"> - <averageVehicleSpeed numberOfInputValuesUsed="1"> <speed>38</speed> </averageVehicleSpeed> </basicData> </measuredValue> </measuredValue> - <measuredValue index="6"> - <measuredValue> - <basicData xsi:type="TrafficSpeed"> - <averageVehicleSpeed numberOfInputValuesUsed="0"> <speed>-1</speed> </averageVehicleSpeed> </basicData> </measuredValue> </measuredValue> - <measuredValue index="7">

理想情况下,我希望用Python将信息作为数据帧加载到Jupyter笔记本中,这样如果数据允许,我可以执行一些预测分析。我用ElementTree,lxml这样,从许多其他线程的灵感中尝试过:

^{pr2}$

尽管这只返回一个包含第一行的条目,比如列名:d2LogicalModel,行:0,entry:None。在

在microsoftedge中,我很难看到树状结构,需要大量的CPU(Notepad++和插件XMLtools也足够了,但会因文件“更大”而崩溃,即大于20mb)。不过,在我看来,这种结构仍然难以理解。有太多的层,我不知道如何定义xml2df()和正确的子子孩子等等

因此,我的问题可以归结为,首先,我如何能够用数据识别变量/列?下面是我想导入的相关数据的概述。第二,如何将其导入数据帧?在

注意:由于DATEXII格式是欧洲交通数据的标准格式,我希望他们的指南能有所帮助(参见documents),但它们对我来说还没有意义。也许他们会对你们中的任何人:)

非常感谢您的帮助!


Tags: 数据httpindextypecountryxmlnsxsisoapenv
1条回答
网友
1楼 · 发布于 2024-09-28 03:17:50

考虑使用XSLT将嵌套的XML输入源转换为更平坦的结构,该语言旨在将XML文件转换为其他XML、HTML甚至文本(CSV/TAB)。因此,考虑下面的XSLT,它将原始XML转换为表格格式的逗号分隔值,以便使用read_csv()导入到pandas中:

XSLT(另存为.xsl文件,一个特殊的xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:pub="http://datex2.eu/schema/2/2_0"
                              xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/soapenv:Envelope">
    <xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
    <xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
    <xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
    <xsl:text>&#xa;</xsl:text>
    <xsl:apply-templates select="soapenv:Body"/>
  </xsl:template>

  <xsl:template match="soapenv:Body">
    <xsl:apply-templates select="d2LogicalModel"/>
  </xsl:template>

  <xsl:template match="d2LogicalModel">
    <xsl:apply-templates select="pub:payloadPublication"/>
  </xsl:template>

  <xsl:template match="pub:payloadPublication">
    <xsl:apply-templates select="pub:siteMeasurements"/>
  </xsl:template>

  <xsl:template match="pub:siteMeasurements">
    <xsl:apply-templates select="pub:measuredValue"/>
  </xsl:template>

  <xsl:template match="pub:measuredValue">
    <xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
                                 ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
                                 @index,',',
                                 pub:measuredValue/pub:basicData/@xsi:type,',',
                                 descendant::pub:vehicleFlowRate,',',
                                 descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
                                 descendant::pub:speed)"/><xsl:text>&#xa;</xsl:text>    
  </xsl:template>

</xsl:stylesheet>

Python

^{pr2}$

输出(父节点值变成具有不同数值数据的重复指标)

print(df)

#           publicationTime country nationalIdentifier msmtSiteTableRef_targetClass  msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass  msmtSiteRef_version     msmtSiteRef_id measurementTimeDefault  measuredValue_index basicData_type  vehicleFlowRate  averageVehicleSpeed_numberOfInputValues  averageVehicleSpeed_value
# 0  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    1    TrafficFlow             60.0                                      NaN                        NaN
# 1  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    2    TrafficFlow              0.0                                      NaN                        NaN
# 2  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    3    TrafficFlow              0.0                                      NaN                        NaN
# 3  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    4    TrafficFlow             60.0                                      NaN                        NaN
# 4  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    5   TrafficSpeed              NaN                                      1.0                       38.0
# 5  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    6   TrafficSpeed              NaN                                      0.0                        1.0

相关问题 更多 >

    热门问题