使用Python或XSLT将复杂XML转换为CSV

2024-06-26 14:04:32 发布

您现在位置:Python中文网/ 问答频道 /正文

使用Python或XSLT,我想知道如何将高度复杂的分层嵌套XML文件转换为CSV,包括所有子元素,而无需硬编码,尽可能少的元素节点,或者是否合理/有效

请查看附件中的简化XML示例和输出CSV,以便更好地了解我要实现的目标

实际的XML文件包含更多的元素,但数据层次结构和嵌套与示例中类似<InvoiceRow>元素及其子元素是XML文件中唯一的重复元素,所有其他元素都是静态的,在输出CSV中重复的次数与XML文件中的<InvoiceRow>元素相同

是重复的<InvoiceRow>元素给我带来了麻烦。不重复的元素很容易转换为CSV,而无需硬编码任何元素

复杂的XML场景,分层数据结构和多个一对多关系都存储在单个XML文件中。结构化文本文件

XML输入示例:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Invoice>
    <SellerDetails>
        <Identifier>1234-1</Identifier>
        <SellerAddress>
            <SellerStreet>Street1</SellerStreet>
            <SellerTown>Town1</SellerTown>
        </SellerAddress>
    </SellerDetails>
    <BuyerDetails>
        <BuyerIdentifier>1234-2</BuyerIdentifier>
        <BuyerAddress>
            <BuyerStreet>Street2</BuyerStreet>
            <BuyerTown>Town2</BuyerTown>
        </BuyerAddress>
    </BuyerDetails>
    <BuyerNumber>001234</BuyerNumber>
    <InvoiceDetails>
        <InvoiceNumber>0001</InvoiceNumber>
    </InvoiceDetails>
    <InvoiceRow>
        <ArticleName>Article1</ArticleName>
        <RowText>Product Text1</RowText>
        <RowText>Product Text2</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article2</ArticleName>
        <RowText>Product Text11</RowText>
        <RowText>Product Text22</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article3</ArticleName>
        <RowText>Product Text111</RowText>
        <RowText>Product Text222</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
    </InvoiceRow>
    <EpiDetails>
        <EpiPartyDetails>
            <EpiBfiPartyDetails>
                <EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
            </EpiBfiPartyDetails>
        </EpiPartyDetails>
    </EpiDetails>
    <InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>


CSV输出示例:

Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article1,Product Text1,Product Text2,10,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article2,Product Text11,Product Text22,20,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article3,Product Text111,Product Text222,30,XXXXX,Some text

Tags: 文件csv元素示例xmlproductxxxxxstreet1
2条回答

我已经完成了类似于您的需求的案例,我已经基于untangle创建了一个包,这个包可以将XML解析为纯python对象,如:

<?xml version="1.0"?>
<root>
    <child name="child1"/>
</root>

obj.root.child['name'] # u'child1'

然后,您可以轻松地编写一些代码来遍历对象以获得所需的内容。 例如,您可以执行类似get_items_by_tag(InvoiceRow)的操作。 希望有帮助

考虑下面的例子:

XML

<Invoice>
    <SellerDetails>
        <Identifier>1234-1</Identifier>
        <SellerAddress>
            <SellerStreet>Street1</SellerStreet>
            <SellerTown>Town1</SellerTown>
        </SellerAddress>
    </SellerDetails>
    <BuyerDetails>
        <BuyerIdentifier>1234-2</BuyerIdentifier>
        <BuyerAddress>
            <BuyerStreet>Street2</BuyerStreet>
            <BuyerTown>Town2</BuyerTown>
        </BuyerAddress>
    </BuyerDetails>
    <BuyerNumber>001234</BuyerNumber>
    <InvoiceDetails>
        <InvoiceNumber>0001</InvoiceNumber>
    </InvoiceDetails>
    <InvoiceRow>
        <ArticleName>Article1</ArticleName>
        <RowText>Product Text1</RowText>
        <RowText>Product Text2</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article2</ArticleName>
        <RowText>Product Text11</RowText>
        <RowText>Product Text22</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article3</ArticleName>
        <RowText>Product Text111</RowText>
        <RowText>Product Text222</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
    </InvoiceRow>
    <EpiDetails>
        <EpiPartyDetails>
            <EpiBfiPartyDetails>
                <EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
            </EpiBfiPartyDetails>
        </EpiPartyDetails>
    </EpiDetails>
    <InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>

XSLT1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>

<xsl:template match="Invoice">
    <xsl:variable name="common-head">
        <xsl:value-of select="SellerDetails/Identifier"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="BuyerDetails/BuyerIdentifier"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="InvoiceDetails/InvoiceNumber"/>
        <xsl:text>,</xsl:text>
        <!  add more here  >
    </xsl:variable>
    <xsl:variable name="common-tail">
        <xsl:value-of select="EpiDetails/EpiPartyDetails/EpiBfiPartyDetails/EpiBfiIdentifier"/>
        <xsl:text>,</xsl:text>
        <!  add more here  >
        <xsl:value-of select="InvoiceUrlText"/>
    </xsl:variable>
    <!  header  >
    <xsl:text>SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText&#10;</xsl:text>
    <!  data  >
    <xsl:for-each select="InvoiceRow">
        <xsl:copy-of select="$common-head"/>
        <xsl:value-of select="ArticleName"/>
        <xsl:text>,</xsl:text>  
        <xsl:value-of select="RowAmount"/>
        <xsl:text>,</xsl:text>  
        <!  add more here  >
        <xsl:copy-of select="$common-tail"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

结果

SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,1234-2,0001,Article1,10.00,XXXXX,Some text
1234-1,1234-2,0001,Article2,20.00,XXXXX,Some text
1234-1,1234-2,0001,Article3,30.00,XXXXX,Some text

针对以下内容添加:

Is there a way in XSLT to get the same results using loop? For example loop through and output all the elements and the sub-elements except the InvoiceRow elements and then vice versa?

如果您愿意,您可以尝试以下方式:

XSLT1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>

<xsl:template match="Invoice">
    <xsl:variable name="invoice-fields" select="//*[not(*) and not(ancestor::InvoiceRow)]" />
    <xsl:variable name="common-data">
        <xsl:for-each select="$invoice-fields">
            <xsl:value-of select="."/>
            <xsl:text>,</xsl:text>  
        </xsl:for-each> 
    </xsl:variable>
    <!  header  >
    <xsl:for-each select="$invoice-fields">
        <xsl:value-of select="name()"/>
        <xsl:text>,</xsl:text>  
    </xsl:for-each>
    <xsl:for-each select="InvoiceRow[1]/*">
        <xsl:value-of select="name()"/>
        <xsl:if test="position()!=last()">,</xsl:if>
    </xsl:for-each>
    <xsl:text>&#10;</xsl:text>
    <!  data  >
    <xsl:for-each select="InvoiceRow">
        <xsl:copy-of select="$common-data"/>
        <xsl:for-each select="*">
            <xsl:value-of select="."/>
            <xsl:if test="position()!=last()">,</xsl:if>
        </xsl:for-each> 
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

结果是:

Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,EpiBfiIdentifier,InvoiceUrlText,ArticleName,RowText,RowText,RowAmount
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article1,Product Text1,Product Text2,10.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article2,Product Text11,Product Text22,20.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article3,Product Text111,Product Text222,30.00

即在行字段之前列出所有发票字段

相关问题 更多 >