Python解析XML文件为pandas数据框

2024-09-30 12:11:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面的xml结构,我正在尝试将xml数据转换为结构化的数据框架。我读过很多关于使用这两种方法进行xml转换的stackoverflow文章xml.etree.ElementTree文件很漂亮,但是没有一个能处理这样一个例子,我想要的不仅仅是标签、属性或文本,而是所有的。你知道吗

例如,我希望从下面的xml中获得如下列:

abr记录上次更新日期,abr替换,abn状态,abn状态,abn起始日期,abn

你会在上面看到abn是实际的文本,我只是不知道如何收集所有。你知道吗

<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>


<ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>



<ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>



</Transfer>

我开始使用根.iter但是我不知道如何使用这个逻辑来获得我想要的所有列。你知道吗

import xml.etree.ElementTree as et
root = et.parse('sample.xml').getroot()

dict_new = {}

for each in root.iter('ABN'):

    #abr_last_updated_date = 
    print(each.tag)
    print(each.attrib)
    print(each.items())
    print(each.text)

最终,如果有人能分享如何迭代每个xml“块”(不确定正确的术语)并获得前几个列,我相信我能解决其余的问题。你知道吗


Tags: xmlstateeachpostcodeabnentitytypeabraddressdetails
2条回答

使用BeautifulSoup可以获取所有项目。你知道吗

  • 标签
  • 标记文本
  • 属性名称
  • 属性值
    from bs4 import BeautifulSoup

    data='''<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>


    <ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>



    <ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>



    </Transfer>'''

    soup=BeautifulSoup(data,'lxml')
    for tag in soup.select('ABN'):
        print("Tag:" + str(tag))
        print("Tag Text " + tag.text)
        for attr in tag.attrs:
            print("Attribute name : "+ attr)
            print("Attribute value : " + tag[attr]) 

输出打印在控制台上

Tag:<abn abnstatusfromdate="19991101" status="ACT">11000000948</abn>
Tag Text 11000000948
Attribute name : abnstatusfromdate
Attribute value : 19991101
Attribute name : status
Attribute value : ACT
Tag:<abn abnstatusfromdate="20190501" status="CAN">11000002568</abn>
Tag Text 11000002568
Attribute name : abnstatusfromdate
Attribute value : 20190501
Attribute name : status
Attribute value : CAN

即使这是XML文件,也可以使用BeautifulSoup或text属性的CSS选择器:

data = '''<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>


<ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>



<ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>



</Transfer>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'xml')

z = zip(soup.select('ABR[recordLastUpdatedDate]'),
    soup.select('ABR[replaced]'),
    soup.select('ABN[status]'),
    soup.select('ABN[ABNStatusFromDate]'),
    soup.select('ABN'))

for (c1, c2, c3, c4, c5) in z:
    print(c1['recordLastUpdatedDate'], c2['replaced'], c3['status'], c4['ABNStatusFromDate'], c5.text.strip())

印刷品:

20180216 N ACT 19991101 11000000948
20190531 N CAN 20190501 11000002568

相关问题 更多 >

    热门问题