Python XML解析器Issu

2024-05-17 03:21:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我对Python还不熟悉。抱歉问了这个愚蠢的问题。 我正在尝试将XML文件读入python对象(最好读入pandas) 现在我只是想把变量打印出来,看看是否能以表格的形式正确地读取它们。你知道吗

我用过xml.etree.ElementTree文件但我可能没有按预期使用它。你知道吗

代码:

import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
ODM = tree.getroot()

ns = {'xmlns': 'http://www.cdisc.org/ns/odm/v1.3',
      'mdsol': 'http://www.mdsol.com/ns/odm/metadata'}

for ClinicalData in ODM:
    LocationOID=None
    #print(ClinicalData.tag, ClinicalData.attrib)
    for SubjectData in ClinicalData:
        for SiteRef in SubjectData:
            LocationOID=SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(ClinicalData.attrib.get('MetaDataVersionOID'),
                     ClinicalData.attrib.get('AuditSubCategoryName'),       #null ouptput due to namespace issue
                     SubjectData.attrib.get('SubjectKey'),
                     SubjectData.attrib.get('SubjectName'),                 #null ouptput due to namespace issue
                     LocationOID,                                           #not sure what is the issue
                     StudyEventData.attrib.get('StudyEventRepeatKey'),
                     AuditRecord.find('DateTimeStamp')                      #not sure what is the issue
                    )

输入:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

我希望所有的打印变量都需要像XML文件中那样有正确的变量赋值。请让我知道有没有其他适当的方式来做它,而不是内部循环多次。你知道吗


Tags: inhttpforgetwwwxmlnsodm
3条回答

我认为可以使用BeautifulSoup解析XML:

    from bs4 import BeautifulSoup

    temp  ="""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>"""



temp=BeautifulSoup(temp,"lxml")
ClinicalData = temp.find('ClinicalData'.lower())
SubjectData = ClinicalData.find_all('SubjectData'.lower())
LocationOID=None
for i in SubjectData:
    SiteRef = i.find('SiteRef'.lower())
    LocationOID = SiteRef.attrs['locationoid']


print('LocationOID',LocationOID)

输出:

LocationOID 0ACCSP3MAPPING1SITE1
[Finished in 1.2s]

@贾斯汀 我应用了你的建议,它成功了,直到我打破它。你知道吗

输入:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928808">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="45">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928809">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="46">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

代码:

import xml.etree.ElementTree as ET
import pandas as pd

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

tree = ET.parse("data.xml")
ODM = tree.getroot()

xmlns = "{http://www.cdisc.org/ns/odm/v1.3}"
mdsol = "{http://www.mdsol.com/ns/odm/metadata}"

def data_reader():
    dfcols = ['CreationDateTime','StudyOID','MetaDataVersionOID','SubjectName','SUBJECTUUID','LocationOID','StudyEventOID',
             'StudyEventRepeatKey','FormOID','FormRepeatKey','DataPageId','ItemgroupOID','RecordId','var_name','Value',
             'DateTimeStamp','ASC_Name','Measurement_Unit','SourceID','UserOID','InstanceId']
    df_xml = pd.DataFrame(columns=dfcols)

    CreationDateTime = ODM.attrib.get('CreationDateTime')

    for ClinicalData in ODM:
        StudyOID = ClinicalData.attrib.get('StudyOID')
        MetaDataVersionOID = ClinicalData.attrib.get('MetaDataVersionOID')
        ASC_Name = ClinicalData.attrib.get('{0}AuditSubCategoryName'.format(mdsol))
        for SubjectData in ClinicalData:
            SubjectName = SubjectData.attrib.get('{0}SubjectName'.format(mdsol))
            SUBJECTUUID = SubjectData.attrib.get('SubjectKey')
            LocationOID = SubjectData.find('{0}SiteRef'.format(xmlns)).attrib.get('LocationOID')
            for StudyEventData in SubjectData:
                StudyEventOID = StudyEventData.attrib.get('StudyEventOID')
                StudyEventRepeatKey = StudyEventData.attrib.get('StudyEventRepeatKey')
                InstanceId = StudyEventData.attrib.get('{0}InstanceId'.format(mdsol))
                for FormData in StudyEventData:
                    FormOID = FormData.attrib.get('FormOID')
                    FormRepeatKey = FormData.attrib.get('FormRepeatKey')
                    DataPageId = FormData.attrib.get('{0}DataPageId'.format(mdsol))
                    for ItemGroupData in FormData:
                        ItemgroupOID = ItemGroupData.attrib.get('ItemgroupOID')
                        RecordId = ItemGroupData.attrib.get('{0}RecordId'.format(mdsol))
                        for ItemData in ItemGroupData:
                            var_name = ItemData.attrib.get('ItemOID')
                            Value = ItemData.attrib.get('Value')
                            Measurement_Unit = ItemData.find('MeasurementUnitRef'.format(xmlns)).attrib.get('MeasurementUnitOID')
                            for AuditRecord in ItemData:
                                DateTimeStamp = AuditRecord.find('{0}DateTimeStamp'.format(xmlns)).text;
                                SourceID = AuditRecord.find('{0}SourceID'.format(xmlns)).text; 
                                UserOID = ItemData.find('{0}UserRef'.format(xmlns)).attrib.get('UserOID')
                                df_xml = df_xml.append(
                                pd.Series([CreationDateTime,StudyOID,MetaDataVersionOID,SubjectName,
                                           SUBJECTUUID,LocationOID,StudyEventOID,
                                           StudyEventRepeatKey,FormOID,FormRepeatKey,DataPageId,ItemgroupOID,
                                           RecordId,var_name,Value,DateTimeStamp,ASC_Name,Measurement_Unit,
                                           SourceID,UserOID,InstanceId], index=dfcols),
                                        ignore_index=True)

    print(df_xml)
data_reader()

问题:我得到了重复的记录。变量DateTimeStamp、SourceID、UserOID和Measurement\u Unit在赋值期间抛出运行时错误。你知道吗

名称空间是使用ElementTree的难点。看这个discussion。你知道吗

简短回答:

for ClinicalData in ODM:
    #print(ClinicalData.tag, ClinicalData.attrib)
    for SubjectData in ClinicalData:
        SiteRef = SubjectData.find('{http://www.cdisc.org/ns/odm/v1.3}SiteRef')
        LocationOID = SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(
                    ClinicalData.attrib.get('MetaDataVersionOID'),
                    ClinicalData.attrib.
                    get('{http://www.mdsol.com/ns/odm/metadata}AuditSubCategoryName'
                        ),  #null ouptput due to namespace issue
                    SubjectData.attrib.get('SubjectKey'),
                    SubjectData.attrib.get(
                        '{http://www.mdsol.com/ns/odm/metadata}SubjectName'
                    ),  #null ouptput due to namespace issue
                    LocationOID,  #not sure what is the issue
                    StudyEventData.attrib.get('StudyEventRepeatKey'),
                    AuditRecord.find(
                        '{http://www.cdisc.org/ns/odm/v1.3}DateTimeStamp').
                    text  #not sure what is the issue
                )

相关问题 更多 >