解析XML,搜索目标start<row>标记,忽略i上面的所有<row>标记

2024-10-03 21:32:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在将XML文件解析为熊猫数据帧。使用下面的代码,我可以成功地获得所有这些,但是这使用了完整XML的编辑版本。完整的XML在主数据表的顶部有一堆摘要数据,请参见fullxmlhere。我需要开始提取的行位于XML的第641行。在

XML示例:

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<Styles> ### approx. 300 lines of styling ### </Styles>
<Worksheet ss:Name="MetasoftStudio">
  <Table>
    <Row/>
    <Row>
      <Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">CPET Results</Data></Cell>
    </Row>
    <Row/>
    <Row>
      <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Operator</Data></Cell>
      <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String"></Data></Cell>
    </Row>
    <Row/>
    <Row/>
    <Row>
      <Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Patient data</Data></Cell>
    </Row>
    <Row/>
    <Row>
      <Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Administrative Data</Data></Cell>
    </Row>
    <Row>
      <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">ID</Data></Cell>
      <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">B013</Data></Cell>
    </Row>
    <Row>
      <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Title</Data></Cell>
      <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String"></Data></Cell>
    </Row>
    <Row>
      <Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Last Name</Data></Cell>
      <Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Data</Data></Cell>
    </Row>

### Skipping few hundred lines ###

    <Row>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">Variable</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">Unit</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">Rest</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">Warm Up</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">AT</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">AT % Max</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">RCP</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">RCP % Max</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">V'O2peak</Data></Cell>
      <Cell ss:StyleID="SummaryTableHead"><Data ss:Type="String">Absolute Maximum Values</Data></Cell>
    </Row>
    <Row>
      <Cell ss:StyleID="SummaryTableParameters"><Data ss:Type="String">V'O2</Data></Cell>
      <Cell ss:StyleID="SummaryTableUnits"><Data ss:Type="String">L/min</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">0.34</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">1.83</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">76</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">2.28</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">94</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">2.42</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">2.59</Data></Cell>
    </Row>
### Skipping some more lines ###
    <Row>
      <Cell ss:StyleID="SummaryTableParameters"><Data ss:Type="String">Borg</Data></Cell>
      <Cell ss:StyleID="SummaryTableUnits"><Data ss:Type="String"></Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
      <Cell ss:StyleID="SummaryTableValues"><Data ss:Type="String">-</Data></Cell>
    </Row>
    <Row/>
    <Row/>
###### NEED TO START EXTRACTING FROM THIS ROW ######
    <Row>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">t</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">Phase</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">Marker</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">V'O2</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">V'CO2</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">V'E</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">V'E/V'O2</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">V'E/V'CO2</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">HR</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">RER</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">V'O2/kg</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">PetO2</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">PetCO2</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">ExCO2</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">BF</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">WR</Data></Cell>
      <Cell ss:StyleID="MeasurementDataTableHead"><Data ss:Type="String">Borg</Data></Cell>
    </Row>
### XML File continues the same from here with the same structure ###

我的当前代码:

^{pr2}$

我需要怎样编辑我的当前代码来只提取与主数据表中的<Row>标记相对应的数据?上面的摘要数据量可能会有所不同,因此硬编码起点不是一个选择。在

主数据的第一个<Row>中的<Cell>有一个属性StyleID="MeasurementDataTableHead",是否可以在此基础上进行搜索并从中开始提取?我在这件事上完全被难住了。在


Tags: 数据datastringtypecellxmlssrow
1条回答
网友
1楼 · 发布于 2024-10-03 21:32:34

如果parsed table总是last,则可以创建helper boolean,设置为TrueIf value t(解析表的第一个值)并由其追加:

from lxml import etree
import pandas as pd

with open('cortex_full.xml', 'r') as infile:
    root = etree.parse(infile)

namespaces = {'o': 'urn:schemas-microsoft-com:office:office',
              'x': 'urn:schemas-microsoft-com:office:excel',
              'ss': 'urn:schemas-microsoft-com:office:spreadsheet'}

data = []
parse = False
ws = root.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0:
    tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    if len(tables) > 0:
        rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
        for row in rows:
            temp = []
            cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
            for cell in cells:
                if cell.text == 't':
                    parse = True
                if parse:    
                    temp.append(cell.text)
            if parse:  
                data.append(temp)

^{pr2}$

相关问题 更多 >