Python SAX部分

2024-09-29 02:27:45 发布

您现在位置:Python中文网/ 问答频道 /正文

请帮忙。我试图解析一个大的XML文件并将数据传输到CSV文件中。我一直在丢失标签之间的大量数据,我不知道为什么。在

以下是一些XML:

<testcase internalid="1256092" name="hls_vtt_single_default_diable_vtt">
    <node_order><![CDATA[7]]></node_order>
    <externalid><![CDATA[6121]]></externalid>
    <version><![CDATA[2]]></version>
    <summary><![CDATA[<p>condition: single subtitle track is available in stream and it is default  &nbsp;set the vtt track to diable status before playing stream.</p>
<p>&nbsp;</p>
<div>play stream  no subtitle is rendered along with A/V<span class="Apple-tab-span" style="white-space:pre">   </span></div>
<div>&nbsp;</div>]]></summary>
    <preconditions><![CDATA[]]></preconditions>
    <execution_type><![CDATA[1]]></execution_type>
    <importance><![CDATA[2]]></importance>
</testcase>

下面是我的Python代码:

^{pr2}$

在“信息摘要”的方括号里是不返回任何问题的。externalid和version数据表现良好。但是从“summary”括号返回的是div括号。在

我需要它回来:

“条件:流中有单个字幕曲目,在播放流之前,默认将vtt曲目设置为禁用状态。播放流没有字幕与A/V一起呈现”


Tags: 文件数据divstreamisversionxmlsummary
1条回答
网友
1楼 · 发布于 2024-09-29 02:27:45

answer所示,您应该将解析后的值+=content与每个对characters()的调用连接起来。但是,要删除解析的CDATA中的xml内容(包括换行符和空格),请考虑使用regex替换:

import xml.sax
import re

class CaseHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.externalid = ""
      self.version = ""
      self.summary = ""

   def startElement(self, tag, attributes):
       self.CurrentData = tag
       if tag == "testcase":
           name = attributes["name"]
           outfile.write("\r" + name + " ,")

   def endElement(self, tag):
       if self.CurrentData == "externalid":
           outfile.write("OTV52-" + self.externalid + ",")

       elif self.CurrentData == "version":        
           outfile.write("Version:  " + self.version + ",")

       elif self.CurrentData == "summary":
           self.summary = re.sub("<[^>]+>", "", self.summary)
           self.summary = re.sub("\n|&nbsp;|/\s\s/", "", self.summary).strip()
           outfile.write("Summary:  " + self.summary + ",")

   def characters(self, content):
      if self.CurrentData == "externalid":
         self.externalid += content
      elif self.CurrentData == "version":
         self.version += content
      elif self.CurrentData == "summary":
         self.summary += content

输出(所有一行)

^{pr2}$

相关问题 更多 >