java如何使用SAX获得xml标记的正确起始/结束位置？

3 日，12 小时 Questions & Answers 3121

SAX中有一个定位器，它跟踪当前位置。但是，当我在startElement（）中调用它时，它总是返回xml标记的结束位置

如何获取标签的起始位置？有没有办法优雅地解决这个问题

Tags:

共 (3) 个答案

# 1 楼答案

我终于想出了一个解决办法。（对不起，我懒得把它挂起来。）这里characters（）、endElement（）和ignorablewitspace（）方法至关重要，它们通过一个定位器指向标记的可能起点。characters（）中的定位器指向非标记信息的最近端点，endElement（）中的定位器指向最后一个标记的结束位置，如果它们粘在一起，这可能是该标记的起点，ignorableWhitespace（）中的定位器指向一系列空白和制表符的结尾。只要我们跟踪这三种方法的结束位置，我们就可以找到这个标记的起点，并且我们已经可以通过endElement（）中的定位器获得这个标记的结束位置。因此，可以成功地找到xml的起点和终点

class Example extends DefaultHandler{
    private Locator locator;
    private SourcePosition startElePoint = new SourcePosition();
    
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }
    /**
    * <a> <- the locator points to here
    *   <b>
    * </a>
    */
    public void startElement(String uri, String localName, 
        String qName, Attributes attributes) {
        
    }
    /**
    * <a>
    *   <b>
    * </a> <- the locator points to here
    */
    public void endElement(String uri, String localName, String qName)  {
        /* here we can get our source position */
        SourcePosition tag_source_starting_position = this.startElePoint;
        SourcePosition tag_source_ending_position = 
            new SourcePosition(this.locator.getLineNumber(),
                this.locator.getColumnNumber());
                
        // do your things here
        
        //update the starting point for the next tag
        this.updateElePoint(this.locator);
    }
    
    /**
    * some other words <- the locator points to here
    * <a>
    *   <b>
    * </a>
    */
    public void characters(char[] ch, int start, int length) {
        this.updateElePoint(this.locator);//update the starting point
    }
    /**
    *the locator points to here-> <a>
    *                               <b>
    *                             </a>
    */
    public void ignorableWhitespace(char[] ch, int start, int length) {
        this.updateElePoint(this.locator);//update the starting point
    }
    private void updateElePoint(Locator lo){
        SourcePosition item = new SourcePosition(lo.getLineNumber(), lo.getColumnNumber());
        if(this.startElePoint.compareTo(item)<0){
            this.startElePoint = item;
        }
    }
    
    class SourcePosition<SourcePosition> implements Comparable<SourcePosition>{
        private int line;
        private int column;
        public SourcePosition(){
            this.line = 1;
            this.column = 1;
        }
        public SourcePosition(int line, int col){
            this.line = line;
            this.column = col;
        }
        public int getLine(){
            return this.line;
        }
        public int getColumn(){
            return this.column;
        }
        public void setLine(int line){
            this.line = line;
        }
        public void setColumn(int col){
            this.column = col;
        }
        public int compareTo(SourcePosition o) {
            if(o.getLine() > this.getLine() || 
                (o.getLine() == this.getLine() 
                    && o.getColumn() > this.getColumn()) ){
                return -1;
            }else if(o.getLine() == this.getLine() && 
                o.getColumn() == this.getColumn()){
                return 0;
            }else{
                return 1;
            }
        }
    }
}

# 2 楼答案

不幸的是，Java系统库在org.xml.sax包中提供的Locator接口不允许按定义提供有关文档位置的更详细信息。引用getColumnNumber方法的documentation（我添加的亮点）：

The return value from the method is intended only as an approximation for the sake of diagnostics; it is not intended to provide sufficient information to edit the character content of the original XML document. For example, when lines contain combining character sequences, wide characters, surrogate pairs, or bi-directional text, the value may not correspond to the column in a text editor's display.

根据该规范，根据SAX驱动程序的最大努力，您将始终获得与文档事件相关联的文本后第一个字符的位置“”。因此，对问题第一部分的简短回答是：否，Locator不提供标签开始位置的信息。此外，如果您正在处理文档中的多字节字符，例如中文或日文文本，那么从SAX驱动程序获得的位置可能不是您想要的

如果你想要标签的精确位置，或者想要更多关于属性、属性内容等的细粒度信息，你必须实现自己的位置提供者

由于涉及所有潜在的编码问题、Unicode字符等，我想这是一个太大的项目，无法在这里发布，实现也将取决于您的具体需求

根据个人经验，这只是一个快速警告：在传递给SAX解析器的InputStream周围编写一个包装器是危险的，因为您不知道SAX解析器何时会根据从流中读取的内容报告其事件

除了使用Locator信息之外，你还可以在ContentHandler的characters(char[], int, int)方法中通过检查换行符、制表符等来开始自己的计数，这会让你更好地了解自己在文档中的实际位置。通过记住最后一个事件的位置，你可以计算出当前事件的开始位置。但是要考虑到，您可能看不到所有的换行符，因为这些换行符可能出现在标记中，而您在characters中看不到这些换行符，但是您可以从Locator信息推断出这些换行符

# 3 楼答案

您使用的是哪种SAX解析器？有人告诉我，有些人不提供定位设备

下面简单Python程序的输出将为您提供XML文件中每个元素的起始行号和列号，例如，如果您在XML中缩进两个空格：

Element: MyRootElem starts at row 2 and column 0 Element: my_first_elem starts at row 3 and column 2 Element: my_second_elem starts at row 4 and column 4

像这样运行：python sax_parser_filename.py my_xml_file.xml

#!/usr/bin/python import sys from xml.sax import ContentHandler, make_parser from xml.sax.xmlreader import Locator class MySaxDocumentHandler(ContentHandler): """ the document handler class will serve to instantiate an event handler which will acts on various events coming from the parser """ def __init__(self): self.setDocumentLocator(Locator()) def startElement(self, name, attrs): print "Element: %s" % name print "starts at row %s" % self._locator.getLineNumber(), \ "and column %s\n" % self._locator.getColumnNumber() def endElement(self, name): pass def mysaxparser(inFileName): # create a handler handler = MySaxDocumentHandler() # create a parser parser = make_parser() # associate our content handler to the parser parser.setContentHandler(handler) inFile = open(inFileName, 'r') # start parser parser.parse(inFile) inFile.close() def main(): mysaxparser(sys.argv[1]) if __name__ == '__main__': main()

Python中文网

有 Java 编程相关的问题?

java如何使用SAX获得xml标记的正确起始/结束位置？

共 (3) 个答案

# 1 楼答案

# 2 楼答案

# 3 楼答案