有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

使用java从XML中查找重复的实体标记

在XML文件中声明了一些实体

例如:

<?xml version="1.0" encoding="utf-8"?>
<!--Arbortext, Inc., 1988-2004, v.4002-->
<!DOCTYPE test PUBLIC "-//Atul//DTD ATM - TEST//EN//-"
 "test.dtd" [
<!ENTITY ent1 SYSTEM "Graphic/test1.txt" NDATA ccitt4>
<!ENTITY ent1 SYSTEM "Graphic/test1.txt" NDATA ccitt4>
<!ENTITY ent2 SYSTEM "Graphic/test2.txt" NDATA ccitt4>
<!ENTITY ent3 SYSTEM "Graphic/test4.txt" NDATA ccitt4>
]>
<test  id="01" >
</test>

我必须发现ent1被多次声明

目前我们正在使用getEntities方法

  NamedNodeMap entities = lJDocumentXML.getDoctype().getEntities();

http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/DocumentType.html#getEntities()

不返回重复实体(仅返回ent1、ent2和ent3)和外部实体(如果引用的dtd中有)

有没有办法把这四个实体都弄到手

谢谢 阿图尔


共 (2) 个答案

  1. # 1 楼答案

    正如@Ariel提到的,“DocumentType”默认情况下会丢弃属性“实体”和“符号”的重复项

    因此,您可以编写这样的自定义函数

    String fileStr = FileUtils.readFileToString(file);
    Pattern pattern = Pattern.compile("<!ENTITY.+SYSTEM");       
    Matcher matcher = pattern.matcher(fileStr);
    ArrayList<String> stringArrayList = new ArrayList <>();
    while(matcher.find())
    {
        String matchedStr = matcher.group();
            matchedStr = matchedStr.replace("<!ENTITY","");  
            matchedStr = matchedStr.replace("SYSTEM","");     
            matchedStr = matchedStr.trim();
            if(stringArrayList.contains(matchedStr))
            {
                       //actions to be taken for duplicates
            }
            stringArrayList.add(matchedStr);
        }
    
  2. # 2 楼答案

    “DocumentType”根据接口定义放弃属性“实体”和“符号”的重复项(参见W3C DOM规范W3C SPEC - REC-DOM-Level-3-Core

    A NamedNodeMap containing the general entities, both external and internal, declared in the DTD. Parameter entities are not contained. Duplicates are discarded. For example in:

    <!DOCTYPE ex SYSTEM "ex.dtd" [
      <!ENTITY foo "foo">
      <!ENTITY bar "bar">
      <!ENTITY bar "bar2">
      <!ENTITY % baz "baz">
    ]>
    <ex/>
    

    the interface provides access to foo and the first declaration of bar but not the second declaration of bar or baz. Every node in this map also implements the Entity interface. The DOM Level 2 does not support editing entities, therefore entities cannot be altered in any way.

    我认为你需要使用另一种方法来解析/检查这些信息。。。e、 g.可以使用正则表达式