JavaJavaX DocumentBuilder生成“双UTF8”字符集编码

3 月，3 周 Questions & Answers 183

我有一个JavaDOMDocument，它被MyFilter重写了。从日志输出中，我知道Document的内容仍然正确。我使用以下行将theDocument转换为List<String>以通过接口将其传回：

Transformer transformer = TransformerFactory.newInstance().newTransformer(); ByteArrayOutputStream buffer = new ByteArrayOutputStream(); transformer.transform(new DOMSource(theDocument), new StreamResult(buffer)); return Arrays.asList(new String(buffer.toByteArray()).split("\r?\n"));

使用org.apache.commons.io.FileUtils从该文件复制方法调用筛选器：

List<String> lines = FileUtils.readLines(source, "UTF-8"); if (filters != null) { for (final MyFilter filter : filters) { lines = filter.filter(lines); } } FileUtils.writeLines(destination, "UTF-8", lines);

这在我的机器上运行得非常好（我可以调试它），但在其他只运行代码的机器上，任何非ASCII字符都会重复地被双UTF-8'ed（例如，Größe变成GrÃ¶ÃŸe）。代码在Tomcat中运行的web应用程序中执行。我确信它们的配置是不同的，但我想要的是在任何配置上获得非损坏的结果

你知道我会错过什么吗

final String ENCODING = "UTF-8"; Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.setOutputProperty(OutputKeys.ENCODING, ENCODING); ByteArrayOutputStream buffer = new ByteArrayOutputStream(); transformer.transform(new DOMSource(theDocument), new StreamResult(buffer)); return Arrays.asList(new String(buffer.toByteArray(), ENCODING).split("\r?\n"));

# 2 楼答案

当您创建了Document对象时，您必须读取它的内容

之后，您必须使用DOM standart为此提供的LSSerializer接口将其写入文件

默认情况下，LSSerializer生成一个没有空格或行的XML文档打破。结果，输出看起来不那么漂亮，但实际上更适合由另一个程序进行解析，因为它没有不必要的空白
如果需要空白，请在创建序列化程序后使用另一个魔法咒语：

ser.getDomConfig().setParameter("format-pretty-print", true);

代码片段如下所示：

private String getContentFromDocument(Document doc) {
    String content;

    DOMImplementation impl = doc.getImplementation();
    DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");

    LSSerializer ser = implLS.createLSSerializer();
    ser.getDomConfig().setParameter("format-pretty-print", true);
    content = ser.writeToString(doc);

    return content;
}

获得字符串内容后，可以将其写入文件，如：

public void writeToXmlFile(String xmlContent) {
    File theDir = new File("./output");
    if (!theDir.exists())
        theDir.mkdir();

    String fileName = "./output/" + this.getClass().getSimpleName() + "_"
            + Calendar.getInstance().getTimeInMillis() + ".xml";

    try (OutputStream stream = new FileOutputStream(new File(fileName))) {
        try (OutputStreamWriter out = new OutputStreamWriter(stream, StandardCharsets.UTF_8)) {
            out.write(xmlContent);
            out.write("\n");
        }
    } catch (IOException ex) {
        System.err.println("Cannot write to file!" + ex.getMessage());
    }
}

顺便说一句：

您是否尝试过以稍微简单一点的方式获取Document对象，例如：

DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = documentFactory.newDocumentBuilder();    
Document doc = builder.parse(new File(fileName));

你也可以试试这个。它应该足以解析xml文件

共 (2) 个答案

# 1 楼答案
我终于发现了：问题出在String(byte[])构造函数中，该构造函数相对于平台的默认字符集解释byte[]这至少应该被标记为不推荐变压器明显产生独立于平台的UTF-8输出。如下所示更改方法会将相同的字符集传递给两个：
```
final String ENCODING = "UTF-8";
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, ENCODING);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray(), ENCODING).split("\r?\n"));
```
为了让它工作，哪种编码并不重要，只是两者都应该使用相同的。不过，最好选择一些unicode字符集，否则不可映射的字符可能会丢失。但是，字符集将反映在XML声明中，因此当稍后保存List<String>时，相应地保存它很重要

# 2 楼答案

当您创建了Document对象时，您必须读取它的内容

之后，您必须使用DOM standart为此提供的LSSerializer接口将其写入文件

默认情况下，LSSerializer生成一个没有空格或行的XML文档打破。结果，输出看起来不那么漂亮，但实际上更适合由另一个程序进行解析，因为它没有不必要的空白
如果需要空白，请在创建序列化程序后使用另一个魔法咒语：

ser.getDomConfig().setParameter("format-pretty-print", true);

代码片段如下所示：

private String getContentFromDocument(Document doc) { String content; DOMImplementation impl = doc.getImplementation(); DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0"); LSSerializer ser = implLS.createLSSerializer(); ser.getDomConfig().setParameter("format-pretty-print", true); content = ser.writeToString(doc); return content; }

获得字符串内容后，可以将其写入文件，如：

public void writeToXmlFile(String xmlContent) { File theDir = new File("./output"); if (!theDir.exists()) theDir.mkdir(); String fileName = "./output/" + this.getClass().getSimpleName() + "_" + Calendar.getInstance().getTimeInMillis() + ".xml"; try (OutputStream stream = new FileOutputStream(new File(fileName))) { try (OutputStreamWriter out = new OutputStreamWriter(stream, StandardCharsets.UTF_8)) { out.write(xmlContent); out.write("\n"); } } catch (IOException ex) { System.err.println("Cannot write to file!" + ex.getMessage()); } }

顺便说一句：

您是否尝试过以稍微简单一点的方式获取Document对象，例如：

DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = documentFactory.newDocumentBuilder(); Document doc = builder.parse(new File(fileName));

你也可以试试这个。它应该足以解析xml文件

Python中文网

有 Java 编程相关的问题?

JavaJavaX DocumentBuilder生成“双UTF8”字符集编码

共 (2) 个答案

# 1 楼答案

# 2 楼答案