有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

xml解析无法使用Java读取xml文档

我试图解析一个XML文件:一个web上的站点地图。我试过很多组合,但都没有成功。我确信我很接近,但我没有发现任何有效的东西

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
org.w3c.dom.Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
System.out.println("XML = " + doc);

输出:

XML = [#document: null]

为什么输出是[#document: null]

文件

"https://www.lavisducagou.nc/page-sitemap.xml

它确实是在线的

提前感谢你的帮助


共 (3) 个答案

  1. # 1 楼答案

    实际上,您的XML文档已正确解析和加载。 您只是对doc.toString()的相当愚蠢的输出感到恼火 (在评估"XML " + doc时在幕后调用)

    事先您知道需要的XML标记名(urlseturlloclastmod) 以及它们是如何相互嵌套的

    XML structure

    要继续了解XML,只需在树中行走即可 提取你想要的东西。例如:

    public static void main(String[] args) throws Exception {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setNamespaceAware(true);
        Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
    
        // Get the <urlset> root element
        Element urlsetElement = doc.getDocumentElement();
    
        // Get the list of <url> elements within the <urlset> element
        NodeList urlNodeList = urlsetElement.getElementsByTagName("url");
    
        for(int i = 0; i < urlNodeList.getLength(); i++) {
            // Get the <url> element
            Element urlElement = (Element) urlNodeList.item(i);
    
            // Get the <loc> element within the <url> element
            Element locElement = (Element) urlElement.getElementsByTagName("loc").item(0);
            // Print the text content of the <lo> element
            System.out.println("loc = " + locElement.getTextContent());
    
            // Get the <lastmod> element within the <url> element
            Element lastmodElement = (Element) urlElement.getElementsByTagName("lastmod").item(0);
            // Print the text content of the <lastmod> element
            System.out.println("lastmod = " + lastmodElement.getTextContent());
        }
    }
    

    您将得到如下输出:

    loc = https://www.lavisducagou.nc/
    lastmod = 2018-07-14T11:30:25+11:00
    loc = https://www.lavisducagou.nc/sinscrire/
    lastmod = 2018-05-03T16:58:35+11:00
    loc = https://www.lavisducagou.nc/se-connecter/
    lastmod = 2018-05-03T18:02:07+11:00
    loc = https://www.lavisducagou.nc/mot-de-passe-oublie/
    lastmod = 2018-05-03T20:33:08+11:00
    loc = https://www.lavisducagou.nc/compte/
    lastmod = 2018-05-03T20:36:32+11:00
    ...
    
  2. # 2 楼答案

    您需要迭代并查找xml元素。下面是一个在url节点中获取的loc和lastmod节点的解决方案

    package com.yourPackage;
    
    import org.w3c.dom.Document;
    import org.w3c.dom.Element;
    import org.w3c.dom.Node;
    import org.w3c.dom.NodeList;
    import org.xml.sax.SAXException;
    
    import javax.xml.parsers.DocumentBuilderFactory;
    import javax.xml.parsers.ParserConfigurationException;
    import java.io.IOException;
    import java.net.URL;
    import java.text.ParseException;
    
    public class Main {
        public static void main(String[] args) throws ParseException, ParserConfigurationException, IOException, SAXException {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setNamespaceAware(true);
    
            Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
            doc.getDocumentElement().normalize();
    
            NodeList urlList = doc.getElementsByTagName("url");
    
            for (int i = 0; i < urlList.getLength(); i++) {
                Element url = (Element)urlList.item(i);
    
                Node loc = url.getElementsByTagName("loc").item(0);
                Node lastmod = url.getElementsByTagName("lastmod").item(0);
    
                System.out.println(loc.getTextContent());
                System.out.println(lastmod.getTextContent());
            }
    
        }
    }
    

    输出为:

    https://www.lavisducagou.nc/
    2018-07-14T11:30:25+11:00
    https://www.lavisducagou.nc/sinscrire/
    2018-05-03T16:58:35+11:00
    https://www.lavisducagou.nc/se-connecter/
    2018-05-03T18:02:07+11:00
    https://www.lavisducagou.nc/mot-de-passe-oublie/
    2018-05-03T20:33:08+11:00
    https://www.lavisducagou.nc/compte/
    2018-05-03T20:36:32+11:00
    https://www.lavisducagou.nc/mon-profil/
    2018-05-05T15:18:36+11:00
    https://www.lavisducagou.nc/processus-de-paiement/
    2018-05-07T15:23:39+11:00
    https://www.lavisducagou.nc/paiement/
    2018-05-07T23:44:51+11:00
    https://www.lavisducagou.nc/historique-des-transactions/
    2018-05-12T16:58:30+11:00
    https://www.lavisducagou.nc/entreprise-standard/
    2018-05-16T23:22:26+11:00
    https://www.lavisducagou.nc/entreprise-premium/
    2018-05-16T23:25:31+11:00
    https://www.lavisducagou.nc/ajouter-une-entreprise/
    2018-05-16T23:30:08+11:00
    https://www.lavisducagou.nc/a-propos-de-nous/
    2018-06-05T18:52:10+11:00
    https://www.lavisducagou.nc/se-referencer/
    2018-06-07T16:15:39+11:00
    https://www.lavisducagou.nc/politique-de-confidentialite/
    2018-06-15T09:15:11+11:00
    https://www.lavisducagou.nc/donner-un-avis/
    2018-06-16T10:55:24+11:00
    https://www.lavisducagou.nc/conditions-dutilisation/
    2018-06-19T16:39:44+11:00
    https://www.lavisducagou.nc/annuaire-des-entreprises/
    2018-06-19T20:51:22+11:00
    https://www.lavisducagou.nc/pdf-generer/
    2018-06-21T00:40:48+11:00
    https://www.lavisducagou.nc/generer-pdf/
    2018-06-21T00:51:22+11:00
    https://www.lavisducagou.nc/contactez-nous/
    2018-06-23T15:44:20+11:00
    https://www.lavisducagou.nc/modifier-standard/
    2018-06-23T20:04:01+11:00
    https://www.lavisducagou.nc/pdf-generer-admin/
    2018-06-30T19:19:01+11:00
    https://www.lavisducagou.nc/conditions-generales-de-vente/
    2018-07-02T15:19:51+11:00
    https://www.lavisducagou.nc/modifier-standard-lentreprise/
    2018-07-04T22:25:30+11:00
    https://www.lavisducagou.nc/modifier-lentreprise/
    2018-07-04T22:26:25+11:00
    https://www.lavisducagou.nc/mentions-legales/
    2018-07-27T16:08:01+11:00
    https://www.lavisducagou.nc/jeu-concours/
    2018-08-22T14:40:53+11:00
    
  3. # 3 楼答案

    您看到的只是com.sun.org.apache.xerces.internal.dom.DocumentImpl的toString实现

    public String toString() {
        return "["+getNodeName()+": "+getNodeValue()+"]";
    }
    

    因为文档没有节点值,所以它是空的。您需要做的是获取childNodes并进行迭代,以获取所需的详细信息

    由于防火墙问题,我无法使用java访问URL,但这里有一个来自同一文件本身的小摘录

    <?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl"  href="//www.lavisducagou.nc/wp-content/plugins/wordpress-seo/css/main-sitemap.xsl"?>
    <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 http://www.google.com/schemas/sitemap-image/1.1/sitemap-image.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <url>
        <loc>https://www.lavisducagou.nc/</loc>
        <lastmod>2018-07-14T11:30:25+11:00</lastmod>
      </url>
      <url>
        <loc>https://www.lavisducagou.nc/sinscrire/</loc>
        <lastmod>2018-05-03T16:58:35+11:00</lastmod>
      </url>
    </urlset>
    

    刚刚用以下步骤更新了代码:

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    org.w3c.dom.Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
    System.out.println("XML = " + doc);
    NodeList nodeList = doc.getChildNodes();
    for (int i=0; i<nodeList.getLength();i++) {
       System.out.println(nodeList.item(i).getNodeName());
    }
    

    样本输出:

    XML = [#document: null]
    xml-stylesheet
    urlset