java使用jsoup或regex在标题标记之间提取html标记

1 月，2 周 Questions & Answers 2488

嗨，我有一个html文件解析的场景。我正在使用jsoup解析html文件，解析后我想提取头标记（h1、h3、h4）。我用过医生。select（）但它将只返回标题标记值，但我的要求是我应该提取h1到h3或h4之间的标记，反之亦然

<h4>SECTION 2</h4>
<p>some thing h4.....</p>
<p>some thing h4.....</p>
<p>some thing h4.....</p>
<h3>lawsuit</h3>
<p>some thing h3.....</p>
<p>some thing h3.....</p>
<p>some thing h3.....</p>
<h1>header one </h1>

所以这里首先搜索html字符串是否包含H1，H3，H4。这里我们有h4，所以包括h4，它应该搜索下一个h1或h3，直到h3，我们提取字符串并将其放在一个单独的html文件中

第一个html文件包含

<h4>SECTION 2</h4>
<p>some thing h4.....</p>
<p>some thing h4.....</p>
<p>some thing h4.....</p>

第二个html文件包含

<h3>lawsuit</h3>
<p>some thing h3.....</p>
<p>some thing h3.....</p>
<p>some thing h3.....</p>

第三个html文件包含

<h1>header one </h1>
....
....
....

这里的html字符串是动态的，所以我想写一个正则表达式来实现这个上下文，因为我是java新手，不知道如何实现。现在我使用了子字符串，但我需要一种通用方法，要么是正则表达式，要么是jsoup本身

我试过的代码是

try {
    File sourceFile = new File("E://data1.html");
    org.jsoup.nodes.Document doc = Jsoup.parse(sourceFile, "UTF-8");
    org.jsoup.nodes.Element elements = doc.body();
    String elementString = StringUtils.substringBetween(elements.toString(),"<h4>", "<h3>");
    System.out.println("elementString::"+elementString);
    File destinationFile = new File("E://sample.html");
    BufferedWriter htmlWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(destinationFile), "UTF-8"));
    htmlWriter.write(elementString);
    htmlWriter.close();
    System.out.println("Completed!!!");
} catch (Exception e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

请帮助我实现这一目标

Elements elements = doc.select("h1,h2,h3,h4,h5"); for (Element element : elements) { StringBuilder sb = new StringBuilder(element.toString()); Element next = element.nextElementSibling(); while (next != null && !next.tagName().startsWith("h")) { sb.append(next.toString()).append("\n"); next = next.nextElementSibling(); } System.out.println(sb); }

Python中文网

有 Java 编程相关的问题?

java使用jsoup或regex在标题标记之间提取html标记

共 (1) 个答案

# 1 楼答案