java在从xml解析的html标记之间提取文本

3 月 Questions & Answers 1083

有人能帮我把html标签中的文本提取成纯文本吗

我已经解析了一个xml，并得到了一些输出作为主体，其中包含html标记。现在我想删除标记并使用文本

提前感谢

# 1 楼答案

试试HTML Parser

如果HTML被转义，即<而不是<，您可能必须首先解码
# 2 楼答案
如果您只想从字符串中删除HTML标记，可以执行以下操作：
```
String output = input.replaceAll("(?s)\\<.*?\\>", " ");
```
# 3 楼答案

考虑到你的需求，你可以试试Jericho HTML Parser

看看TextExtractor类：

Using the default settings, the source segment: "<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>" produces the text "One Two Three"

你可以使用HTML解析器，比如JSoup

比如 HTML是

<div style="height:240px;"><br>test: example<br>test1:example1</div>

你可以使用

Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();

Python中文网