1. Overview
1.概述
Sometimes, we would like to remove all HTML tags and extract the text from an HTML document string.
有时,我们想删除所有的HTML标签,并从一个HTML文档字符串中提取文本。
The problem looks pretty straightforward. However, depending on the requirements, it can have different variants.
这个问题看起来很简单。然而,根据不同的要求,它可以有不同的变体。
In this tutorial, we’ll discuss how to do that using Java.
在本教程中,我们将讨论如何使用Java来实现这一目标。
2. Using Regex
2.使用Regex
Since we’ve already got the HTML as a String variable, we need to do a kind of text manipulation.
由于我们已经把HTML作为一个String变量,我们需要做一种文本处理。
When facing text manipulation problems, regular expressions (Regex) could be the first idea coming up.
当面临文本处理问题时,regular expressions(Regex)可能是第一个出现的想法。
Removing HTML tags from a string won’t be a challenge for Regex since no matter the start or the end HTML elements, they follow the pattern “< … >”.
从一个字符串中删除HTML标签对Regex来说不是一个挑战,因为无论HTML元素的开始或结束,它们都遵循”< … >”的模式。
If we translate it into Regex, it would be “<[^>]*>” or “<.*?>”.
如果我们把它翻译成Regex,它将是“<[^>]*>”/em>或“<.*?>”/em>。
We should note that Regex does greedy matching by default. That is, the Regex “<.*>” won’t work for our problem since we want to match from ‘<‘ until the next ‘>‘ instead of the last ‘>‘ in a line.
我们应该注意到,Regex默认进行贪婪匹配。也就是说,Regex “<.*>”对我们的问题不起作用,因为我们想从’<‘直到下一个’>‘进行匹配,而不是一行中的最后一个’>‘。
Now, let’s test if it can remove tags from an HTML source.
现在,让我们测试一下它是否能从HTML源中删除标签。
2.1. Removing Tags From example1.html
2.1.从example1.html中删除标签
Before we test removing HTML tags, first let’s create an HTML example, say example1.html:
在我们测试删除HTML标签之前,首先让我们创建一个HTML例子,例如example1.html。
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>This is the page title</title>
</head>
<body>
<p>
If the application X doesn't start, the possible causes could be:<br/>
1. <a href="maven.com">Maven</a> is not installed.<br/>
2. Not enough disk space.<br/>
3. Not enough memory.
</p>
</body>
</html>
Now, let’s write a test and use String.replaceAll() to remove HTML tags:
现在,让我们写一个测试,使用String.replaceAll()来删除HTML标签。
String html = ... // load example1.html
String result = html.replaceAll("<[^>]*>", "");
System.out.println(result);
If we run the test method, we see the result:
如果我们运行测试方法,我们会看到结果。
This is the page title
If the application X doesn't start, the possible causes could be:
1. Maven is not installed.
2. Not enough disk space.
3. Not enough memory.
The output looks pretty good. This is because all HTML tags have been removed.
输出看起来很不错。这是因为所有的HTML标签都已被删除。
It preserves whitespaces from the stripped HTML. But we can easily remove or skip those empty lines or whitespaces when we process the extracted text. So far, so good.
它保留了剥离的HTML中的空白处。但我们在处理提取的文本时,可以很容易地删除或跳过这些空行或空白处。到目前为止,一切都很好。
2.2. Removing Tags From example2.html
2.2.从example2.html中删除标签
As we’ve just seen, using Regex to remove HTML tags is pretty straightforward. However, this approach may have problems since we cannot predict what HTML source we’ll get.
正如我们刚刚看到的,使用Regex来删除HTML标签是非常直接的。然而,这种方法可能有问题,因为我们无法预测我们会得到什么HTML源。
For example, an HTML document may have <script> or <style> tags, and we may not want to have their content in the result.
例如,一个HTML文档可能有<script>或<style>标签,而我们可能不希望在结果中出现它们的内容。
Further, the text in the <script>, <style>, or even the <body> tags could contain “<” or “>” characters. If this is the case, our Regex approach may fail.
此外,<script>、<style>,甚至是<body>标签中的文本可能包含”<“或”>“字符。如果是这种情况,我们的Regex方法可能会失败。
Now, let’s see another HTML example, say example2.html:
现在,让我们看看另一个HTML例子,比如example2.html。
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>This is the page title</title>
</head>
<script>
// some interesting script functions
</script>
<body>
<p>
If the application X doesn't start, the possible causes could be:<br/>
1. <a
id="link"
href="http://maven.apache.org/">
Maven
</a> is not installed.<br/>
2. Not enough (<1G) disk space.<br/>
3. Not enough (<64MB) memory.<br/>
</p>
</body>
</html>
This time, we have a <script> tag and “<” characters in the <body> tag.
这一次,我们有一个<script>标签和”<“字符在<body>标签。
If we use the same method on example2.html, we’ll get (empty lines have been removed):
如果我们在example2.html上使用同样的方法,我们会得到(空行已被删除)。
This is the page title
// some interesting script functions
If the application X doesn't start, the possible causes could be:
1.
Maven
is not installed.
2. Not enough (
3. Not enough (
Apparently, we’ve lost some text due to the “<” characters.
显然,由于”<“字符的存在,我们失去了一些文本。
Therefore, using Regex to process XML or HTML is fragile. Instead, we can choose an HTML parser to do the job.
因此,使用Regex来处理XML或HTML是脆弱的。相反,我们可以选择一个HTML分析器来完成这项工作。
Next, we’ll address a few easy-to-use HTML libraries to extract text.
接下来,我们将讨论几个易于使用的HTML库来提取文本。
3. Using Jsoup
3.使用Jsoup
Jsoup is a popular HTML parser. To extract text from an HTML document, we can simply call Jsoup.parse(htmlString).text().
Jsoup是一个流行的HTML解析器。要从一个HTML文档中提取文本,我们可以简单地调用Jsoup.parse(htmlString).text()。
First, we need to add the Jsoup library to the classpath. For example, let’s say we’re using Maven to manage project dependencies:
首先,我们需要将Jsoup库添加到classpath中。例如,假设我们使用Maven来管理项目依赖性。
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
Now, let’s test it with our example2.html:
现在,让我们用我们的example2.html来测试它。
String html = ... // load example2.html
System.out.println(Jsoup.parse(html).text());
If we give the method a run, it prints:
如果我们让这个方法运行一下,它就会打印出来。
This is the page title If the application X doesn't start, the possible causes could be: 1. Maven is not installed. 2. Not enough (<1G) disk space. 3. Not enough (<64MB) memory.
As the output shows, Jsoup has successfully extracted texts from the HTML document. Also, the text in the <script> element has been ignored.
如输出结果所示,Jsoup已经成功地从HTML文档中提取了文本。另外,<script>元素中的文本也被忽略了。
Additionally, by default, Jsoup will remove all text formatting and whitespaces, such as line breaks.
此外,默认情况下,Jsoup将删除所有文本格式和空白,如换行。
However, if it’s required, we can also ask Jsoup to preserve the line breaks.
但是,如果需要的话,我们也可以要求Jsoup保留换行符。
4. Using HTMLCleaner
4.使用HTMLCleaner
HTMLCleaner is another HTML parser. Its goal is to make “ill-formed and dirty” HTML from the Web suitable for further processing.
HTMLCleaner是另一个HTML解析器。它的目标是使来自网络的 “不合格的和肮脏的 “HTML适合进一步处理。
First, let’s add the HTMLCleaner dependency in our pom.xml:
首先,让我们在我们的pom.xml中添加HTMLCleaner依赖项。
<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>2.25</version>
</dependency>
We can set various options to control HTMLCleaner’s parsing behavior.
我们可以设置各种选项来控制HTMLCleaner的解析行为。
Here, as an example, let’s tell HTMLCleaner to skip the <script> element when parsing example2.html:
这里,作为一个例子,让我们告诉HTMLCleaner在解析<script>元素时跳过example2.html。
String html = ... // load example2.html
CleanerProperties props = new CleanerProperties();
props.setPruneTags("script");
String result = new HtmlCleaner(props).clean(html).getText().toString();
System.out.println(result);
HTMLCleaner will produce this output if we run the test:
如果我们运行测试,HTMLCleaner将产生这个输出。
This is the page title
If the application X doesn't start, the possible causes could be:
1.
Maven
is not installed.
2. Not enough (<1G) disk space.
3. Not enough (<64MB) memory.
As we can see, the content in the <script> element has been ignored.
我们可以看到,<script>元素中的内容已经被忽略了。
Also, it converts <br/> tags into line breaks in the extracted text. This can be helpful if the format is significant.
此外,它将<br/>标签转换为提取的文本中的换行符。如果格式很重要,这可能很有帮助。
On the other hand, HTMLCleaner preserves whitespace from the stripped HTML source. So, for example, the text “1. Maven is not installed” is broken into three lines.
另一方面,HTMLCleaner保留了剥离后的HTML源的空白。因此,举例来说,文本”1.Maven未安装“被分成了三行。
5. Using Jericho
5.使用杰里科
At last, we’ll see another HTML parser – Jericho. It has a nice feature: rendering HTML markup with simple text formatting. We’ll see it in action later.
最后,我们将看到另一个HTML解析器–Jericho。它有一个很好的功能:用简单的文本格式化来渲染HTML标记。我们稍后会看到它的运行情况。
As usual, let’s first add the Jericho dependency in the pom.xml:
像往常一样,让我们首先在pom.xml中添加Jericho依赖性。
<dependency>
<groupId>net.htmlparser.jericho</groupId>
<artifactId>jericho-html</artifactId>
<version>3.4</version>
</dependency>
In our example2.html, we have a hyperlink “Maven (http://maven.apache.org/)“. Now, let’s say we would like to have both the link URL and link text in the result.
在我们的example2.html中,我们有一个超链接”Maven (http://maven.apache.org/)“。现在,假设我们想在结果中同时出现链接的URL和链接文本。
To do that, we can create a Renderer object and use the includeHyperlinkURLs option:
要做到这一点,我们可以创建一个Renderer对象并使用includeHyperlinkURLs选项。
String html = ... // load example2.html
Source htmlSource = new Source(html);
Segment segment = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRender = new Renderer(segment).setIncludeHyperlinkURLs(true);
System.out.println(htmlRender);
Next, let’s execute the test and check the output:
接下来,让我们执行测试并检查输出。
If the application X doesn't start, the possible causes could be:
1. Maven <http://maven.apache.org/> is not installed.
2. Not enough (<1G) disk space.
3. Not enough (<64MB) memory.
As we can see in the result above, the text has been pretty-formatted. Also, the text in the <title> element is ignored by default.
正如我们在上面的结果中看到的,文本已经被漂亮地格式化了。另外,<title>元素中的文本默认被忽略。
The link URL is included as well. Apart from rendering links (<a>), Jericho supports rendering other HTML tags, for example <hr/>, <br/>, bullet-list (<ul> and <li>), and so on.
链接的URL也被包括在内。除了渲染链接(<a>),Jericho支持渲染其他HTML标签,例如<hr/>、<br/>、bullet-list(<ul>和<li>),等等。
6. Conclusion
6.结语
In this article, we’ve addressed different ways to remove HTML tags and extract HTML text.
在这篇文章中,我们已经解决了去除HTML标签和提取HTML文本的不同方法。
We should note that it’s not a good practice to use Regex to process XML/HTML.
我们应该注意,使用Regex来处理XML/HTML不是一个好的做法。
As always, the complete source code for this article can be found over on GitHub.
一如既往,本文的完整源代码可以在GitHub上找到。