1. Overview
1.概述
In this tutorial, we’ll look briefly at the different ways of preserving line breaks when using Jsoup to parse HTML to plain text. We will cover how to preserve line breaks associated with newline (\n) characters, as well as those associated with <br> and <p> tags.
在本教程中,我们将简要介绍在使用Jsoup将HTML解析为纯文本时保留换行的不同方法。我们将介绍如何保留与换行符(n)相关的换行符,以及与<br>和<p>标记相关的换行符。
2. Preserving \n While Parsing HTML Text
2.在解析HTML文本时保留\n
Jsoup removes the newline character (\n) by default from the HTML text and replaces each newline with a space character.
Jsoup默认从HTML文本中删除换行符(n),并将每个换行符替换为空格符。
However, to prevent Jsoup from removing the newline characters, we can change the OutputSetting of Jsoup and disable pretty-print. If pretty-print is disabled, the HTML output methods will not re-format the output, and the output will look like the input:
然而,为了防止Jsoup删除换行符,我们可以改变Jsoup的OutputSetting,禁用pretty-print。如果禁用pretty-print,HTML输出方法将不会对输出进行重新格式化,而输出将看起来像输入:。
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
Furthermore, we can use Jsoup#clean to remove all the HTML tags from the string:
此外,我们可以使用Jsoup#clean来删除字符串中所有HTML标签。
String strHTML = "<html><body>Hello\nworld</body></html>";
String strWithNewLines = Jsoup.clean(strHTML, "", Whitelist.none(), outputSettings);
Let’s see what our output string strWithNewLines looks like:
让我们看看我们的输出字符串strWithNewLines是什么样子。
assertEquals("Hello\nworld", strWithNewLines);
Therefore, we can see that by calling Jsoup#clean with Whitelist#none and disabling the pretty-print output setting of Jsoup, we are able to preserve the line breaks associated with the newline character.
因此,我们可以看到,通过调用Jsoup#clean与Whitelist#none,并禁用Jsoup的pretty-print输出设置,我们能够保留与换行符相关的断行。
Let’s see what else we can do!
让我们看看我们还能做什么!
3. Preserving Line Breaks Associated with <br> and <p> Tags
3.保留与<br>和<p>标签相关的换行符
While cleaning the HTML text using the Jsoup#clean method, it removes the line breaks created by HTML tags like <br> and <p>.
在使用Jsoup#clean方法清理HTML文本时,它会删除由HTML标签(如<br>和<p>)创建的换行。
To preserve the line breaks associated with these tags, we first need to create a Jsoup Document from our HTML string:
为了保留与这些标签相关的换行,我们首先需要从我们的HTML字符串中创建一个Jsoup Document。
String strHTML = "<html><body>Hello<br>World<p>Paragraph</p></body></html>";
Document jsoupDoc = Jsoup.parse(strHTML);
Next, we prepend a newline character before the <br> and <p> tags — once again, we’re disabling the pretty-print output setting as well:
接下来,我们在<br>和<p>标签前加一个换行符–再一次,我们也要禁用pretty-print输出设置。
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
jsoupDoc.outputSettings(outputSettings);
jsoupDoc.select("br").before("\\n");
jsoupDoc.select("p").before("\\n");
Here, we used the select method of Jsoup Document along with the before method to prepend the newline character.
在这里,我们使用了Jsoup Document的select方法以及before方法来预置换行符。
After that, we get the HTML string from jsoupDoc retaining the original new lines:
之后,我们从jsoupDoc获得HTML字符串,保留原来的新行。
String str = jsoupDoc.html().replaceAll("\\\\n", "\n");
Finally, we call Jsoup#clean with Whitelist#none and the pretty-print output setting disabled:
最后,我们调用Jsoup#clean,Whitelist#none,禁用pretty-print输出设置。
String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), outputSettings);
And our output string strWithNewLines looks like:
而我们的输出字符串strWithNewLines看起来像。
assertEquals("Hello\nWorld\nParagraph", strWithNewLines);
Thus, by prepending <br> and <p> HTML tags with the newline character, and disabling the pretty-print output setting of Jsoup, we can preserve the line breaks associated with them.
因此,通过在<br>和<p>HTML标签前加上换行符,并禁用Jsoup的pretty-print输出设置,我们可以保留与它们相关的换行符。
4. Conclusion
4.总结
In this short article, we learned how to preserve line breaks associated with newline (\n) characters and the <br> and <p> tags when parsing HTML into plain text with Jsoup.
在这篇短文中,我们学习了在用Jsoup将HTML解析为纯文本时,如何保留与换行符(n)以及<br>和<p>标签相关的换行。
As always, all these code samples are available over on GitHub.
一如既往,所有这些代码样本都可以在GitHub上找到。