1. Overview
1.概述
Jsoup is an open-source library used to scrape HTML pages. It provides an API for data parsing, extraction, and manipulation using DOM API methods.
Jsoup是一个开源库,用于刮取 HTML 页面。它提供了使用 DOM API 方法进行数据解析、提取和操作的 API。
In this article, we will see how to parse an HTML table using Jsoup. We will be retrieving and updating data from the HTML table and also, adding and deleting rows in the table using Jsoup.
在本文中,我们将了解如何使用 Jsoup 解析 HTML 表格。我们将使用 Jsoup 从 HTML 表中检索和更新数据,并在表中添加和删除行。
2. Dependencies
2.依赖关系
To use the Jsoup library, add the following dependency to the project:
要使用 Jsoup 库,请在项目中添加以下依赖项:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
We can find the latest version of the Jsoup library in the Maven central repository.
我们可以在 Maven 中央资源库中找到 Jsoup 库的最新版本。
3. Table Structure
3.表格结构
To illustrate parsing HTML tables via jsoup, we will be using a sample HTML structure. The complete HTML structure is available in the code base provided in the GitHub repository mentioned at the end of the article. Here, we are showing a table with only two rows of data for representational purposes:
为了说明如何通过 jsoup 解析 HTML 表格,我们将使用一个 HTML 结构示例。文章末尾提到的 GitHub 代码库中提供了完整的 HTML 结构。在这里,我们展示的是一个只有两行数据的表格,以示说明:
<table>
<thead>
<tr>
<th>Name</th>
<th>Maths</th>
<th>English</th>
<th>Science</th>
</tr>
</thead>
<tbody>
<tr>
<td>Student 1</td>
<td>90</td>
<td>85</td>
<td>92</td>
</tr>
</tbody>
</table>
As we can see, we are parsing the table with a header row with thead tag followed by data rows in the tbody tag. We are assuming that the table in the HTML document will be in the above format.
正如我们所看到的,我们在解析表格时,使用 thead 标记作为表头行,然后使用 tbody 标记作为数据行。我们假设 HTML 文档中的表格将采用上述格式。
4. Parsing Table
4.解析表
Firstly, to select an HTML table from the parsed document, we can use the code snippet below:
首先,要从解析后的文档中选择 HTML 表格,我们可以使用下面的代码片段:
Element table = doc.select("table");
Elements rows = table.select("tr");
Elements first = rows.get(0).select("th,td");
As we can see, the table element is selected from the document, and then, to get the row element, tr is selected from the table element. As there are multiple rows in the table, we have selected the th or td elements in the first row. By using these functions, we can write the below function to parse table data.
正如我们所看到的,我们从文档中选择了表格元素,然后从表格元素中选择了 tr 来获取行元素。由于表格中有多行,我们选择了第一行中的 th 或 td 元素。通过使用这些函数,我们可以编写下面的函数来解析表格数据。
Here, we are assuming no colspan or rowspan elements are used in the table, and the first row is present with header th tags.
在此,我们假设表格中未使用 colspan 或 rowspan 元素,并且第一行存在标题 th 标记。
Following is the code for parsing the table:
以下是解析表格的代码:
public List<Map<String, String>> parseTable(Document doc, int tableOrder) {
Element table = doc.select("table").get(tableOrder);
Element tbody = table.select("tbody").get(0);
Elements dataRows = tbody.select("tr");
Elements headerRow = table.select("tr")
.get(0)
.select("th,td");
List<String> headers = new ArrayList<String>();
for (Element header : headerRow) {
headers.add(header.text());
}
List<Map<String, String>> parsedDataRows = new ArrayList<Map<String, String>>();
for (int row = 0; row < dataRows.size(); row++) {
Elements colVals = dataRows.get(row).select("th,td");
int colCount = 0;
Map<String, String> dataRow = new HashMap<String, String>();
for (Element colVal : colVals) {
dataRow.put(headers.get(colCount++), colVal.text());
}
parsedDataRows.add(dataRow);
}
return parsedDataRows;
}
In this function, parameter doc is the HTML document loaded from the file, and tableOrder is the nth table element in the document. We are using List<Map<String, String>> to store a list of dataRows in the table under the tbody element. Each element of the list is a Map representing a dataRow. This Map stores the column name as a key and the row value for that column as a map value. Using a list of Maps makes it easy to access the retrieved data.
在此函数中,参数 doc 是指从文件中加载的 HTML 文档,而 tableOrder 是指文档中的第 n 个表格元素。我们使用 List<Map<String, String>> 在 tbody 元素下的表格中存储 dataRow 列表。列表中的每个元素都是一个 Map 表示一个 dataRow 。该 Map 将列名称作为键存储,并将该列的行值作为映射值存储。使用 Map 列表可以轻松访问检索到的数据。
The list index represents row numbers, and we can get specific cell data by its map key.
列表索引代表行号,我们可以通过其映射键获取特定单元格数据。
We can verify if table data is retrieved correctly using the test case below:
我们可以使用下面的测试用例来验证 table 数据是否被正确检索:
@Test
public void whenDocumentTableParsed_thenTableDataReturned() {
JsoupTableParser jsoParser = new JsoupTableParser();
Document doc = jsoParser.loadFromFile("Students.html");
List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
assertEquals("90", tableData.get(0).get("Maths"));
}
From the JUnit test case, we can confirm that since we have parsed the text of all table cells and stored it in an ArrayList of HashMap objects, each element of the list represents a data row in the table. The row is represented by a HashMap with the key as the column header and cell text as the value. Using this structure, we can easily access table data.
从 JUnit 测试用例中,我们可以确认,由于我们解析了所有表格单元格的文本,并将其存储在 HashMap 对象的 ArrayList 中,因此列表中的每个元素都代表表格中的一行数据。行由 HashMap 表示,键为列头,单元格文本为值。使用这种结构,我们可以轻松访问表格数据。
5. Update Elements of the Parsed Table
5.更新解析表的元素
To insert or update elements while parsing, we can use the below code on the td element retrieved from the row:
要在解析时插入或更新元素,我们可以在从行中获取的 td 元素上使用下面的代码:
colVals.get(colCount++).text(updateValue);
or
或
colVals.get(colCount++).html(updateValue);
The function to update values in the parsed table would look like below:
更新解析表中数值的函数如下所示:
public void updateTableData(Document doc, int tableOrder, String updateValue) {
Element table = doc.select("table").get(tableOrder);
Element tbody = table.select("tbody").get(0);
Elements dataRows = tbody.select("tr");
for (int row = 0; row < dataRows.size(); row++) {
Elements colVals = dataRows.get(row).select("th,td");
for (int colCount = 0; colCount < colVals.size(); colCount++) {
colVals.get(colCount).text(updateValue);
}
}
}
In the above function, we are getting data rows from the tbody element of the table. The function traverses each cell of the table and sets its value to the parameter value, updatedValue. It updates all cells to the same value to demonstrate that cell values can be updated using Jsoup. We can update the individual cell values by specifying the row and column index for the data row.
在上述函数中,我们从表格的 tbody 元素中获取数据行。该函数遍历 table 的每个单元格,并将其值设置为参数值 updatedValue。它将所有单元格更新为相同的值,以演示可以使用 Jsoup 更新单元格值。我们可以通过指定数据行的行和列索引来更新单个单元格的值。
The test below verifies the update function:
下面的测试验证了更新功能:
@Test
public void whenTableUpdated_thenUpdatedDataReturned() {
JsoupTableParser jsoParser = new JsoupTableParser();
Document doc = jsoParser.loadFromFile("Students.html");
jsoParser.updateTableData(doc, 0, "50");
List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
assertEquals("50", tableData.get(2).get("Maths"));
}
The JUnit test case confirms that the update operation updates all table cell values to 50. Here we are verifying data from the third data row of the Maths column.
JUnit 测试用例确认更新操作将所有表单元格的值更新为 50。这里我们验证的是数学列第三数据行的数据。
Similarly, we can set desired values for specific cells of the table.
同样,我们也可以为表格的特定单元格设置所需的值。
6. Adding Row to the Table
6.为表格添加行
We can add a row to the table using the following function:
我们可以使用以下函数在表格中添加一行:
public void addRowToTable(Document doc, int tableOrder) {
Element table = doc.select("table").get(tableOrder);
Element tbody = table.select("tbody").get(0);
Elements rows = table.select("tr");
Elements headerCols = rows.get(0).select("th,td");
int numCols = headerCols.size();
Elements colVals = new Elements(numCols);
for (int colCount = 0; colCount < numCols; colCount++) {
Element colVal = new Element("td");
colVal.text("11");
colVals.add(colVal);
}
Elements dataRows = tbody.select("tr");
Element newDataRow = new Element("tr");
newDataRow.appendChildren(colVals);
dataRows.add(newDataRow);
tbody.html(dataRows.toString());
}
In the above function, we are getting the number of columns from the header row and the data rows from the tbody element of the table. After adding a new row to the dataRows list, we are updating the tbody HTML content with the dataRows.
在上述函数中,我们从标题行中获取列数,从表格的 tbody 元素中获取数据行数。在 dataRows 列表中添加新行后,我们将使用 dataRows 更新 tbody HTML 内容。
We can verify row addition using the following test case:
我们可以使用下面的测试用例来验证行加法:
@Test
public void whenTableRowAdded_thenRowCountIncreased() {
JsoupTableParser jsoParser = new JsoupTableParser();
Document doc = jsoParser.loadFromFile("Students.html");
List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
int countBeforeAdd = tableData.size();
jsoParser.addRowToTable(doc, 0);
tableData = jsoParser.parseTable(doc, 0);
assertEquals(countBeforeAdd + 1, tableData.size());
}
We can confirm from the JUnit test case that the addRowToTable operation on the table increases the number of rows in the table by 1. This operation adds a new row at the end of the list.
我们可以从 JUnit 测试用例中确认,对表执行 adRowToTable 操作后,表中的行数增加了 1。
Similarly, we can add a row at any position by specifying the index while adding it to the row elements collection.
类似地,我们可以通过指定索引在任意位置添加一行,同时将其添加到行元素集合中。
7. Delete the Row From the Table
7.从表中删除行
We can delete a row from the table using the following function:
我们可以使用下面的函数从表中删除一行:
public void deleteRowFromTable(Document doc, int tableOrder, int rowNumber) {
Element table = doc.select("table").get(tableOrder);
Element tbody = table.select("tbody").get(0);
Elements dataRows = tbody.select("tr");
if (rowNumber < dataRows.size()) {
dataRows.remove(rowNumber);
}
}
In the above function, we are getting the tbody element of the table. From tbody, we are getting a list of dataRows. From the list of dataRows, we are deleting the row at the rowNumber position in the table. We can verify row deletion using the following test case:
在上述函数中,我们正在获取表格的 tbody 元素。从 tbody 中,我们将获得 dataRows 列表。我们将从 dataRows 清单中删除 table 中 rowNumber 位置上的行。我们可以使用以下测试用例验证行删除:
@Test
public void whenTableRowDeleted_thenRowCountDecreased() {
JsoupTableParser jsoParser = new JsoupTableParser();
Document doc = jsoParser.loadFromFile("Students.html");
List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
int countBeforeDel = tableData.size();
jsoParser.deleteRowFromTable(doc, 0, 2);
tableData = jsoParser.parseTable(doc, 0);
assertEquals(countBeforeDel - 1, tableData.size());
}
The JUnit test case confirms that the deleteRowFromTable operation on the table reduces the number of rows in the table by 1.
JUnit 测试用例确认,对表进行 deleteRowFromTable 操作后,表中的行数减少了 1。
Similarly, we can delete a row at any position by specifying the index while removing it from the row elements collection.
类似地,我们可以通过指定索引来删除任意位置上的一行,同时将其从 row 元素集合中删除。
8. Conclusion
8.结论
In this article, we have seen how we can use jsoup to parse HTML tables from HTML documents. Also, we can update table structure as well as table cell data. As always, the source for these examples is available over on GitHub.
在本文中,我们了解了如何使用 jsoup 从 HTML 文档中解析 HTML 表格。此外,我们还可以更新表格结构和表格单元格数据。与往常一样,这些示例的源代码可在 GitHub 上获取。