1. Introduction
1.导言
Large table reads can cause our application to run out of memory. They also add extra load to the database and require more bandwidth to execute. The recommended approach while reading a large table is to use paginated queries. Essentially, we read a subset (page) of data, process the data, and then move to the next page.
大量读表会导致应用程序内存不足。它们还会增加数据库的额外负载,并需要更多带宽来执行。在读取大型表时,建议使用分页查询。从本质上讲,我们读取数据的一个子集(页),处理数据,然后移动到下一页。
In this article, we’ll discuss and implement different strategies for pagination with JDBC.
在本文中,我们将讨论并使用 JDBC 实现分页的不同策略。
2. Setup
2.设置
First, we need to add the appropriate JDBC dependency based on our database in the pom.xml file so that we can connect to our database. For example, if our database is PostgreSQL, we need to add the PostgreSQL dependency:
首先,我们需要根据pom.xml文件中的数据库添加相应的 JDBC 依赖关系,以便连接到数据库。例如,如果我们的数据库是 PostgreSQL,我们需要添加 PostgreSQL 依赖项:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.6.0</version>
</dependency>
Second, we’ll need a large dataset to make a paginated query. Let’s create an employees table and insert one million records into it:
其次,我们需要一个大型数据集来进行分页查询。让我们创建一个 employees 表,并在其中插入一百万条记录:
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
salary DECIMAL(10, 2)
);
INSERT INTO employees (first_name, last_name, salary)
SELECT
'FirstName' || series_number,
'LastName' || series_number,
(random() * 100000)::DECIMAL(10, 2) -- Adjust the range as needed
FROM generate_series(1, 1000000) as series_number;
Lastly, we’ll create a connection object inside our sample app and configure it with our database connection:
最后,我们将在示例应用程序中创建一个连接对象,并将其配置为数据库连接:
Connection connect() throws SQLException {
Connection connection = DriverManager.getConnection(url, user, password);
if (connection != null) {
System.out.println("Connected to database");
}
return connection;
}
3. Pagination With JDBC
3.使用 JDBC 进行分页
Our dataset contains about 1M records, and querying it all together puts pressure not only on the database but also on bandwidth since more data needs to be transferred for a given moment. Additionally, it puts pressure on our in-memory app space since more data needs to fit in RAM. It is always recommended to read and process in pages or batches when reading large datasets.
我们的数据集包含约 100 万条记录,将其全部查询在一起不仅会对数据库造成压力,还会对带宽造成压力,因为在给定的时间内需要传输更多的数据。在读取大型数据集时,始终建议分页或分批读取和处理。
JDBC doesn’t provide out-of-the-box methods to read in pages, but there are approaches that we can implement by ourselves. We’ll be discussing and implementing two such approaches.
JDBC 没有提供读取页面的现成方法,但我们可以自己实现一些方法。我们将讨论并实现两种这样的方法。
3.1. Using LIMIT and OFFSET
3.1.使用 LIMIT 和 OFFSET
We can use LIMIT and OFFSET along with our select query to return the defined size of results. The LIMIT clause gets us the number of rows that we want to return, while the OFFSET clause skips the defined number of rows from the query result. We can then paginate our query by controlling the OFFSET position.
我们可以在选择查询中使用 LIMIT 和 OFFSET 来返回定义的结果大小。LIMIT子句将获得我们希望返回的行数,而OFFSET子句将跳过查询结果中定义的行数。然后,我们可以通过控制OFFSET位置来分页查询。
In the below logic, we’ve defined LIMIT as pageSize and offset as the start position for the reading of the records:
在下面的逻辑中,我们将 LIMIT 定义为 pageSize ,将 offset 定义为读取记录的起始位置:
ResultSet readPageWithLimitAndOffset(Connection connection, int offset, int pageSize) throws SQLException {
String sql = """
SELECT * FROM employees
LIMIT ? OFFSET ?
""";
PreparedStatement preparedStatement = connection.prepareStatement(sql);
preparedStatement.setInt(1, pageSize);
preparedStatement.setInt(2, offset);
return preparedStatement.executeQuery();
}
The query result is a single page of data. To read the entire table in pagination, we iterate for each page, process each page’s records, and then move to the next page.
查询结果是一页数据。要以分页方式读取整个表格,我们需要遍历每一页,处理每一页的记录,然后移动到下一页。
3.2. Using a Sorted Key With LIMIT
3.2.使用 LIMIT 排序键
We can also take advantage of the sorted key with LIMIT to read results in batches. For example, in our employees table, we have an ID column that is an auto-increment column and has an index on it. We’ll use this ID column to set a lower bound for our page, and LIMIT will help us to set an upper bound for the page:
我们还可以利用排序键和 LIMIT 来分批读取结果。例如,在我们的 employees 表中,我们有一个 ID 列,它是一个自动递增列,并且有一个索引。我们将使用该 ID 列为页面设置下限,而 LIMIT 将帮助我们为页面设置上限:
ResultSet readPageWithSortedKeys(Connection connection, int lastFetchedId, int pageSize) throws SQLException {
String sql = """
SELECT * FROM employees
WHERE id > ? LIMIT ?
""";
PreparedStatement preparedStatement = connection.prepareStatement(sql);
preparedStatement.setInt(1, lastFetchedId);
preparedStatement.setInt(2, pageSize);
return preparedStatement.executeQuery();
}
As we can see in the above logic, we’re passing lastFetchedId as the lower bound for the page, and pageSize would be the upper bound that we set with LIMIT.
正如我们在上述逻辑中看到的,我们传递 lastFetchedId 作为页面的下限,而 pageSize 将是我们使用 LIMIT 设置的上限。
4. Testing
4.测试
Let’s test our logic by writing simple unit tests. For testing, we’ll set up a database and insert 1M records into the table. We’re running setup() and tearDown() methods once per test class for setting up test data and tearing it down:
让我们通过编写简单的单元测试来测试我们的逻辑。在测试中,我们将建立一个数据库,并在表中插入 100 万条记录。我们将在每个测试类中运行一次 setup() 和 tearDown() 方法,用于设置测试数据和删除数据:
@BeforeAll
public static void setup() throws Exception {
connection = connect(JDBC_URL, USERNAME, PASSWORD);
populateDB();
}
@AfterAll
public static void tearDown() throws SQLException {
destroyDB();
}
The populateDB() method first creates an employees table and inserts sample records for 1M employees:
populateDB() 方法首先创建了一个 employees 表,并插入了 100 万员工的样本记录:
private static void populateDB() throws SQLException {
String createTable = """
CREATE TABLE EMPLOYEES (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
salary DECIMAL(10, 2)
);
""";
PreparedStatement preparedStatement = connection.prepareStatement(createTable);
preparedStatement.execute();
String load = """
INSERT INTO EMPLOYEES (first_name, last_name, salary)
VALUES(?,?,?)
""";
IntStream.rangeClosed(1,1_000_000).forEach(i-> {
PreparedStatement preparedStatement1 = null;
try {
preparedStatement1 = connection.prepareStatement(load);
preparedStatement1.setString(1,"firstname"+i);
preparedStatement1.setString(2,"lastname"+i);
preparedStatement1.setDouble(3, 100_000+(1_000_000-100_000)+Math.random());
preparedStatement1.execute();
} catch (SQLException e) {
throw new RuntimeException(e);
}
});
}
Our tearDown() method destroys the employees table:
我们的 tearDown() 方法将销毁 employees 表:
private static void destroyDB() throws SQLException {
String destroy = """
DROP table EMPLOYEES;
""";
connection
.prepareStatement(destroy)
.execute();
}
Once we’ve set up the test data, we can write a simple unit test for the LIMIT and OFFSET approach to verify the page size:
设置好测试数据后,我们就可以为LIMIT和OFFSET方法编写一个简单的单元测试,以验证页面大小:
@Test
void givenDBPopulated_WhenReadPageWithLimitAndOffset_ThenReturnsPaginatedResult() throws SQLException {
int offset = 0;
int pageSize = 100_000;
int totalPages = 0;
while (true) {
ResultSet resultSet = PaginationLogic.readPageWithLimitAndOffset(connection, offset, pageSize);
if (!resultSet.next()) {
break;
}
List<String> resultPage = new ArrayList<>();
do {
resultPage.add(resultSet.getString("first_name"));
} while (resultSet.next());
assertEquals("firstname" + (resultPage.size() * (totalPages + 1)), resultPage.get(resultPage.size() - 1));
offset += pageSize;
totalPages++;
}
assertEquals(10, totalPages);
}
As we can see above, we’re also looping until we’ve read all the database records in pages, and for each page, we’re verifying the last read record.
如上图所示,我们还在循环读取数据库中的所有记录,每读取一页,我们都要验证最后读取的记录。
Similarly, we can write another test for pagination with sorted keys using the ID column:
同样,我们可以使用 ID 列编写另一个分页测试:
@Test
void givenDBPopulated_WhenReadPageWithSortedKeys_ThenReturnsPaginatedResult() throws SQLException {
PreparedStatement preparedStatement = connection.prepareStatement("SELECT min(id) as min_id, max(id) as max_id FROM employees");
ResultSet resultSet = preparedStatement.executeQuery();
resultSet.next();
int minId = resultSet.getInt("min_id");
int maxId = resultSet.getInt("max_id");
int lastFetchedId = 0; // assign lastFetchedId to minId
int pageSize = 100_000;
int totalPages = 0;
while ((lastFetchedId + pageSize) <= maxId) {
resultSet = PaginationLogic.readPageWithSortedKeys(connection, lastFetchedId, pageSize);
if (!resultSet.next()) {
break;
}
List<String> resultPage = new ArrayList<>();
do {
resultPage.add(resultSet.getString("first_name"));
lastFetchedId = resultSet.getInt("id");
} while (resultSet.next());
assertEquals("firstname" + (resultPage.size() * (totalPages + 1)), resultPage.get(resultPage.size() - 1));
totalPages++;
}
assertEquals(10, totalPages);
}
As we can see above, we’re looping over the entire table to read all the data, one page at a time. We’re finding minId and maxId that’ll help us define our iteration window for the loop. Then, we’re asserting the last read record for each page and the total page size.
如上图所示,我们正在对整个表进行循环,一次一页地读取所有数据。我们将找到 minId 和 maxId 以帮助我们定义循环的迭代窗口。然后,我们将确定每一页的最后读取记录和总页面大小。
5. Conclusion
5.结论
In this article, we discussed reading large datasets in batches instead of reading them all in one query. We discussed and implemented two approaches along with a unit test verifying the working.
在本文中,我们讨论了分批读取大型数据集而不是一次查询读取所有数据集的问题。我们讨论并实现了两种方法,并通过单元测试验证了其工作原理。
LIMIT and OFFSET methods may turn inefficient for large datasets since they read all the rows and skips defined by OFFSET position, while the sorted key approach is efficient since it only queries relevant data using a sorted key that is indexed as well.
LIMIT和OFFSET方法可能会导致大型数据集的效率低下,因为它们会读取所有行和OFFSET位置定义的跳转,而排序键方法的效率很高,因为它只使用排序键查询相关数据,而排序键也是索引。
As always, the example code is available over on GitHub.
在 GitHub 上提供了示例代码。