1. Overview
1.概述
In this tutorial, we’ll explore various ways of iterating through large data sets retrieved with Spring Data JPA.
在本教程中,我们将探索通过使用Spring Data JPA检索的大型数据集的各种方法。
Firstly, we’ll use paginated queries, and we’ll see the difference between a Slice and a Page. After that, we’ll learn how to stream and process the data from the database, without collecting it.
首先,我们将使用分页查询,我们将看到Slice和Page之间的区别。之后,我们将学习如何从数据库中流式处理数据,而不收集数据。
2. Paginated Queries
2.分页查询
A common approach for this situation is to use paginated queries. To do this, we need to define a batch size and execute multiple queries. As a result, we’ll be able to process all the entities in smaller batches and avoid loading large amounts of data in memory.
针对这种情况的一个常见方法是使用分批查询。要做到这一点,我们需要定义一个批次大小并执行多个查询。因此,我们将能够以较小的批次处理所有的实体,避免在内存中加载大量的数据。
2.1 Pagination Using Slices
2.1 使用切片进行分页
For the code examples in this article, we’ll use the Student entity as the data model:
在本文的代码示例中,我们将使用Student实体作为数据模型。
@Entity
public class Student {
@Id
@GeneratedValue
private Long id;
private String firstName;
private String lastName;
// consturctor, getters and setters
}
Let’s add a method that queries all the students by firstName. With Spring Data JPA, we simply need to add to the JpaRepository a method that receives a Pageable as a parameter and returns a Slice:
让我们添加一个方法,按firstName查询所有的学生。通过Spring Data JPA,我们只需向JpaRepository添加一个方法,该方法接收一个Pageable作为参数并返回一个Slice。
@Repository
public interface StudentRepository extends JpaRepository<Student, Long> {
Slice<Student> findAllByFirstName(String firstName, Pageable page);
}
We can notice that the return type is Slice<Student>. The Slice object allows us to process the first batch of Student entities. The slice object exposes a hasNext() method that allows us to check if the batch we’re processing is the last one of the result set.
我们可以注意到,返回类型是Slice<Student>。这个Slice对象允许我们处理第一批Student实体。slice对象暴露了一个hasNext() 方法,允许我们检查我们正在处理的批次是否是结果集的最后一个。
Moreover, we can move from one slice to the next one with the help of the method nextPageable(). This method returns the Pageable object needed for requesting the next slice. Therefore, we can retrieve all the data, slice by slice, with a combination of the two methods inside a while loop:
此外,在nextPageable()方法的帮助下,我们可以从一个片断移动到下一个片断。该方法返回请求下一个片断所需的Pageable对象。因此,我们可以通过在一个while循环中结合这两个方法来逐个检索所有的数据。
void processStudentsByFirstName(String firstName) {
Slice<Student> slice = repository.findAllByFirstName(firstName, PageRequest.of(0, BATCH_SIZE));
List<Student> studentsInBatch = slice.getContent();
studentsInBatch.forEach(emailService::sendEmailToStudent);
while(slice.hasNext()) {
slice = repository.findAllByFirstName(firstName, slice.nextPageable());
slice.get().forEach(emailService::sendEmailToStudent);
}
}
Let’s run a short test using a small batch size and follow the SQL statements. We’ll expect multiple queries to be executed:
让我们使用一个小批量规模运行一个简短的测试,并遵循SQL语句。我们将期待多个查询被执行。
[main] DEBUG org.hibernate.SQL - select student0_.id as id1_0_, student0_.first_name as first_na2_0_, student0_.last_name as last_nam3_0_ from student student0_ where student0_.first_name=? limit ?
[main] DEBUG org.hibernate.SQL - select student0_.id as id1_0_, student0_.first_name as first_na2_0_, student0_.last_name as last_nam3_0_ from student student0_ where student0_.first_name=? limit ? offset ?
[main] DEBUG org.hibernate.SQL - select student0_.id as id1_0_, student0_.first_name as first_na2_0_, student0_.last_name as last_nam3_0_ from student student0_ where student0_.first_name=? limit ? offset ?
2.2. Pagination Using Pages
2.2.使用页面进行分页
As an alternative to Slice<>, we can also use Page<> as the return type of the query:
作为Slice<>的替代品,我们也可以使用Page<> 作为查询的返回类型。
@Repository
public interface StudentRepository extends JpaRepository<Student, Long> {
Slice<Student> findAllByFirstName(String firstName, Pageable page);
Page<Student> findAllByLastName(String lastName, Pageable page);
}
The Page interface extends Slice, adding two other methods to it: getTotalPages() and getTotalElements().
Page接口扩展了Slice,添加了另外两个方法。getTotalPages()和getTotalElements()。
Pages are often used as the return type when the paginated data is requested over the network. This way, the caller will know exactly how many rows are left and how many additional requests will be needed.
当通过网络请求分页数据时,Pages通常被用作返回类型。这样,调用者就会清楚地知道还剩下多少行,以及还需要多少次请求。
On the other hand, using Pages results in additional queries that count the rows meeting the criteria:
另一方面,使用Pages会导致额外的查询,计算符合标准的行。
[main] DEBUG org.hibernate.SQL - select student0_.id as id1_0_, student0_.first_name as first_na2_0_, student0_.last_name as last_nam3_0_ from student student0_ where student0_.last_name=? limit ?
[main] DEBUG org.hibernate.SQL - select count(student0_.id) as col_0_0_ from student student0_ where student0_.last_name=?
[main] DEBUG org.hibernate.SQL - select student0_.id as id1_0_, student0_.first_name as first_na2_0_, student0_.last_name as last_nam3_0_ from student student0_ where student0_.last_name=? limit ? offset ?
[main] DEBUG org.hibernate.SQL - select count(student0_.id) as col_0_0_ from student student0_ where student0_.last_name=?
[main] DEBUG org.hibernate.SQL - select student0_.id as id1_0_, student0_.first_name as first_na2_0_, student0_.last_name as last_nam3_0_ from student student0_ where student0_.last_name=? limit ? offset ?
Consequently, we should only use Page<> as the return type if we need to know the total number of entities.
因此,如果我们需要知道实体的总数,我们应该只使用Page<>作为返回类型。
3. Streaming From the Database
3.从数据库中流转
Spring Data JPA also allows us to stream the data from the result set:
Spring Data JPA还允许我们从结果集中流化数据。
Stream<Student> findAllByFirstName(String firstName);
As a result, we’ll process the entities one by one, without loading them in memory all at the same time. However, we’ll need to manually close the stream created by the Spring Data JPA, with a try-with-resource block. Furthermore, we’ll have to wrap the query in a read-only transaction.
因此,我们将逐一处理这些实体,而不是同时将它们加载到内存中。但是,我们需要通过try-with-resource块来手动关闭由Spring Data JPA创建的流。此外,我们还必须将查询包裹在一个只读事务中。
Lastly, even if we process the rows one by one, we’ve to make sure the persistence context isn’t keeping the reference to all the entities. We can achieve this by manually detaching the entities before consuming the stream:
最后,即使我们一个一个地处理这些行,我们也必须确保持久化上下文不会保留对所有实体的引用。我们可以通过在消费流之前手动分离实体来实现这一点。
private final EntityManager entityManager;
@Transactional(readOnly = true)
public void processStudentsByFirstNameUsingStreams(String firstName) {
try (Stream<Student> students = repository.findAllByFirstName(firstName)) {
students.peek(entityManager::detach)
.forEach(emailService::sendEmailToStudent);
}
}
4. Conclusion
4.总结
In this article, we explored various ways of processing large data sets. Initially, we achieved this through multiple, paginated, queries. We learned that we should use Page<> as the return type when the caller needs to know the total number of elements and Slice<> otherwise. After that, we learned how to stream the rows from the database and process them individually.
在这篇文章中,我们探索了处理大型数据集的各种方法。最初,我们是通过多个分页的查询来实现的。我们了解到,当调用者需要知道元素的总数时,我们应该使用Page<>作为返回类型,否则就使用Slice<>。之后,我们学习了如何从数据库中串联行并单独处理它们。
As always, the code samples can be found over on GitHub.
一如既往,代码样本可以在GitHub上找到over。