1. Overview
1.概述
Sorting is a fundamental operation in computer science, essential for data organization and manipulation across various applications.
排序是计算机科学中的一项基本操作,对于各种应用中的数据组织和操作至关重要。
In this tutorial, we’ll compare Java’s commonly used sorting methods: Arrays.sort() and Collections.sort(). While serving the same primary function—sorting data—each method has its own features, caveats, and optimal use cases.
在本教程中,我们将比较 Java 常用的排序方法:Arrays.sort() 和 Collections.sort().虽然每种方法都具有相同的主要功能–对数据进行排序,但它们都有自己的特点、注意事项和最佳用例。
2. Basic Overview
2.基本概况
Let’s begin by examining the fundamental differences between these two methods.
让我们先来看看这两种方法的根本区别。
2.1. Arrays.sort()
2.1.Arrays.sort()
The Arrays.sort() method is a utility function for sorting arrays in Java. It allows to sort arrays of primitive data types and objects. Whether we’re working with numerical data or alphabetical strings, Arrays.sort() can arrange the elements in ascending order. Additionally, we can modify the behavior with custom comparators for object arrays. This method is part of the java.util.Arrays class, which provides a suite of utilities for array manipulation.
Arrays.sort()方法是 Java 中用于数组排序的实用程序函数。它允许对原始数据类型和对象的数组进行排序。无论我们处理的是数字数据还是按字母顺序排列的字符串,Arrays.sort() 都能按升序排列元素。此外,我们还可以使用对象数组的自定义比较器来修改行为。该方法是 java.util.Arrays 类的一部分,该类为数组操作提供了一系列实用程序。
2.2. Collections.sort()
2.2.Collections.sort()
On the other hand, Collections.sort() is designed for sorting instances of the List interface in Java’s Collection Framework. Unlike Arrays.sort(), which is limited to arrays, Collections.sort() can sort more dynamic data structures like ArrayList, LinkedList, and other classes that implement the List interface. Collections.sort() is a member of the java.util.Collections class, a utility class filled with static methods for operating on collections.
另一方面,Collections.sort() 设计用于对 Java 集合框架中的 List 接口实例进行排序。 Arrays.sort() 仅限于数组,而 Collections.sort() 则不同。sort() 可以对更多动态数据结构进行排序,例如 ArrayList、LinkedList 和其他实现 List 接口的类。Collections.sort()是java.util.Collections类的成员,该类是一个实用工具类,其中包含用于操作集合的静态方法。
3. Stability
3.稳定性
Let’s imagine that we have a collection of tasks:
假设我们有一系列任务:
tasks = new ArrayList<>();
tasks.add(new Task(1, 1, "2023-09-01"));
tasks.add(new Task(2, 2, "2023-08-30"));
tasks.add(new Task(3, 1, "2023-08-29"));
tasks.add(new Task(4, 2, "2023-09-02"));
tasks.add(new Task(5, 3, "2023-09-05"));
tasks.add(new Task(6, 1, "2023-09-03"));
tasks.add(new Task(7, 3, "2023-08-28"));
tasks.add(new Task(8, 2, "2023-09-01"));
tasks.add(new Task(9, 1, "2023-08-28"));
tasks.add(new Task(10, 2, "2023-09-04"));
tasks.add(new Task(11, 3, "2023-08-31"));
tasks.add(new Task(12, 1, "2023-08-30"));
tasks.add(new Task(13, 3, "2023-09-02"));
We would like to sort them first by their priority and then by their due date. We’ll be sorting them with two different approaches. In the first case, we’ll be using a stable algorithm provided by Collections:
我们希望先按优先级排序,然后再按到期日排序。我们将使用两种不同的方法对它们进行排序。在第一种情况下,我们将使用由 Collections 提供的 稳定算法:
final List<Task> tasks = Tasks.supplier.get();
Collections.sort(tasks, Comparator.comparingInt(Task::getPriority));
System.out.println("After sorting by priority:");
for (Task task : tasks) {
System.out.println(task);
}
Collections.sort(tasks, Comparator.comparing(Task::getDueDate));
System.out.println("\nAfter sorting by due date:");
for (Task task : tasks) {
System.out.println(task);
}
Also, we’ll sort the tasks using a non-stable algorithm. Because Java doesn’t offer the option to sort a List of reference types using a non-stable algorithm, we have a simple implementation of quicksort:
此外,我们将使用非稳定算法对任务进行排序。由于 Java 不提供使用非稳定算法对引用类型列表进行排序的选项,因此我们使用了 quicksort 的简单实现:
List<Task> tasks = Tasks.supplier.get();
quickSort(tasks, Comparator.comparingInt(Task::getPriority));
System.out.println("After sorting by priority:");
for (Task task : tasks) {
System.out.println(task);
}
quickSort(tasks, Comparator.comparing(Task::getDueDate));
System.out.println("\nAfter sorting by due date:");
for (Task task : tasks) {
System.out.println(task);
}
The code is overall the same. The only difference is the algorithms used. The sorting is happening in two steps. The first step sorts the tasks by priority and the second by due date.
代码总体上是一样的。唯一不同的是使用的算法。排序分两步进行。第一步按优先级排序任务,第二步按到期日期排序任务。
The difference in the results is subtle but might significantly affect the code’s functionality and introduce hard-to-debug bugs. The stable version produces the following output:
结果的差异很微妙,但可能会严重影响代码的功能,并引入难以调试的错误。稳定版的输出结果如下:
After sorting by due date:
Task: #9 | Priority: 1 | Due Date: 2023-08-28
Task: #7 | Priority: 3 | Due Date: 2023-08-28
Task: #3 | Priority: 1 | Due Date: 2023-08-29
Task: #12 | Priority: 1 | Due Date: 2023-08-30
Task: #2 | Priority: 2 | Due Date: 2023-08-30
Task: #11 | Priority: 3 | Due Date: 2023-08-31
Task: #1 | Priority: 1 | Due Date: 2023-09-01
Task: #8 | Priority: 2 | Due Date: 2023-09-01
Task: #4 | Priority: 2 | Due Date: 2023-09-02
Task: #13 | Priority: 3 | Due Date: 2023-09-02
Task: #6 | Priority: 1 | Due Date: 2023-09-03
Task: #10 | Priority: 2 | Due Date: 2023-09-04
Task: #5 | Priority: 3 | Due Date: 2023-09-05
The tasks are sorted by date, and the previous ordering by priority will be saved when the dates are the same. Whereas the non-stable version gives us this:
任务按日期排序,日期相同时将保存之前的优先级排序。而非稳定版本则是这样:
After sorting by due date:
Task: #9 | Priority: 1 | Due Date: 2023-08-28
Task: #7 | Priority: 3 | Due Date: 2023-08-28
Task: #3 | Priority: 1 | Due Date: 2023-08-29
Task: #2 | Priority: 2 | Due Date: 2023-08-30
Task: #12 | Priority: 1 | Due Date: 2023-08-30
Task: #11 | Priority: 3 | Due Date: 2023-08-31
Task: #1 | Priority: 1 | Due Date: 2023-09-01
Task: #8 | Priority: 2 | Due Date: 2023-09-01
Task: #4 | Priority: 2 | Due Date: 2023-09-02
Task: #13 | Priority: 3 | Due Date: 2023-09-02
Task: #6 | Priority: 1 | Due Date: 2023-09-03
Task: #10 | Priority: 2 | Due Date: 2023-09-04
Task: #5 | Priority: 3 | Due Date: 2023-09-05
The tasks #2 and #12 have the same due date, but the priority is inversed. This is because a non-stable algorithm produces a non-deterministic behavior when it deals with equal but distinguishable items.
任务 #2 和 #12 的到期日相同,但优先级相反。这是因为非稳定算法在处理相等但可区分的项目时会产生非确定性行为。
Because primitives don’t have identity or additional parameters, we can sort them using a non-stable algorithm. The only thing they have is a value, and that’s why we don’t care about the stability of the algorithms we’re using for primitives. As the example above shows, the stability feature is crucial for sorting objects.
由于基元没有标识或附加参数,因此我们可以使用非稳定算法对它们进行排序。它们唯一拥有的就是值,这就是为什么我们不关心基元算法稳定性的原因。正如上面的示例所示,稳定性对于对象排序至关重要。
That’s why Arrays.sort() uses the same implementation of a non-stable algorithm for primitives, such as Quicksort or Dual-Pivot Quicksort. When dealing with the collections of reference types, both Arrays.sort() and Collections.sort() use the same implementation, usually Merge Sort or TimSort.
这就是为什么Arrays.sort()会对基元使用相同的非稳定算法实现,例如 Quicksort 或 Dual-Pivot Quicksort。在处理引用类型的集合时,Arrays.sort() 和 Collections.sort() 都使用相同的实现,通常是 Merge 排序 或 TimSort.
4. Complexity
4.复杂性
Let’s make a simple example comparing Merge Sort and Quicksort to show the difference between these algorithms and eventually between Collections.sort() and Arrays.sort(). We’ll be using simple implementations for both of them. This is done partially because Java doesn’t provide these algorithms, so we can pick them directly, and partly because the current algorithms have too many tweaks and improvements. Hence, it’s hard to develop similar test cases for both of them.
让我们通过一个简单的示例来比较合并排序和 Quicksort,以显示这些算法之间的差异,并最终显示 Collections.sort() 和 Arrays.sort() 之间的差异。我们将使用这两种算法的简单实现。这样做的部分原因是 Java 没有提供这些算法,因此我们可以直接选择它们,另一部分原因是当前的算法有太多的调整和改进。因此,很难为这两种算法开发类似的测试用例。
We will run the following tests to compare the throughput:
我们将进行以下测试来比较吞吐量:
@Measurement(iterations = 2, time = 10, timeUnit = TimeUnit.MINUTES)
@Warmup(iterations = 5, time = 10)
public class PerformanceBenchmark {
private static final Random RANDOM = new Random();
private static final int ARRAY_SIZE = 10000;
private static final int[] randomNumbers = RANDOM.ints(ARRAY_SIZE).toArray();
private static final int[] sameNumbers = IntStream.generate(() -> 42).limit(ARRAY_SIZE).toArray();
public static final Supplier<int[]> randomNumbersSupplier = randomNumbers::clone;
public static final Supplier<int[]> sameNumbersSupplier = sameNumbers::clone;
@Benchmark
@BenchmarkMode(Mode.Throughput)
@Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-quick-sort-same-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
public void quickSortSameNumber() {
Quicksort.sort(sameNumbersSupplier.get());
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-quick-sort-random-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
public void quickSortRandomNumber() {
Quicksort.sort(randomNumbersSupplier.get());
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-merge-sort-same-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
public void mergeSortSameNumber() {
MergeSort.sort(sameNumbersSupplier.get());
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-merge-sort-random-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
public void mergeSortRandomNumber() {
MergeSort.sort(randomNumbersSupplier.get());
}
}
After running these tests, we got two artifacts: one is the performance metrics, and the other one is the garbage collection log.
运行这些测试后,我们会得到两个工件:一个是性能指标,另一个是垃圾回收日志。
4.1. Time Complexity
4.1.时间复杂性
Let’s review the performance metrics for the tests above:
让我们回顾一下上述测试的性能指标:
Benchmark Mode Cnt Score Error Units
PerformanceBenchmark.mergeSortRandomNumber thrpt 4 1489.983 ± 401.330 ops/s
PerformanceBenchmark.quickSortRandomNumber thrpt 4 2757.273 ± 29.606 ops/s
Firstly, using Quicksort to sort random numbers, in general, is almost twice as fast as the MergeSort. The fact that Quicksort happens in place reduces the space complexity affecting the performance, which we discuss in the next section.
首先,在一般情况下,使用 Quicksort 对随机数进行排序的速度几乎是 MergeSort 的两倍。事实上,Quicksort 是就地排序,这降低了影响性能的空间复杂性,我们将在下一节讨论这一点。
Also, we can see that Quicksort may degrade quite rapidly in some cases:
此外,我们还可以看到,在某些情况下,Quicksort 的性能可能会下降得相当快:
Benchmark Mode Cnt Score Error Units
PerformanceBenchmark.mergeSortSameNumber thrpt 4 5295.502 ± 98.624 ops/s
PerformanceBenchmark.quickSortSameNumber thrpt 4 118.211 ± 0.117 ops/s
For example, when all the numbers are the same, MergeSort performs much better and is blazingly fast. Although we’re using a simple implementation of Quicksort and MergeSort, the behavior is generally similar to their more optimized and complex counterparts.
例如,当所有数字都相同时,MergeSort 的性能要好得多,速度也快得惊人。虽然我们使用的是 Quicksort 和 MergeSort 的简单实现,但它们的行为与经过优化的复杂实现大体相似。
Please keep in mind that the performance of an algorithm and its time complexity might not correlate as there are many things we should consider additional: space complexity, hidden constant factors, optimization, adaptivity, etc.
请记住,算法的性能和时间复杂度可能并不相关,因为我们还需要考虑很多其他因素:空间复杂度、隐藏常量因素、优化、adaptivity等。
The upper bound for the Quicksort and Dual-Pivot Quicksort is higher than MergeSort or TimSort. However, due to a series of improvements and checks, performance issues are rendered negligible and, overall, can be ignored. Thus, we can assume that Merge Sort, TimSort, Quicksort, and Dual-Pivot Quicksort would have the same time complexity.
Quicksort 和 Dual-Pivot Quicksort 的 上限比 MergeSort 或 TimSort 高。不过,由于进行了一系列改进和检查,性能问题已变得微不足道,总体上可以忽略不计。因此,我们可以假设合并排序、TimSort、Quicksort 和双透视 Quicksort 的时间复杂度相同。
DualPivotQuicksort.sort() method, for example, considers many parameters, such as parallelization, array size, presorted runs, and even the recursion depth. Depending on the primitive types and the size of an array, Java can use different algorithms, like Insertion Sort or Counting Sort. That’s why it’s hard to compare highly optimized algorithms; they would generally produce similar results.
例如,DualPivotQuicksort.sort() 方法会考虑许多参数,如并行化、数组大小、预排序运行,甚至递归深度。根据数组的基元类型和大小,Java 可以使用不同的算法,如 Insertion Sort 或 Counting Sort。这就是为什么很难比较高度优化的算法;它们通常会产生相似的结果。
4.2. Space Complexity
4.2.空间复杂性
As mentioned previously, while being non-stable, Quicksort and Dual-Pivot Quicksort algorithms come with a trade-off as they use less space. Depending on the implementation, they might have at most O(log*n) space complexity. This is a nice feature that might have a significant performance impact. In our case, let’s concentrate on sorting random numbers.
如前所述,Quicksort 和 Dual-Pivot Quicksort 算法虽然不稳定,但它们使用的空间更少,因此需要权衡利弊。根据不同的实现,它们的空间复杂度可能最多为 O(log*n)。这是一个不错的功能,但可能会对性能产生重大影响。在我们的案例中,让我们专注于对随机数进行排序。
While the time complexity of these algorithms is considered roughly the same, we have dramatic differences in the performance:
虽然这些算法的时间复杂度大致相同,但在性能上却存在巨大差异:
Benchmark Mode Cnt Score Error Units
PerformanceBenchmark.mergeSortRandomNumber thrpt 4 1489.983 ± 401.330 ops/s
PerformanceBenchmark.quickSortRandomNumber thrpt 4 2757.273 ± 29.606 ops/s
To investigate this difference, we could look at the garbage collection logs. We’ll be using IBM Garbage Collection and Memory Visualizer:
要研究这种差异,我们可以查看垃圾回收日志。我们将使用 IBM Garbage Collection 和 Memory Visualizer:
Variant | mergeSort | quickSort |
---|---|---|
Forced collection count | 0 | 0 |
Full collections | 0 | 0 |
GC Mode | G1 | G1 |
Mean garbage collection pause (ms) | 0.33 | 0.47 |
Number of collections triggered by allocation failure | 26848 | 588 |
Proportion of time spent in garbage collection pauses (%) | 0.72 | 0.02 |
Proportion of time spent in stop-the-world garbage collection pauses (%) | 0.72 | 0.02 |
Proportion of time spent unpaused (%) | 99.28 | 99.98 |
Young collections – Mean garbage collection pause (ms) | 0.33 | 0.47 |
Young collections – Mean interval between collections (ms) | 46.6 | 2124 |
As we can see, the number of garbage collection events is significantly higher for MergeSort (26848 vs. 588), and it’s understandable as this algorithm uses more space.
正如我们所看到的,MergeSort 的垃圾回收事件数量明显更高(26848 对 588),这也是可以理解的,因为这种算法使用的空间更大。
4.3. Optimization
4.3.优化
Because Merge Sort and TimSort require more space, using a non-stable algorithm for primitives is the optimization step, assuming that Quicksort and Dual-Pivot Quicksort degradation to O(n^2) is negligible. Technically, it’s possible to use a non-stable sorting algorithm for a collection of reference types and get a boost in performance. This can be done if stability isn’t important or equal objects are non-distinguishable.
由于合并排序和 TimSort 需要更多空间,因此假设 Quicksort 和 Dual-Pivot Quicksort 退化为 O(n^2) 可以忽略不计,那么对基元使用非稳定算法就是优化步骤。从技术上讲,对引用类型集合使用非稳定排序算法可以提高性能。
One of the improvements we can use for wrapper classes is to convert them into primitives, do the sorting, and convert them back. Let’s consider the following test:
我们可以对封装类进行的改进之一,就是将它们转换为基元,进行排序,然后再转换回来。让我们来看看下面的测试:
@Measurement(iterations = 2, time = 10, timeUnit = TimeUnit.MINUTES)
@Warmup(iterations = 5, time = 10)
@Fork(value = 2)
public class ObjectOverheadBenchmark {
private static final ThreadLocalRandom RANDOM = ThreadLocalRandom.current();
@State(Scope.Benchmark)
public static class Input {
public Supplier<List<Integer>> randomNumbers = () -> RANDOM.ints().limit(10000).boxed().collect(Collectors.toList());
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public void sortingPrimitiveArray(Input input, Blackhole blackhole) {
final int[] array = input.randomNumbers.get().stream().mapToInt(Integer::intValue).toArray();
Arrays.sort(array);
final List<Integer> result = Arrays.stream(array).boxed().collect(Collectors.toList());
blackhole.consume(result);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public void sortingObjectArray(Input input, Blackhole blackhole) {
final Integer[] array = input.randomNumbers.get().toArray(new Integer[0]);
Arrays.sort(array);
blackhole.consume(array);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public void sortingObjects(Input input, Blackhole blackhole) {
final List<Integer> list = input.randomNumbers.get();
Collections.sort(list);
blackhole.consume(list);
}
}
The fact that we’re using Arrays.sort() here significantly boosts our performance Collection.sort():
事实上,我们在这里使用 Arrays.sort() 大大提高了 Collection.sort() 的性能:</em
Benchmark Mode Cnt Score Error Units
ObjectOverheadBenchmark.sortingObjectArray thrpt 4 982.849 ± 19.201 ops/s
ObjectOverheadBenchmark.sortingObjects thrpt 4 976.778 ± 10.580 ops/s
ObjectOverheadBenchmark.sortingPrimitiveArray thrpt 4 1998.818 ± 373.654 ops/s
Sorting an int[] produces more than a 100% increase in performance.
对int[]进行排序可使性能提高 100%以上。
5. Implementation Details of Arrays.sort() and Collections.sort()
5.Arrays.sort() 和 Collections.sort() 的实现细节
Please note that the algorithms and tests used in the previous sections. They don’t reflect the library implementation’s performance, as they have more complex processes that allow them to optimize specific cases. The tests are presented only to provide more visual information about the inner workings of the simple implementation for these two kinds of algorithms.
请注意,前面章节中使用的算法和测试并不能反映库的性能,因为它们有更复杂的过程,可以优化特定情况。介绍这些测试只是为了提供更直观的信息,说明这两种算法的简单实现的内部运作情况。
It’s virtually impossible to compare Collections.sort() and Arrays.sort() without their underlying algorithms.
要比较 Collections.sort() 和 Arrays.sort() 而不了解它们的底层算法几乎是不可能的。
The underlying algorithms are the crucial part contributing to these two methods’ complexity and performance. Because Collections.sort() is implemented with stability in mind, they use Merge Sort or TimSort. At the same time, sorting primitives doesn’t require this property and can use Quicksort or Dual-Pivot Quicksort.
底层算法是导致这两种方法复杂性和性能的关键部分。由于 Collections.sort() 在实现时考虑到了稳定性,因此它们使用了合并排序或 TimSort。同时,排序原语不需要这一属性,可以使用 Quicksort 或 Dual-Pivot Quicksort。
To better understand these methods, we looked at the sorting algorithms they use directly.
为了更好地理解这些方法,我们直接研究了它们使用的排序算法。
6. Conclusion
6.结论
Sorting is one of the cornerstone operations in computer science, enabling efficient data manipulation across many applications. By understanding the differences between algorithms, we can make more informed decisions when coding and optimizing for performance and functionality according to your specific requirements.
排序是计算机科学中的基石操作之一,可以在许多应用中实现高效的数据操作。通过了解算法之间的差异,我们可以在编码时做出更明智的决定,并根据您的具体要求优化性能和功能。
Thus, although this article aims to highlight the difference between Collections.sort() and Arrays.sort(). It is also a great guide to understanding the differences between different sorting algorithms. As always, the code for the implementation of this algorithm can be found over on GitHub.
因此,虽然本文旨在强调 Collections.sort() 和 Arrays.sort() 之间的区别,但它也是了解不同排序算法之间差异的绝佳指南。它也是了解不同排序算法之间差异的绝佳指南。与往常一样,您可以在 GitHub 上找到该算法的实现代码。