1. Introduction
1.绪论
In this article, we’ll see how to find the kth-smallest element in the union of two sorted arrays.
在这篇文章中,我们将看到如何在两个排序数组的联合中找到k第1个最小的元素。
First, we’ll define the exact problem. Second, we’ll see two inefficient but straightforward solutions. Third, we’ll look at an efficient solution based on a binary search on the two arrays. Finally, we’ll look at some tests to verify that our algorithm works.
首先,我们将定义确切的问题。第二,我们将看到两个低效但直接的解决方案。第三,我们将看到一个基于两个数组的二进制搜索的有效解决方案。最后,我们将看一些测试来验证我们的算法是否有效。
We’ll also see Java code snippets for all parts of the algorithm. For simplicity, our implementation will only operate on integers. However, the described algorithm works with all data types that are comparable and could even be implemented using Generics.
我们还将看到该算法所有部分的Java代码片段。为了简单起见,我们的实现将只对整数进行操作。然而,所描述的算法适用于所有可比较的数据类型,甚至可以使用泛型实现。
2. What Is the Kth Smallest Element in the Union of Two Sorted Arrays?
2.什么是两个排序数组的联合体中的K个最小的元素?
2.1. The Kth Smallest Element
2.1.Kth最小的元素
To find the kth-smallest element, also called the kth-order statistic, in an array, we typically use a selection algorithm. However, these algorithms operate on a single, unsorted array, whereas in this article, we want to find the kth smallest element in two sorted arrays.
为了找到一个数组中的kth-smallest元素,也称为kth-order统计,我们通常使用选择算法。然而,这些算法是针对单个未排序的数组进行操作的,而在本文中,我们要在两个排序的数组中找到k第1个最小的元素。
Before we see several solutions to the problem, let’s exactly define what we want to achieve. For that, let’s jump right into an example.
在我们看到这个问题的几种解决方案之前,让我们确切地定义我们想要实现的目标。为此,让我们直接跳入一个例子。
We are given two sorted arrays (a and b), which do not necessarily need to have an equal number of elements:
我们给了两个排序的数组(a和b),它们的元素数量不一定相等。
In these two arrays, we want to find the kth smallest element. More specifically, we want to find the kth smallest element in the combined and sorted array:
在这两个数组中,我们想找到k第最小的元素。更确切地说,我们想在合并和排序的数组中找到第k个最小的元素。
The combined and sorted array for our example is shown in (c). The 1st smallest element is 3, and the 4th smallest element is 20.
我们的例子的组合和排序的数组显示在(c)。第1个最小的元素是3,第4个最小元素是20。
2.2. Duplicate Values
2.2.重复值
We’ll also need to define how to handle duplicate values. An element could occur more than once in one of the arrays (element 3 in array a) and also occur again in the second array (b).
我们还需要定义如何处理重复的值。一个元素可能在一个数组中出现不止一次(元素3在数组a中),也可能在第二个数组中再次出现(b)。
If we count duplicates only once, we’ll count as shown in (c). If we count all occurrences of an element, we’ll count as shown in (d).
如果我们只计算重复的一次,我们将计算如(c)所示。如果我们计算一个元素的所有出现次数,我们将计算如(d)所示。
In the remaining part of this article, we’ll count duplicates as shown in (d), thus counting them as if they were distinct elements.
在本文的剩余部分,我们将按照(d)所示计算重复的内容,从而把它们当作不同的元素来计算。
3. Two Simple but Less Efficient Approaches
3.两种简单但效率较低的方法
3.1. Join and Then Sort the Two Arrays
3.1.将两个数组连接起来,然后进行排序
The simplest way to find the kth smallest element is to join the arrays, sort them, and return the kth element of the resulting array:
找到k第1个最小元素的最简单方法是连接数组,对其进行排序,并返回所得数组的k第1个元素。
int getKthElementSorted(int[] list1, int[] list2, int k) {
int length1 = list1.length, length2 = list2.length;
int[] combinedArray = new int[length1 + length2];
System.arraycopy(list1, 0, combinedArray, 0, list1.length);
System.arraycopy(list2, 0, combinedArray, list1.length, list2.length);
Arrays.sort(combinedArray);
return combinedArray[k-1];
}
With n being the length of the first array and m the length of the second array, we get the combined length c = n + m.
n是第一个数组的长度,m是第二个数组的长度,我们得到合并长度c=n+m。
Since the complexity for the sort is O(c log c), the overall complexity of this approach is O(n log n).
由于排序的复杂度为O(c log c),这种方法的总体复杂度为O(n log n)。
A disadvantage of this approach is that we need to create a copy of the array, which results in more space needed.
这种方法的缺点是,我们需要创建一个数组的副本,这导致需要更多的空间。
3.2. Merge the Two Arrays
3.2.合并两个数组
Similar to one single step of the Merge Sort sorting algorithm, we can merge the two arrays and then directly retrieve the kth element.
类似于合并排序排序算法的一个单一步骤,我们可以合并这两个数组,然后直接检索k第1个元素。
The basic idea of the merge algorithm is to start with two pointers, which point to the first elements of the first and second arrays (a).
合并算法的基本思想是以两个指针开始,这两个指针指向第一和第二数组的第一个元素(a)。
We then compare the two elements (3 and 4) at the pointers, add the smaller one (3) to the result, and move that pointer one position forward (b). Again, we compare the elements at the pointers and add the smaller one (4) to the result.
然后我们比较指针上的两个元素(3和4),将较小的元素(3)加入结果中,并将指针向前移动一个位置(b)。再一次,我们比较指针上的元素,并将较小的一个(4)加入到结果中。
We continue in the same way until all elements are added to the resulting array. If one of the input arrays does not have more elements, we simply copy all of the remaining elements of the other input array to the result array.
我们以同样的方式继续,直到所有的元素都加入到结果数组中。如果其中一个输入数组没有更多的元素,我们只需将另一个输入数组的所有剩余元素复制到结果数组中。
We can improve the performance if we don’t copy the full arrays, but stop when the resulting array has k elements. We don’t even need to create an additional array for the combined array but can operate on the original arrays only.
如果我们不复制完整的数组,而是在产生的数组有k个元素时停止,就可以提高性能。我们甚至不需要为合并后的数组创建一个额外的数组,而只需对原始数组进行操作即可。。
Here’s an implementation in Java:
这里有一个用Java实现的方法。
public static int getKthElementMerge(int[] list1, int[] list2, int k) {
int i1 = 0, i2 = 0;
while(i1 < list1.length && i2 < list2.length && (i1 + i2) < k) {
if(list1[i1] < list2[i2]) {
i1++;
} else {
i2++;
}
}
if((i1 + i2) < k) {
return i1 < list1.length ? list1[k - i2 - 1] : list2[k - i1 - 1];
} else if(i1 > 0 && i2 > 0) {
return Math.max(list1[i1-1], list2[i2-1]);
} else {
return i1 == 0 ? list2[i2-1] : list1[i1-1];
}
}
It’s straightforward to understand the time-complexity of this algorithm is O(k). An advantage of this algorithm is that it can be easily adapted to consider duplicate elements only once.
我们可以直接理解这个算法的时间复杂度是O(k)。这个算法的一个优点是,它可以很容易地调整为只考虑重复的元素。
4. A Binary Search Over Both Arrays
4.在两个数组上的二进制搜索
Can we do better than O(k)? The answer is that we can. The basic idea is to do a binary search algorithm over the two arrays.
我们能做到比O(k)更好吗?答案是,我们可以。基本思路是对两个数组进行二进制搜索算法。
For this to work, we need a data structure that provides constant-time read access to all its elements. In Java, that could be an array or an ArrayList.
要做到这一点,我们需要一个数据结构,对其所有元素提供持续的读取访问。在Java中,这可能是一个数组或一个ArrayList。
Let’s define the skeleton for the method we are going to implement:
让我们为我们要实现的方法定义一个骨架。
int findKthElement(int k, int[] list1, int[] list2)
throws NoSuchElementException, IllegalArgumentException {
// check input (see below)
// handle special cases (see below)
// binary search (see below)
}
Here, we pass k and the two arrays as arguments. First, we’ll validate the input; second, we handle some special cases and then do the binary search. In the next three sections, we’ll look at these three steps in reverse order, so first, we’ll see the binary search, second, the special cases, and finally, the parameter validation.
在这里,我们把k和两个数组作为参数传递。首先,我们将验证输入;其次,我们处理一些特殊情况,然后进行二进制搜索。在接下来的三节中,我们将按照相反的顺序来看这三个步骤,所以首先,我们将看到二进制搜索,其次是特殊情况,最后是参数验证。
4.1. The Binary Search
4.1.二进制搜索
The standard binary search, where we are looking for a specific element, has two possible outcomes: either we find the element we’re looking for and the search is successful, or we don’t find it and the search is unsuccessful. This is different in our case, where we want to find the kth smallest element. Here, we always have a result.
标准的二进制搜索,即我们要寻找一个特定的元素,有两种可能的结果:要么我们找到我们要找的元素,搜索成功;要么我们没有找到,搜索不成功。这在我们的案例中是不同的,我们想找到k第最小的元素。在这里,我们总是有一个结果。。
Let’s look at how to implement that.
让我们看看如何实现这一点。
4.1.1. Finding the Correct Number of Elements From Both Arrays
We start our search with a certain number of elements from the first array. Let’s call that number nElementsList1. As we need k elements in total, the number nElementsList1 is:
我们从第一个数组中的一定数量的元素开始我们的搜索。让我们把这个数字称为nElementsList1。由于我们总共需要k个元素,nElementsList1的数量是:。
int nElementsList2 = k - nElementsList1;
As an example, let’s say k = 8. We start with four elements from the first array and four elements from the second array (a).
作为一个例子,我们说k = 8。我们从第一个数组的四个元素和第二个数组的四个元素(a)开始。
If the 4th element in the first array is bigger than the 4th element in the second array, we know that we took too many elements from the first array and can decrease nElementsList1 (b). Otherwise, we know that we took too few elements and can increase nElementsList1 (b’).
如果第一个数组中的第4个元素大于第二个数组中的第4个元素,我们知道我们从第一个数组中抽取了太多的元素,可以减少nElementsList1(b)。否则,我们知道我们取的元素太少,可以增加nElementsList1 (b’)。
We continue until we have reached the stopping criteria. Before we look at what that is, let’s look at the code for what we’ve described so far:
我们继续下去,直到我们达到停止的标准。在我们看这是什么之前,让我们看一下到目前为止我们所描述的代码。
int right = k;
int left = = 0;
do {
nElementsList1 = ((left + right) / 2) + 1;
nElementsList2 = k - nElementsList1;
if(nElementsList2 > 0) {
if (list1[nElementsList1 - 1] > list2[nElementsList2 - 1]) {
right = nElementsList1 - 2;
} else {
left = nElementsList1;
}
}
} while(!kthSmallesElementFound(list1, list2, nElementsList1, nElementsList2));
4.1.2. Stopping Criteria
We can stop in two cases. First, we can stop if the maximum element we take from the first array is equal to the maximum element we take from the second (c). In this case, we can simply return that element.
我们可以在两种情况下停止。首先,如果我们从第一个数组中获取的最大元素等于我们从第二个数组中获取的最大元素(c),我们可以停止。在这种情况下,我们可以简单地返回该元素。
Second, we can stop if the following two conditions are met (d):
第二,如果满足以下两个条件,我们可以停下来(d)。
- The largest element to take from the first array is smaller than the smallest element we do not take from the second array (11 < 100).
- The largest element to take from the second array is smaller than the smallest element we do not take from the first array (21 < 27).
It’s easy to visualize (d’) why that condition works: all elements we take from the two arrays are surely smaller than any other element in the two arrays.
很容易理解(d’)为什么这个条件能起作用:我们从两个数组中取出的所有元素肯定比两个数组中的任何其他元素都小。
Here’s the code for the stopping criteria:
下面是停止标准的代码。
private static boolean foundCorrectNumberOfElementsInBothLists(int[] list1, int[] list2, int nElementsList1, int nElementsList2) {
// we do not take any element from the second list
if(nElementsList2 < 1) {
return true;
}
if(list1[nElementsList1-1] == list2[nElementsList2-1]) {
return true;
}
if(nElementsList1 == list1.length) {
return list1[nElementsList1-1] <= list2[nElementsList2];
}
if(nElementsList2 == list2.length) {
return list2[nElementsList2-1] <= list1[nElementsList1];
}
return list1[nElementsList1-1] <= list2[nElementsList2] && list2[nElementsList2-1] <= list1[nElementsList1];
}
4.1.3. The Return Value
Finally, we need to return the correct value. Here, we have three possible cases:
最后,我们需要返回正确的值。这里,我们有三种可能的情况。
- We take no elements from the second array, thus the target value is in the first array (e)
- The target value is in the first array (e’)
- The target value is in the second array (e”)
Let’s see this in code:
让我们在代码中看看这个。
return nElementsList2 == 0 ? list1[nElementsList1-1] : max(list1[nElementsList1-1], list2[nElementsList2-1]);
Note that we do not need to handle the case where we don’t take any element from the first array — we’ll exclude that case in the handling of special cases later.
注意,我们不需要处理不从第一个数组中提取任何元素的情况–我们将在后面的特殊情况处理中排除这种情况。
4.2. Initial Values for the Left and Right Borders
4.2.左边和右边边界的初始值
Until now, we initialized the right and left border for the first array with k and 0:
到目前为止,我们用k和0初始化了第一个数组的左右边界。
int right = k;
int left = 0;
However, depending on the value of k, we need to adapt these borders.
然而,根据k的值,我们需要调整这些边界。
First, if k exceeds the length of the first array, we need to take the last element as the right border. The reason for this is quite straightforward, as we cannot take more elements from the array than there are.
首先,如果k超过了第一个数组的长度,我们需要将最后一个元素作为右边界。这样做的原因很简单,因为我们不能从数组中取出比其数量更多的元素。
Second, if k is bigger than the number of elements in the second array, we know for sure that we need to take at least (k – length(list2)) from the first array. As an example, let’s say k = 7. As the second array only has four elements, we know that we need to take at least 3 elements from the first array, so we can set L to 2:
其次,如果k大于第二个数组中的元素数,我们可以肯定的是,我们至少需要从第一个数组中抽取(k – length(list2))。作为一个例子,我们假设k = 7。由于第二个数组只有四个元素,我们知道我们至少需要从第一个数组中取出3个元素,所以我们可以将L设为2:。
Here’s the code for the adapted left and right borders:
下面是改编后的左右边框的代码。
// correct left boundary if k is bigger than the size of list2
int left = k < list2.length ? 0 : k - list2.length - 1;
// the inital right boundary cannot exceed the list1
int right = min(k-1, list1.length - 1);
4.3. Handling of Special Cases
4.3.特殊案例的处理
Before we do the actual binary search, we can handle a few special cases to make the algorithm slightly less complicated and avoid exceptions. Here’s the code with explanations in the comments:
在我们进行实际的二进制搜索之前,我们可以处理一些特殊情况,使算法稍微不那么复杂,避免出现异常。下面是代码,注释里有解释。
// we are looking for the minimum value
if(k == 1) {
return min(list1[0], list2[0]);
}
// we are looking for the maximum value
if(list1.length + list2.length == k) {
return max(list1[list1.length-1], list2[list2.length-1]);
}
// swap lists if needed to make sure we take at least one element from list1
if(k <= list2.length && list2[k-1] < list1[0]) {
int[] list1_ = list1;
list1 = list2;
list2 = list1_;
}
4.4. Input Validation
4.4.输入验证
Let’s look at the input validation first. To prevent the algorithm from failing and throwing, for example, a NullPointerException or ArrayIndexOutOfBoundsException, we want to make sure that the three parameters meet the following conditions:
让我们先来看看输入验证的情况。为了防止算法失败并抛出例如NullPointerException或ArrayIndexOutOfBoundsException,我们要确保三个参数满足以下条件。
- Both arrays must not be null and have at least one element
- k must be >= 0 and cannot be bigger than the length of the two arrays together
Here’s our validation in code:
下面是我们的验证代码。
void checkInput(int k, int[] list1, int[] list2) throws NoSuchElementException, IllegalArgumentException {
if(list1 == null || list2 == null || k < 1) {
throw new IllegalArgumentException();
}
if(list1.length == 0 || list2.length == 0) {
throw new IllegalArgumentException();
}
if(k > list1.length + list2.length) {
throw new NoSuchElementException();
}
}
4.5. Full Code
4.5.完整的代码
Here’s the full code of the algorithm we’ve just described:
下面是我们刚刚描述的算法的完整代码。
public static int findKthElement(int k, int[] list1, int[] list2) throws NoSuchElementException, IllegalArgumentException {
checkInput(k, list1, list2);
// we are looking for the minimum value
if(k == 1) {
return min(list1[0], list2[0]);
}
// we are looking for the maximum value
if(list1.length + list2.length == k) {
return max(list1[list1.length-1], list2[list2.length-1]);
}
// swap lists if needed to make sure we take at least one element from list1
if(k <= list2.length && list2[k-1] < list1[0]) {
int[] list1_ = list1;
list1 = list2;
list2 = list1_;
}
// correct left boundary if k is bigger than the size of list2
int left = k < list2.length ? 0 : k - list2.length - 1;
// the inital right boundary cannot exceed the list1
int right = min(k-1, list1.length - 1);
int nElementsList1, nElementsList2;
// binary search
do {
nElementsList1 = ((left + right) / 2) + 1;
nElementsList2 = k - nElementsList1;
if(nElementsList2 > 0) {
if (list1[nElementsList1 - 1] > list2[nElementsList2 - 1]) {
right = nElementsList1 - 2;
} else {
left = nElementsList1;
}
}
} while(!kthSmallesElementFound(list1, list2, nElementsList1, nElementsList2));
return nElementsList2 == 0 ? list1[nElementsList1-1] : max(list1[nElementsList1-1], list2[nElementsList2-1]);
}
private static boolean foundCorrectNumberOfElementsInBothLists(int[] list1, int[] list2, int nElementsList1, int nElementsList2) {
// we do not take any element from the second list
if(nElementsList2 < 1) {
return true;
}
if(list1[nElementsList1-1] == list2[nElementsList2-1]) {
return true;
}
if(nElementsList1 == list1.length) {
return list1[nElementsList1-1] <= list2[nElementsList2];
}
if(nElementsList2 == list2.length) {
return list2[nElementsList2-1] <= list1[nElementsList1];
}
return list1[nElementsList1-1] <= list2[nElementsList2] && list2[nElementsList2-1] <= list1[nElementsList1];
}
5. Testing the Algorithm
5.测试算法
In our GitHub repository, there are many test cases that cover a lot of possible input arrays and also many corner cases.
在我们的GitHub资源库中,有许多测试案例,涵盖了许多可能的输入阵列,也有许多角落案例。
Here, we only point out one of the tests, which tests not against static input arrays but compares the result of our double binary search algorithm to the result of the simple join-and-sort algorithm. The input consists of two randomized arrays:
在这里,我们只指出其中一个测试,它不是针对静态输入数组进行测试,而是将我们的双二进制搜索算法的结果与简单的连接和排序算法的结果进行比较。输入由两个随机数组组成。
int[] sortedRandomIntArrayOfLength(int length) {
int[] intArray = new Random().ints(length).toArray();
Arrays.sort(intArray);
return intArray;
}
The following method performs one single test:
下面的方法执行一个单一的测试。
private void random() {
Random random = new Random();
int length1 = (Math.abs(random.nextInt())) % 1000 + 1;
int length2 = (Math.abs(random.nextInt())) % 1000 + 1;
int[] list1 = sortedRandomIntArrayOfLength(length1);
int[] list2 = sortedRandomIntArrayOfLength(length2);
int k = (Math.abs(random.nextInt()) + 1) % (length1 + length2);
int result = findKthElement(k, list1, list2);
int result2 = getKthElementSorted(list1, list2, k);
int result3 = getKthElementMerge(list1, list2, k);
assertEquals(result2, result);
assertEquals(result2, result3);
}
And we can call the above method to run a large number of tests like that:
而我们可以调用上述方法来运行大量这样的测试。
@Test
void randomTests() {
IntStream.range(1, 100000).forEach(i -> random());
}
6. Conclusion
6.结语
In this article, we saw several ways of how to find the kth smallest element in the union of two sorted arrays. First, we saw a simple and straightforward O(n log n) algorithm, then a version with complexity O(n), and last, an algorithm that runs in O(log n).
在这篇文章中,我们看到了如何在两个排序数组的联合体中找到k第最小的元素的几种方法。首先,我们看到了一个简单明了的O(n log n)算法,然后是一个复杂度为O(n)的版本,最后是一个能以O(log n)运行的算法。
The last algorithm we saw is a nice theoretical exercise; however, for most practical purposes, we should consider using one of the first two algorithms, which are much simpler than the binary search over two arrays. Of course, if performance is an issue, a binary search could be a solution.
我们看到的最后一种算法是一个很好的理论练习;然而,对于大多数实际的目的,我们应该考虑使用前两种算法中的一种,这比在两个数组上的二进制搜索要简单得多。当然,如果性能是一个问题,二进制搜索可能是一个解决方案。
All the code in this article is available over on GitHub.
本文的所有代码都可以在GitHub上找到。。