Performance of removeAll() in a HashSet – 在HashSet中removeAll()的性能

最后修改: 2020年 10月 21日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

HashSet is a collection for storing unique elements.

HashSet是一个用于存储独特元素的集合。

In this tutorial, we’ll discuss the performance of the removeAll() method in the java.util.HashSet class.

在本教程中,我们将讨论removeAll()方法在java.util.HashSet类中的性能。

2. HashSet.removeAll()

2.HashSet.removeAll()

The removeAll method removes all the elements, that are contained in the collection:

removeAll方法删除所有元素,这些元素包含在collection中。

Set<Integer> set = new HashSet<Integer>();
set.add(1);
set.add(2);
set.add(3);
set.add(4);

Collection<Integer> collection = new ArrayList<Integer>();
collection.add(1);
collection.add(3);

set.removeAll(collection);

Integer[] actualElements = new Integer[set.size()];
Integer[] expectedElements = new Integer[] { 2, 4 };
assertArrayEquals(expectedElements, set.toArray(actualElements));

As a result, elements 1 and 3 will be removed from the set.

因此,元素1和3将被从集合中删除。

3. Internal Implementation and Time Complexity

3.内部执行和时间的复杂性

The removeAll() method determines which one is smaller – the set or the collection. This is done by invoking the size() method on the set and the collection.

removeAll()方法决定了哪个更小–集合和数据集。这是通过在集合上调用size()方法来实现的。

If the collection has fewer elements than the set, then it iterates over the specified collection with the time complexity O(n). It also checks if the element is present in the set with the time complexity O(1). And if the element is present, it’s being removed from the set using the remove() method of the set, which again has a time complexity of O(1). So the overall time complexity is O(n).

如果集合中的元素少于设定的元素,那么它将以时间复杂度 O(n)在指定的集合中进行迭代。它还检查元素是否存在于集合中,其时间复杂度为O(1)。如果该元素存在,则使用集合的remove()方法将其从集合中移除,其时间复杂度同样为O(1)。所以整体时间复杂度是O(n)

If the set has fewer elements than the collection, then it iterates over this set using O(n). Then it checks if each element is present in the collection by invoking its contains() method. And if such an element is present, then the element is removed from the set. So this depends on the time complexity of the contains() method.

如果集合的元素少于集合的元素,那么它就用O(n)对这个集合进行迭代。然后,它通过调用contains()方法检查每个元素是否存在于集合中。如果有这样的元素存在,那么这个元素就会从集合中移除。所以这取决于contains()方法的时间复杂性。

Now in this case, if the collection is an ArrayList, the time complexity of the contains() method is O(m). So overall time complexity to remove all elements present in the ArrayList from the set is O(n * m).

现在在这种情况下,如果集合是ArrayListcontains()方法的时间复杂度是O(m)。所以从集合中移除ArrayList中的所有元素的总体时间复杂性是O(n * m)

If the collection is again HashSet, the time complexity of the contains() method is O(1). So overall time complexity to remove all elements present in the HashSet from the set is O(n).

如果集合又是HashSetcontains()方法的时间复杂度为O(1)。所以从集合中移除HashSet中存在的所有元素的总体时间复杂性是O(n)

4. Performance

4.业绩

To see the performance difference between the above 3 cases, let’s write a simple JMH benchmark test.

为了了解上述3种情况的性能差异,让我们编写一个简单的JMH基准测试。

For the first case, we’ll initialize the set and collection, where we have more elements in the set than the collection. In the second case, we’ll initialize the set and the collection, where we have more elements in the collection than the set. And in the third case, we’ll initialize 2 sets, where we’ll have 2nd set having more number of elements than the 1st one:

对于第一种情况,我们将初始化集合和收藏,在集合中的元素比收藏中的元素多。在第二种情况下,我们将初始化集合和收藏,其中收藏中的元素比集合中的元素多。在第三种情况下,我们将初始化2个集合,其中第2个集合的元素数量比第1个多。

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5)
public class HashSetBenchmark {

    @State(Scope.Thread)
    public static class MyState {
        private Set employeeSet1 = new HashSet<>();
        private List employeeList1 = new ArrayList<>();
        private Set employeeSet2 = new HashSet<>();
        private List employeeList2 = new ArrayList<>();
        private Set<Employee> employeeSet3 = new HashSet<>();
        private Set<Employee> employeeSet4 = new HashSet<>();

        private long set1Size = 60000;
        private long list1Size = 50000;
        private long set2Size = 50000;
        private long list2Size = 60000;
        private long set3Size = 50000;
        private long set4Size = 60000;

        @Setup(Level.Trial)
        public void setUp() {
            // populating sets
        }
    }
}

After, we add our benchmark tests:

之后,我们加入我们的基准测试。

@Benchmark
public boolean given_SizeOfHashsetGreaterThanSizeOfCollection_whenRemoveAllFromHashSet_thenGoodPerformance(MyState state) {
    return state.employeeSet1.removeAll(state.employeeList1);
}

@Benchmark
public boolean given_SizeOfHashsetSmallerThanSizeOfCollection_whenRemoveAllFromHashSet_thenBadPerformance(MyState state) {
    return state.employeeSet2.removeAll(state.employeeList2);
}

@Benchmark
public boolean given_SizeOfHashsetSmallerThanSizeOfAnotherHashSet_whenRemoveAllFromHashSet_thenGoodPerformance(MyState state) {
    return state.employeeSet3.removeAll(state.employeeSet4);
}

And here are the results:

下面是结果。

Benchmark                                              Mode  Cnt            Score            Error  Units
HashSetBenchmark.testHashSetSizeGreaterThanCollection  avgt   20      2700457.099 ±     475673.379  ns/op
HashSetBenchmark.testHashSetSmallerThanCollection      avgt   20  31522676649.950 ± 3556834894.168  ns/op
HashSetBenchmark.testHashSetSmallerThanOtherHashset    avgt   20      2672757.784 ±     224505.866  ns/op

We can see the HashSet.removeAll() performs pretty bad when the HashSet has fewer elements than the Collection, which is passed as an argument to the removeAll() method. But when the other collection is again HashSet, then the performance is good.

我们可以看到,当HashSet的元素少于Collection时,HashSet.removeAll()的表现相当糟糕,该集合被作为参数传递给removeAll方法。但是当另一个集合又是HashSet时,那么性能就很好。

5. Conclusion

5.总结

In this article, we saw the performance of removeAll() in HashSet. When the set has fewer elements than the collection, then the performance of removeAll() depends on the time complexity of the contains() method of the collection.

在这篇文章中,我们看到了removeAll()在HashSet中的性能。当集合的元素少于集合时,那么removeAll()的性能就取决于集合的contains()方法的时间复杂性。

As usual, the complete code for this article is available over on GitHub.

像往常一样,本文的完整代码可在GitHub上获得