A Guide to HashSet in Java – Java中的HashSet指南

最后修改: 2017年 12月 11日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this article, we’ll dive into HashSet. It’s one of the most popular Set implementations as well as an integral part of the Java Collections Framework.

在这篇文章中,我们将深入探讨HashSet。它是最受欢迎的Set实现之一,也是Java集合框架的一个组成部分。

2. Intro to HashSet

2、HashSet的介绍

HashSet is one of the fundamental data structures in the Java Collections API.

HashSet是Java集合API的基本数据结构之一.

Let’s recall the most important aspects of this implementation:

让我们回顾一下这一实施的最重要方面。

  • It stores unique elements and permits nulls
  • It’s backed by a HashMap
  • It doesn’t maintain insertion order
  • It’s not thread-safe

Note that this internal HashMap gets initialized when an instance of the HashSet is created:

请注意,这个内部的HashMapHashSet的实例被创建时被初始化。

public HashSet() {
    map = new HashMap<>();
}

If you want to go deeper into how the HashMap works, you can read the article focused on it here.

如果你想深入了解HashMap的工作原理,你可以阅读这里重点介绍的文章

3. The API

3.API

In this section, we’re going to review most commonly used methods and have a look at some simple examples.

在本节中,我们将回顾最常用的方法,并看一下一些简单的例子。

3.1. add()

3.1.add()

The add() method can be used for adding elements to a set. The method contract states that an element will be added only when it isn’t already present in a set. If an element was added, the method returns true, otherwise – false.

add()方法可用于向一个集合添加元素。该方法的契约规定,只有当一个元素还没有出现在集合中时才会被添加。如果一个元素被添加,该方法会返回true,否则–false。

We can add an element to a HashSet like:

我们可以向HashSet添加一个元素,比如。

@Test
public void whenAddingElement_shouldAddElement() {
    Set<String> hashset = new HashSet<>();
 
    assertTrue(hashset.add("String Added"));
}

From an implementation perspective, the add method is an extremely important one. Implementation details illustrate how the HashSet works internally and leverages the HashMap’s put method:

从实现的角度来看,add方法是一个极其重要的方法。实现细节说明了HashSet如何在内部工作,并利用HashMap的put方法。

public boolean add(E e) {
    return map.put(e, PRESENT) == null;
}

The map variable is a reference to the internal, backing HashMap:

map变量是对内部的、支持的HashMap:的引用。

private transient HashMap<E, Object> map;

It’d be a good idea to get familiar with the hashcode first to get a detailed understanding of how the elements are organized in hash-based data structures.

最好先熟悉一下hashcode,以详细了解基于散列的数据结构中的元素是如何组织的。

Summarizing:

归纳总结。

  • A HashMap is an array of buckets with a default capacity of 16 elements – each bucket corresponds to a different hashcode value
  • If various objects have the same hashcode value, they get stored in a single bucket
  • If the load factor is reached, a new array gets created twice the size of the previous one and all elements get rehashed and redistributed among new corresponding buckets
  • To retrieve a value, we hash a key, mod it, and then go to a corresponding bucket and search through the potential linked list in case of there’s more than a one object

3.2. contains()

3.2.contains()

The purpose of the contains method is to check if an element is present in a given HashSet. It returns true if the element is found, otherwise false.

contains方法的目的是检查一个元素是否存在于一个给定的HashSet中。如果找到该元素,它将返回true,否则返回false。

We can check for an element in the HashSet:

我们可以在HashSet中检查一个元素。

@Test
public void whenCheckingForElement_shouldSearchForElement() {
    Set<String> hashsetContains = new HashSet<>();
    hashsetContains.add("String Added");
 
    assertTrue(hashsetContains.contains("String Added"));
}

Whenever an object is passed to this method, the hash value gets calculated. Then, the corresponding bucket location gets resolved and traversed.

每当一个对象被传递到这个方法,哈希值就会被计算出来。然后,相应的桶的位置被解析和遍历。

3.3. remove()

3.3. remove()

The method removes the specified element from the set if it’s present. This method returns true if a set contained the specified element.

该方法从集合中删除指定的元素,如果它存在的话。如果一个集合包含指定的元素,该方法返回true

Let’s see a working example:

让我们看看一个工作实例。

@Test
public void whenRemovingElement_shouldRemoveElement() {
    Set<String> removeFromHashSet = new HashSet<>();
    removeFromHashSet.add("String Added");
 
    assertTrue(removeFromHashSet.remove("String Added"));
}

3.4. clear()

3.4.clear()

We use this method when we intend to remove all the items from a set. The underlying implementation simply clears all elements from the underlying HashMap.

当我们打算从一个集合中移除所有的项目时,我们使用这个方法。底层实现只是简单地清除了底层HashMap中的所有元素。

Let’s see that in action:

让我们看看这一点的行动。

@Test
public void whenClearingHashSet_shouldClearHashSet() {
    Set<String> clearHashSet = new HashSet<>();
    clearHashSet.add("String Added");
    clearHashSet.clear();
    
    assertTrue(clearHashSet.isEmpty());
}

3.5. size()

3.5.size()

This is one of the fundamental methods in the API. It’s used heavily as it helps in identifying the number of elements present in the HashSet. The underlying implementation simply delegates the calculation to the HashMap’s size() method.

这是API中的一个基本方法。它被大量使用,因为它有助于识别HashSet中的元素数量。底层实现只是将计算委托给HashMap的size()方法。

Let’s see that in action:

让我们看看这一点的行动。

@Test
public void whenCheckingTheSizeOfHashSet_shouldReturnThesize() {
    Set<String> hashSetSize = new HashSet<>();
    hashSetSize.add("String Added");
    
    assertEquals(1, hashSetSize.size());
}

3.6. isEmpty()

3.6.isEmpty()

We can use this method to figure if a given instance of a HashSet is empty or not. This method returns true if the set contains no elements:

我们可以使用这个方法来计算HashSet的一个给定实例是否为空。如果该集合不包含任何元素,该方法返回true

@Test
public void whenCheckingForEmptyHashSet_shouldCheckForEmpty() {
    Set<String> emptyHashSet = new HashSet<>();
    
    assertTrue(emptyHashSet.isEmpty());
}

3.7. iterator()

3.7.iterator()

The method returns an iterator over the elements in the Set. The elements are visited in no particular order and iterators are fail-fast.

该方法返回Set中元素的一个迭代器。这些元素不按特定的顺序被访问,而且迭代器是快速失败的

We can observe the random iteration order here:

我们可以观察到这里的随机迭代顺序。

@Test
public void whenIteratingHashSet_shouldIterateHashSet() {
    Set<String> hashset = new HashSet<>();
    hashset.add("First");
    hashset.add("Second");
    hashset.add("Third");
    Iterator<String> itr = hashset.iterator();
    while(itr.hasNext()){
        System.out.println(itr.next());
    }
}

If the set is modified at any time after the iterator is created in any way except through the iterator’s own remove method, the Iterator throws a ConcurrentModificationException.

如果在迭代器创建后的任何时候,除了通过迭代器自身的移除方法外,该集合被修改,迭代器会抛出一个ConcurrentModificationException

Let’s see that in action:

让我们看看这一点的行动。

@Test(expected = ConcurrentModificationException.class)
public void whenModifyingHashSetWhileIterating_shouldThrowException() {
 
    Set<String> hashset = new HashSet<>();
    hashset.add("First");
    hashset.add("Second");
    hashset.add("Third");
    Iterator<String> itr = hashset.iterator();
    while (itr.hasNext()) {
        itr.next();
        hashset.remove("Second");
    }
}

Alternatively, had we used the iterator’s remove method, then we wouldn’t have encountered the exception:

另外,如果我们使用迭代器的移除方法,那么我们就不会遇到这个异常。

@Test
public void whenRemovingElementUsingIterator_shouldRemoveElement() {
 
    Set<String> hashset = new HashSet<>();
    hashset.add("First");
    hashset.add("Second");
    hashset.add("Third");
    Iterator<String> itr = hashset.iterator();
    while (itr.hasNext()) {
        String element = itr.next();
        if (element.equals("Second"))
            itr.remove();
    }
 
    assertEquals(2, hashset.size());
}

The fail-fast behavior of an iterator cannot be guaranteed as it’s impossible to make any hard guarantees in the presence of unsynchronized concurrent modification.

迭代器的故障快速行为是无法保证的,因为在存在非同步并发修改的情况下,不可能做出任何硬性保证。

Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it’d be wrong to write a program that depended on this exception for its correctness.

失败的迭代器在尽力而为的基础上抛出ConcurrentModificationException。因此,编写一个依靠这个异常来保证其正确性的程序是错误的。

4. How HashSet Maintains Uniqueness?

4、HashSet如何保持唯一性?

When we put an object into a HashSet, it uses the object’s hashcode value to determine if an element is not in the set already.

当我们将一个对象放入HashSet时,它使用该对象的hashcode值来确定一个元素是否已经在集合中。

Each hash code value corresponds to a certain bucket location which can contain various elements, for which the calculated hash value is the same. But two objects with the same hashCode might not be equal.

每个哈希码值都对应于某个桶的位置,其中可以包含各种元素,对于这些元素,计算出来的哈希值是一样的。但两个具有相同hashCode的对象可能不相等

So, objects within the same bucket will be compared using the equals() method.

因此,同一桶内的对象将使用equals()方法进行比较。

5. Performance of HashSet

5.HashSet的性能

The performance of a HashSet is affected mainly by two parameters – its Initial Capacity and the Load Factor.

HashSet的性能主要受两个参数影响–其初始容量负载因子

The expected time complexity of adding an element to a set is O(1) which can drop to O(n) in the worst case scenario (only one bucket present) – therefore, it’s essential to maintain the right HashSet’s capacity.

向一个集合添加一个元素的预期时间复杂度是O(1),在最坏的情况下(只有一个桶存在)可以下降到O(n)–因此,保持正确的HashSet的容量至关重要。

An important note: since JDK 8, the worst case time complexity is O(log*n).

一个重要的说明:从JDK 8开始,最坏情况下的时间复杂度是O(log*n)

The load factor describes what is the maximum fill level, above which, a set will need to be resized.

负载系数描述了什么是最大的填充水平,超过这个水平,就需要调整一组的尺寸。

We can also create a HashSet with custom values for initial capacity and load factor:

我们还可以创建一个HashSet,为初始容量负载系数自定义值。

Set<String> hashset = new HashSet<>();
Set<String> hashset = new HashSet<>(20);
Set<String> hashset = new HashSet<>(20, 0.5f);

In the first case, the default values are used – the initial capacity of 16 and the load factor of 0.75. In the second, we override the default capacity and in the third one, we override both.

在第一种情况下,使用默认值–初始容量为16,负载系数为0.75。在第二种情况下,我们覆盖了默认容量,在第三种情况下,我们覆盖了两者。

A low initial capacity reduces space complexity but increases the frequency of rehashing which is an expensive process.

低的初始容量减少了空间的复杂性,但增加了重新洗牌的频率,这是一个昂贵的过程。

On the other hand, a high initial capacity increases the cost of iteration and the initial memory consumption.

另一方面,高的初始容量增加了迭代的成本和初始内存的消耗。

As a rule of thumb:

作为一个经验法则。

  • A high initial capacity is good for a large number of entries coupled with little to no iteration
  • A low initial capacity is good for few entries with a lot of iteration

It’s, therefore, very important to strike the correct balance between the two. Usually, the default implementation is optimized and works just fine, should we feel the need to tune these parameters to suit the requirements, we need to do judiciously.

因此,在这两者之间取得正确的平衡是非常重要的。通常情况下,默认的实现是经过优化的,而且工作得很好,如果我们觉得有必要调整这些参数以适应要求,我们需要审慎地进行调整。

6. Conclusion

6.结论

In this article, we outlined the utility of a HashSet, its purpose as well as its underlying working. We saw how efficient it is in terms of usability given its constant time performance and ability to avoid duplicates.

在这篇文章中,我们概述了HashSet的效用,它的目的以及它的基础工作。我们看到,鉴于其恒定的时间性能和避免重复的能力,它在可用性方面是多么有效。

We studied some of the important methods from the API, how they can help us as a developer to use a HashSet to its potential.

我们研究了API中的一些重要方法,它们是如何帮助我们作为一个开发者使用HashSet发挥其潜力的。

As always, code snippets can be found over on GitHub.

像往常一样,代码片段可以在GitHub上找到over