Optimizing HashMap’s Performance – 优化HashMap’的性能

最后修改: 2021年 2月 25日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.绪论

HashMap is a powerful data structure that has a broad application, especially when fast lookup time is needed. Yet, if we don’t pay attention to details, it can get suboptimal.

HashMap是一种强大的数据结构,具有广泛的应用,尤其是在需要快速查找的时候。然而,如果我们不注意细节,它就会变得次优。

In this tutorial, we’ll take a look at how to make HashMap as fast as possible.

在本教程中,我们将看看如何使HashMap尽可能快。

2. HashMap‘s Bottleneck

2.HashMap的瓶颈

HashMap‘s optimistic constant time of element retrieval (O(1)) comes from the power of hashing. For each element, HashMap computes the hash code and puts the element in the bucket associated with that hash code. Because non-equal objects can have the same hash codes (a phenomenon called hash code collision), buckets can grow in size.

HashMap的元素检索的乐观恒定时间(O(1))来自散列的力量。对于每个元素,HashMap计算散列代码并将该元素放入与该散列代码相关的桶中。由于非相等的对象可能具有相同的哈希代码(这种现象被称为哈希代码碰撞),所以桶的大小会越来越大。

The bucket is actually a simple linked list. Finding elements in the linked list isn’t very fast (O(n)) but that’s not a problem if the list is very small. Problems start when we have a lot of hash code collisions, so instead of a big number of small buckets, we have a small number of big buckets.

这个桶实际上是一个简单的链表。在链表中寻找元素的速度并不快(O(n)),但如果列表非常小,这并不是一个问题。当我们有大量的哈希代码碰撞时,问题就开始了,所以我们不是有大量的小桶,而是有少量的大桶。

In the worst-case scenario, in which we put everything inside one bucket, our HashMap is downgraded to a linked list. Consequently, instead of O(1) lookup time, we get a very unsatisfactory O(n).

在最坏的情况下,我们把所有东西都放在一个桶里,我们的HashMap被降级为一个链接列表。因此,我们得到的不是O(1)的查找时间,而是非常不理想的O(n)

3. Tree Instead of LinkedList

3.用树代替LinkedList

Starting from Java 8, one optimization is built-in in HashMapWhen buckets are getting too large, they’re transformed into trees, instead of linked lists. That brings the pessimistic time of O(n) to O(log(n)), which is much better. For that to work, the keys of HashMap need to implement the Comparable interface.

从 Java 8 开始,HashMap 中内置了一项优化当桶变得过大时,它们将被转换为树,而不是链接列表。这使得悲观的时间O(n)变为O(log(n)),这要好很多。为了实现这一目标,HashMap的键需要实现Comparable接口。

That’s a nice and automatic solution, but it’s not perfect. O(log(n)) is still worse than desired constant time, and transforming and storing trees takes additional power and memory.

这是一个不错的自动解决方案,但它并不完美。O(log(n)) 仍然比理想的恒定时间差,而且转换和存储树需要额外的功率和内存。

4. Best hashCode Implementation

4.最佳hashCode实现

There are two factors we need to take into consideration when choosing a hashing function: quality of produced hash codes and speed.

在选择散列函数时,我们需要考虑两个因素:产生的散列代码的质量和速度。

4.1. Measuring hashCode Quality

4.1.测量hashCode质量

Hash codes are stored inside int variables, so the number of possible hashes is limited to the capacity of the int type. It must be so because hashes are used to compute indexes of an array with buckets. That means there’s also a limited number of keys that we can store in a HashMap without hash collision.

散列代码存储在int变量内,所以可能的散列数量被限制在int类型的容量内。必须如此,因为哈希码是用来计算带桶数组的索引的。这意味着我们可以存储在HashMap中的键的数量也是有限的,不会发生哈希碰撞。

To avoid collisions as long as we can, we want to spread hashes as evenly as possible. In other words, we want to achieve uniform distribution. That means that each hash code value has the same chance of occurring as any other.

为了尽可能地避免碰撞,我们希望尽可能均匀地分布哈希值。换句话说,我们希望实现均匀分布。这意味着每个哈希码值的出现机会与其他任何值相同。

Similarly, a bad hashCode method would have a very unbalanced distribution. In the very worst-case scenario, it would always return the same number.

同样地,一个糟糕的hashCode方法会有一个非常不平衡的分布。在最坏的情况下,它将总是返回相同的数字。

4.2. Default Object‘s hashCode

4.2.默认ObjecthashCode

In general, we shouldn’t use the default Object’s hashCode method because we don’t want to use object identity in the equals method. However, in that very unlikely scenario in which we really want to use object identity for keys in a HashMap, the default hashCode function will work fine. Otherwise, we’ll want a custom implementation.

一般来说,我们不应该使用默认的Object的hashCode方法,因为我们不希望在equals方法中使用对象身份。然而,在那种不太可能的情况下,我们真的想在HashMap中的键上使用对象身份,默认的hashCode函数就能正常工作。否则,我们将需要一个自定义的实现。

4.3. Custom hashCode

4.3.自定义hashCode

Usually, we want to override the equals method, and then we also need to override hashCode. Sometimes, we can take advantage of the specific identity of the class and easily make a very fast hashCode method.

通常,我们要覆盖equals方法,然后我们还需要覆盖hashCode。有时,我们可以利用类的特殊身份,轻松地做出一个非常快速的hashCode方法

Let’s say our object’s identity is purely based on its integer id. Then, we can just use this id as a hash function:

假设我们的对象的身份纯粹是基于其整数id。那么,我们就可以把这个id作为一个哈希函数。

@Override
public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;

    MemberWithId that = (MemberWithId) o;

    return id.equals(that.id);
}

@Override
public int hashCode() {
    return id;
}

It will be extremely fast and won’t produce any collisions. Our HashMap will behave like it has an integer key instead of a complex object.

它将是非常快的,不会产生任何碰撞。我们的HashMap将表现得像它有一个整数键而不是一个复杂的对象。

The situation will get more complicated if we have more fields that we need to take into account. Let’s say we want to base equality on both id and name:

如果我们有更多的字段需要考虑,情况会变得更加复杂。比方说,我们想在idname上建立平等。

@Override
public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;

    MemberWithIdAndName that = (MemberWithIdAndName) o;

    if (!id.equals(that.id)) return false;
    return name != null ? name.equals(that.name) : that.name == null;
}

Now, we need to somehow combine hashes of id and name.

现在,我们需要以某种方式结合idname的哈希值。

First, we’ll get id‘s hash the same as before. Then, we’ll multiple it by some carefully chosen number and add the name‘s hash:

首先,我们会像以前一样得到id的哈希值。然后,我们将用一些精心选择的数字进行乘法,并加上name的哈希值。

@Override
public int hashCode() {
    int result = id.hashCode();
    result = PRIME * result + (name != null ? name.hashCode() : 0);
    return result;
}

How to choose that number isn’t an easy question to answer sufficiently. Historically, the most popular number was 31. It’s prime, it results in a good distribution, it’s small, and multiplying by it can be optimized using a bit-shift operation:

如何选择这个数字并不是一个容易充分回答的问题。历史上,最受欢迎的数字是31。它是素数,它的结果是一个很好的分布,它很小,而且乘以它可以用位移操作来优化。

31 * i == (i << 5) - i

However, now that we don’t need to fight for every CPU cycle, some bigger primes can be used. For example, 524287 also can be optimized:

然而,现在我们不需要争夺每一个CPU周期,一些更大的素数可以被使用。例如,524287也可以被优化。

524287 * i == i << 19 - i

And, it may provide a hash of better quality resulting in a lesser chance of collision. Mind that these bit-shift optimizations are done automatically by the JVM, so we don’t need to obfuscate our code with them.

而且,它可以提供一个质量更好的散列,从而减少碰撞的机会。注意,这些位移优化是由JVM自动完成的,所以我们不需要用它们来混淆我们的代码。

4.4. Objects Utility Class

4.4.Objects实用类

The algorithm we just implemented is well established, and we don’t usually need to recreate it by hand every time. Instead, we can use the helper method provided by the Objects class:

我们刚刚实现的算法已经很成熟了,我们通常不需要每次都手工重新创建它。相反,我们可以使用Objects类提供的辅助方法。

@Override
public int hashCode() {
    return Objects.hash(id, name);
}

Under-the-hood, it uses exactly the algorithm described earlier with the number 31 as a multiplier.

在内部,它完全使用前面描述的算法,以数字31作为乘法器。

4.5. Other Hash Functions

4.5.其他哈希函数

There are many hash functions that provide a lesser collision chance than the one described earlier. The problem is that they’re computationally heavier and thus don’t provide the speed gain we seek.

有许多哈希函数提供了比前面描述的更小的碰撞机会。问题是,它们的计算量更大,因此不能提供我们所寻求的速度提升。

If for some reason we really need quality and don’t care much for speed, we can take a look at the Hashing class from the Guava library:

如果出于某种原因,我们真的需要质量而不太在乎速度,我们可以看看Guava库中的Hashing类。

@Override
public int hashCode() {
    HashFunction hashFunction = Hashing.murmur3_32();
    return hashFunction.newHasher()
      .putInt(id)
      .putString(name, Charsets.UTF_8)
      .hash().hashCode();
}

It’s important to choose a 32-bit function because we can’t store longer hashes anyway.

选择一个32位的函数很重要,因为无论如何我们都不能存储更长的哈希值。

5. Conclusion

5.总结

Modern Java’s HashMap is a powerful and well-optimized data structure. Its performance can, however, be worsened by a badly designed hashCode method. In this tutorial, we looked at possible ways to make hashing fast and effective.

现代Java的HashMap是一个强大而优化的数据结构。然而,它的性能可能会因设计不当的hashCode方法而变差。在本教程中,我们研究了使散列变得快速有效的可能方法。

As always, the code examples for this article are available over on GitHub.

像往常一样,本文的代码示例可在GitHub上获得