Guide to the HyperLogLog Algorithm in Java – Java中的HyperLogLog算法指南

最后修改: 2017年 7月 18日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

The HyperLogLog (HLL) data structure is a probabilistic data structure used to estimate the cardinality of a data set.

HyperLogLog(HLL)数据结构是一种概率数据结构,用于估计数据集的cardinality

Suppose that we have millions of users and we want to calculate the number of distinct visits to our web page. A naive implementation would be to store each unique user id in a set, and then the size of the set would be our cardinality.

假设我们有数以百万计的用户,我们想计算出访问我们网页的不同次数。一个天真的实现是将每个独特的用户ID存储在一个集合中,然后这个集合的大小将是我们的cardinality。

When we are dealing with very large volumes of data, counting cardinality this way will be very inefficient because the data set will take up a lot of memory.

当我们处理非常大的数据量时,用这种方式计算cardinality将是非常低效的,因为数据集会占用大量的内存。

But if we are fine with an estimation within a few percent and don’t need the exact number of unique visits, then we can use the HLL, as it was designed for exactly such a use case – estimating the count of millions or even billions of distinct values.

但是,如果我们对百分之几以内的估计没有意见,并且不需要精确的独特访问量,那么我们可以使用HLL,因为它正是为这样的用例而设计的–估计数百万甚至数十亿的独特值的计数

2. Maven Dependency

2.Maven的依赖性

To get started we’ll need to add the Maven dependency for the hll library:

为了开始工作,我们需要为hll库添加Maven依赖。

<dependency>
    <groupId>net.agkn</groupId>
    <artifactId>hll</artifactId>
    <version>1.6.0</version>
</dependency>

3. Estimating Cardinality Using HLL

3.使用HLL估计cardinality

Jumping right in – the HLL constructor has two arguments that we can tweak according to our needs:

直接跳进去–HLL构造函数有两个参数,我们可以根据我们的需要进行调整。

  • log2m (log base 2) – this is the number of registers used internally by HLL (note: we are specifying the m)
  • regwidth – this is the number of bits used per register

If we want a higher accuracy, we need to set these to higher values. Such a configuration will have additional overhead because our HLL will occupy more memory. If we’re fine with lower accuracy, we can lower those parameters, and our HLL will occupy less memory.

如果我们想要更高的精度,我们需要将这些设置为更高的值。这样的配置将有额外的开销,因为我们的HLL将占用更多的内存。如果我们对较低的精度没有意见,我们可以降低这些参数,我们的HLL将占用较少的内存。

Let’s create an HLL to count distinct values for a data set with 100 million entries. We will set the log2m parameter equal to 14 and regwidth equal to 5 – reasonable values for a data set of this size.

让我们创建一个HLL来计算一个有1亿个条目的数据集的不同值。我们将设置log2m参数等于14regwidth等于5–对于这种规模的数据集来说是合理的值。

When each new element is inserted to the HLL, it needs to be hashed beforehand. We will be using Hashing.murmur3_128() from the Guava library (included with the hll dependency) because it is both accurate and fast.

当每个新元素被插入到HLL时,需要事先进行哈希处理。我们将使用Guava库中的Hashing.murmur3_128()(包含在hll依赖项中),因为它既精确又快速。

HashFunction hashFunction = Hashing.murmur3_128();
long numberOfElements = 100_000_000;
long toleratedDifference = 1_000_000;
HLL hll = new HLL(14, 5);

Choosing those parameters should give us an error rate below one percent (1,000,000 elements). We will be testing this in a moment.

选择这些参数应该能让我们的错误率低于1%(1,000,000个元素)。我们稍后将对此进行测试。

Next, let’s insert the 100 million elements:

接下来,让我们插入这1亿个元素。

LongStream.range(0, numberOfElements).forEach(element -> {
    long hashedValue = hashFunction.newHasher().putLong(element).hash().asLong();
    hll.addRaw(hashedValue);
  }
);

Finally, we can test that the cardinality returned by the HLL is within our desired error threshold:

最后,我们可以测试HLL返回的cardinality是否在我们期望的误差阈值之内

long cardinality = hll.cardinality();
assertThat(cardinality)
  .isCloseTo(numberOfElements, Offset.offset(toleratedDifference));

4. Memory Size of HLL

4、HLL的内存大小

We can calculate how much memory our HLL from the previous section will take by using the following formula: numberOfBits = 2 ^ log2m * regwidth.

我们可以通过以下公式来计算我们上一节的HLL将占用多少内存。numberOfBits = 2 ^ log2m * regwidth

In our example that will be 2 ^ 14 * 5 bits (roughly 81000 bits or 8100 bytes). So estimating the cardinality of a 100-million member set using HLL occupied only 8100 bytes of memory.

在我们的例子中,这将是2 ^ 14 * 5 位(大约81000 位或8100 字节)。因此,使用HLL估计一个1亿个成员集的cardinality只占用了8100字节的内存。

Let’s compare this with a naive set implementation. In such an implementation, we need to have a Set of 100 million Long values, which would occupy 100,000,000 * 8 bytes = 800,000,000 bytes.

让我们将其与一个天真的集合实现进行比较。在这样的实现中,我们需要有一个包含1亿个Long值的Set,这将占用100,000,000 * 8 bytes = 800,000,000 bytes

We can see the difference is astonishingly high. Using HLL, we need only 8100 bytes, whereas using the naive Set implementation we would need roughly 800 megabytes.

我们可以看到,两者的差异是惊人的。使用HLL,我们只需要8100字节,而使用天真的Set实现,我们将需要大约800兆字节。

When we consider bigger data sets, the difference between HLL and the naive Set implementation becomes even higher.

当我们考虑更大的数据集时,HLL和天真的Set实现之间的差异变得更高。

5. Union of Two HLLs

5.两个HLLs的联合

HLL has one beneficial property when performing unions. When we take the union of two HLLs created from distinct data sets and measure its cardinality, we will get the same error threshold for the union that we would get if we had used a single HLL and calculated the hash values for all elements of both data sets from the beginning.

HLL在执行联合时有一个有益的特性当我们从不同的数据集创建的两个HLL的联合并测量它的cardinality时,我们将得到与我们使用单个HLL并从头计算两个数据集的所有元素的哈希值相同的联合错误阈值

Note that when we union two HLLs, both should have the same log2m and regwidth parameters to yield proper results.

请注意,当我们联合两个HLL时,两个HLL都应该有相同的log2m regwidth 参数,以产生正确的结果。

Let’s test that property by creating two HLLs – one is populated with values from 0 to 100 million, and the second is populated with values from 100 million to 200 million:

让我们通过创建两个HLLs来测试这个属性–一个填充了从0到1亿的值,第二个填充了从1亿到2亿的值。

HashFunction hashFunction = Hashing.murmur3_128();
long numberOfElements = 100_000_000;
long toleratedDifference = 1_000_000;
HLL firstHll = new HLL(15, 5);
HLL secondHLL = new HLL(15, 5);

LongStream.range(0, numberOfElements).forEach(element -> {
    long hashedValue = hashFunction.newHasher()
      .putLong(element)
      .hash()
      .asLong();
    firstHll.addRaw(hashedValue);
    }
);

LongStream.range(numberOfElements, numberOfElements * 2).forEach(element -> {
    long hashedValue = hashFunction.newHasher()
      .putLong(element)
      .hash()
      .asLong();
    secondHLL.addRaw(hashedValue);
    }
);

Please note that we tuned the configuration parameters of the HLLs, increasing the log2m parameter from 14, as seen in the previous section, to 15 for this example, since the resulting HLL union will contain twice as many elements.

请注意,我们调整了HLLs的配置参数,将log2m参数从上一节中看到的14增加到本例中的15,因为产生的HLL联盟将包含两倍的元素。

Next, let’s union the firstHll and secondHll using the union() method. As you can see, the estimated cardinality is within an error threshold as if we had taken the cardinality from one HLL with 200 million elements:

接下来,让我们使用union()方法将firstHllsecondHll合并。正如你所看到的,估计的cardinality在误差阈值之内,就像我们从一个有2亿个元素的HLL中获取cardinality。

firstHll.union(secondHLL);
long cardinality = firstHll.cardinality();
assertThat(cardinality)
  .isCloseTo(numberOfElements * 2, Offset.offset(toleratedDifference * 2));

6. Conclusion

6.结论

In this tutorial, we had a look at the HyperLogLog algorithm.

在本教程中,我们看了一下HyperLogLog算法。

We saw how to use the HLL to estimate the cardinality of a set. We also saw that HLL is very space-efficient compared to the naive solution. And we performed the union operation on two HLLs and verified that the union behaves in the same way as a single HLL.

我们看到如何使用HLL来估计一个集合的cardinality。我们还看到,与天真的解决方案相比,HLL是非常节省空间的。我们对两个HLL进行了并联操作,并验证了并联的行为与单个HLL的行为相同。

The implementation of all these examples and code snippets can be found in the GitHub project ; this is a Maven project, so it should be easy to import and run as it is.

所有这些例子和代码片段的实现都可以在GitHub项目中找到;这是一个Maven项目,所以应该很容易导入并按原样运行。