1. Overview
1.概述
In this article, we’ll be looking at the Bloom filter construct from the Guava library. A Bloom filter is a memory-efficient, probabilistic data structure that we can use to answer the question of whether or not a given element is in a set.
在这篇文章中,我们将研究Bloom filter结构,该结构来自Guava库。Bloom filter是一种内存效率高的概率数据结构,我们可以用它来回答某个元素是否在一个集合中的问题。
There are no false negatives with a Bloom filter, so when it returns false, we can be 100% certain that the element is not in the set.
布隆过滤器没有假阴性,所以当它返回false时,我们可以100%确定该元素不在该集合中。
However, a Bloom filter can return false positives, so when it returns true, there is a high probability that the element is in the set, but we can not be 100% sure.
然而,布隆过滤器可能会返回假阳性结果,所以当它返回true时,有很大的可能性该元素在集合中,但我们不能100%确定。
For a more in-depth analysis of how a Bloom filter works, you can go through this tutorial.
要想更深入地分析布鲁姆过滤器的工作原理,你可以通过这个教程。
2. Maven Dependency
2.Maven的依赖性
We will be using Guava’s implementation of the Bloom filter, so let’s add the guava dependency:
我们将使用Guava对Bloom过滤器的实现,所以让我们添加guava依赖。
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.0.1-jre</version>
</dependency>
The latest version can be found on Maven Central.
最新版本可以在Maven Central上找到。
3. Why Use Bloom Filter?
3.为什么要使用布鲁姆过滤器?
The Bloom filter is designed to be space-efficient and fast. When using it, we can specify the probability of false positive responses which we can accept and, according to that configuration, the Bloom filter will occupy as little memory as it can.
布隆过滤器的设计是节省空间和快速。在使用它时,我们可以指定我们可以接受的假阳性反应的概率,根据该配置,布隆过滤器将尽可能少地占用内存。
Due to this space-efficiency, the Bloom filter will easily fit in memory even for huge numbers of elements. Some databases, including Cassandra and Oracle, use this filter as the first check before going to disk or cache, for example, when a request for a specific ID comes in.
由于这种空间效率,Bloom过滤器可以很容易地装入内存中,即使是数量巨大的元素。一些数据库,包括Cassandra和Oracle,在进入磁盘或缓存之前,使用这个过滤器作为第一道检查,例如,当一个特定ID的请求进来时。
If the filter returns that the ID is not present, the database can stop further processing of the request and return to the client. Otherwise, it goes to the disk and returns the element if it is found on disk.
如果过滤器返回ID不存在,数据库可以停止对请求的进一步处理并返回给客户。否则,它将进入磁盘,如果在磁盘上找到该元素,则返回。
4. Creating a Bloom Filter
4.创建一个布鲁姆过滤器
Suppose we want to create a Bloom filter for up to 500 Integers and that we can tolerate a one-percent (0.01) probability of false positives.
假设我们想为多达500个Integers创建一个布鲁姆过滤器,并且我们可以容忍百分之一(0.01)的假阳性概率。
We can use the BloomFilter class from the Guava library to achieve this. We need to pass the number of elements that we expect to be inserted into the filter and the desired false positive probability:
我们可以使用Guava 库中的BloomFilter 类来实现这一目标。我们需要传递我们期望被插入过滤器的元素数量和期望的假阳性概率。
BloomFilter<Integer> filter = BloomFilter.create(
Funnels.integerFunnel(),
500,
0.01);
Now let’s add some numbers to the filter:
现在让我们给过滤器添加一些数字。
filter.put(1);
filter.put(2);
filter.put(3);
We added only three elements, and we defined that the maximum number of insertions will be 500, so our Bloom filter should yield very accurate results. Let’s test it using the mightContain() method:
我们只添加了三个元素,并且定义了最大插入次数为500,所以我们的Bloom过滤器应该会产生非常准确的结果。让我们用mightContain()方法来测试一下。
assertThat(filter.mightContain(1)).isTrue();
assertThat(filter.mightContain(2)).isTrue();
assertThat(filter.mightContain(3)).isTrue();
assertThat(filter.mightContain(100)).isFalse();
As the name suggests, we cannot be 100% sure that a given element is actually in the filter when the method returns true.
顾名思义,当该方法返回true时,我们不能100%确定某个元素确实在过滤器中。
When mightContain() returns true in our example, we can be 99% certain that the element is in the filter, and there is a one-percent probability that the result is a false positive. When the filter returns false, we can be 100% certain that the element is not present.
在我们的例子中,当mightContain()返回true时,我们可以99%确定该元素在过滤器中,并且有百分之一的概率该结果是假阳性。当过滤器返回false时,我们可以100%确定该元素不存在。
5. Over-Saturated Bloom Filter
5.过度饱和的布鲁姆过滤器
When we design our Bloom filter, it is important that we provide a reasonably accurate value for the expected number of elements. Otherwise, our filter will return false positives at a much higher rate than desired. Let’s see an example.
当我们设计我们的布隆过滤器时,我们必须为预期的元素数量提供一个合理准确的值。否则,我们的过滤器将以比预期高得多的比率返回假阳性结果。让我们看一个例子。
Suppose that we created a filter with a desired false-positive probability of one percent and an expected some elements equal to five, but then we inserted 100,000 elements:
假设我们创建了一个过滤器,期望的假阳性概率为百分之一,期望的一些元素等于五,但后来我们插入了10万个元素。
BloomFilter<Integer> filter = BloomFilter.create(
Funnels.integerFunnel(),
5,
0.01);
IntStream.range(0, 100_000).forEach(filter::put);
Because the expected number of elements is so small, the filter will occupy very little memory.
因为预期的元素数量非常少,所以过滤器占用的内存非常少。
However, as we add more items than expected, the filter becomes over-saturated and has a much higher probability of returning false positive results than the desired one percent:
然而,随着我们添加的项目比预期的多,过滤器变得过度饱和,返回假阳性结果的概率比预期的百分之一高得多。
assertThat(filter.mightContain(1)).isTrue();
assertThat(filter.mightContain(2)).isTrue();
assertThat(filter.mightContain(3)).isTrue();
assertThat(filter.mightContain(1_000_000)).isTrue();
Note that the mightContatin() method returned true even for a value that we didn’t insert into the filter previously.
请注意,mightContatin() 方法返回true,即使是我们之前没有插入到过滤器的值。
6. Conclusion
6.结论
In this quick tutorial, we looked at the probabilistic nature of the Bloom filter data structure – making use of the Guava implementation.
在这个快速教程中,我们研究了布隆过滤器数据结构的概率性质–利用Guava实现。
You can find the implementation of all these examples and code snippets in the GitHub project.
你可以在GitHub项目中找到所有这些例子和代码片断的实现。
This is a Maven project, so it should be easy to import and run as it is.
这是一个Maven项目,所以应该很容易导入并按原样运行。