A Guide to the Folding Technique in Java – Java中的折叠技术指南

最后修改: 2019年 6月 19日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.绪论

In this tutorial, we consider hashing techniques used in various data structures that provide constant time access to their elements.

在本教程中,我们考虑在各种数据结构中使用散列技术,提供对其元素的持续时间访问。

We discuss in more detail the so-called folding technique and give a short introduction to mid-square and binning techniques.

我们更详细地讨论了所谓的折叠技术,并对中方和分选技术作了简短介绍。

2. Overview

2.概述

When we choose data structures for storing objects, one of the considerations is whether we need to access them quickly.

当我们选择数据结构来存储对象时,其中一个考虑因素是我们是否需要快速访问它们。

The Java utility package offers us quite a lot of data structures for storing our objects. For more information on data structures, refer to our Java Collections compilation page that contains guides on several of them.

Java实用程序包为我们提供了相当多的数据结构来存储我们的对象。关于数据结构的更多信息,请参考我们的Java集合编译页面,其中包含了关于其中几个结构的指南。

As we know, some of these data structures allow us to retrieve their elements in constant time, independent of the number of elements that they contain.

正如我们所知,这些数据结构中的一些允许我们在恒定的时间内检索其元素,与它们所包含的元素数量无关。

Probably, the simplest one is the array. In fact, we access elements in the array by their index. The access time, naturally, does not depend on the size of the array. In fact, behind the scene, many data structures heavily use arrays.

可能,最简单的是数组。事实上,我们通过索引访问数组中的元素。访问时间,自然不取决于数组的大小。事实上,在幕后,许多数据结构都大量使用数组。

The problem is that the array indexes must be numeric, while we often prefer to manipulate these data structures with objects.

问题是,数组的索引必须是数字,而我们常常喜欢用对象来操作这些数据结构。

In order to address this problem, many data structures try to assign a numeric value that can serve as an array index to objects. We call this value a hash value or simply a hash.

为了解决这个问题,许多数据结构试图分配一个数字值,作为对象的数组索引。我们把这个值称为哈希值,或者简单地称为哈希

3. Hashing

3.洗练

Hashing is a transformation of an object into a numerical value. Functions that perform these transformations are called hash functions.

哈希是将一个对象转换为一个数值的过程执行这些转换的函数被称为哈希函数

For the sake of simplicity, let’s consider hash functions that transform strings into array indexes, that is, into integers from the range [0, N] with a finite N.

为了简单起见,让我们考虑将字符串转化为数组索引的哈希函数,也就是转化为范围[0, N]的整数,并有一个有限的N

Naturally, a hash function is applied to a wide variety of strings. Therefore its “global” properties become important.

自然,散列函数被应用于各种各样的字符串。因此,它的 “全局 “属性变得很重要。

Mapping of strings into array indexes
Unfortunately, it’s not possible that a hash function always transforms different strings into different numbers.

将字符串映射为数组索引

不幸的是,哈希函数不可能总是将不同的字符串转化为不同的数字

We may convince ourselves quite easily that the number of strings is much bigger than the number of integers in any range [0, N]. Therefore, it’s inevitable that there is a pair of non-equal strings for which a hash function produces equal values. This phenomenon is called collision.

我们可以很容易地说服自己,在任何范围内[0, N],字符串的数量都比整数的数量大得多。因此,不可避免的是,有一对不相等的字符串,哈希函数产生了相等的值。这种现象被称为碰撞

We are not going to dive into the engineering details behind hash functions, but it’s clear that a good hash function should try to map uniformly the strings on which it’s defined into numbers.

我们不打算深入研究哈希函数背后的工程细节,但很明显,一个好的哈希函数应该尝试将它所定义的字符串统一映射为数字。

Another obvious requirement is that a good hash function should be fast. If it takes too long to calculate a hash value, then we cannot access elements quickly.

另一个明显的要求是,一个好的哈希函数应该是快速的。如果计算一个哈希值需要太长的时间,那么我们就不能快速访问元素。

In this tutorial, we consider one of the techniques that try to make the mapping uniform while maintaining it fast.

在本教程中,我们考虑其中一个技术,试图使映射统一,同时保持其快速。

4. Folding Technique

4.折叠技术

Our goal is to find a function that transforms strings into array indexes. Just to illustrate the idea, suppose that we want this array to have the capacity for 105 elements and let’s use string Java language as an example.

我们的目标是找到一个能将字符串转化为数组索引的函数。只是为了说明这个想法,假设我们希望这个数组有105个元素的容量,让我们用字符串Java语言作为一个例子。

4.1. Description

4.1.描述

Let’s start by converting the string’s characters into numbers. ASCII is a good candidate for this operation:

让我们先把字符串的字符转换为数字。ASCII是这种操作的一个很好的候选者。

Convert the string into ascii

Now, we arrange the numbers we just obtained into groups of some size. Generally, we choose the group size value based on the size of our array which is 105. Since the numbers, in which we transformed the characters into, contain from two to three digits, without loss of generality, we may set the group size to two:

现在,我们将刚刚得到的数字排列成一定大小的组。一般来说,我们根据我们数组的大小来选择组的大小,即105。由于我们将字符转化成的数字包含两到三位数字,在不影响一般情况下,我们可以将组的大小设置为2。

Arrange string's ascii codes

The next step is to concatenate the numbers in each group as if they were strings and find their sum:

下一步是将每组的数字连接起来,就像它们是字符串一样,并找到它们的总和。

Concatenate and sum up the numbers

Now we must make the final step. Let’s check whether the number 348933 may serve as an index of our array of size 105. Naturally, it exceeds the maximum allowed value 99999. We may easily overcome this problem by applying the modulo operator in order to find the final result:

现在我们必须做最后一步了。让我们检查一下数字348933是否可以作为大小为105的数组的索引。自然,它超过了允许的最大值99999.我们可以通过应用模运算符来轻松克服这个问题,以找到最终结果。

348933 % 10000 = 48933

4.2. Final Remarks

4.2.最后备注

We see that the algorithm does not include any time-consuming operations and hence it is quite fast. Every character of the input string contributes to the final result. This fact definitely helps to reduce collisions, but not to avoid them completely.

我们看到,该算法不包括任何耗时的操作,因此它是相当快的。输入字符串的每个字符都对最终结果有贡献。这一事实无疑有助于减少碰撞,但不能完全避免。

For example, if we wanted to skip the folding and applied the modulo operator directly to the ASCII-transformed input string (ignoring the overflow issue)

例如,如果我们想跳过折叠,直接对ASCII转换后的输入字符串应用模运算(忽略溢出的问题)

749711897321089711010311797103101 % 100000 = 3101

then such hash function would produce the same value for all strings that have the same last two characters as our input string: age, page, large, and so on.

那么这样的哈希函数将为所有与我们的输入字符串最后两个字符相同的字符串产生相同的值:age, page, large, 等等。

From the description of the algorithm, we can easily see that it is not free from the collisions. For instance, the algorithm produces the same hash value for Java language and vaJa language strings.

从该算法的描述中,我们可以很容易地看到,它不是没有碰撞。例如,该算法对Java语言vaJa语言字符串产生相同的哈希值

5. Other Techniques

5.其他技术

The Folding Technique is quite common, but not the only one. Sometimes, the binning or mid-square techniques may be useful too.

折叠技术是很常见的,但不是唯一的。有时,宾语中位数技术也可能有用。

We illustrate their idea by not using strings, but numbers (suppose that we have already somehow transformed the strings into numbers). We won’t discuss their advantages and weaknesses, but you may form an opinion after seeing the algorithms.

我们不使用字符串,而是使用数字来说明他们的想法(假设我们已经以某种方式将字符串转换为数字)。我们不会讨论他们的优点和缺点,但你可以在看到这些算法后形成自己的看法。

5.1. Binning Technique

5.1.分级技术

Suppose that we have 100 integer numbers and we want our hash function to map them into an array of 10 elements. Then we may just arrange those 100 integers into ten groups in such a way that the first ten integers end up in the first bin, the second ten integers end up in the second bin, etc.:

假设我们有100个整数,我们希望我们的哈希函数能将它们映射成一个10个元素的数组。那么我们就可以把这100个整数排列成10个组,使前10个整数最终进入第一个bin,后10个整数最终进入第二个bin,等等。

Binning technique

5.2. Mid-Square Technique

5.2.中方技术

This algorithm was proposed by John von Neumann and it allows us to generate pseudo-random numbers starting from a given number.

这种算法是由约翰-冯-诺伊曼提出的,它允许我们从一个给定的数字开始生成伪随机数。

Mid-square hashing
Let’s illustrate it on a concrete example. Suppose, we have a four-digit number 1111. According to the algorithm, we square it, thus obtaining 1234321‬. Now, we extract four digits from the middle, for example, 2343. The algorithm allows us to repeat this process until we are satisfied with the result.

mid-square hashing
>
让我们用一个具体的例子来说明它。假设,我们有一个四位数的数字1111。根据该算法,我们将其平方,从而得到1234321。现在,我们从中间提取四个数字,例如,2343。该算法允许我们重复这一过程,直到我们对结果感到满意。

6. Conclusion

6.结语

In this tutorial, we considered several hashing techniques. We described in detail the folding technique and gave a flash description of how binning and mid-square can be achieved.

在本教程中,我们考虑了几种散列技术。我们详细描述了折叠技术,并对如何实现分选和中位数给出了一个闪光的描述。

As always, we may find the corresponding code snippets on our GitHub repository.

一如既往,我们可以在我们的GitHub资源库中找到相应的代码片段。