Locality-Sensitive Hashing in Java Using Java-LSH – 在Java中使用Java-LSH进行位置敏感的哈希运算

最后修改: 2017年 6月 20日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

The Locality-Sensitive Hashing (LSH) algorithm hashes input items so that similar items have a high probability of being mapped to the same buckets.

Locality-Sensitive Hashing (LSH)算法对输入的项目进行散列,使类似的项目有很大的概率被映射到相同的桶中。

In this quick article, we will use the java-lsh library to demonstrate a simple use case of this algorithm.

在这篇快速文章中,我们将使用java-lsh库来演示这种算法的一个简单用例。

2. Maven Dependency

2.Maven的依赖性

To get started we’ll need to add Maven dependency to the java-lsh library:

为了开始工作,我们需要向java-lsh库添加Maven依赖性。

<dependency>
    <groupId>info.debatty</groupId>
    <artifactId>java-lsh</artifactId>
    <version>0.10</version>
</dependency>

3. Locality-Sensitive Hashing Use Case

3.位置敏感的哈希使用案例

LSH has many possible applications, but we will consider one particular example.

LSH有许多可能的应用,但我们将考虑一个特殊的例子。

Suppose we have a database of documents and want to implement a search engine that will be able to identify similar documents.

假设我们有一个文档数据库,想要实现一个能够识别类似文档的搜索引擎。

We can use LSH as part of this solution:

我们可以使用LSH作为这个解决方案的一部分。

  • Every document can be transformed to a vector of numbers or booleans – for example, we could use the word2vect algorithm to transform words and documents into vectors of numbers
  • Once we have a vector representing each document, we can use the LSH algorithm to calculate a hash for each vector, and due to the characteristics of LSH, documents that are presented as similar vectors will have a similar or same hash
  • As a result, given a particular document’s vector, we can find N numbers of vectors that have a similar hash and return the corresponding documents to the end user

4. Example

4.实例

We will be using the java-lsh library to calculate hashes for our input vectors. We won’t be covering the transformation itself, as this is a huge topic beyond the scope of this article.

我们将使用java-lsh库来为我们的输入向量计算哈希值。我们将不涉及转换本身,因为这是一个超出本文范围的巨大话题。

However, suppose we have three input vectors that are transformed from a set of three documents, presented in a form that can be used as the input for the LSH algorithm:

然而,假设我们有三个输入向量,它们是由一组三个文件转化而来,以一种可以作为LSH算法输入的形式呈现。

boolean[] vector1 = new boolean[] {true, true, true, true, true};
boolean[] vector2 = new boolean[] {false, false, false, true, false};
boolean[] vector3 = new boolean[] {false, false, true, true, false};

Note that in a production application, the number of input vectors should be a lot higher to leverage the LSH algorithm, but for the sake of this demonstration, we will stick to three vectors only.

请注意,在生产应用中,为了利用LSH算法,输入向量的数量应该多得多,但为了这个演示,我们将坚持只使用三个向量。

It is important to note that first vector is vastly different from the second and third, whereas the second and third vectors are quite similar to each other.

值得注意的是,第一个向量与第二个和第三个向量有很大的不同,而第二个和第三个向量之间则相当相似。

Let’s create an instance of the LSHMinHash class. We need to pass the size of the input vectors to it – all input vectors should have equal size. We also need to specify how many hash buckets we want and how many stages of computation (iterations) LSH should perform:

让我们创建一个LSHMinHash类的实例。我们需要将输入向量的大小传递给它–所有的输入向量应该有相同的大小。我们还需要指定我们想要多少个哈希桶以及LSH应该执行多少个计算阶段(迭代)。

int sizeOfVectors = 5;
int numberOfBuckets = 10;
int stages = 4;

LSHMinHash lsh = new LSHMinHash(stages, numberOfBuckets, sizeOfVectors);

We specify that all vectors that will be hashed by the algorithms should be hashed among ten buckets. We also want to have four iterations of LSH for calculating hashes.

我们指定所有将被算法散列的向量应该在十个桶中散列。我们还希望有四次LSH的迭代来计算散列值。

To calculate the hash for each vector, we pass the vector to the hash() method:

为了计算每个向量的哈希值,我们将向量传递给hash()方法。

int[] firstHash = lsh.hash(vector1);
int[] secondHash = lsh.hash(vector2);
int[] thirdHash = lsh.hash(vector3);

System.out.println(Arrays.toString(firstHash));
System.out.println(Arrays.toString(secondHash));
System.out.println(Arrays.toString(thirdHash));

Running that code will result in output similar to:

运行该代码将导致类似的输出。

[0, 0, 1, 0]
[9, 3, 9, 8]
[1, 7, 8, 8]

Looking at each output array, we can see the hash values calculated at each of the four iterations for the corresponding input vector. The first line shows the hash results for the first vector, the second line for the second vector, and third line for the third vector.

观察每个输出数组,我们可以看到在四个迭代中的每个迭代为相应的输入向量计算的哈希值。第一行显示第一个向量的哈希结果,第二行显示第二个向量,第三行显示第三个向量。

After four iterations, the LSH yielded results as we expected – LSH calculated the same hash value (8) for the second and third vectors, which were similar to each other, and a different hash value (0) for the first vector, which was different from the second and third vectors.

经过四次迭代,LSH得出的结果与我们预期的一样–LSH为第二和第三向量计算了相同的哈希值(8),这两个向量彼此相似,而为第一个向量计算了不同的哈希值(0),这个向量与第二和第三向量不同。

LSH is an algorithm that is based on probability, so we cannot be sure that two similar vectors will land in the same hash bucket. Nevertheless, when we have a large enough number of input vectors, the algorithm yields results that will have a high probability for assigning similar vectors to the same buckets.

LSH是一种基于概率的算法,所以我们不能确定两个相似的向量会落在同一个哈希桶里。然而,当我们有足够多的输入向量时,该算法产生的结果将有很高的概率将相似的向量分配到相同的桶中

When we are dealing with massive data sets, LSH can be a handy algorithm.

当我们处理海量数据集时,LSH可以成为一种方便的算法。

5. Conclusion

5.结论

In this quick article, we looked at an application of the Locality-Sensitive Hashing algorithm and showed how to use it with the help of the java-lsh library.

在这篇快速文章中,我们看了一个对位置敏感的哈希算法的应用,并展示了如何在java-lsh库的帮助下使用它。

The implementation of all these examples and code snippets can be found in the GitHub project – this is a Maven project, so it should be easy to import and run as it is.

所有这些例子和代码片段的实现都可以在GitHub项目中找到–这是一个Maven项目,所以应该很容易导入并按原样运行。