A Simple File Search with Lucene – 用Lucene进行简单的文件搜索

最后修改: 2017年 12月 24日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Apache Lucene is a full-text search engine, which can be used by various programming languages. To get started with Lucene, please refer to our introductory article here.

Apache Lucene是一个全文本搜索引擎,它可以被各种编程语言所使用。要开始使用Lucene,请参考我们的介绍性文章这里

In this quick article, we’ll index a text file and search sample Strings and text snippets within that file.

在这篇快速文章中,我们将索引一个文本文件并搜索该文件中的样本Strings和文本片段。

2. Maven Setup

2.Maven的设置

Let’s add necessary dependencies first:

让我们首先添加必要的依赖性。

<dependency>        
    <groupId>org.apache.lucene</groupId>          
    <artifactId>lucene-core</artifactId>
    <version>7.1.0</version>
</dependency>

The latest version can be found here.

最新版本可以在这里找到。

Also, for parsing our search queries, we’ll need:

此外,为了解析我们的搜索查询,我们将需要。

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.1.0</version>
</dependency>

Remember to check the latest version here.

记得检查最新版本这里

3. File System Directory

3.文件系统目录

In order to index files, we’ll first need to create a file-system index.

为了对文件进行索引,我们首先需要创建一个文件系统索引。

Lucene provides the FSDirectory class to create a file system index:

Lucene提供了FSDirectory类来创建一个文件系统索引。

Directory directory = FSDirectory.open(Paths.get(indexPath));

Here indexPath is the location of the directory. If the directory doesn’t exist, Lucene will create it.

这里indexPath是目录的位置。如果该目录不存在,Lucene将创建它。

Lucene provides three concrete implementations of the abstract FSDirectory class: SimpleFSDirectory, NIOFSDirectory, and MMapDirectory. Each of them might have special issues with a given environment.

Lucene提供了三个抽象的FSDirectory类的具体实现。SimpleFSDirectory、NIOFSDirectory和MapDirectory。它们中的每一个都可能在特定的环境下有特殊的问题。

For example, SimpleFSDirectory has poor concurrent performance as it blocks when multiple threads read from the same file.

例如,SimpleFSDirectory的并发性能很差,因为当多个线程从同一个文件中读取时,它会阻塞。

Similarly, the NIOFSDirectory and MMapDirectory implementations face file-channel issues in Windows and memory release problems respectively.

同样,NIOFSDirectory和MapDirectory实现分别面临Windows中的文件通道问题和内存释放问题。

To overcome such environment peculiarities Lucene provides the FSDirectory.open() method. When invoked, it tries to choose the best implementation depending on the environment.

为了克服这种环境的特殊性,Lucene提供了FSDirectory.open() 方法。当被调用时,它试图根据环境来选择最佳的实现。

4. Index Text File

4.索引文本文件

Once we’ve created the index directory, let’s go ahead and add a file to the index:

一旦我们创建了索引目录,让我们继续往前走,向索引添加一个文件。

public void addFileToIndex(String filepath) {

    Path path = Paths.get(filepath);
    File file = path.toFile();
    IndexWriterConfig indexWriterConfig
     = new IndexWriterConfig(analyzer);
    Directory indexDirectory = FSDirectory
      .open(Paths.get(indexPath));
    IndexWriter indexWriter = new IndexWriter(
      indexDirectory, indexWriterConfig);
    Document document = new Document();

    FileReader fileReader = new FileReader(file);
    document.add(
      new TextField("contents", fileReader));
    document.add(
      new StringField("path", file.getPath(), Field.Store.YES));
    document.add(
      new StringField("filename", file.getName(), Field.Store.YES));

    indexWriter.addDocument(document);
    indexWriter.close();
}

Here, we create a document with two StringFields named “path” and “filename” and a TextField called “contents”.

在这里,我们创建了一个有两个名为 “路径 “和 “文件名 “的StringField和一个名为 “内容 “的TextField的文档。

Note that we pass the fileReader instance as the second parameter to the TextField. The document is added to the index using the IndexWriter.

请注意,我们将fileReader实例作为第二个参数传递给TextField。文件被添加到使用IndexWriter.的索引中。

The third argument in the TextField or StringField constructor indicates whether the value of the field will also be stored.

TextFieldStringField构造函数中的第三个参数表示该字段的值是否也将被存储。

Finally, we invoke the close() of the IndexWriter to gracefully close and release the lock from the index files.

最后,我们调用IndexWriter close() 来优雅地关闭并释放索引文件的锁。

5. Search Indexed Files

5.搜索被索引的文件

Now let’s search the files we have indexed:

现在让我们来搜索我们已经编入索引的文件。

public List<Document> searchFiles(String inField, String queryString) {
    Query query = new QueryParser(inField, analyzer)
      .parse(queryString);
    Directory indexDirectory = FSDirectory
      .open(Paths.get(indexPath));
    IndexReader indexReader = DirectoryReader
      .open(indexDirectory);
    IndexSearcher searcher = new IndexSearcher(indexReader);
    TopDocs topDocs = searcher.search(query, 10);
    
    return topDocs.scoreDocs.stream()
      .map(scoreDoc -> searcher.doc(scoreDoc.doc))
      .collect(Collectors.toList());
}

Let’s now test the functionality:

现在我们来测试一下这个功能。

@Test
public void givenSearchQueryWhenFetchedFileNamehenCorrect(){
    String indexPath = "/tmp/index";
    String dataPath = "/tmp/data/file1.txt";
    
    Directory directory = FSDirectory
      .open(Paths.get(indexPath));
    LuceneFileSearch luceneFileSearch 
      = new LuceneFileSearch(directory, new StandardAnalyzer());
    
    luceneFileSearch.addFileToIndex(dataPath);
    
    List<Document> docs = luceneFileSearch
      .searchFiles("contents", "consectetur");
    
    assertEquals("file1.txt", docs.get(0).get("filename"));
}

Notice how we’re creating a file-system index in the location indexPath and indexing the file1.txt.

注意我们如何在indexPath位置创建一个文件系统索引,并为file1.txt.建立索引。

Then, we simply search for the Stringconsectetur” in the “contents” field.

然后,我们只需在“内容”字段中搜索字符串consectetur“。

6. Conclusion

6.结语

This article was a quick demonstration of indexing and searching text with Apache Lucene. To learn more about indexing, searing and queries of Lucene, please refer our introduction to Lucene article.

这篇文章是对使用Apache Lucene进行索引和搜索文本的一个快速演示。要了解有关Lucene的索引、搜索和查询的更多信息,请参考我们的Lucene介绍文章

As always the code for the examples can be found over on Github.

像往常一样,这些例子的代码可以在Github上找到over