Content Analysis with Apache Tika – 用Apache Tika进行内容分析

最后修改: 2018年 3月 13日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Apache Tika is a toolkit for extracting content and metadata from various types of documents, such as Word, Excel, and PDF or even multimedia files like JPEG and MP4.

Apache Tika是一个工具包,用于从各种类型的文档中提取内容和元数据,如Word、Excel和PDF,甚至JPEG和MP4等多媒体文件。

All text-based and multimedia files can be parsed using a common interface, making Tika a powerful and versatile library for content analysis.

所有基于文本的文件和多媒体文件都可以使用一个通用的界面进行解析,使Tika成为一个强大的、多功能的内容分析库。

In this article, we’ll give an introduction to Apache Tika, including its parsing API and how it automatically detects the content type of a document. Working examples will also be provided to illustrate operations of this library.

在这篇文章中,我们将介绍Apache Tika,包括它的解析API以及它如何自动检测文档的内容类型。还将提供工作实例来说明这个库的操作。

2. Getting Started

2.开始

In order to parse documents using Apache Tika, we need only one Maven dependency:

为了使用Apache Tika解析文档,我们只需要一个Maven依赖项。

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.17</version>
</dependency>

The latest version of this artifact can be found here.

该工件的最新版本可以在这里找到。

3. The Parser API

3.解析器 API

The Parser API is the heart of Apache Tika, abstracting away the complexity of the parsing operations. This API relies on a single method:

Parser API是Apache Tika的核心,抽象了解析操作的复杂性。这个API依赖于一个单一的方法。

void parse(
  InputStream stream, 
  ContentHandler handler, 
  Metadata metadata, 
  ParseContext context) 
  throws IOException, SAXException, TikaException

The meanings of this method’s parameters are:

这个方法的参数的含义是:。

  • stream an InputStream instance created from the document to be parsed
  • handler a ContentHandler object receiving a sequence of XHTML SAX events parsed from the input document; this handler will then process events and export the result in a particular form
  • metadata a Metadata object conveying metadata properties in and out of the parser
  • context a ParseContext instance carrying context-specific information, used to customize the parsing process

The parse method throws an IOException if it fails to read from the input stream, a TikaException if the document taken from the stream cannot be parsed and a SAXException if the handler is unable to process an event.

parse方法如果不能从输入流中读取,则抛出IOException;如果从流中获取的文件不能被解析,则抛出TikaException;如果处理程序无法处理一个事件,则抛出SAXException

When parsing a document, Tika attempts to reuse existing parser libraries such as Apache POI or PDFBox as much as possible. As a result, most of the Parser implementation classes are just adapters to such external libraries.

当解析一个文档时,Tika试图尽可能地重用现有的解析器库,如Apache POI或PDFBox。因此,大部分的Parser实现类只是对这种外部库的适配器。

In section 5, we’ll see how the handler and metadata parameters can be used to extract content and metadata of a document.

在第5节,我们将看到如何使用handlermetadata参数来提取文档的内容和元数据。

For convenience, we can use the facade class Tika to access the functionality of the Parser API.

为了方便,我们可以使用门面类Tika来访问Parser API的功能。

4. Auto-Detection

4.自动检测

Apache Tika can automatically detect the type of a document and its language based on the document itself rather than on additional information.

Apache Tika可以根据文件本身而不是其他信息,自动检测文件的类型和语言。

4.1. Document Type Detection

4.1.文件类型检测

The detection of document types can be done using an implementation class of the Detector interface, which has a single method:

文档类型的检测可以通过Detector接口的实现类来完成,它有一个方法。

MediaType detect(java.io.InputStream input, Metadata metadata) 
  throws IOException

This method takes a document, and its associated metadata – then returns a MediaType object describing the best guess regarding the type of the document.

这个方法接收一个文档及其相关的元数据,然后返回一个MediaType对象,描述关于该文档类型的最佳猜测。

Metadata isn’t the only source of information on which a detector relies. The detector can also make use of magic bytes, which are a special pattern near the beginning of a file or delegate the detection process to a more suitable detector.

元数据并不是一个检测器所依赖的唯一信息来源。检测器还可以利用神奇的字节,这是一个靠近文件开头的特殊模式,或者将检测过程委托给一个更合适的检测器。

In fact, the algorithm used by the detector is implementation dependent.

事实上,检测器所使用的算法取决于实施。

For instance, the default detector works with magic bytes first, then metadata properties. If the content type hasn’t been found at this point, it will use the service loader to discover all available detectors and try them in turn.

例如,默认的检测器首先工作于魔法字节,然后是元数据属性。如果这时还没有找到内容类型,它将使用服务加载器来发现所有可用的检测器,并依次尝试它们。

4.2. Language Detection

4.2.语言检测

In addition to the type of a document, Tika can also identify its language even without help from metadata information.

除了文件的类型外,Tika还可以识别其语言,即使没有元数据信息的帮助。

In previous releases of Tika, the language of the document is detected using a LanguageIdentifier instance.

在Tika以前的版本中,文档的语言是通过LanguageIdentifier实例来检测的。

However, LanguageIdentifier has been deprecated in favor of web services, which is not made clear in the Getting Started docs.

然而,LanguageIdentifier已经被弃用,转而使用Web服务,这一点在Getting Started文档中没有明确说明。

Language detection services are now provided via subtypes of the abstract class LanguageDetector. Using web services, you can also access fully-fledged online translation services, such as Google Translate or Microsoft Translator.

语言检测服务现在通过抽象类LanguageDetector的子类型提供。使用网络服务,你也可以访问成熟的在线翻译服务,如谷歌翻译或微软翻译。

For the sake of brevity, we won’t go over those services in detail.

为了简洁起见,我们不会详细介绍这些服务。

5. Tika in Action

5 提卡在行动

This section illustrates Apache Tika features using working examples.

本节使用工作实例来说明Apache Tika的功能。

The illustration methods will be wrapped in a class:

图解方法将被包裹在一个类中。

public class TikaAnalysis {
    // illustration methods
}

5.1. Detecting Document Types

5.1.检测文档类型

Here’s the code we can use to detect the type of a document read from an InputStream:

下面是我们可以用来检测从InputStream读取的文档类型的代码。

public static String detectDocTypeUsingDetector(InputStream stream) 
  throws IOException {
    Detector detector = new DefaultDetector();
    Metadata metadata = new Metadata();

    MediaType mediaType = detector.detect(stream, metadata);
    return mediaType.toString();
}

Assume we have a PDF file named tika.txt in the classpath. The extension of this file has been changed to try to trick our analysis tool. The real type of the document can still be found and confirmed by a test:

假设我们在classpath中有一个名为tika.txt的PDF文件。这个文件的扩展名已经被改变,试图欺骗我们的分析工具。该文件的真实类型仍然可以通过测试找到并确认。

@Test
public void whenUsingDetector_thenDocumentTypeIsReturned() 
  throws IOException {
    InputStream stream = this.getClass().getClassLoader()
      .getResourceAsStream("tika.txt");
    String mediaType = TikaAnalysis.detectDocTypeUsingDetector(stream);

    assertEquals("application/pdf", mediaType);

    stream.close();
}

It’s clear that a wrong file extension can’t keep Tika from finding the correct media type, thanks to the magic bytes %PDF at the start of the file.

很明显,一个错误的文件扩展名不能让Tika找到正确的媒体类型,这要感谢文件开头的神奇字节%PDF

For convenience, we can re-write the detection code using the Tika facade class with the same result:

为了方便起见,我们可以使用Tika Facade类重新编写检测代码,结果是一样的。

public static String detectDocTypeUsingFacade(InputStream stream) 
  throws IOException {
 
    Tika tika = new Tika();
    String mediaType = tika.detect(stream);
    return mediaType;
}

5.2. Extracting Content

5.2.提取内容

Let’s now extract the content of a file and return the result as a String – using the Parser API:

现在让我们提取一个文件的内容并将结果作为String返回 – 使用Parser API。

public static String extractContentUsingParser(InputStream stream) 
  throws IOException, TikaException, SAXException {
 
    Parser parser = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    parser.parse(stream, handler, metadata, context);
    return handler.toString();
}

Given a Microsoft Word file in the classpath with this content:

在classpath中给定了一个有此内容的Microsoft Word文件。

Apache Tika - a content analysis toolkit
The Apache Tika™ toolkit detects and extracts metadata and text ...

The content can be extracted and verified:

内容可以被提取和验证。

@Test
public void whenUsingParser_thenContentIsReturned() 
  throws IOException, TikaException, SAXException {
    InputStream stream = this.getClass().getClassLoader()
      .getResourceAsStream("tika.docx");
    String content = TikaAnalysis.extractContentUsingParser(stream);

    assertThat(content, 
      containsString("Apache Tika - a content analysis toolkit"));
    assertThat(content, 
      containsString("detects and extracts metadata and text"));

    stream.close();
}

Again, the Tika class can be used to write the code more conveniently:

同样,可以使用Tika类来更方便地编写代码。

public static String extractContentUsingFacade(InputStream stream) 
  throws IOException, TikaException {
 
    Tika tika = new Tika();
    String content = tika.parseToString(stream);
    return content;
}

5.3. Extracting Metadata

5.3.提取元数据

In addition to the content of a document, the Parser API can also extract metadata:

除了文档的内容之外,Parser API还可以提取元数据。

public static Metadata extractMetadatatUsingParser(InputStream stream) 
  throws IOException, SAXException, TikaException {
 
    Parser parser = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    parser.parse(stream, handler, metadata, context);
    return metadata;
}

When a Microsoft Excel file exists in the classpath, this test case confirms that the extracted metadata is correct:

当classpath中存在一个Microsoft Excel文件时,这个测试案例确认提取的元数据是正确的。

@Test
public void whenUsingParser_thenMetadataIsReturned() 
  throws IOException, TikaException, SAXException {
    InputStream stream = this.getClass().getClassLoader()
      .getResourceAsStream("tika.xlsx");
    Metadata metadata = TikaAnalysis.extractMetadatatUsingParser(stream);

    assertEquals("org.apache.tika.parser.DefaultParser", 
      metadata.get("X-Parsed-By"));
    assertEquals("Microsoft Office User", metadata.get("Author"));

    stream.close();
}

Finally, here’s another version of the extraction method using the Tika facade class:

最后,这里是另一个使用Tika Facade类的提取方法的版本。

public static Metadata extractMetadatatUsingFacade(InputStream stream) 
  throws IOException, TikaException {
    Tika tika = new Tika();
    Metadata metadata = new Metadata();

    tika.parse(stream, metadata);
    return metadata;
}

6. Conclusion

6.结论

This tutorial focused on content analysis with Apache Tika. Using the Parser and Detector APIs, we can automatically detect the type of a document, as well as extract its content and metadata.

这个教程的重点是用Apache Tika进行内容分析。使用ParserDetector API,我们可以自动检测一个文档的类型,以及提取其内容和元数据

For advanced use cases, we can create custom Parser and Detector classes to have more control over the parsing process.

对于高级用例,我们可以创建自定义的ParserDetector类,以便对解析过程有更多控制。

The complete source code for this tutorial can be found over on GitHub.

本教程的完整源代码可以在GitHub上找到over