Content Analysis with Apache Tika – 用Apache Tika进行内容分析

最后修改: 2018年 3月 13日


1. Overview


Apache Tika is a toolkit for extracting content and metadata from various types of documents, such as Word, Excel, and PDF or even multimedia files like JPEG and MP4.

Apache Tika是一个工具包,用于从各种类型的文档中提取内容和元数据,如Word、Excel和PDF,甚至JPEG和MP4等多媒体文件。

All text-based and multimedia files can be parsed using a common interface, making Tika a powerful and versatile library for content analysis.


In this article, we’ll give an introduction to Apache Tika, including its parsing API and how it automatically detects the content type of a document. Working examples will also be provided to illustrate operations of this library.

在这篇文章中,我们将介绍Apache Tika,包括它的解析API以及它如何自动检测文档的内容类型。还将提供工作实例来说明这个库的操作。

2. Getting Started


In order to parse documents using Apache Tika, we need only one Maven dependency:

为了使用Apache Tika解析文档,我们只需要一个Maven依赖项。


The latest version of this artifact can be found here.


3. The Parser API

3.解析器 API

The Parser API is the heart of Apache Tika, abstracting away the complexity of the parsing operations. This API relies on a single method:

Parser API是Apache Tika的核心,抽象了解析操作的复杂性。这个API依赖于一个单一的方法。

void parse(
  InputStream stream, 
  ContentHandler handler, 
  Metadata metadata, 
  ParseContext context) 
  throws IOException, SAXException, TikaException

The meanings of this method’s parameters are:


  • stream an InputStream instance created from the document to be parsed
  • handler a ContentHandler object receiving a sequence of XHTML SAX events parsed from the input document; this handler will then process events and export the result in a particular form
  • metadata a Metadata object conveying metadata properties in and out of the parser
  • context a ParseContext instance carrying context-specific information, used to customize the parsing process

The parse method throws an IOException if it fails to read from the input stream, a TikaException if the document taken from the stream cannot be parsed and a SAXException if the handler is unable to process an event.


When parsing a document, Tika attempts to reuse existing parser libraries such as Apache POI or PDFBox as much as possible. As a result, most of the Parser implementation classes are just adapters to such external libraries.

当解析一个文档时,Tika试图尽可能地重用现有的解析器库,如Apache POI或PDFBox。因此,大部分的Parser实现类只是对这种外部库的适配器。

In section 5, we’ll see how the handler and metadata parameters can be used to extract content and metadata of a document.


For convenience, we can use the facade class Tika to access the functionality of the Parser API.

为了方便,我们可以使用门面类Tika来访问Parser API的功能。

4. Auto-Detection


Apache Tika can automatically detect the type of a document and its language based on the document itself rather than on additional information.

Apache Tika可以根据文件本身而不是其他信息,自动检测文件的类型和语言。

4.1. Document Type Detection


The detection of document types can be done using an implementation class of the Detector interface, which has a single method:


MediaType detect( input, Metadata metadata) 
  throws IOException

This method takes a document, and its associated metadata – then returns a MediaType object describing the best guess regarding the type of the document.


Metadata isn’t the only source of information on which a detector relies. The detector can also make use of magic bytes, which are a special pattern near the beginning of a file or delegate the detection process to a more suitable detector.


In fact, the algorithm used by the detector is implementation dependent.


For instance, the default detector works with magic bytes first, then metadata properties. If the content type hasn’t been found at this point, it will use the service loader to discover all available detectors and try them in turn.


4.2. Language Detection


In addition to the type of a document, Tika can also identify its language even without help from metadata information.


In previous releases of Tika, the language of the document is detected using a LanguageIdentifier instance.


However, LanguageIdentifier has been deprecated in favor of web services, which is not made clear in the Getting Started docs.

然而,LanguageIdentifier已经被弃用,转而使用Web服务,这一点在Getting Started文档中没有明确说明。

Language detection services are now provided via subtypes of the abstract class LanguageDetector. Using web services, you can also access fully-fledged online translation services, such as Google Translate or Microsoft Translator.


For the sake of brevity, we won’t go over those services in detail.


5. Tika in Action

5 提卡在行动

This section illustrates Apache Tika features using working examples.

本节使用工作实例来说明Apache Tika的功能。

The illustration methods will be wrapped in a class:


public class TikaAnalysis {
    // illustration methods

5.1. Detecting Document Types


Here’s the code we can use to detect the type of a document read from an InputStream:


public static String detectDocTypeUsingDetector(InputStream stream) 
  throws IOException {
    Detector detector = new DefaultDetector();
    Metadata metadata = new Metadata();

    MediaType mediaType = detector.detect(stream, metadata);
    return mediaType.toString();

Assume we have a PDF file named tika.txt in the classpath. The extension of this file has been changed to try to trick our analysis tool. The real type of the document can still be found and confirmed by a test:


public void whenUsingDetector_thenDocumentTypeIsReturned() 
  throws IOException {
    InputStream stream = this.getClass().getClassLoader()
    String mediaType = TikaAnalysis.detectDocTypeUsingDetector(stream);

    assertEquals("application/pdf", mediaType);


It’s clear that a wrong file extension can’t keep Tika from finding the correct media type, thanks to the magic bytes %PDF at the start of the file.


For convenience, we can re-write the detection code using the Tika facade class with the same result:

为了方便起见,我们可以使用Tika Facade类重新编写检测代码,结果是一样的。

public static String detectDocTypeUsingFacade(InputStream stream) 
  throws IOException {
    Tika tika = new Tika();
    String mediaType = tika.detect(stream);
    return mediaType;

5.2. Extracting Content


Let’s now extract the content of a file and return the result as a String – using the Parser API:

现在让我们提取一个文件的内容并将结果作为String返回 – 使用Parser API。

public static String extractContentUsingParser(InputStream stream) 
  throws IOException, TikaException, SAXException {
    Parser parser = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    parser.parse(stream, handler, metadata, context);
    return handler.toString();

Given a Microsoft Word file in the classpath with this content:

在classpath中给定了一个有此内容的Microsoft Word文件。

Apache Tika - a content analysis toolkit
The Apache Tika™ toolkit detects and extracts metadata and text ...

The content can be extracted and verified:


public void whenUsingParser_thenContentIsReturned() 
  throws IOException, TikaException, SAXException {
    InputStream stream = this.getClass().getClassLoader()
    String content = TikaAnalysis.extractContentUsingParser(stream);

      containsString("Apache Tika - a content analysis toolkit"));
      containsString("detects and extracts metadata and text"));


Again, the Tika class can be used to write the code more conveniently:


public static String extractContentUsingFacade(InputStream stream) 
  throws IOException, TikaException {
    Tika tika = new Tika();
    String content = tika.parseToString(stream);
    return content;

5.3. Extracting Metadata


In addition to the content of a document, the Parser API can also extract metadata:

除了文档的内容之外,Parser API还可以提取元数据。

public static Metadata extractMetadatatUsingParser(InputStream stream) 
  throws IOException, SAXException, TikaException {
    Parser parser = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    parser.parse(stream, handler, metadata, context);
    return metadata;

When a Microsoft Excel file exists in the classpath, this test case confirms that the extracted metadata is correct:

当classpath中存在一个Microsoft Excel文件时,这个测试案例确认提取的元数据是正确的。

public void whenUsingParser_thenMetadataIsReturned() 
  throws IOException, TikaException, SAXException {
    InputStream stream = this.getClass().getClassLoader()
    Metadata metadata = TikaAnalysis.extractMetadatatUsingParser(stream);

    assertEquals("Microsoft Office User", metadata.get("Author"));


Finally, here’s another version of the extraction method using the Tika facade class:

最后,这里是另一个使用Tika Facade类的提取方法的版本。

public static Metadata extractMetadatatUsingFacade(InputStream stream) 
  throws IOException, TikaException {
    Tika tika = new Tika();
    Metadata metadata = new Metadata();

    tika.parse(stream, metadata);
    return metadata;

6. Conclusion


This tutorial focused on content analysis with Apache Tika. Using the Parser and Detector APIs, we can automatically detect the type of a document, as well as extract its content and metadata.

这个教程的重点是用Apache Tika进行内容分析。使用ParserDetector API,我们可以自动检测一个文档的类型,以及提取其内容和元数据

For advanced use cases, we can create custom Parser and Detector classes to have more control over the parsing process.


The complete source code for this tutorial can be found over on GitHub.
