Optical Character Recognition with Tesseract – 使用魔方的光学字符识别

最后修改: 2020年 3月 15日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

With the advancement of technology in AI and machine learning, we require tools to recognize text within images.

随着人工智能和机器学习技术的进步,我们需要工具来识别图像中的文字。

In this tutorial, we’ll explore Tesseract, an optical character recognition (OCR) engine, with a few examples of image-to-text processing.

在本教程中,我们将通过几个图像到文本处理的例子来探讨Tesseract,一个光学字符识别(OCR)引擎。

2. Tesseract

2.魔方

Tesseract is an open-source OCR engine developed by HP that recognizes more than 100 languages, along with the support of ideographic and right-to-left languages. Also, we can train Tesseract to recognize other languages.

Tesseract是一个由HP公司开发的开源OCR引擎,可识别100多种语言,同时支持表意文字和从右到左的语言。此外,我们可以训练Tesseract来识别其他语言

It contains two OCR engines for image processing – a LSTM (Long Short Term Memory) OCR engine and a legacy OCR engine that works by recognizing character patterns.

它包含两个用于图像处理的OCR引擎–一个LSTM(长短期记忆)OCR引擎和一个通过识别字符模式工作的传统OCR引擎。

The OCR engine uses the Leptonica library to open the images and supports various output formats like plain text, hOCR (HTML for OCR), PDF, and TSV.

该OCR引擎使用Leptonica库来打开图像,并支持各种输出格式,如纯文本、hOCR(OCR的HTML)、PDF和TSV。

3. Setup

3.设置

Tesseract is available for download/install on all major operating systems.

Tesseract可在所有主要操作系统上下载/安装。

For example, if we’re using macOS, we can install the OCR engine using Homebrew:

例如,如果我们使用macOS,我们可以使用Homebrew安装OCR引擎。

brew install tesseract

We’ll observe that the package contains a set of language data files, like English, and orientation and script detection (OSD), by default:

我们将观察到,该软件包默认包含一组语言数据文件,如英语,以及方向和脚本检测(OSD)。

==> Installing tesseract 
==> Downloading https://homebrew.bintray.com/bottles/tesseract-4.1.1.high_sierra.bottle.tar.gz
==> Pouring tesseract-4.1.1.high_sierra.bottle.tar.gz
==> Caveats
This formula contains only the "eng", "osd", and "snum" language data files.
If you need any other supported languages, run `brew install tesseract-lang`.
==> Summary
/usr/local/Cellar/tesseract/4.1.1: 65 files, 29.9MB

However, we can install the tesseract-lang module for support of other languages:

然而,我们可以安装tesseract-lang模块来支持其他语言。

brew install tesseract-lang

For Linux, we can install Tesseract using the yum command:

对于Linux,我们可以使用yum命令安装Tesseract。

yum install tesseract

Likewise, let’s add language support:

同样地,让我们增加语言支持。

yum install tesseract-langpack-eng
yum install tesseract-langpack-spa

Here, we’ve added the language-trained data for English and Spanish.

在这里,我们已经添加了英语和西班牙语的语言训练数据。

For Windows, we can get the installers from Tesseract at UB Mannheim.

对于Windows,我们可以从曼海姆大学的魔方获得安装程序。

4. Tesseract Command-Line

4.宇宙魔方的命令行

4.1. Run

4.1. 运行

We can use the Tesseract command-line tool to extract text from images.

我们可以使用Tesseract命令行工具从图像中提取文本。

For instance, let’s take a snapshot of our website:

例如,让我们对我们的网站做一个快照。

Then, we’ll run the tesseract command to read the baeldung.png snapshot and write the text in the output.txt file:

然后,我们将运行tesseract命令来读取baeldung.png快照,并将文本写入output.txt文件中。

tesseract baeldung.png output

The output.txt file will look like:

output.txt文件将看起来像。

a REST with Spring Learn Spring (new!)
The canonical reference for building a production
grade API with Spring.
From no experience to actually building stuff.
y
Java Weekly Reviews

We can observe that Tesseract hasn’t processed the entire content of the image. Because the accuracy of the output depends on various parameters like the image quality, language, page segmentation, trained data, and engine used for image processing.

我们可以观察到,Tesseract没有处理图像的全部内容。因为输出的准确性取决于各种参数,如图像质量、语言、页面分割、训练数据和用于图像处理的引擎。

4.2. Language Support

4.2.语言支持

By default, the OCR engine uses English when processing the images. However, we can declare the language by using the -l argument:

默认情况下,OCR引擎在处理图像时使用英语。然而,我们可以通过使用-l参数来声明语言。

Let’s take a look at another example with multi-language text:

让我们来看看另一个多语言文本的例子。

First, let’s process the image with the default English language:

首先,让我们用默认的英语语言来处理图像。

tesseract multiLanguageText.png output

The output will look like:

输出将看起来像。

Der ,.schnelle” braune Fuchs springt
iiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marron rapido salta sobre el perro
perezoso. A raposa marrom rapida
salta sobre 0 cao preguicoso.

Then, let’s process the image with the Portuguese language:

然后,让我们用葡萄牙语来处理图像。

tesseract multiLanguageText.png output -l por

So, the OCR engine will also detect Portuguese letters:

因此,OCR引擎也会检测到葡萄牙语字母。

Der ,.schnelle” braune Fuchs springt
iber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrón rápido salta sobre el perro
perezoso. A raposa marrom rápida
salta sobre o cão preguiçoso.

Similarly, we can declare a combination of languages:

同样地,我们可以声明一种语言的组合。

tesseract multiLanguageText.png output -l spa+por

Here, the OCR engine will primarily use Spanish and then Portuguese for image processing. However, the output can differ based on the order of languages we specify.

这里,OCR引擎将主要使用西班牙语,然后是葡萄牙语进行图像处理。然而,根据我们指定的语言顺序,输出结果会有所不同。

4.3. Page Segmentation Mode

4.3.分割模式页面

Tesseract supports various page segmentation modes like OSD, automatic page segmentation, and sparse text.

Tesseract支持各种页面分割模式,如OSD、自动页面分割和稀疏文本。

We can declare the page segmentation mode by using the –psm argument with a value of 0 to 13 for various modes:

我们可以通过使用-psm参数来声明页面分割模式,其值为0到13,用于各种模式。

tesseract multiLanguageText.png output --psm 1

Here, by defining a value of 1, we’ve declared the Automatic page segmentation with OSD for image processing.

在这里,通过定义一个1的值,我们已经声明用OSD进行图像处理的自动页面分割。

Let’s take a look of all the page segmentation modes supported:

让我们来看看所有支持的页面分割模式。

4.4. OCR Engine Mode

4.4.OCR引擎模式

Similarly, we can use various engine modes like legacy and LSTM engine while processing the images.

同样,在处理图像时,我们可以使用各种引擎模式,如传统的和LSTM引擎。

For this, we can use the –oem argument with a value of 0 to 3:

为此,我们可以使用-oem参数,其值为0至3。

tesseract multiLanguageText.png output --oem 1

The OCR engine modes are:

OCR引擎的模式是。

4.5. Tessdata

4.5. 测试数据

Tesseract contains two sets of trained data for the LSTM OCR engine – best trained LSTM models and fast integer versions of trained LSTM models.

Tesseract包含两套用于LSTM OCR引擎的训练数据–最佳训练LSTM模型训练LSTM模型的快速整数版本

The former provides better accuracy, and the latter offers better speed in image processing.

前者提供了更好的准确性,而后者在图像处理方面提供了更好的速度。

Also, Tesseract provides a combined trained data with support for both legacy and LSTM OCR engine.

此外,Tesseract还提供了一个综合训练数据,支持传统的和LSTM OCR引擎。

If we use the Legacy OCR engine without providing the supporting trained data, Tesseract will throw an error:

如果我们使用传统的OCR引擎而不提供支持性的训练数据,Tesseract将抛出一个错误。

Error: Tesseract (legacy) engine requested, but components are not present in /usr/local/share/tessdata/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!

So, we should download the required .traineddata files and either keep them in the default tessdata location or declare the location using the –tessdata-dir argument:

因此,我们应该下载所需的.traineddata文件,并将其保留在默认的tessdata位置,或者使用-tessdata-dir参数声明该位置。

tesseract multiLanguageText.png output --tessdata-dir /image-processing/tessdata

4.6. Output

4.6.输出

We can declare an argument to get the required output format.

我们可以声明一个参数来获得所需的输出格式。

For instance, to get searchable PDF output:

例如,为了获得可搜索的PDF输出。

tesseract multiLanguageText.png output pdf

This will create the output.pdf file with the searchable text layer (with recognized text) on the image provided.

这将创建output.pdf文件,在所提供的图像上有可搜索的文本层(有识别文本)。

Similarly, for hOCR output:

同样地,对于hOCR的输出。

tesseract multiLanguageText.png output hocr

Also, we can use tesseract –help and tesseract –help-extra commands for more information on the tesseract command-line usage.

此外,我们还可以使用tesseract -helptesseract -help-extra命令来了解更多关于tesseract命令行用法的信息。

5. Tess4J

5.TESS4J

Tess4J is a Java wrapper for the Tesseract APIs that provides OCR support for various image formats like JPEG, GIF, PNG, and BMP.

Tess4J是Tesseract APIs的一个Java包装器,为JPEG、GIF、PNG和BMP等各种图像格式提供OCR支持。

First, let’s add the latest tess4j Maven dependency to our pom.xml:

首先,让我们把最新的tess4j Maven依赖性添加到我们的pom.xml

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.5.1</version>
</dependency>

Then, we can use the Tesseract class provided by tess4j to process the image:

然后,我们可以使用tess4j提供的Tesseract类来处理该图像。

File image = new File("src/main/resources/images/multiLanguageText.png");
Tesseract tesseract = new Tesseract();
tesseract.setDatapath("src/main/resources/tessdata");
tesseract.setLanguage("eng");
tesseract.setPageSegMode(1);
tesseract.setOcrEngineMode(1);
String result = tesseract.doOCR(image);

Here, we’ve set the value of the datapath to the directory location that contains osd.traineddata and eng.traineddata files.

这里,我们将datapath的值设置为包含osd.traineddataeng.traineddata文件的目录位置。

Finally, we can verify the String output of the image processed:

最后,我们可以验证处理后的图像的String输出。

Assert.assertTrue(result.contains("Der ,.schnelle” braune Fuchs springt"));
Assert.assertTrue(result.contains("salta sopra il cane pigro. El zorro"));

Additionally, we can use the setHocr method to get the HTML output:

此外,我们可以使用setHocr方法来获得HTML输出。

tesseract.setHocr(true);

By default, the library processes the entire image. However, we can process a particular section of the image by using the java.awt.Rectangle object while calling the doOCR method:

默认情况下,该库处理整个图像。然而,我们可以通过使用 java.awt.Rectangle对象,在调用doOCR方法时处理图像的特定部分。

result = tesseract.doOCR(imageFile, new Rectangle(1200, 200));

Similar to Tess4J, we can use Tesseract Platform to integrate Tesseract in Java applications. This is a JNI wrapper of the Tesseract APIs based on the JavaCPP Presets library.

与Tess4J类似,我们可以使用Tesseract Platform来将Tesseract集成到Java应用程序中。这是一个基于JavaCPP Presets库的Tesseract API的JNI包装器。

6. Conclusion

6.结论

In this article, we’ve explored the Tesseract OCR engine with a few examples of image processing.

在这篇文章中,我们通过几个图像处理的例子探讨了Tesseract OCR引擎。

First, we examined the tesseract command-line tool to process the images, along with a set of arguments like -l, –psm and –oem.

首先,我们检查了处理图像的tesseract命令行工具,以及-l-psm-oem等一系列参数。

Then, we’ve explored tess4j, a Java wrapper to integrate Tesseract in Java applications.

然后,我们探讨了tess4j,一个将Tesseract集成到Java应用程序中的Java包装器。

As usual, all the code implementations are available over on GitHub.

像往常一样,所有的代码实现都可以在GitHub上找到