1. Introduction
1.介绍
In this quick article, we’ll focus on doing programmatic conversion between PDF files and other formats in Java.
在这篇快速文章中,我们将重点讨论在Java中进行PDF文件和其他格式之间的程序化转换。
More specifically, we’ll describe how to save PDFs as image files, such as PNG or JPEG, convert PDFs to Microsoft Word documents, export as an HTML, and extract the texts, by using multiple Java open-source libraries.
更具体地说,我们将描述如何通过使用多个Java开源库将PDF保存为图像文件,如PNG或JPEG,将PDF转换成Microsoft Word文档,导出为HTML,并提取文本。
2. Maven Dependencies
2.Maven的依赖性
The first library we’ll look at is Pdf2Dom. Let’s start with the Maven dependencies we need to add to our project:
我们要看的第一个库是Pdf2Dom。让我们从我们需要添加到项目中的Maven依赖项开始。
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox-tools</artifactId>
<version>2.0.25</version>
</dependency>
<dependency>
<groupId>net.sf.cssbox</groupId>
<artifactId>pdf2dom</artifactId>
<version>2.0.1</version>
</dependency>
We’re going to use the first dependency to load the selected PDF file. The second dependency is responsible for the conversion itself. The latest versions can be found here: pdfbox-tools and pdf2dom.
我们将使用第一个依赖关系来加载选定的 PDF 文件。第二个依赖关系负责转换本身。最新的版本可以在这里找到。pdfbox-tools和pdf2dom。
What’s more, we’ll use iText to extract the text from a PDF file and POI to create the .docx document.
更重要的是,我们将使用iText来提取PDF文件中的文本,并使用POI来创建.docx文档。
Let’s take a look at Maven dependencies that we need to include in our project:
让我们来看看我们需要在项目中加入的Maven依赖项。
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.10</version>
</dependency>
<dependency>
<groupId>com.itextpdf.tool</groupId>
<artifactId>xmlworker</artifactId>
<version>5.5.10</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.15</version>
</dependency>
The latest version of iText can be found here and you can look for Apache POI here.
最新版本的iText可以在这里找到,你可以在这里寻找Apache POI。
3. PDF and HTML Conversions
3.PDF和HTML转换
To work with HTML files we’ll use Pdf2Dom – a PDF parser that converts the documents to an HTML DOM representation. The obtained DOM tree can then be then serialized to an HTML file or further processed.
为了处理HTML文件,我们将使用Pdf2Dom–一个PDF解析器,它可以将文档转换为HTML DOM表示。然后得到的DOM树可以被序列化为一个HTML文件或进一步处理。
To convert PDF to HTML, we need to use XMLWorker, library that is provided by iText.
为了将PDF转换为HTML,我们需要使用XMLWorker,该库由iText提供。
3.1. PDF to HTML
3.1.PDF转HTML
Let’s have a look at a simple conversion from PDF to HTML:
让我们看一下从PDF到HTML的简单转换。
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();
}
In the code snippet above we load the PDF file, using the load API from PDFBox. With the PDF loaded, we use the parser to parse the file and write to output specified by java.io.Writer.
在上面的代码片段中,我们使用 PDFBox 的加载 API 加载 PDF 文件。装入 PDF 后,我们使用解析器来解析文件,并写入由 java.io.Writer. 指定的输出。
Note that converting PDF to HTML is never a 100%, pixel-to-pixel result. The results depend on the complexity and the structure of the particular PDF file.
请注意,将PDF转换为HTML从来不是100%的、像素对像素的结果。其结果取决于特定PDF文件的复杂性和结构。
3.2. HTML to PDF
3.2. HTML转PDF
Now, let’s have a look at conversion from HTML to PDF:
现在,让我们看一下从HTML到PDF的转换。
private static void generatePDFFromHTML(String filename) {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream("src/output/html.pdf"));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(filename));
document.close();
}
Note that converting HTML to PDF, you need to ensure that HTML has all tags properly started and closed, otherwise the PDF will be not created. The positive aspect of this approach is that PDF will be created exactly the same as it was in HTML file.
请注意,将HTML转换为PDF,你需要确保HTML的所有标签都正确地开始和关闭,否则PDF将不会被创建。这种方法的积极方面是,PDF的创建将与HTML文件中的内容完全相同。
4. PDF to Image Conversions
4.PDF到图像的转换
There are many ways of converting PDF files to an image. One of the most popular solutions is named Apache PDFBox. This library is an open source Java tool for working with PDF documents. For image to PDF conversion, we’ll use iText again.
有许多方法可以将PDF文件转换为图像。其中一个最流行的解决方案名为Apache PDFBox。这个库是一个用于处理PDF文档的开源Java工具。对于图像到PDF的转换,我们将再次使用iText。
4.1. PDF to Image
4.1.PDF转图像
To start converting PDFs to images, we need to use dependency mentioned in the previous section – pdfbox-tools.
要开始将PDF转换为图像,我们需要使用上一节中提到的依赖关系–pdfbox-tools。
Let’s take a look at the code example:
让我们来看看这个代码例子。
private void generateImageFromPDF(String filename, String extension) {
PDDocument document = PDDocument.load(new File(filename));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(
page, 300, ImageType.RGB);
ImageIOUtil.writeImage(
bim, String.format("src/output/pdf-%d.%s", page + 1, extension), 300);
}
document.close();
}
There are few important parts in the above-mentioned code. We need to use PDFRenderer, in order to render PDF as a BufferedImage. Also, each page of the PDF file needs to be rendered separately.
在上述代码中,有几个重要的部分。我们需要使用PDFRenderer,以便将PDF渲染成BufferedImage。另外,PDF文件的每一页都需要被单独渲染。
Finally, we use ImageIOUtil, from Apache PDFBox Tools, to write an image, with the extension that we specify. Possible file formats are jpeg, jpg, gif, tiff or png.
最后,我们使用Apache PDFBox工具中的ImageIOUtil,来写一个图像,扩展名是我们指定的。可能的文件格式是jpeg、jpg、gif、tiff或png。
Note that Apache PDFBox is an advanced tool – we can create our own PDF files from scratch, fill forms inside PDF file, sign and/or encrypt the PDF file.
注意,Apache PDFBox是一个高级工具 – 我们可以从头开始创建我们自己的PDF文件,在PDF文件内填写表格,签署和/或加密PDF文件。
4.2. Image to PDF
4.2.图片转PDF
Let’s take a look at the code example:
让我们来看看这个代码例子。
private static void generatePDFFromImage(String filename, String extension) {
Document document = new Document();
String input = filename + "." + extension;
String output = "src/output/" + extension + ".pdf";
FileOutputStream fos = new FileOutputStream(output);
PdfWriter writer = PdfWriter.getInstance(document, fos);
writer.open();
document.open();
document.add(Image.getInstance((new URL(input))));
document.close();
writer.close();
}
Please note, that we can provide an image as a file, or load it from URL, as it is shown in the example above. Moreover, the extensions of the output file that we can use are jpeg, jpg, gif, tiff or png.
请注意,我们可以提供文件形式的图像,或者从URL加载,如上面的例子所示。此外,我们可以使用的输出文件的扩展名是jpeg、jpg、gif、tiff或png。
5. PDF to Text Conversions
5.PDF到文本的转换
To extract the raw text out of a PDF file, we’ll also use Apache PDFBox again. For text to PDF conversion, we are going to use iText.
为了从PDF文件中提取原始文本,我们还将再次使用Apache PDFBox。对于文本到PDF的转换,我们将使用iText。
5.1. PDF to Text
5.1.PDF转换为文本
We created a method named generateTxtFromPDF(…) and divided it into three main parts: loading of the PDF file, extraction of text, and final file creation.
我们创建了一个名为generateTxtFromPDF(…)的方法,并将其分为三个主要部分:加载PDF文件,提取文本,以及最终创建文件。
Let’s start with loading part:
让我们从装载部分开始。
File f = new File(filename);
String parsedText;
PDFParser parser = new PDFParser(new RandomAccessFile(f, "r"));
parser.parse();
In order to read a PDF file, we use PDFParser, with an “r” (read) option. Moreover, we need to use the parser.parse() method that will cause the PDF to be parsed as a stream and populated into the COSDocument object.
为了读取一个PDF文件,我们使用PDFParser,其中有一个 “r”(读取)选项。此外,我们需要使用parser.parse()方法,该方法将使PDF被解析为一个流,并填充到COSDocument对象中。
Let’s take a look at the extracting text part:
让我们来看看提取文本的部分。
COSDocument cosDoc = parser.getDocument();
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
In the first line, we’ll save COSDocument inside the cosDoc variable. It will be then used to construct PDocument, which is the in-memory representation of the PDF document. Finally, we will use PDFTextStripper to return the raw text of a document. After all of those operations, we’ll need to use close() method to close all the used streams.
在第一行,我们将把COSDocument保存在cosDoc变量内。然后它将被用来构造PDocument,这是PDF文档的内存表示。最后,我们将使用PDFTextStripper来返回文档的原始文本。在所有这些操作之后,我们需要使用close()方法来关闭所有使用的流.。
In the last part, we’ll save text into the newly created file using the simple Java PrintWriter:
在最后一部分,我们将使用简单的Java PrintWriter将文本保存到新创建的文件中。
PrintWriter pw = new PrintWriter("src/output/pdf.txt");
pw.print(parsedText);
pw.close();
Please note that you cannot preserve formatting in a plain text file because it contains text only.
请注意,您不能在纯文本文件中保留格式,因为它只包含文本。
5.2. Text to PDF
5.2.文本转PDF
Converting text files to PDF is bit tricky. In order to maintain the file formatting, you’ll need to apply additional rules.
将文本文件转换为PDF是有点棘手的。为了保持文件的格式,你需要应用额外的规则。
In the following example, we are not taking into consideration the formatting of the file.
在下面的例子中,我们没有考虑到文件的格式问题。
First, we need to define the size of the PDF file, version and output file. Let’s have a look at the code example:
首先,我们需要定义PDF文件的大小、版本和输出文件。让我们看一下代码示例。
Document pdfDoc = new Document(PageSize.A4);
PdfWriter.getInstance(pdfDoc, new FileOutputStream("src/output/txt.pdf"))
.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
pdfDoc.open();
In the next step, we’ll define the font and also the command that is used to generate new paragraph:
在下一步,我们将定义字体和用于生成新段落的命令。
Font myfont = new Font();
myfont.setStyle(Font.NORMAL);
myfont.setSize(11);
pdfDoc.add(new Paragraph("\n"));
Finally, we are going to add paragraphs into newly created PDF file:
最后,我们要在新创建的PDF文件中添加段落。
BufferedReader br = new BufferedReader(new FileReader(filename));
String strLine;
while ((strLine = br.readLine()) != null) {
Paragraph para = new Paragraph(strLine + "\n", myfont);
para.setAlignment(Element.ALIGN_JUSTIFIED);
pdfDoc.add(para);
}
pdfDoc.close();
br.close();
6. PDF to Docx Conversions
6.PDF到Docx的转换
Creating PDF file from Word document is not easy, and we’ll not cover this topic here. We recommend 3rd party libraries to do it, like jWordConvert.
从Word文档中创建PDF文件并不容易,我们在这里不涉及这个话题。我们推荐第三方库来做这件事,比如jWordConvert。
To create Microsoft Word file from a PDF, we’ll need two libraries. Both libraries are open source. The first one is iText and it is used to extract the text from a PDF file. The second one is POI and is used to create the .docx document.
为了从PDF创建Microsoft Word文件,我们将需要两个库。这两个库都是开源的。第一个是iText,它被用来从PDF文件中提取文本。第二个是POI,用于创建.docx文档。
Let’s take a look at the code snippet for the PDF loading part:
让我们看一下PDF加载部分的代码片段。
XWPFDocument doc = new XWPFDocument();
String pdf = filename;
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
After loading of the PDF, we need to read and render each page separately in the loop, and then write to the output file:
加载PDF后,我们需要在循环中分别读取和渲染每一页,然后写到输出文件。
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy =
parser.processContent(i, new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.addBreak(BreakType.PAGE);
}
FileOutputStream out = new FileOutputStream("src/output/pdf.docx");
doc.write(out);
// Close all open files
Please note, that with the SimpleTextExtractionStrategy() extraction strategy, we’ll lose all formatting rules. In order to fix it, play with extraction strategies described here, to achieve a more complex solution.
请注意,使用SimpleTextExtractionStrategy()提取策略,我们将失去所有的格式化规则。为了解决这个问题,请使用这里描述的提取策略,以实现一个更复杂的解决方案。
7. PDF to X Commercial Libraries
7.PDF to X商业程序库
In previous sections, we described open source libraries. There are few more libraries worth notice, but they are paid:
在前面的章节中,我们介绍了开放源码库。还有一些值得注意的库,但它们是付费的。
- jPDFImages – jPDFImages can create images from pages in a PDF document and export them as JPEG, TIFF, or PNG images.
- JPEDAL – JPedal is an actively developed and very capable native Java PDF library SDK used for printing, viewing and conversion of files
- pdfcrowd – it’s another Web/HTML to PDF and PDF to Web/HTML conversion library, with advanced GUI
8. Conclusion
8.结论
In this article, we discussed the ways to convert PDF file into various formats.
在这篇文章中,我们讨论了如何将PDF文件转换成各种格式。
The full implementation of this tutorial can be found in the GitHub project – this is a Maven-based project. In order to test, just simply run the examples and see the results in the output folder.
本教程的完整实现可以在GitHub项目中找到 – 这是一个基于Maven的项目。为了进行测试,只需简单地运行示例并在output文件夹中看到结果。