Intro to XPath with Java – 用Java做XPath的介绍

最后修改: 2016年 6月 8日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this article we’re going to go over the basics of XPath with the support in the standard Java JDK.

在这篇文章中,我们将介绍XPath的基础知识与标准Java JDK的支持

We are going to use a simple XML document, process it and see how to go over the document to extract the information we need from it.

我们将使用一个简单的XML文档,对其进行处理,并看看如何翻阅该文档以从中提取我们需要的信息。

XPath is a standard syntax recommended by the W3C, it is a set of expressions to navigate XML documents. You can find a full XPath reference here.

XPath是W3C推荐的一种标准语法,它是一套用于导航XML文档的表达式。你可以找到完整的XPath参考资料这里

2. A Simple XPath Parser

2.一个简单的XPath解析器

import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;

public class DefaultParser {
    
    private File file;

    public DefaultParser(File file) {
        this.file = file;
    }
}

Now lets take a closer look to the elements you will find in the DefaultParser:

现在让我们仔细看看你将在DefaultParser中发现的元素。

FileInputStream fileIS = new FileInputStream(this.getFile());
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial";
nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

Let’s break that down:

让我们来分析一下。

DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();

We will use this object to produce a DOM object tree from our xml document:

我们将使用这个对象,从我们的xml文档中产生一个DOM对象树。

DocumentBuilder builder = builderFactory.newDocumentBuilder();

Having an instance of this class, we can parse XML documents from many different input sources like InputStream, File, URL and SAX:

有了这个类的实例,我们可以从许多不同的输入源解析XML文档,比如InputStreamFileURLSAX

Document xmlDocument = builder.parse(fileIS);

A Document (org.w3c.dom.Document) represents the entire XML document, is the root of the document tree, provides our first access to data:

Documentorg.w3c.dom.Document)代表整个XML文档,是文档树的根,提供了我们对数据的第一个访问。

XPath xPath = XPathFactory.newInstance().newXPath();

From the XPath object we’ll access the expressions and execute them over our document to extract what we need from it:

我们将从XPath对象中访问表达式,并在我们的文档上执行它们,以从中提取我们需要的东西。

xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

We can compile an XPath expression passed as string and define what kind of data we are expecting to receive such a NODESET, NODE or String for example.

我们可以编译一个以字符串形式传递的XPath表达式,并定义我们期望收到什么样的数据,比如说NODESETNODEString

3. Lets Start

3.让我们开始吧

Now that we took a look to the base components we will use, lets start with some code using some simple XML, for testing purposes:

现在我们看了一下我们将使用的基本组件,让我们开始使用一些简单的XML的代码,用于测试目的。

<?xml version="1.0"?>
<Tutorials>
    <Tutorial tutId="01" type="java">
        <title>Guava</title>
  <description>Introduction to Guava</description>
  <date>04/04/2016</date>
  <author>GuavaAuthor</author>
    </Tutorial>
    <Tutorial tutId="02" type="java">
        <title>XML</title>
  <description>Introduction to XPath</description>
  <date>04/05/2016</date>
  <author>XMLAuthor</author>
    </Tutorial>
</Tutorials>

3.1. Retrieve a Basic List of Elements

3.1.检索元素的基本列表

The first method is a simple use of an XPath expression to retrieve a list of nodes from the XML:

第一种方法是简单地使用XPath表达式,从XML中检索一个节点列表。

FileInputStream fileIS = new FileInputStream(this.getFile());
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial";
nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

We can retrieve the tutorial list contained in the root node by using the expression above, or by using the expression “//Tutorial” but this one will retrieve all <Tutorial> nodes in the document from the current node no matter where they are located in the document, this means at whatever level of the tree starting from the current node.

我们可以通过使用上面的表达式来检索根节点中包含的教程列表,或者使用表达式”//Tutorial“,但这个表达式将从当前节点开始检索文档中所有的<Tutorial>节点,无论它们在文档中位于什么位置,这意味着从当前节点开始,在树的任何级别。

The NodeList it returns by specifying NODESET to the compile instruction as return type, is an ordered collection of nodes that can be accessed by passing an index as parameter.

通过在编译指令中指定NODESET作为返回类型,它返回的NodeList是一个有序的节点集合,可以通过传递一个索引作为参数来访问。

3.2. Retrieving a Specific Node by Its ID

3.2.通过ID检索一个特定的节点

We can look for an element based on any given id just by filtering:

我们可以根据任何给定的id,仅仅通过过滤来寻找一个元素。

DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(this.getFile());
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial[@tutId=" + "'" + id + "'" + "]";
node = (Node) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODE);

By using this kind of expressions we can filter for whatever element we need to look for just by using the correct syntax. These kind of expressions are called predicates and they are an easy way to locate specific data over a document, for example:

通过使用这种表达式,我们可以通过使用正确的语法来过滤我们需要寻找的任何元素。这类表达式被称为谓词,它们是在文档中定位特定数据的一种简单方法,例如。

/Tutorials/Tutorial[1]

/Tutorials/Tutorial[1]

/Tutorials/Tutorial[first()]

/Tutorials/Tutorial[first()]

/Tutorials/Tutorial[position()<4]

/Tutorials/Tutorial[position()<4]

You can find a complete reference of predicates here

你可以找到一个完整的谓词参考这里

3.3. Retrieving Nodes by a Specific Tag Name

3.3.按特定标签名称检索节点

Now we’re going further by introducing axes, lets see how this works by using it in an XPath expression:

现在我们要进一步引入轴,让我们通过在XPath表达式中使用它来看看它是如何工作的。

Document xmlDocument = builder.parse(this.getFile());
this.clean(xmlDocument);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//Tutorial[descendant::title[text()=" + "'" + name + "'" + "]]";
nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

With the expression used above, we are looking for every <Tutorial> element who has a descendant <title> with the text passed as parameter in the “name” variable.

通过上面使用的表达式,我们正在寻找每一个<Tutorial>元素,该元素有一个后代<title>,其文本在 “name “变量中作为参数传递。

Following the sample xml provided for this article, we could look for a <title> containing the text “Guava” or “XML” and we will retrieve the whole <Tutorial> element with all its data.

按照本文提供的xml样本,我们可以寻找一个包含文本 “Guava “或 “XML “的<title>,我们将检索整个<tutorial>元素及其所有数据。

Axes provide a very flexible way to navigate an XML document and you can find a full documentation it the official site.

Axes提供了一种非常灵活的方式来浏览XML文档,你可以在官方网站上找到完整的文档。

3.4. Manipulating Data in Expressions

3.4.在表达式中操纵数据

XPath allows us to manipulate data too in the expressions if needed.

如果需要,XPath允许我们在表达式中也对数据进行操作。

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//Tutorial[number(translate(date, '/', '')) > " + date + "]";
nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

In this expression we are passing to our method a simple string as a date that looks like “ddmmyyyy” but the XML stores this data with the format “dd/mm/yyyy“, so to match a result we manipulate the string to convert it to the correct data format used by our document and we do it by using one of the functions provided by XPath

在这个表达式中,我们将一个简单的字符串作为日期传递给我们的方法,看起来像 “ddmmyyyy”,但是XML以”dd/mm/yyyy“的格式存储这个数据,所以为了匹配一个结果,我们要处理这个字符串,将其转换为我们文档使用的正确数据格式,我们通过使用XPath>提供的一个函数来实现这个目标。

3.5. Retrieving Elements from a Document With Namespace Defined

3.5.从定义了名称空间的文档中检索元素

If our xml document has a namespace defined as it is in the example_namespace.xml used here, the rules to retrieve the data we need are going to change since our xml starts like this:

如果我们的xml文档像这里使用的example_namespace.xml那样定义了一个命名空间,那么检索我们需要的数据的规则就会改变,因为我们的xml是这样开始的。

<?xml version="1.0"?>
<Tutorials xmlns="/full_archive">

</Tutorials>

Now when we use an expression similar to “//Tutorial”, we are not going to get any result. That XPath expression is going to return all <Tutorial> elements that aren’t under any namespace, and in our new example_namespace.xml, all <Tutorial> elements are defined in the namespace /full_archive.

现在,当我们使用类似于”/Tutorial “的表达式时,我们将不会得到任何结果。那个XPath表达式将返回所有不在任何命名空间下的<Tutorial>元素,而在我们新的example_namespace.xml中,所有的<Tutorial>元素都定义在命名空间/full_archive中。

Lets see how to handle namespaces.

让我们看看如何处理命名空间。

First of all we need to set the namespace context so XPath will be able to know where are we looking for our data:

首先,我们需要设置命名空间上下文,这样XPath就能够知道我们在哪里寻找我们的数据。

xPath.setNamespaceContext(new NamespaceContext() {
    @Override
    public Iterator getPrefixes(String arg0) {
        return null;
    }
    @Override
    public String getPrefix(String arg0) {
        return null;
    }
    @Override
    public String getNamespaceURI(String arg0) {
        if ("bdn".equals(arg0)) {
            return "/full_archive";
        }
        return null;
    }
});

In the method above, we are defining “bdn” as the name for our namespace “/full_archive“, and from now on, we need to add “bdn” to the XPath expressions used to locate elements:

在上面的方法中,我们定义了”bdn“作为我们的命名空间”/full_archive“的名称,从现在开始,我们需要在用于定位元素的XPath表达式中添加”bdn“。

String expression = "/bdn:Tutorials/bdn:Tutorial";
nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

Using the expression above we are able to retrieve all <Tutorial> elements under “bdn” namespace.

使用上述表达式,我们能够检索到”<Tutorial>“命名空间下的所有bdn元素。

3.6. Avoiding Empty Text Nodes Troubles

3.6.避免空文本结点的麻烦

As you could notice, in the code at the 3.3 section of this article a new function is called just right after parsing our XML to a Document object, this.clean(xmlDocument);

正如你所注意到的,在本文3.3部分的代码中,一个新的函数就在解析我们的XML到一个Document对象之后被调用,this.clean(xmlDocument);

Sometimes when we iterate through elements, childnodes and so on, if our document has empty text nodes we can find an unexpected behavior in the results we want to get.

有时,当我们在元素、子节点等方面进行迭代时,如果我们的文档有空的文本节点,我们会发现我们想得到的结果有一个意外的行为。

We called node.getFirstChild() when we are iterating over all <Tutorial> elements looking for the <title> information, but instead of what we are looking for we just have “#Text” as an empty node.

当我们在所有的<Tutorial>元素上迭代寻找<title>信息时,我们调用了<title>,但我们没有找到我们要找的东西,而只是有 “#Text “这样一个空节点。

To fix the problem we can navigate through our document and remove those empty nodes, like this:

为了解决这个问题,我们可以浏览我们的文档并删除那些空节点,像这样。

NodeList childs = node.getChildNodes();
for (int n = childs.getLength() - 1; n >= 0; n--) {
    Node child = childs.item(n);
    short nodeType = child.getNodeType();
    if (nodeType == Node.ELEMENT_NODE) {
        clean(child);
    }
    else if (nodeType == Node.TEXT_NODE) {
        String trimmedNodeVal = child.getNodeValue().trim();
        if (trimmedNodeVal.length() == 0){
            node.removeChild(child);
        }
        else {
            child.setNodeValue(trimmedNodeVal);
        }
    } else if (nodeType == Node.COMMENT_NODE) {
        node.removeChild(child);
    }
}

By doing this we can check each type of node we find and remove those ones we don’t need.

通过这样做,我们可以检查我们找到的每一种类型的节点,并删除那些我们不需要的节点。

4. Conclusions

4.结论

Here we just introduced the default XPath provided support, but there are many popular libraries as JDOM, Saxon, XQuery, JAXP, Jaxen or even Jackson now. There are libraries for specific HTML parsing too like JSoup.

这里我们只是介绍了默认的XPath提供的支持,但现在有许多流行的库,如JDOM、Saxon、XQuery、JAXP、Jaxen甚至Jackson。也有一些库用于特定的HTML解析,如JSoup。

It’s not limited to java, XPath expressions can be used by XSLT language to navigate XML documents.

它并不局限于java,XPath表达式可以被XSLT语言用来浏览XML文档。

As you can see, there is a wide range of possibilities on how to handle these kind of files.

正如你所看到的,在如何处理这类文件方面,有广泛的可能性。

There is a great standard support by default for XML/HTML documents parsing, reading and processing. You can find the full working sample here.

默认情况下,对XML/HTML文档的解析、阅读和处理有很大的标准支持。你可以找到完整的工作样本这里