Parsing an XML File Using SAX Parser – 使用SAX解析器解析XML文件

最后修改: 2019年 9月 29日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

SAX, also known as the Simple API for XML, is used for parsing XML documents.

SAX,也被称为the Simple API for XML,用于解析XML文档。

In this tutorial, we’ll learn what SAX is and why, when and how it should be used.

在本教程中,我们将学习什么是SAX以及为什么、何时和如何使用它。

2. SAX: the Simple API for XML

2.SAX:XML的简单API

SAX is an API used to parse XML documents. It is based on events generated while reading through the document. Callback methods receive those events. A custom handler contains those callback methods.

SAX是一个用于解析XML文档的API。它基于在阅读文档时产生的事件。回调方法接收这些事件。一个自定义处理程序包含这些回调方法。

The API is efficient because it drops events right after the callbacks received them. Therefore, SAX has efficient memory management, unlike DOM, for example.

该API是高效的,因为它在回调接收到事件后立即丢弃事件。因此,SAX具有高效的内存管理,与DOM等不同。

3. SAX vs DOM

3.SAX与DOM

DOM stands for Document Object Model. The DOM parser does not rely on events. Moreover, it loads the whole XML document into memory to parse it. SAX is more memory-efficient than DOM.

DOM是Document Object Model的缩写。DOM解析器不依赖于事件。此外,它将整个XML文档加载到内存中来解析它。SAX比DOM更节省内存。

DOM has its benefits, too. For example, DOM supports XPath. It makes it also easy to operate on the whole document tree at once since the document is loaded into memory.

DOM也有它的好处。例如,DOM支持XPath。由于文档被加载到内存中,它也使得一次对整个文档树进行操作变得容易。

4. SAX vs StAX

4.SAX vs StAX

StAX is more recent than SAX and DOM. It stands for Streaming API for XML.

StAX比SAX和DOM更新。它是Streaming API for XML的缩写。

The main difference with SAX is that StAX uses a pull mechanism instead of SAX’s push mechanism (using callbacks).
This means the control is given to the client to decide when the events need to be pulled. Therefore, there is no obligation to pull the whole document if only a part of it is needed.

与SAX的主要区别是,StAX使用的是拉动机制,而不是SAX的推动机制(使用回调)。
这意味着控制权交给了客户端,由其决定何时需要拉动事件。因此,如果只需要其中的一部分,就没有义务去拉动整个文档。

It provides an easy API to work with XML with a memory-efficient way of parsing.

它提供了一个简单的API来处理XML,并有一个内存高效的解析方式。

Unlike SAX, it doesn’t provide schema validation as one of its features.

与SAX不同,它不提供模式验证作为其功能之一。

5. Parsing the XML File Using a Custom Handler

5.使用自定义处理程序解析XML文件

Let’s now use the following XML representing the Baeldung website and its articles:

现在让我们用下面的XML代表Baeldung网站和它的文章。

<baeldung>
    <articles>
        <article>
            <title>Parsing an XML File Using SAX Parser</title>
            <content>SAX Parser's Lorem ipsum...</content>
        </article>
        <article>
            <title>Parsing an XML File Using DOM Parser</title>
            <content>DOM Parser's Lorem ipsum...</content>
        </article>
        <article>
            <title>Parsing an XML File Using StAX Parser</title>
            <content>StAX's Lorem ipsum...</content>
        </article>
    </articles>
</baeldung>

We’ll begin by creating POJOs for our Baeldung root element and its children:

我们将首先为我们的Baeldung根元素和它的子元素创建POJO。

public class Baeldung {
    private List<BaeldungArticle> articleList;
    // usual getters and setters
}
public class BaeldungArticle {
    private String title;
    private String content;
    // usual getters and setters
}

We’ll continue by creating the BaeldungHandler. This class will implement the callback methods necessary to capture the events.

我们将继续创建BaeldungHandler。这个类将实现捕捉事件所需的回调方法。

We’ll override four methods from the superclass DefaultHandler, each characterizing an event:

我们将覆盖来自超类DefaultHandler的四个方法,每个方法描述一个事件。

    • characters(char[], int, int) receives characters with boundaries. We’ll convert them to a String and store it in a variable of BaeldungHandler
    • startDocument() is invoked when the parsing begins – we’ll use it to construct our Baeldung instance
    • startElement() is invoked when the parsing begins for an element – we’ll use it to construct either List<BaeldungArticle> or BaeldungArticle instances – qName helps us make the distinction between both types
    • endElement() is invoked when the parsing ends for an element – this is when we’ll assign the content of the tags to their respective variables

With all the callbacks defined, we can now write the BaeldungHandler class:

在定义了所有的回调后,我们现在可以编写BaeldungHandler类。

public class BaeldungHandler extends DefaultHandler {
    private static final String ARTICLES = "articles";
    private static final String ARTICLE = "article";
    private static final String TITLE = "title";
    private static final String CONTENT = "content";

    private Baeldung website;
    private StringBuilder elementValue;

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        if (elementValue == null) {
            elementValue = new StringBuilder();
        } else {
            elementValue.append(ch, start, length);
        }
    }

    @Override
    public void startDocument() throws SAXException {
        website = new Baeldung();
    }

    @Override
    public void startElement(String uri, String lName, String qName, Attributes attr) throws SAXException {
        switch (qName) {
            case ARTICLES:
                website.articleList = new ArrayList<>();
                break;
            case ARTICLE:
                website.articleList.add(new BaeldungArticle());
                break;
            case TITLE:
                elementValue = new StringBuilder();
                break;
            case CONTENT:
                elementValue = new StringBuilder();
                break;
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        switch (qName) {
            case TITLE:
                latestArticle().setTitle(elementValue.toString());
                break;
            case CONTENT:
                latestArticle().setContent(elementValue.toString());
                break;
        }
    }

    private BaeldungArticle latestArticle() {
        List<BaeldungArticle> articleList = website.articleList;
        int latestArticleIndex = articleList.size() - 1;
        return articleList.get(latestArticleIndex);
    }

    public Baeldung getWebsite() {
        return website;
    }
}

String constants have also been added to increase readability. A method to retrieve the latest encountered article is also convenient. Finally, we need a getter for the Baeldung object.

为了提高可读性,还增加了字符串常数。一个检索最新遇到的文章的方法也很方便。最后,我们需要一个Baeldung对象的getter。

Note that the above isn’t thread-safe since we’re holding onto state in between the method calls.

请注意,上面的方法并不是线程安全的,因为我们在方法调用之间保持着状态。

6. Testing the Parser

6.测试解析器

In order to test the parser, we’ll instantiate the SaxFactory, the SaxParser and also the BaeldungHandler:

为了测试解析器,我们将实例化SaxFactorySaxParser以及BaeldungHandler

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
SaxParserMain.BaeldungHandler baeldungHandler = new SaxParserMain.BaeldungHandler();

After that, we’ll parse the XML file and assert that the object contains all expected elements parsed:

之后,我们将解析XML文件,并断言该对象包含所有预期解析的元素。

saxParser.parse("src/test/resources/sax/baeldung.xml", baeldungHandler);

SaxParserMain.Baeldung result = baeldungHandler.getWebsite();

assertNotNull(result);
List<SaxParserMain.BaeldungArticle> articles = result.getArticleList();

assertNotNull(articles);
assertEquals(3, articles.size());

SaxParserMain.BaeldungArticle articleOne = articles.get(0);
assertEquals("Parsing an XML File Using SAX Parser", articleOne.getTitle());
assertEquals("SAX Parser's Lorem ipsum...", articleOne.getContent());

SaxParserMain.BaeldungArticle articleTwo = articles.get(1);
assertEquals("Parsing an XML File Using DOM Parser", articleTwo.getTitle());
assertEquals("DOM Parser's Lorem ipsum...", articleTwo.getContent());

SaxParserMain.BaeldungArticle articleThree = articles.get(2);
assertEquals("Parsing an XML File Using StAX Parser", articleThree.getTitle());
assertEquals("StAX Parser's Lorem ipsum...", articleThree.getContent());

As expected, the baeldung has been parsed correctly and contains the awaited sub-objects.

正如预期的那样,baeldung已经被正确解析,并且包含了等待的子对象。

7. Conclusion

7.结语

We just discovered how to use SAX to parse XML files. It’s a powerful API generating a light memory footprint in our applications.

我们刚刚发现如何使用SAX来解析XML文件。它是一个强大的API,在我们的应用程序中产生了一个轻量级的内存足迹。

As usual, the code for this article is available over on GitHub.

像往常一样,本文的代码可以在GitHub上找到