Convert an XML File to CSV File – 将 XML 文件转换为 CSV 文件

最后修改: 2023年 12月 5日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述</b

In this article, we will explore various methods to turn XML files into CSV format using Java.

在本文中,我们将探讨使用 Java 将 XML 文件转换为 CSV 格式的各种方法 。</span

XML (Extensible Markup Language) and CSV (Comma-Separated Values) are both popular choices for data exchange. While XML is a powerful option that allows for a structured, layered approach to complicated data sets, CSV is more straightforward and designed primarily for tabular data. 

XML(可扩展标记语言)和 CSV(逗号分隔值)都是数据交换的流行选择。虽然 XML 是一种功能强大的选择,可以采用结构化、分层的方法来处理复杂的数据集,但 CSV 更为直接,主要针对表格数据而设计。

Sometimes, there might be situations where we need to convert an XML to a CSV to make data import or analysis easier.

有时,我们可能需要将 XML 转换为 CSV,以便更轻松地导入或分析数据。

2. Introduction to XML Data Layout

2.XML 数据布局简介</b

Imagine we run a bunch of bookstores, and we’ve stored our inventory data in an XML format similar to the example below:

想象一下,我们经营着许多书店,并以类似于下面示例的 XML 格式存储了库存数据: 想象一下,我们经营着许多书店,并以类似于下面示例的 XML 格式存储了库存数据:</span

<?xml version="1.0"?>
<Bookstores>
    <Bookstore id="S001">
        <Books>
            <Book id="B001" category="Fiction">
                <Title>Death and the Penguin</Title>
                <Author id="A001">Andrey Kurkov</Author>
                <Price>10.99</Price>
            </Book>
            <Book id="B002" category="Poetry">
                <Title>Kobzar</Title>
                <Author id="A002">Taras Shevchenko</Author>
                <Price>8.50</Price>
            </Book>
        </Books>
    </Bookstore>
    <Bookstore id="S002">
        <Books>
            <Book id="B003" category="Novel">
                <Title>Voroshilovgrad</Title>
                <Author id="A003">Serhiy Zhadan</Author>
                <Price>12.99</Price>
            </Book>
        </Books>
    </Bookstore>
</Bookstores>

This XML organizes attributes ‘id’ and ‘category’ and text elements ‘Title,’ ‘Author,’ and ‘Price’ neatly in a hierarchy. Ensuring a well-structured XML simplifies the conversion process, making it more straightforward and error-free.

该 XML 将属性 “id “和 “类别 “以及文本元素 “标题”、”作者 “和 “价格 “整齐地组织在一个层次结构中。确保结构良好的 XML 简化了转换过程,使其更加简单明了且不会出错。

The goal is to convert this data into a CSV format for easier handling in tabular form. To illustrate, let’s take a look at how the bookstores from our XML data would be represented in the CSV format:

我们的目标是将这些数据转换成 CSV 格式,以便于以表格形式处理。为了说明这一点,让我们看看如何将 XML 数据中的 bookstores 用 CSV 格式表示:

bookstore_id,book_id,category,title,author_id,author_name,price
S001,B001,Fiction,Death and the Penguin,A001,Andrey Kurkov,10.99
S001,B002,Poetry,Kobzar,A002,Taras Shevchenko,8.50
S002,B003,Novel,Voroshilovgrad,A003,Serhiy Zhadan,12.99

Moving forward, we’ll discuss the methods to achieve this conversion.

接下来,我们将讨论实现这种转换的方法。

3. Converting using XSLT

3.使用 XSLT 进行转换</b

3.1. Introduction to XSLT

3.1.XSLT 简介

XSLT (Extensible Stylesheet Language Transformations) is a tool that changes XML files into various other formats like HTML, plain text, or even CSV.

XSLT (可扩展样式表语言转换)是一种将 XML 文件转换为 HTML、纯文本甚至 CSV 等其他各种格式的工具。

It operates by following rules set in a special stylesheet, usually an XSL file. This becomes especially useful when we aim to convert XML to CSV for easier use.

它通过遵循特殊样式表(通常是 XSL 文件)中设置的规则来运行。当我们要将 XML 转换为 CSV 以方便使用时,这就变得特别有用。

3.2. XSLT Conversion Process

3.2. XSLT 转换过程

To get started, we’ll need to create an XSLT stylesheet that uses XPath to navigate the XML tree structure and specifies how to convert the XML elements into CSV rows and columns.

开始时,我们需要创建一个 XSLT 样式表,使用 XPath 来导航 XML 树结构,并指定如何将 XML 元素转换为 CSV 行和列。</span

Below is an example of such an XSLT file:

下面是这样一个 XSLT 文件的示例: 下面是这样一个 XSLT 文件的示例:</span

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" omit-xml-declaration="yes" indent="no"/>
    <xsl:template match="/">
        <xsl:text>bookstore_id,book_id,category,title,author_id,author_name,price</xsl:text>
        <xsl:text>&#xA</xsl:text>
        <xsl:for-each select="//Bookstore">
            <xsl:variable name="bookstore_id" select="@id"/>
            <xsl:for-each select="./Books/Book">
                <xsl:variable name="book_id" select="@id"/>
                <xsl:variable name="category" select="@category"/>
                <xsl:variable name="title" select="Title"/>
                <xsl:variable name="author_id" select="Author/@id"/>
                <xsl:variable name="author_name" select="Author"/>
                <xsl:variable name="price" select="Price"/>
                <xsl:value-of select="concat($bookstore_id, ',', $book_id, ',', $category, ',', $title, ',', $author_id, ',', $author_name, ',', $price)"/>
                <xsl:text>&#xA</xsl:text>
            </xsl:for-each>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

This stylesheet first matches the root element and then examines each ‘Bookstore’ node, gathering its attributes and child elements. Like the book’s id, category, and so on, into variables. These variables are then used to build out each row in the CSV file. CSV will have columns for bookstore ID, book ID, category, title, author ID, author name, and price.

该样式表首先匹配根元素,然后检查每个 “书店 “节点,收集其属性和子元素。如图书的 id类别 等,并将其转化为变量。然后,这些变量将用于构建 CSV 文件中的每一行。CSV 文件将包含书店 ID、图书 ID、类别、标题、作者 ID、作者姓名和价格等列。</span

The <xsl:template> sets transformation rules. It targets the XML root with <xsl:template match=”/”> and then defines the CSV header.

<xsl:template> 设置转换规则。它使用 <xsl:template match=”/”> 锁定 XML 根,然后定义 CSV 标头。

The instruction <xsl:for-each select=”//Bookstore”> processes each ‘Bookstore’ node and captures its attributes. Another inner instruction, <xsl:for-each select=”./Books/Book”>, processes each ‘Book‘ within the current ‘Bookstore‘.

指令 处理每个 “Bookstore “节点并捕获其属性。 另一条内部指令 <xsl:for-each select=”./Books/Book”> 处理当前”Bookstore“中的每个”Book“。

The concat() function combines these values into a CSV row.

concat() 函数将这些值合并为 CSV 行。

The adds a line feed (LF) character, corresponding to the ASCII value of 0xA in hexadecimal notation.

添加换行(LF)字符,对应于十六进制符号中的 ASCII 值 0xA。

Here’s how we can use the Java-based XSLT processor:

下面我们来看看如何使用基于 Java 的 XSLT 处理器:

void convertXml2CsvXslt(String xslPath, String xmlPath, String csvPath) throws IOException, TransformerException {
    StreamSource styleSource = new StreamSource(new File(xslPath));
    Transformer transformer = TransformerFactory.newInstance()
      .newTransformer(styleSource);
    Source source = new StreamSource(new File(xmlPath));
    Result outputTarget = new StreamResult(new File(csvPath));
    transformer.transform(source, outputTarget);
}

We use TransformerFactory to compile our XSLT stylesheet. Then, we create a Transformer object, which takes care of applying this stylesheet to our XML data, turning it into a CSV file. Once the code runs successfully, a new file will appear in the specified directory.

我们使用 TransformerFactory 来编译 XSLT 样式表。然后,我们创建一个 Transformer 对象,它负责将该样式表应用于 XML 数据,并将其转化为 CSV 文件。代码运行成功后,指定目录中将出现一个新文件。

Using XSLT for XML to CSV conversion is highly convenient and flexible, offering a standardized and powerful approach for most use cases, but it requires loading the whole XML file into memory. This can be a drawback for large files. While it’s perfect for medium-sized data sets, if we have a larger dataset, you might want to consider using StAX, which we’ll get into next.

使用 XSLT 进行 XML 到 CSV 的转换非常方便灵活,可为大多数用例提供标准化的强大方法,但它需要将整个 XML 文件加载到内存中。对于大文件来说,这可能是一个缺点。虽然它非常适合中等规模的数据集,但如果我们有一个更大的数据集,您可能需要考虑使用 StAX,我们将在下一节中介绍。

4. Using StAX

4. 使用 StAX

4.1. Introduction to StAX

4.1.StAX 简介

StAX (Streaming API for XML) is designed to read and write XML files in a more memory-efficient way. It allows us to process XML documents on the fly, making it ideal for handling large files.

StAX(Streaming API for XML)旨在以更节省内存的方式读写 XML 文件。它允许我们即时处理 XML 文档,因此非常适合处理大型文件。

Converting using StAX involves three main steps.

使用 StAX 进行转换包括三个主要步骤。

  • Initialize the StAX Parser
  • Reading XML Elements
  • Writing to CSV

4.2. StAX Conversion Process

4.2.StAX 转换过程

Here’s a full example, encapsulated in a method named convertXml2CsvStax():

下面是一个完整的示例,封装在一个名为 convertXml2CsvStax() 的方法中:</span

void convertXml2CsvStax(String xmlFilePath, String csvFilePath) throws IOException, TransformerException {
    XMLInputFactory inputFactory = XMLInputFactory.newInstance();

    try (InputStream in = Files.newInputStream(Paths.get(xmlFilePath)); BufferedWriter writer = new BufferedWriter(new FileWriter(csvFilePath))) {
        writer.write("bookstore_id,book_id,category,title,author_id,author_name,price\n");

        XMLStreamReader reader = inputFactory.createXMLStreamReader(in);

        String currentElement;
        StringBuilder csvRow = new StringBuilder();
        StringBuilder bookstoreInfo = new StringBuilder();

        while (reader.hasNext()) {
            int eventType = reader.next();

            switch (eventType) {
                case XMLStreamConstants.START_ELEMENT:
                    currentElement = reader.getLocalName();
                    if ("Bookstore".equals(currentElement)) {
                        bookstoreInfo.setLength(0);
                        bookstoreInfo.append(reader.getAttributeValue(null, "id"))
                          .append(",");
                    }
                    if ("Book".equals(currentElement)) {
                        csvRow.append(bookstoreInfo)
                          .append(reader.getAttributeValue(null, "id"))
                          .append(",")
                          .append(reader.getAttributeValue(null, "category"))
                          .append(",");
                    }
                    if ("Author".equals(currentElement)) {
                        csvRow.append(reader.getAttributeValue(null, "id"))
                          .append(",");
                    }
                    break;

                case XMLStreamConstants.CHARACTERS:
                    if (!reader.isWhiteSpace()) {
                        csvRow.append(reader.getText()
                          .trim())
                          .append(",");
                    }
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    if ("Book".equals(reader.getLocalName())) {
                        csvRow.setLength(csvRow.length() - 1);
                        csvRow.append("\n");
                        writer.write(csvRow.toString());
                        csvRow.setLength(0);
                    }
                    break;
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

To begin, we initialize the StAX parser by creating an instance of XMLInputFactory. We then use this factory object to generate an XMLStreamReader:

首先,我们通过创建 XMLInputFactory 的实例来初始化 StAX 解析器。然后,我们使用该工厂对象生成一个 XMLStreamReader: </span

XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new FileInputStream(xmlFilePath);
XMLStreamReader reader = inputFactory.createXMLStreamReader(in);

We use the XMLStreamReader to iterate through the XML file, and based on the event type, such as START_ELEMENT, CHARACTERS, and END_ELEMENT, we build our CSV rows.

我们使用 XMLStreamReader 遍历 XML 文件,并根据事件类型(如 START_ELEMENT、CHARACTERS、END_ELEMENT)创建 CSV 行。

As we read the XML data, we build up CSV rows and write them to the output file using a BufferedWriter.

当我们读取 XML 数据时,我们使用 BufferedWriter 建立 CSV 行并将其写入输出文件。

So, in a nutshell, StAX offers a memory-efficient solution that’s well-suited for processing large or real-time XML files. While it may require more manual effort and lacks some of the transformation features of XSLT, it excels in specific scenarios where resource utilization is a concern. With the foundational knowledge and example provided, we are now prepared to use StAX for our XML to CSV conversion needs when those specific conditions apply.

因此,简而言之,StAX 提供了一种内存效率高的解决方案,非常适合处理大型或实时 XML 文件。虽然它可能需要更多的手动操作,并且缺乏 XSLT 的某些转换功能,但它在资源利用率受到关注的特定场景中表现出色。有了所提供的基础知识和示例,我们现在已准备好在适用这些特定条件时使用 StAX 来满足我们将 XML 转换为 CSV 的需求。

5. Additional Methods

5.其他方法</b

We’ve primarily focused on XSLT and StAX as XML to CSV conversion methods. However, other options like DOM (Document Object Model) parsers, SAX (Simple API for XML) parsers, and Apache Commons CSV also exist.

我们主要将 XSLT 和 StAX 作为 XML 到 CSV 的转换方法。不过,还有其他选择,如 DOM(文档对象模型)解析器、SAX(Simple API for XML)解析器和 Apache Commons CSV。

Yet, there are some factors to consider. DOM parsers are great for loading the whole XML file into memory, giving you the flexibility to traverse and manipulate the XML tree freely. On the other hand, they do make you work a bit harder when you need to transform that XML data into CSV format.

然而,还有一些因素需要考虑。DOM 解析器非常适合将整个 XML 文件加载到内存中,让您可以灵活地自由遍历和操作 XML 树。另一方面,当您需要将 XML 数据转换为 CSV 格式时,它们确实会让您的工作变得更加困难。</span

When it comes to SAX parsers, they are more memory-efficient but can present challenges for complex manipulations. Their event-driven nature requires you to manage the state manually, and they offer no option for looking ahead or behind in the XML document, making certain transformations cumbersome.

说到 SAX 解析器,它们的内存效率更高,但在进行复杂操作时可能会面临挑战。它们的事件驱动特性要求您手动管理状态,而且它们不提供在 XML 文档中提前或延后查看的选项,从而使某些转换变得繁琐。</span

Apache Commons CSV shines when writing CSV files but expects you to handle the XML parsing part yourself.

Apache Commons CSV 在编写 CSV 文件时表现出色,但希望您自己处理 XML 解析部分。

In summary, while each alternative has its own advantages, for this example, XSLT and StAX provide a more balanced solution for most XML to CSV conversion tasks.

总之,虽然每种替代方案都有自己的优势,但就本示例而言,XSLT 和 StAX 为大多数 XML 到 CSV 的转换任务提供了更为均衡的解决方案。

6. Best Practices

6.最佳做法</b

To convert XML to CSV, several factors, such as data integrity, performance, and error handling, need to be considered. Validating the XML against its schema is crucial for confirming the data structure. In addition, proper mapping of XML elements to CSV columns is a fundamental step.

要将 XML 转换为 CSV,需要考虑多个因素,如数据完整性、性能和错误处理。此外,将 XML 元素正确映射到 CSV 列也是一个基本步骤。</span

For large files, using streaming techniques like StAX can be advantageous for memory efficiency. Also, consider breaking down large files into smaller batches for easier processing.

对于大文件,使用 StAX 等流式技术可以提高内存效率。此外,还可以考虑将大文件分解成较小的批次,以便于处理。

It’s important to mention that the code examples provided may not handle special characters found in XML data, including but not limited to commas, newlines, and double quotes. For example, a comma within a field value can conflict with the comma used to delimit fields in the CSV. Similarly, a newline character could disrupt the logical structure of the file.

需要指出的是,所提供的代码示例可能无法处理 XML 数据中的特殊字符,包括但不限于逗号、换行符和双引号。例如,字段值中的逗号可能与 CSV 中用于分隔字段的逗号相冲突。同样,换行符也会破坏文件的逻辑结构。</span

Addressing such issues can be complex and varies depending on specific project requirements. To work around commas, you can enclose fields in double quotes in the resulting CSV file. That said, to keep the code examples in this article easy to follow, these special cases have not been addressed. Therefore, this aspect should be taken into account for a more accurate conversion.

解决这些问题可能很复杂,而且根据具体的项目要求而有所不同。要解决逗号问题,可以在生成的 CSV 文件中用双引号括起字段。尽管如此,为了使本文中的代码示例易于理解,这些特殊情况并未涉及。因此,要实现更准确的转换,应考虑到这一方面。

7. Conclusion

7.结论</b

In this article, we explored various methods for converting XML to CSV, specifically diving into the XSLT and StAX methods. Regardless of the method chosen, having a well-suited XML structure for CSV, implementing data validation, and knowing which special characters to handle are essential for a smooth and successful conversion. The code for these examples is available on GitHub.

在本文中,我们探讨了将 XML 转换为 CSV 的各种方法,特别是 XSLT 和 StAX 方法。无论选择哪种方法,拥有一个适合 CSV 的 XML 结构、实施数据验证以及知道要处理哪些特殊字符对于顺利和成功地转换都是至关重要的。这些示例的代码可在 GitHub 上获取