Introduction to Apache Calcite – 阿帕奇方解石简介

最后修改: 2024年 1月 20日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll learn about Apache Calcite. It’s a powerful data management framework that can be used in various use cases concerning data access. Calcite focuses on retrieving data from any source, not on storing it. Additionally, Its query optimization capability enables faster and more efficient data retrieval.

在本教程中,我们将了解 Apache Calcite这是一个功能强大的数据管理框架,可用于与数据访问相关的各种用例。Calcite 专注于从任何来源检索数据,而不是存储数据。此外,其查询优化功能可实现更快、更高效的数据检索。

Let’s explore more, starting with the use cases where Apache Calcite is relevant.

让我们从与 Apache Calcite 相关的使用案例开始,探索更多内容。

2. Apache Calcite Use Cases

2.Apache Calcite 使用案例

Due to its capabilities, Apache Calcite can be leveraged in several use cases:

Apache Calcite 功能强大,可用于多种用途:

Apache Calcite Use Cases

It takes years to build query engines for new databases. However, Calcite helps get us started immediately with an out-of-the-box extendible SQL parser, validator, and optimizer. Calcite has been used in building databases such as HerdDB, Apache Druid, MapD, and many more.

为新数据库构建查询引擎需要数年时间。然而,Calcite 通过开箱即用的可扩展SQL解析器、验证器和优化器帮助我们立即开始工作。Calcite 已被用于构建 HerdDBApache DruidMapD 等数据库。

Because of Calcite’s capability to integrate with multiple databases, it’s widely used in building data warehouses and business intelligence tools such as Apache Kyline, Apache Wayang, Alibaba MaxCompute, and many more.

由于 Calcite 能够与多个数据库集成,因此被广泛用于构建数据仓库和商业智能工具,如 Apache KylineApache WayangAlibaba MaxCompute 等。

Calcite is an integral component of streaming platforms such as Apache Kafka, Apache Apex, and Flink, which help build tools that can present and analyze live feeds.

Calcite 是 Apache KafkaApache ApexFlink 等流媒体平台不可或缺的组成部分,这些平台有助于构建可呈现和分析实时源的工具。

3. Any Data, Anywhere

3.任何数据,任何地点

Apache Calcite provides ready-made adapters to integrate with third-party data sources like Cassandra, Elasticsearch, MongoDB, and many more.

Apache Calcite 提供了现成的适配器,以集成第三方数据源,如CassandraElasticsearchMongoDB

Let’s explore this in detail.

让我们来详细探讨一下。

3.1. High-Level Important Classes

3.1.高级重要类别

Apache Calcite provides a robust framework for retrieving data. This framework is extendable. Hence, custom new adapters can also be created. Let’s take a look at the important Java classes:

Apache Calcite 提供了一个强大的数据检索框架。该框架是可扩展的。因此,也可以创建自定义的新适配器。让我们来看看重要的 Java 类:

 

calcide adaptor cld

The Apache Calcite adapters provide classes such as ElasticsearchSchemaFactory, MongoSchemaFactory, FileSchemaFactory, and more, implementing the interface SchemaFactory. The SchemaFactory helps connect the underlying data sources in a unified manner by creating a virtual Schema defined in a JSON/YAML model file.

Apache Calcite 适配器提供了实现接口 ElasticsearchSchemaFactoryMongoSchemaFactoryFileSchemaFactory 等的 SchemaFactory类。SchemaFactory通过创建定义在 JSON/YAML 模型文件中的虚拟 Schema 来帮助以统一的方式连接底层数据源。

3.2. CSV Adapter

3.2 CSV 适配器

Furthermore, let’s see an example where we’ll read the data from a CSV file using SQL query. Let’s begin by importing the necessary Maven dependencies required for using the file adapter in the pom.xml file:

此外,让我们来看一个使用 SQL 查询从 CSV 文件读取数据的示例。首先,让我们在 pom.xml 文件中导入使用文件适配器所需的 Maven 依赖项

<dependency>
    <groupId>org.apache.calcite</groupId>
    <artifactId>calcite-core</artifactId>
    <version>1.36</version>
</dependency>
<dependency>
    <groupId>org.apache.calcite</groupId>
    <artifactId>calcite-file</artifactId>
    <version>1.36</version>
</dependency>

Next, let’s define the model in the model.json:

接下来,让我们在 model.json 中定义模型:

{
  "version": "1.0",
  "defaultSchema": "TRADES",
  "schemas": [
    {
      "name": "TRADES",
      "type": "custom",
      "factory": "org.apache.calcite.adapter.file.FileSchemaFactory",
      "operand": {
        "directory": "trades"
      }
    }
  ]
}

The FileSchemaFactory specified in the model.json looks into the trades directory for the CSV files and creates a virtual TRADES schema. Subsequently, the CSV files under the trades directory are treated like tables.

model.json中指定的 FileSchemaFactory 会在 trades 目录中查找 CSV 文件,并创建虚拟 TRADES 模式。随后,trades 目录下的 CSV 文件将被当作表格处理。

Before we move on to see the file adapter in action, let’s look at the trade.csv file, which we’ll query using the calcite adapter:

在继续查看文件适配器的运行情况之前,让我们先看看 trade.csv 文件,我们将使用钙钛矿适配器查询该文件:

tradeid:int,product:string,qty:int
232312123,"RFTXC",100
232312124,"RFUXC",200
232312125,"RFSXC",1000

The CSV file has three columns: tradeid, product, and qty. Additionally, the column headers also specify the data types. In total, there are three trade records in the CSV file.

CSV 文件有三列:tradeidproductqty。此外,列标题还指定了数据类型。CSV 文件中共有三条贸易记录。

Finally, let’s take a look at how to fetch the records using the Calcite adapter:

最后,让我们看看如何使用 Calcite 适配器获取记录:

@Test
void whenCsvSchema_thenQuerySuccess() throws SQLException {
    Properties info = new Properties();
    info.put("model", getPath("model.json"));
    try (Connection connection = DriverManager.getConnection("jdbc:calcite:", info);) {
        Statement statement = connection.createStatement();
        ResultSet resultSet = statement.executeQuery("select * from trades.trade");

        assertEquals(3, resultSet.getMetaData().getColumnCount());

        List<Integer> tradeIds = new ArrayList<>();
        while (resultSet.next()) {
            tradeIds.add(resultSet.getInt("tradeid"));
        }

        assertEquals(3, tradeIds.size());
    }
}

The Calcite adapter takes the model property to create a virtual schema mimicking the file system. Then, using the usual JDBC semantics, it fetches the records from the trade.csv file.

Calcite 适配器使用 model 属性创建一个模仿文件系统的虚拟模式。然后,它使用通常的 JDBC 语义,从 trade.csv 文件中获取记录。

The file adapter can read not only CSV files but also HTML and JSON files. Moreover, for handling CSV files, Apache Calcite also provides a special CSV adapter for handling advanced use cases that use CSVSchemaFactory.

文件适配器不仅可以读取 CSV 文件,还可以读取 HTML 和 JSON 文件。此外,为处理 CSV 文件,Apache Calcite 还提供了一个特殊的 CSV 适配器,用于处理使用 CSVSchemaFactory 的高级用例。

3.3. In-Memory SQL Operation on Java Objects

3.3.对 Java 对象进行内存 SQL 操作

Similar to the CSV adapter example, let’s look at another example where, with the help of Apache Calcite, we’ll run SQL queries on Java objects.

与 CSV 适配器示例类似,让我们来看另一个示例,在 Apache Calcite 的帮助下,我们将在 Java 对象上运行 SQL 查询。

Assume two arrays of Employee and Department classes in the CompanySchema class:

假设 CompanySchema 类中有两个 EmployeeDepartment 类数组:

public class CompanySchema {
    public Employee[] employees;
    public Department[] departments;
}

Now, let’s take a look at the Employee class:

现在,让我们来看看 Employee 类:

public class Employee {
    public String name;
    public String id;

    public String deptId;

    public Employee(String name, String id, String deptId) {
        this.name = name;
        this.id = id;
        this.deptId = deptId;
    }
}

Similar to the Employee class, let’s define the Department class:

Employee 类类似,让我们定义 Department 类:

public class Department {
    public String deptId;
    public String deptName;

    public Department(String deptId, String deptName) {
        this.deptId = deptId;
        this.deptName = deptName;
    }
}

Assume there are three departments: Finance, Marketing, and Human Resources. We’ll run a query on the CompanySchema object to find the number of employees in each of the departments:

假设有三个部门:财务部、市场部和人力资源部。我们将在 CompanySchema 对象上运行一个查询,以查找每个 部门中的 员工人数

@Test
void whenQueryEmployeesObject_thenQuerySuccess() throws SQLException {
    Properties info = new Properties();
    info.setProperty("lex", "JAVA");
    Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
    CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);
    SchemaPlus rootSchema = calciteConnection.getRootSchema();
    Schema schema = new ReflectiveSchema(companySchema);
    rootSchema.add("company", schema);
    Statement statement = calciteConnection.createStatement();
    String query = "select dept.deptName, count(emp.id) "
      + "from company.employees as emp "
      + "join company.departments as dept "
      + "on (emp.deptId = dept.deptId) "
      + "group by dept.deptName";

    assertDoesNotThrow(() -> {
        ResultSet resultSet = statement.executeQuery(query);
        while (resultSet.next()) {
            logger.info("Dept Name:" + resultSet.getString(1)
              + " No. of employees:" + resultSet.getInt(2));
        }
    });
}

Interestingly, the method runs fine and fetches the results as well. In the method, the Apache Calcite class ReflectiveSchema helps create the Schema of the CompanySchema object. Then, it runs the SQL query and fetches the records using the standard JDBC semantics.

有趣的是,该方法运行正常,也能获取结果。在该方法中,Apache Calcite 类 ReflectiveSchema 帮助创建 CompanySchema 对象的模式。然后,运行 SQL 查询并使用标准 JDBC 语义获取记录。

Moreover, this example proves that irrespective of the source, Calcite can fetch data from anywhere using SQL statements.

此外,此示例还证明,无论数据来源如何,Calcite 都能使用 SQL 语句从任何地方获取数据。

4. Query Processing

4.查询处理

Query processing is Apache calcite’s core functionality.

查询处理是 Apache calcite 的核心功能。

Standard JDBC drivers or SQL clients execute queries on databases. Whereas Apache Calcite, after parsing and validating the query, intelligently optimizes them for efficient execution, saving resources and boosting performance.

标准 JDBC 驱动程序或 SQL 客户端执行数据库查询。而 Apache Calcite 在解析和验证查询后,会对其进行智能优化,以便高效执行,从而节省资源并提高性能。

4.1. Decoding Query Processing Steps

4.1.解码查询处理步骤

Calcite offers pretty standard components that help in query processing:

Calcite 提供有助于查询处理的标准组件:

 

calcite process

Interestingly, we can also extend these components to meet the specific requirements of any database. Let’s understand more about these steps in detail.

有趣的是,我们还可以扩展这些组件,以满足任何数据库的特定要求。让我们详细了解一下这些步骤。

4.2. SQL Parser and Validator

4.2.SQL 解析器和验证器

As part of the parsing process, the parser converts the SQL query into a tree-like structure called AST (Abstract Syntax Tree).

作为解析过程的一部分,解析器会将 SQL 查询转换成树状结构,称为 AST(抽象语法树)

Assume a SQL query on two tables, Teacher and Department:

假设对两个表 TeacherDepartment 进行 SQL 查询:

Select Teacher.name, Department.name 
From Teacher join 
Department On (Department.deptid = Teacher.deptid)
Where Department.name = 'Science'

First, the query parser converts the query into an AST and then performs a basic syntactic validation:

首先,查询解析器会将查询转换成 AST,然后执行基本的语法验证:

 

query processing AST

Further, the validator validates the nodes semantically, for example:

此外,验证器还对节点进行语义验证,例如

  • validating the functions and operators
  • validating the database objects like tables and columns against the database’s catalog

4.3. Relational Expression Builder

4.3.关系表达式生成器

Subsequently, after the validation step, the relational expression builder converts the syntax tree using some of the common relational operators:

随后,在验证步骤之后,关系表达式生成器会使用一些常用的关系运算符转换语法树:

  • LogicalTableScan: Reads data from a table
  • LogicalFilter: Selects rows based on a condition
  • LogicalProject: Choose specific columns to include
  • LogicalJoin: Combines rows from two tables based on matching values

Considering the AST shown earlier, the corresponding logical relational expression derived from it would be:

考虑到前面所示的 AST,由此导出的相应逻辑关系表达式为

LogicalProject(
    projects=[
        $0.name AS name0,
        $1.name AS name1
    ],
    input=LogicalFilter(
        condition=[
            ($1.name = 'Science')
        ],
        input=LogicalJoin(
            condition=[
                ($0.deptid = $1.deptid)
            ],
            left=LogicalTableScan(table=[[Teacher]]),
            right=LogicalTableScan(table=[[Department]])
        )
    )
)

In the relational expression, $0 and $1 represent the tables Teacher and Department. Essentially, it’s a mathematical expression that helps understand what operations will be carried out to get the results. However, it has no information related to execution.

在关系表达式中,$0$1 表示表 TeacherDepartment 。从本质上讲,这是一个数学表达式,有助于理解将进行哪些运算以获得结果。但是,它没有与执行相关的信息。

4.4. Query Optimizer

4.4.查询优化器

Then, the Calcite optimizer applies optimization on the relational expression. Some common optimization include:

然后,Calcite 优化器对关系表达式进行优化。常见的优化包括

  • Predicate Pushdown: Pushing filters as close to the data source as possible to reduce the amount of data fetched
  • Join Reordering: Rearranging join order to minimize intermediate results and improve efficiency
  • Projection Pushdown: Pushing down projections to avoid unnecessary columns from being processed
  • Index Usage: Identifying and utilizing indexes to speed up data retrieval

4.5. Query Planner, Generator, and Executor

4.5.查询规划器、生成器和执行器

Following optimization, the Calcite Query Planner creates an execution plan for executing the optimized query. The execution plan specifies the exact steps to be taken by the query engine to fetch and process data. This is also called a physical plan specific to the back-end query engine.

优化后,Calcite 查询规划器将创建执行优化查询的执行计划。执行计划指定了查询引擎获取和处理数据的具体步骤。这也被称为后端查询引擎特定的物理计划。

Then, the Calcite Query Generator generates code in a language specific to the chosen execution engine.

然后,Calcite 查询生成器会使用所选执行引擎的特定语言生成代码

Finally, the Executor connects to the database to execute the final query.

最后,执行器连接到数据库,执行最终查询。

5. Conclusion

5.结论

In this article, we explored the capabilities of Apache Calcite, which rapidly equips databases with standardized SQL parsers, validators, and optimizers. Hence, freeing vendors from developing years-long query engines, Calcite empowers them to prioritize back-end storage. Additionally, Calcite’s ready-made adapters simplify connection to diverse databases, helping to develop a unified integration interface.

在本文中,我们探讨了 Apache Calcite 的功能,它可快速为数据库配备标准化的 SQL 解析器、验证器和优化器。因此,Calcite 将供应商从开发长达数年的查询引擎中解放出来,使他们能够优先考虑后端存储。此外,Calcite 的现成适配器简化了与各种数据库的连接,有助于开发统一的集成界面。

Moreover, by leveraging Calcite, database developers can accelerate time-to-market and deliver robust, versatile SQL capabilities.

此外,通过利用 Calcite,数据库开发人员可以加快产品上市时间,并提供强大的多功能 SQL 功能。

The code used in this article is available over on GitHub.

本文中使用的代码可在 GitHub 上获取。