Secondary Indexes in Cassandra – 卡桑德拉的二级索引

最后修改: 2022年 3月 10日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll discuss how to use secondary indexes in Apache Cassandra.

在本教程中,我们将讨论如何在Apache Cassandra中使用二级索引。

We’ll see how data is distributed in the database and explore all the index types. Finally, we’ll discuss some best practices and recommendations for using secondary indexes.

我们将看到数据是如何在数据库中分布的,并探索所有的索引类型。最后,我们将讨论一些使用二级索引的最佳实践和建议。

2. Cassandra Architecture

2.卡桑德拉架构

Cassandra is a NoSQL distributed database with a completely decentralized communication model.

Cassandra是一个NoSQL分布式数据库,具有完全去中心化的通信模型。

It comprises multiple nodes with equal duties, offering high availability. It can run on any cloud provider and on-premise, making it cloud-agnostic.

它包括具有同等职责的多个节点,提供高可用性。它可以运行在任何云供应商和企业内部,使其具有云计算能力。

We can also deploy a single Cassandra cluster simultaneously across multiple cloud platforms. It’s most suited for OLTP (Online Transaction Processing) queries, where response speed is crucial, with simple queries that rarely change.

我们还可以在多个云平台上同时部署一个Cassandra集群。它最适合于OLTP(在线交易处理)查询,在这种情况下,响应速度至关重要,查询内容简单,很少变化。

2.1. Primary Key

2.1.主键

The primary key is the most important data modeling choice that uniquely identifies a data record. It consists of at least one partition key and zero or more clustering columns.

主键是最重要的数据建模选择,它能唯一地识别一条数据记录它至少由一个分区键和零个或多个聚类列组成。

The partition key defines how we split data across the cluster. The clustering column orders data on disk to enable fast read operations.

分区键定义了我们如何在集群中分割数据。集群列对磁盘上的数据进行排序,以实现快速读取操作。

Let’s look at an example:

我们来看看一个例子。

CREATE TABLE company (
    company_name text,
    employee_name text,
    employee_email text,
    employee_age int,
    PRIMARY KEY ((company_name), employee_email)
);

Here, we’ve defined company_name as the partition key used to distribute the table data evenly across the nodes. Next, since we’ve specified employee_email as a clustering column, Cassandra uses it to keep the data in ascending order on each node for efficient retrieval of rows.

在这里,我们将company_name定义为分区键,用于将表的数据平均分配到各节点上。接下来,由于我们指定employee_email为聚类列,Cassandra使用它来保持每个节点上数据的升序,以便有效地检索行。

2.2. Cluster Topology

2.2.集群拓扑结构

Cassandra offers linear scalability and performance directly proportional to the number of nodes available.

Cassandra提供线性可扩展性,性能与可用的节点数量成正比

The nodes are placed in a ring, forming a data center, and by connecting multiple geographically distributed data centers, we create a cluster.

节点被放置在一个环中,形成一个数据中心,通过连接多个地理上分布的数据中心,我们创建一个集群。

Cassandra automatically partitions the data without manual intervention, thus making it big data ready.

Cassandra自动对数据进行分区,无需人工干预,从而使其为大数据做好准备。

Next, let’s see how Cassandra partitions our table by company_name:

接下来,让我们看看Cassandra如何按company_name对我们的表进行分区。

Cassandra Cluster Topology

As we can see, the company table is split into partitions using the partition key company_name and distributed across the nodes. We can notice that Cassandra groups the rows with the same company_name value and stores them on the same physical partition on the disk. As a result, we can read all the data for a given company with minimal I/O cost.

我们可以看到,company表使用分区键company_name被分割成分区,并分布在各节点上。我们可以注意到,Cassandra将具有相同company_name值的行分组,并将它们存储在磁盘的同一个物理分区中。因此,我们可以用最小的I/O成本来读取一个特定公司的所有数据。

Additionally, we can replicate the data across the data center by defining the replication factor. A replication factor of N will store each data row on N different nodes in the cluster.

此外,我们可以通过定义复制因子在整个数据中心复制数据。复制因子为N,将在集群中的N个不同的节点上存储每个数据行。

We can specify the number of replicas at the data center level and not at the cluster level. As a result, we can have a cluster of multiple data centers, with each data center having a different replication factor.

我们可以在数据中心层面而不是集群层面上指定复制的数量。因此,我们可以有一个由多个数据中心组成的集群,每个数据中心有不同的复制系数。

3. Querying on Non-Primary Key

3.对非主键的查询

Let’s take the company table that we defined earlier and try to search by employee_age:

让我们来看看我们之前定义的company表,并尝试通过employee_age来搜索。

SELECT * FROM company WHERE employee_age = 30;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

We get this error message because we cannot query a column that’s not part of the primary key unless we use the ALLOW FILTERING clause.

我们得到这个错误信息是因为我们不能查询一个不属于主键的列,除非我们使用ALLOW FILTERING子。

However, even if we technically can, we should not use it in production because ALLOW FILTERING is expensive and time-consuming. This is because, in the background, it starts full-table scans across all nodes in the cluster to fetch the results, which has a negative impact on performance.

然而,即使我们在技术上可以,我们也不应该在生产中使用它,因为ALLOW FILTERING是昂贵和耗时的。这是因为,在后台,它开始在集群的所有节点上进行全表扫描以获取结果,这对性能有负面影响。

However, one acceptable use case where we could use it is when we need to do a lot of filtering on a single partition. In this case, Cassandra still performs a table scan, but we can limit it to a single node:

然而,我们可以使用它的一个可接受的用例是当我们需要在一个分区上做大量的过滤。在这种情况下,Cassandra仍然执行表扫描,但我们可以把它限制在一个节点上。

SELECT * FROM company WHERE company_name = 'company_a' AND employee_age = 30 ALLOW FILTERING;

Because we added the company_name clustering column as a condition, Cassandra uses it to identify the node that holds all the company data. Consequently, it performs a table scan just on the table data on that specific node.

因为我们把company_name聚类列作为一个条件,Cassandra用它来识别持有所有公司数据的节点。因此,它只对那个特定节点上的表数据进行了表扫描。

4. Secondary Indexes

4.二级索引

Secondary Indexes in Cassandra solve the need for querying columns that are not part of the primary key.

Cassandra中的二级索引解决了查询不属于主键的列的需要。

When we insert data, Cassandra uses an append-only file called commitlog for storing the changes, so writes are quick. At the same time, the data is written to an in-memory cache of key/column values called a Memtable. Periodically, Cassandra flushes the Memtable to disk in the form of an immutable SSTable.

当我们插入数据时,Cassandra使用一个名为commitlog的纯附录文件来存储变化,所以写入速度很快。同时,数据被写入一个叫做Memtable的键/列值的内存缓存中。Cassandra会定期将Memtable以不可变的SSTable的形式刷新到磁盘。

Next, let’s look at three different indexing methods in Cassandra and discuss the advantages and disadvantages.

接下来,让我们看看Cassandra中三种不同的索引方法,并讨论其优势和劣势。

4.1. Regular Secondary Index (2i)

4.1.常规二级指数 (2i)

The regular secondary index is the most basic index we can define for executing queries on non-primary key columns.

常规二级索引是我们可以定义的最基本的索引,用于执行对非主键列的查询。

Let’s define a secondary index on the employee_age column:

让我们为employee_age列定义一个二级索引。

CREATE INDEX IF NOT EXISTS ON company (employee_age);

With that in place, we can now run the query by employee_age without any errors:

有了这些,我们现在可以按employee_age运行查询而不会有任何错误。

SELECT * FROM company WHERE employee_age = 30; 

company_name  | employee_email    | employee_age | employee_name 
--------------+-------------------+--------------+---------------
    company_A | emp1@companyA.com |           30 |     employee_1

When we set up the index, Cassandra creates a hidden table for storing the index data in the background:

当我们设置索引时,Cassandra会在后台创建一个用于存储索引数据的隐藏表。

CREATE TABLE company_by_employee_age_idx ( 
    employee_age int,
    company_name text,
    employee_email text,
    PRIMARY KEY ((employee_age), company_name, employee_email) 
);

Unlike regular tables, Cassandra doesn’t distribute the hidden index table using the cluster-wide partitioner. The index data is co-located with the source data on the same nodes.

与普通表不同,Cassandra不使用集群范围内的分区器来分配隐藏的索引表。索引数据与源数据共同存放在同一节点上。

Therefore, when executing a search query using the secondary index, Cassandra reads the indexed data from every node and collects all the results. If our cluster has many nodes, this can lead to increased data transfer and high latency.

因此,当使用二级索引执行搜索查询时,Cassandra从每个节点读取索引数据并收集所有结果。如果我们的集群有很多节点,这可能会导致数据传输的增加和高延时。

We might ask ourselves why Cassandra doesn’t partition the index table across nodes based on the primary key. The answer is that storing the index data alongside the source data reduces the latency. Also, because the index update is executed locally and not over the network, there is no risk to lose the update operation due to connectivity issues. Additionally, Cassandra avoids creating wide partitions if the index column data is not evenly distributed.

我们可能会问自己,为什么Cassandra不根据主键在各节点间划分索引表。答案是,将索引数据与源数据一起存储可以减少延迟。另外,由于索引更新是在本地执行的,而不是通过网络,所以没有因连接问题而失去更新操作的风险。此外,如果索引列数据的分布不均匀,Cassandra会避免创建宽的分区。

When we insert data to a table with a secondary index attached, Cassandra writes to both the index and the base Memtable. Additionally, both are flushed to the SSTables simultaneously. Consequently, the index data will have a separate lifecycle than the source data.

当我们向一个附有二级索引的表插入数据时,Cassandra同时向索引和基础Memtable写入数据。此外,两者都会同时刷新到SSTables。因此,索引数据将比源数据有一个单独的生命周期。

When we read data based on the secondary index, Cassandra first retrieves the primary keys for all matching rows in the index, and after that, it uses them to fetch all the data from the source table.

当我们基于二级索引读取数据时,Cassandra首先检索索引中所有匹配行的主键,之后,它使用这些主键从源表中获取所有数据。

4.2. SSTable-Attached Secondary Index (SASI)

4.2.SSTable-Attached Secondary Index (SASI)

SASI introduces the new idea of binding the SSTable lifecycle to the index. Performing in-memory indexing followed by flushing the index with the SSTable to disk reduces disk usage and saves CPU cycles.

SASI引入了将SSTable生命周期与索引绑定的新理念。执行内存中的索引,然后将索引与SSTable一起冲到磁盘上,可以减少磁盘的使用并节省CPU周期。

Let’s look at how we define a SASI index:

让我们看看我们如何定义SASI指数。

CREATE CUSTOM INDEX IF NOT EXISTS company_by_employee_age ON company (employee_age) USING 'org.apache.cassandra.index.sasi.SASIIndex';

The advantages of SASI are the tokenized text search, fast range scans, and in-memory indexing. On the other hand, a disadvantage is that it generates big index files, especially when enabling text tokenization.

SASI的优点是标记化文本搜索、快速范围扫描和内存索引。另一方面,缺点是它会产生大的索引文件,特别是在启用文本标记化时。

Finally, we should note that SASI indexes in DataStax Enterprise (DSE) are experimental. DataStax does not support SASI indexes for production.

最后,我们应该注意,DataStax Enterprise(DSE)中的SASI索引是实验性的。DataStax不支持用于生产的SASI索引。

4.3. Storage-Attached Indexing (SAI)

4.3.存储器附加索引(SAI)

Storage-Attached Indexing is a highly-scalable data indexing mechanism available for DataStax Astra and DataStax Enterprise databases. We can define one or more SAI indexes on any column and then use range queries (numeric only), CONTAINs semantics, and filter queries.

存储附加索引是一种高度可扩展的数据索引机制,可用于 DataStax Astra 和 DataStax Enterprise 数据库。我们可以在任何列上定义一个或多个SAI索引,然后使用范围查询(仅限数字)、CONTAINs语义和过滤查询。

SAI stores individual index files for each column and contains a pointer to the offset of the source data in the SSTable. Once we insert data into an indexed column, it will be written first to memory. Whenever Cassandra flushes data from memory to disk, it writes the index along with the data table.

SAI为每一列存储单独的索引文件,并包含一个指向SSTable中源数据的偏移量的指针。一旦我们在有索引的列中插入数据,它将首先被写入内存。每当Cassandra将数据从内存刷到磁盘时,它就会将索引和数据表一起写入。

This approach improves throughput by 43% and latency by 230% over 2i by reducing the overhead for writing. Compared to SASI and 2i, it uses significantly less disk space for indexing, has fewer failure points, and comes with a more simplified architecture.

这种方法通过减少写入的开销,使吞吐量比2i提高了43%,延迟提高了230%。与SASI和2i相比,它用于索引的磁盘空间大大减少,故障点也更少,并且具有更简化的架构。

Let’s define our index using SAI:

让我们用SAI来定义我们的指数。

CREATE CUSTOM INDEX ON company (employee_age) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': false, 'normalize': false};

The normalize option transforms special characters to their base character. For example, we can normalize the German character ö to a regular o, enabling query matching without typing the special characters. So we can, for instance, search for the term “schön” by simply using “schon” as a condition.

归一化选项将特殊字符转换为其基本字符。例如,我们可以将德语字符ö规范化为正则o,使查询匹配无需输入特殊字符。因此,例如,我们可以通过简单地使用 “schon “作为条件来搜索 “schön “这个词。

4.4. Best Practices

4.4.最佳做法

Firstly, when we use secondary indexes in our queries, a recommendation is to add the partition key as a condition. As a result, we can reduce the read operation to a single node (plus replicas depending on the consistency level):

首先,当我们在查询中使用二级索引时,一个建议是将分区键作为一个条件加入。因此,我们可以将读取操作减少到一个节点上(根据一致性级别,加上复制)。

SELECT * FROM company WHERE employee_age = 30 AND company_name = "company_A";

Secondly, we can restrict the query to a list of partition keys and bound the number of nodes involved in  fetching the results:

其次,我们可以将查询限制在一个分区键的列表中,并约束获取结果时涉及的节点数量。

SELECT * FROM company WHERE employee_age = 30 AND company_name IN ("company_A", "company_B", "company_C");

Thirdly, if we need just a subset of the results, we can add a limit to the query. This also reduces the number of nodes involved in the read path:

第三,如果我们只需要结果的一个子集,我们可以给查询添加一个限制。这也减少了读取路径中涉及的节点数量。

SELECT * FROM company WHERE employee_age = 30 LIMIT 10;

Additionally, we must avoid defining secondary indexes on columns with very low cardinality (gender, true/false columns, etc.) because they produce very wide partitions that impact performance.

此外,我们必须避免在cardinality非常低的列上定义二级索引(性别、真/假列等),因为它们会产生非常宽的分区,影响性能。

Similarly, columns with high cardinality (social security number, email, etc.) will result in indexes with very granular partitions, which in the worst case will force the cluster coordinator to hit all the primary replicas.

同样地,具有高cardinality的(社会安全号码、电子邮件等)将导致索引具有非常细化的分区,在最坏的情况下,这将迫使集群协调者击中所有的主副本。

Lastly, we must avoid using secondary indexes on frequently updated columns. The rationale behind this is that Cassandra uses immutable data structures, and frequent updates increases the number of write operations on disk.

最后,我们必须避免在频繁更新的列上使用二级索引。这背后的原理是,Cassandra使用不可变的数据结构,而频繁的更新会增加磁盘上的写操作数量。

5. Conclusion

5.总结

In this article, we have explored how Cassandra partitions the data across the data center and explored three types of secondary indexes.

在这篇文章中,我们已经探讨了Cassandra如何在整个数据中心划分数据,并探讨了三种类型的二级索引。

Before considering a secondary index, we should consider denormalizing our data into a second table and keeping it up to date with the main table if we plan to access it frequently.

在考虑二级索引之前,如果我们计划频繁地访问数据,我们应该考虑将数据反规范化到第二个表中,并使其与主表保持同步。

On the other hand, if the data access is sporadic, adding a separate table adds unjustified complexity. Therefore, introducing a secondary index is a better option. Undoubtedly, storage-attached indexing is the best choice out of the three indexing options we have, offering the best performance and simplified architecture.

另一方面,如果数据访问是零星的,添加一个单独的表会增加不合理的复杂性。因此,引入一个二级索引是一个更好的选择。毋庸置疑,在我们的三种索引选择中,存储连接索引是最好的选择,它提供了最好的性能和简化的架构。