Data Modeling in Cassandra – 卡桑德拉的数据建模

最后修改: 2017年 7月 22日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Cassandra is a NoSQL database that provides high availability and horizontal scalability without compromising performance.

Cassandra是一个NoSQL数据库,在不影响性能的情况下提供高可用性和横向可扩展性。

To get the best performance out of Cassandra, we need to carefully design the schema around query patterns specific to the business problem at hand.

为了从Cassandra中获得最佳性能,我们需要围绕手头业务问题的特定查询模式仔细设计模式。

In this article, we will review some of the key concepts around how to approach data modeling in Cassandra.

在这篇文章中,我们将围绕如何在Cassandra中进行数据建模回顾一些关键概念。

Before proceeding, you can go through our Cassandra with Java article to understand the basics and how to connect to Cassandra using Java.

在继续之前,您可以先阅读我们的Cassandra with Java文章,了解基础知识以及如何使用Java连接到Cassandra。

2. Partition Key

2.分区钥匙

Cassandra is a distributed database in which data is partitioned and stored across multiple nodes within a cluster.

Cassandra是一个分布式数据库,其中数据被分割并存储在一个集群内的多个节点上。

The partition key is made up of one or more data fields and is used by the partitioner to generate a token via hashing to distribute the data uniformly across a cluster.

分区密钥由一个或多个数据字段组成,被分区器用来通过散列法生成一个令牌,以在集群中统一分配数据

3. Clustering Key

3.聚类的关键

A clustering key is made up of one or more fields and helps in clustering or grouping together rows with same partition key and storing them in sorted order.

聚类键由一个或多个字段组成,有助于将具有相同分区键的行进行聚类或分组,并按排序存储。

Let’s say that we are storing time-series data in Cassandra and we want to retrieve the data in chronological order. A clustering key that includes time-series data fields will be very helpful for efficient retrieval of data for this use case.

比方说,我们在Cassandra中存储时间序列数据,我们想按时间顺序检索数据。一个包括时间序列数据字段的聚类键将非常有助于这种用例的数据的有效检索。

Note: The combination of partition key and clustering key makes up the primary key and uniquely identifies any record in the Cassandra cluster.

注意:分区键和集群键的组合构成了主键,并唯一地标识了 Cassandra 集群中的任何记录。

4. Guidelines Around Query Patterns

4.围绕查询模式的准则

Before starting with data modeling in Cassandra, we should identify the query patterns and ensure that they adhere to the following guidelines:

在开始进行Cassandra的数据建模之前,我们应该确定查询模式,并确保它们遵守以下准则。

  1. Each query should fetch data from a single partition
  2. We should keep track of how much data is getting stored in a partition, as Cassandra has limits around the number of columns that can be stored in a single partition
  3. It is OK to denormalize and duplicate the data to support different kinds of query patterns over the same data

Based on the above guidelines, let’s look at some real-world use cases and how we would model the Cassandra data models for them.

基于上述准则,让我们看看一些真实世界的用例,以及我们将如何为它们建立Cassandra数据模型。

5. Real World Data Modeling Examples

5.真实世界的数据建模实例

5.1. Facebook Posts

5.1.脸书上的帖子

Suppose that we are storing Facebook posts of different users in Cassandra. One of the common query patterns will be fetching the top ‘N‘ posts made by a given user.

假设我们在Cassandra中存储了不同用户的Facebook帖子。其中一个常见的查询模式是获取一个给定用户所发的前”N“帖子。

Thus, we need to store all data for a particular user on a single partition as per the above guidelines.

因此,我们需要按照上述准则,将某个特定用户的所有数据存储在一个分区上

Also, using the post timestamp as the clustering key will be helpful for retrieving the top ‘N‘ posts more efficiently.

此外,使用帖子的时间戳作为聚类的关键,将有助于更有效地检索前”N“帖子。

Let’s define the Cassandra table schema for this use case:

让我们为这个用例定义Cassandra的表模式。

CREATE TABLE posts_facebook (
  user_id uuid,
  post_id timeuuid, 
  content text,
  PRIMARY KEY (user_id, post_id) )
WITH CLUSTERING ORDER BY (post_id DESC);

Now, let’s write a query to find the top 20 posts for the user Anna:

现在,让我们写一个查询,找到用户Anna的前20个帖子。

SELECT content FROM posts_facebook WHERE user_id = "Anna_id" LIMIT 20

5.2. Gyms Across the Country

5.2.全国各地的健身房

Suppose that we are storing the details of different partner gyms across the different cities and states of many countries and we would like to fetch the gyms for a given city.

假设我们正在存储许多国家不同城市和州的不同伙伴健身房的详细信息,我们想获取某个城市的健身房。

Also, let’s say we need to return the results having gyms sorted by their opening date.

另外,假设我们需要返回的结果是按开业日期排序的健身房。

Based on the above guidelines, we should store the gyms located in a given city of a specific state and country on a single partition and use the opening date and gym name as a clustering key.

基于上述准则,我们应该将位于特定州和国家的特定城市的健身房存储在一个分区上,并使用开业日期和健身房名称作为聚类键。

Let’s define the Cassandra table schema for this example:

让我们为这个例子定义Cassandra的表模式。

CREATE TABLE gyms_by_city (
 country_code text,
 state text,
 city text,
 gym_name text,
 opening_date timestamp,
 PRIMARY KEY (
   (country_code, state_province, city), 
   (opening_date, gym_name)) 
 WITH CLUSTERING ORDER BY (opening_date ASC, gym_name ASC);

Now, let’s look at a query that fetches the first ten gyms by their opening date for the city of Phoenix within the U.S. state of Arizona:

现在,让我们来看看一个查询,根据美国亚利桑那州凤凰城的开业日期,检索出前十个健身房。

SELECT * FROM gyms_by_city
  WHERE country_code = "us" AND state = "Arizona" AND city = "Phoenix"
  LIMIT 10

Next, let’s see a query that fetches the ten most recently-opened gyms in the city of Phoenix within the U.S. state of Arizona:

接下来,让我们看看一个查询,它可以找到美国亚利桑那州凤凰城最近开张的十个健身房。

SELECT * FROM gyms_by_city
  WHERE country_code = "us" and state = "Arizona" and city = "Phoenix"
  ORDER BY opening_date DESC 
  LIMIT 10

Note: As the last query’s sort order is opposite of the sort order defined during the table creation, the query will run slower as Cassandra will first fetch the data and then sort it in memory.

注意:由于最后一个查询的排序顺序与创建表时定义的排序顺序相反,查询的运行速度会变慢,因为Cassandra会先获取数据,然后在内存中排序。

5.3. E-commerce Customers and Products

5.3.电子商务客户和产品

Let’s say we are running an e-commerce store and that we are storing the Customer and Product information within Cassandra. Let’s look at some of the common query patterns around this use case:

假设我们正在运行一个电子商务商店,并且我们在 Cassandra 中存储了 CustomerProduct 信息。让我们看看围绕这个用例的一些常见查询模式。

  1. Get Customer info
  2. Get Product info
  3. Get all Customers who like a given Product
  4. Get all Products a given Customer likes

We will start by using separate tables for storing the Customer and Product information. However, we need to introduce a fair amount of denormalization to support the 3rd and 4th queries shown above.

我们将首先使用单独的表来存储CustomerProduct信息。然而,我们需要引入相当数量的规范化,以支持上面显示的第3和第4个查询。

We will create two more tables to achieve this – “Customer_by_Product” and “Product_by_Customer“.

我们将再创建两个表来实现这一目标 – “Customer_by_Product“和”Product_by_Customer“。

Let’s look at the Cassandra table schema for this example:

让我们看看这个例子中的Cassandra表模式。

CREATE TABLE Customer (
  cust_id text,
  first_name text, 
  last_name text,
  registered_on timestamp, 
  PRIMARY KEY (cust_id));

CREATE TABLE Product (
  prdt_id text,
  title text,
  PRIMARY KEY (prdt_id));

CREATE TABLE Customer_By_Liked_Product (
  liked_prdt_id text,
  liked_on timestamp,
  title text,
  cust_id text,
  first_name text, 
  last_name text, 
  PRIMARY KEY (prdt_id, liked_on));

CREATE TABLE Product_Liked_By_Customer (
  cust_id text, 
  first_name text,
  last_name text,
  liked_prdt_id text, 
  liked_on timestamp,
  title text,
  PRIMARY KEY (cust_id, liked_on));

Note: To support both the queries, recently-liked products by a given customer and customers who recently liked a given product, we have used the “liked_on” column as a clustering key.

注:为了支持这两个查询,即某个客户最近喜欢的产品和最近喜欢某个产品的客户,我们使用”liked_on“列作为聚类的关键。

Let’s look at the query to find the ten Customers who most recently liked the product “Pepsi“:

让我们来看看寻找最近喜欢产品”Pepsi“的十个客户的查询。

SELECT * FROM Customer_By_Liked_Product WHERE title = "Pepsi" LIMIT 10

And let’s see the query that finds the recently-liked products (up to ten) by a customer named “Anna“:

让我们看看这个查询,它可以找到一个名为”Anna“的客户最近喜欢的产品(最多10个)。

SELECT * FROM Product_Liked_By_Customer 
  WHERE first_name = "Anna" LIMIT 10

6. Inefficient Query Patterns

6.低效的查询模式

Due to the way that Cassandra stores data, some query patterns are not at all efficient, including the following:

由于Cassandra存储数据的方式,一些查询模式完全没有效率,包括以下内容。

  • Fetching data from multiple partitions – this will require a coordinator to fetch the data from multiple nodes, store it temporarily in heap, and then aggregate the data before returning results to the user
  • Join-based queries – due to its distributed nature, Cassandra does not support table joins in queries the same way a relational database does, and as a result, queries with joins will be slower and can also lead to inconsistency and availability issues

7. Conclusion

7.结论

In this tutorial, we have covered several best practices around how to approach data modeling in Cassandra.

在本教程中,我们围绕如何在Cassandra中进行数据建模,介绍了几个最佳实践。

Understanding the core concepts and identifying the query patterns in advance is necessary for designing a correct data model that gets the best performance from a Cassandra cluster.

理解核心概念并提前确定查询模式,对于设计一个正确的数据模型,从Cassandra集群中获得最佳性能是必要的。