Cluster, Datacenters, Racks and Nodes in Cassandra – 卡桑德拉的集群、数据中心、机架和节点

最后修改: 2021年 6月 27日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.绪论

In this tutorial, we’ll have a close look at Cassandra’s architecture. We’ll find out about data storing in a distributed architecture, and we’ll discuss basic architecture components.

在本教程中,我们将仔细研究Cassandra的架构。我们将了解到分布式架构中的数据存储,并讨论基本的架构组件。

2. Cassandra Overview

2.卡桑德拉概述

Apache Cassandra is a NoSQL, distributed database management system. The main advantage of Cassandra is that it can handle a high volume of structured data across commodity servers. Moreover, it provides high availability and provides no single point of failure. Cassandra achieves this by using a ring-type architecture, where the smallest logical unit is a node. It uses the partitioning of data for optimizing queries.

Apache Cassandra是一个NoSQL、分布式数据库管理系统。Cassandra的主要优势在于它可以在商品服务器上处理大量的结构化数据。此外,它还提供了高可用性,并且不提供单点故障。Cassandra通过使用环形结构来实现这一目标,其中最小的逻辑单元是一个节点。它利用数据的分区来优化查询。

Every piece of data has a partition key. The partition key for every row is hashed. As a result, we’ll get a unique token for every piece of data. For each node, there is an assigned range of tokens. Consequently, data with the same token is stored on the same node. The ring architecture of nodes is shown below:

每一块数据都有一个分区键。每一行的分区键都是散列的。因此,我们将为每条数据得到一个唯一的令牌。对于每个节点,有一个指定的令牌范围。因此,具有相同令牌的数据被存储在同一个节点上。节点的环状结构如下所示。

3. Cassandra Components

3.卡桑德拉组件

3.1. Node

3.1 节点

A Node is the basic infrastructure component of Cassandra. It’s a fully functional machine that connects with other nodes in the cluster through the high internal network.

节点是 Cassandra 的基本基础设施组件。它是一台功能齐全的机器,通过高内部网络与集群中的其他节点连接。

The name of this network is Gossip Protocol. To clarify, the machine can be a physical server or an EC2 instance, or a virtual machine. All nodes are organized with ring network topology. Importantly, every node is independent and has the same role in the ring. Cassandra arranges nodes in a peer-to-peer structure. The node contains the actual data.

这个网络的名称是Gossip Protocol。为了说明问题,机器可以是物理服务器或EC2实例,也可以是虚拟机。所有节点都以环形网络拓扑结构组织。重要的是,每个节点都是独立的,在环形网络中具有相同的作用。Cassandra以点对点的结构安排节点。节点包含实际数据。

Each node in a cluster can accept read and write requests. Therefore, it doesn’t matter where the data is actually located in the cluster. We’ll always get the newest version of data.

集群中的每个节点都可以接受读和写请求。因此,数据在集群中的实际位置并不重要。我们总是会得到最新版本的数据。

3.2. Virtual Node

3.2. 虚拟节点

Newer versions of Cassandra use virtual nodes or vnodes for short. A virtual node is the data storage layer within a server.

新版本的Cassandra使用虚拟节点,或简称为vnodes虚拟节点是服务器中的数据存储层。

There are 256 virtual nodes per server by default. As we discussed in the previous paragraph, each node has a range of tokens assigned. Every virtual node uses a sub-range of tokens from the node they belong to.

每台服务器默认有256个虚拟节点。正如我们在上一段所讨论的,每个节点都有一个分配的令牌范围。每个虚拟节点使用它们所属节点的一个子范围的令牌。

These virtual nodes provide greater flexibility in the system. Consequently, it’s easier for Cassandra to add new nodes to the cluster when we need them. When our data has unequally distributed tokens between nodes, we can easily extend the storage capacity by extending virtual nodes to the more loaded node.

这些虚拟节点在系统中提供了更大的灵活性。因此,当我们需要时,Cassandra更容易在集群中添加新的节点。当我们的数据在节点之间有不平等分布的令牌时,我们可以通过将虚拟节点扩展到负载较高的节点来轻松扩展存储容量。

3.4. Server

3.4.服务器

When we use the term server, we’ll mean a machine with the Cassandra software installed. Every node has a single instance of Cassandra, which is technically a server. As we said earlier, each instance of Cassandra has evolved to contain 256 virtual nodes. The Cassandra server runs core processes. For example, processes like spreading replicas around nodes or routing requests.

当我们使用术语server时,我们将指一个安装了Cassandra软件的机器。每个节点都有一个Cassandra的实例,从技术上讲是一个服务器。正如我们前面所说,每个Cassandra实例已经发展到包含256个虚拟节点。Cassandra服务器运行核心进程。例如,像在节点周围传播副本或路由请求的进程。

3.5. Rack

3.5架

A Cassandra rack is a logical grouping of nodes within the ring. In other words, a rack is a collection of servers. The database uses racks so that it can ensure replicas are distributed among different logical groupings. As a result, it can send operations not only to just one node. Multiple nodes, each on a separate rack, can provide greater fault tolerance and availability.

一个Cassandra机架是环形内节点的一个逻辑分组。换句话说,一个机架就是一个服务器的集合。数据库使用机架,这样它可以确保副本分布在不同的逻辑分组中。因此,它不仅可以向一个节点发送操作。多个节点,每个节点都在一个单独的机架上,可以提供更大的容错性和可用性。

3.6. Datacenters

3.6.数据中心

A datacenter is a logical set of racks. The datacenter should contain at least one rack.  We can say that the Cassandra Datacenter is a group of nodes related and configured within a cluster for replication purposes. So, it helps to reduce latency, prevent transactions from impact by other workloads and related effects. Whatsmore, the replication factor can also be set up to write to multiple datacenters. As a result, Cassandra can provide additional flexibility in architectural design and organization.

一个数据中心是一组逻辑上的机架。数据中心应该至少包含一个机架。 我们可以说,Cassandra数据中心是一组相关的节点,并配置在一个集群内,用于复制目的。因此,它有助于减少延迟,防止交易受到其他工作负载的影响和相关影响。此外,复制因素也可以被设置为写到多个数据中心。因此,Cassandra可以在架构设计和组织方面提供额外的灵活性。

3.7. Cluster

3.7.集群

A cluster is a component that contains one or more datacenters. It’s the most outer storage container in the database. One database contains one or more clusters. The hierarchy of elements in the Cassandra cluster is:

集群是一个包含一个或多个数据中心的组件。它是数据库中最外层的存储容器。一个数据库包含一个或多个集群。Cassandra集群中的元素的层次结构是。

First, we have clusters that consist of datacenters. Inside of datacenters, we have nodes that contain by default 256 virtual nodes.

首先,我们有由数据中心组成的集群。在数据中心内,我们有默认包含256个虚拟节点的节点。

4. Data Replication

4.数据复制

Now, when we know the basic components of Cassandra. Let’s talk about how Cassandra manages data around its structure. Some systems can not allow for data loss or interruption in data delivery. The solution is to provide a backup when the problem has occurred. For example, it can be hardware problems, or links can be down at any time during the data process. Cassandra stores data replicas on multiple nodes to ensure reliability and fault tolerance.

现在,当我们知道了Cassandra的基本组件。让我们来谈谈Cassandra如何围绕其结构管理数据。有些系统不能允许数据丢失或数据传输中断。解决方案是在问题发生后提供一个备份。例如,可能是硬件问题,也可能是链接在数据过程中的任何时候出现故障。Cassandra在多个节点上存储数据副本以确保可靠性和容错性。

5.1. Replication Factor

5.1.复制因子

We can determine the number of replicas and their location by the replication factor and replication strategy. The replication factor is the total number of replicas across the cluster. When we set this factor to one, it means that only one copy of each row exists in a cluster and so on. We can set this factor on the datacenter level and on rack level.

我们可以通过复制因子和复制策略来确定复制的数量和位置。复制因子是整个集群中复制的总数量。当我们把这个系数设置为1时,就意味着在一个集群中每一行只存在一个副本,以此类推我们可以在数据中心层面和机架层面设置这个系数。

5.1. Replication Strategy

5.1.复制策略

The replication strategy controls how the replicas are chosen. The importance of replicas is the same. Cassandra has two strategies for determining which nodes contain replicated data. The first one is called the SimpleStrategy, and it is unaware of the logical division of nodes for datacenters and racks. The second one is NetworkTopologyStrategy is more complicated and is both racks aware and datacenter aware. We can define how many replicas would be placed in different datacenters by using The NetworkTopologyStrategy. Additionally, it tries to avoid situations when two replicas are placed on the same rack.

复制策略控制着如何选择复制体。复制的重要性是一样的。Cassandra有两种策略来确定哪些节点包含复制的数据。第一个被称为SimpleStrategy,,它不知道数据中心和机架的节点的逻辑划分。第二个是NetworkTopologyStrategy,它更复杂,既能意识到机架,又能意识到数据中心。我们可以通过使用NetworkTopologyStrategy来定义在不同数据中心放置多少个副本。此外,它试图避免两个复制体被放置在同一个机架上的情况

5. Conclusion

5.总结

This tutorial introduces the basic components of Cassandra’s architecture. We covered the key concepts of this database that ensure its high availability and partitioning tolerance. We also talked about data partitioning and data replication.

本教程介绍了Cassandra架构的基本组成部分。我们涵盖了这个数据库的关键概念,确保其高可用性和分区容忍度。我们还谈到了数据分区和数据复制。