1. Introduction
1.绪论
Apache Cassandra is an open-source distributed NoSQL database that is built to handle large amounts of data across multiple data centers. Cassandra’s data model is a topic of discussion across multiple documents and papers, often resulting in confusing or contradictory information. This is due to Cassandra’s ability to store and access column families separately, which results in a mistaken classification as column-oriented rather than column-family.
Apache Cassandra是一个开源的分布式NoSQL数据库,它是为了处理多个数据中心的大量数据而建立的。Cassandra的数据模型是多个文件和论文中讨论的话题,经常导致混乱或矛盾的信息。这是由于Cassandra能够单独存储和访问列族,这导致人们错误地将其归类为面向列,而不是列族。
In this tutorial, we’ll look at the differences between data models and establish the nature of Cassandra’s partitioned row store data model.
在本教程中,我们将研究数据模型之间的差异,并确立Cassandra的分区行存储数据模型的性质。
2. Database Data Models
2.数据库数据模型
The README on the Apache Cassandra git repo states that:
Apache Cassandra git repo上的README指出,。
Cassandra is a partitioned row store. Rows are organized into tables with a required primary key.
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.
Row store means that like relational databases, Cassandra organizes data by rows and columns.
From this, we can conclude that Cassandra is a partitioned rowstore. However, column-family or wide-column are also suitable names, as we’ll find out below.
由此,我们可以得出结论:Cassandra是一个分区的行存储。然而,column-family或wide-column也是合适的名称,我们将在下面发现。
A column-family data model is not the same as a column-oriented model. A column-family database stores a row with all its column families together, whereas a column-oriented database simply stores data tables by column rather than by row.
一个列族数据模型与一个面向列的模型不同。一个column-family数据库将一行与所有列族一起存储,而一个column-oriented数据库只是按列而不是按行存储数据表。
2.1. Row-Oriented and Column-Oriented Data Stores
2.1.面向行的和面向列的数据存储
Let’s take an Employees table as an example:
让我们以一个Employees表为例。
ID Last First Age
1 Cooper James 32
2 Bell Lisa 57
3 Young Joseph 45
A row-oriented database stores the above data as:
一个面向行的数据库将上述数据存储为。
1,Cooper,James,32;2,Bell,Lisa,57;3,Young,Joseph,45;
While a column-oriented database stores the data as:
而面向列的数据库将数据存储为。
1,2,3;Cooper,Bell,Young;James,Lisa,Joseph;32,57,45;
Cassandra does not store its data like either a row-oriented or a column-oriented database.
Cassandra并不像面向行或面向列的数据库那样存储其数据。
2.2. Partitioned Row Store
2.2.分区行存储
Cassandra uses a partitioned row store, which means rows contain columns. A column-family database stores data with keys mapped to values and the values grouped into multiple column families.
Cassandra使用分区行存储,这意味着行包含列。列族数据库存储数据,键映射到值,而值被分组到多个列族中。
In a partitioned row store, the Employees data looks like this:
在一个分区行存储中,Employees数据看起来像这样。
"Employees" : {
row1 : { "ID":1, "Last":"Cooper", "First":"James", "Age":32},
row2 : { "ID":2, "Last":"Bell", "First":"Lisa", "Age":57},
row3 : { "ID":3, "Last":"Young", "First":"Jospeh", "Age":45},
...
}
A partitioned row store has rows that contain columns, yet the number of columns in each row does not have to be the same (like big-table). Some rows may have thousands of columns, while some rows could be limited to just one.
一个分区行存储有包含列的行,然而每行的列数不一定相同(像大表)。有些行可能有成千上万的列,而有些行可能只限于一个。
We can think of a partitioned row store as a two-dimensional key-value store, where a row key and a column key are used to access data. To access the smallest unit of data (a column), we must first specify the row name (key) and then the column name.
我们可以将分区行存储视为二维键值存储,其中行键和列键被用于访问数据。要访问最小的数据单位(一列),我们必须首先指定行名(key),然后是列名。
3. Conclusion
3.总结
In this article, we have learned that Cassandra’s partitioned row store means that it is column-family rather than column-oriented. The main characteristic that defines column-family is that column information is part of the data. This is the main difference between a column-family model and both row-oriented and column-oriented models. The term wide-column comes from the idea that tables holding an unlimited number of columns are wide by nature.
在这篇文章中,我们已经了解到Cassandra的分区行存储意味着它是列-家庭,而不是面向列。定义column-family的主要特征是,column信息是数据的一部分。这是column-family模型与row-oriented和column-oriented模型的主要区别。术语宽列来自这样的想法:持有无限列数的表在本质上是宽的。
We’ve also explored how rows in a column-family datastore don’t need to share column names or column numbers. This enables schema-free or semi-structured tables.
我们还探讨了column-family数据存储中的行如何不需要共享列名或列号。这使得无模式或半结构化表成为可能。