Cassandra Data Modeling: Primary, Clustering, Partition, and Compound Keys
Today, we dive into how Cassandra models data: with an assortment of keys used for grouping and organizing data into columns and rows in the database.
Join the DZone community and get the full member experience.
Join For FreeIn this post, we are going to discuss the different keys available in Cassandra. The primary key concept in Cassandra is different from relational databases. Therefore, it is worth spending some time to understand it.
Let's take an example and create a student table which has student_id
as a primary key column.
1) Primary Key
create table person (student_id int primary key, fname text, lname text,
dateofbirth timestamp, email text, phone text );
In Cassandra, a table can have a number of rows. Each row is referenced by a primary key, also called the row key. There are a number of columns in a row but the number of columns can vary in different rows.
For example, one row in a table can have three columns whereas another row in the same table can have 10 columns. It is also important to note that in Cassandra, both column names and values have binary types. That means column names can have binary values, such as strings, timestamps, or an integer, etc. This is different from SQL databases, where each row in a SQL table has a fixed number of columns, and column names can only be text.
We saw that student_id
was used as a row key to refer to person
data.
2) Compound Primary Key
As the name suggests, a compound primary key is comprised of one or more columns that are referenced in the primary key. One component of the compound primary key is called partition key, whereas the other component is called the clustering key. The following are different variations of primary keys. Please note that C1, C2, C3,… and so on represent columns in the table.
C1: Primary key has only one partition key and no cluster key.
(C1, C2): Column C1 is a partition key and column C2 is a cluster key.
(C1,C2,C3,…): Column C1 is a partition key and columns C2, C3, and so on make the cluster key.
(C1, (C2, C3,…)): It is same as 3, i.e., column C1 is a partition key and columns C2,C3,… make the cluster key.
((C1, C2,…), (C3,C4,…)): columns C1, C2 make partition key and columns C3,C4,… make the cluster key.
It is important to note that when the compound key is C1, C2, C3, then the first key, C1, becomes the partition key, and the rest of the keys become part of the cluster key. In order to make composite partition keys, we have to specify keys in parenthesis such as: ( ( C1,C2) , C3, C4). In this case, C1 and C2 are part of the partition keys, and C3 and C4 are part of the cluster key.
3) Partition Key
The purpose of a partition key is to identify the partition or node in the cluster that stores that row. When data is read or written from the cluster, a function called Partitioner is used to compute the hash value of the partition key. This hash value is used to determine the node/partition which contains that row. For example, rows whose partition key values range from 1000 to 1234 may reside in node A, and rows with partition key values range from 1235 to 2000 may reside in node B, as shown in figure 1. If a row contains partition key whose hash value is 1233 then it will be stored in node A.
4) Clustering Key
The purpose of the clustering key is to store row data in a sorted order. The sorting of data is based on columns, which are included in the clustering key. This arrangement makes it efficient to retrieve data using the clustering key.
Example
To make these concepts clear, we will consider the example of a school system.
Create a keyspace with replication strategy ‘SimpleStrategy’ and replication_factor 1.
create keyspace Students_Details with replication = {‘class’ : ‘SimpleStrategy’, ‘replication_factor’:1};
Now switch to the students_details keyspace:
use students_details;
Check the number of tables present in the keyspace:
students_details> desc TABLES;
We will create a table, student
, that contains general information about any student.
create table student (stuid int, avg_marks float, description text,
primary key (stuid));
Type the following insert statements to enter some data into this table.
insert into student (stuid, avg_marks, description) values (1,25.5,’student 1′);
insert into student (stuid, avg_marks, description) values (2,35.5,’student 2′);
To view the details just inserted...
students_details> select * from student;
stuid | avg_marks | description
——-+———–+————-
1 | 25.5 | student 1
2 | 35.5 | student 2
We can see how Cassandra has stored this data under the hood by using the cassandra-cli
tool. Run cassandra-cli
in a separate terminal windo.
Important: The CLI utility is deprecated and will be removed in Cassandra 3.0. For ease of use and performance, switch from Thrift and CLI to CQL and cqlsh.)
So if you're using a Cassandra verison above 3.0, then use the below commands.
Using the EXPAND Command in cqlsh , we can view the details info for the queries . EXPAND with no arguments shows the current value of the expanded setting.
cqlsh:students_details> EXPAND
If expanded output is disabled. Use EXPAND ON
to enable it.
Now view the details inserted above (the studid will be present in a red color in cqlsh, representing the primary key/row key)
cqlsh:students_details> select * from student;
@ Row 1
————-+———–
stuid | 1
avg_marks | 25.5
description | student 1
@ Row 2
————-+———–
stuid | 2
avg_marks | 35.5
description | student 2
(2 rows)
We can see from the above output that the stuid has become the row key, and it identifies individual rows.
cqlsh:students_details> select token(stuid) from student;
@ Row 1
———————+———————-
system.token(stuid) | -4069959284402364209
@ Row 2
———————+———————-
system.token(stuid) | -3248873570005575792
Also, you can see that there are two tokens.
We can use columns in the primary key to filter data in the select statement. Type the following command in the cqlsh window:
select * from student where stuid = 1;
Now we will create another table called marks, which records marks of each student every day (say every day, new exams and marks are recorded). Type the following command on cqlsh:
create table marks(stuid int,exam_date timestamp,marks float, exam_name text,
primary key (stuid,exam_date));
This statement creates the marks table with a primary key (stuid , exam_date ). As the primary key has two components, the first component is considered a partition key, and the second component becomes the cluster key. Add some data into the table:
insert into marks(stuid ,exam_date ,marks ,exam_name) values (1,’2016-11-10′,76 ,’examA’);
insert into marks(stuid ,exam_date ,marks ,exam_name) values (1,’2016-11-11′,90 ,’examB’);
insert into marks(stuid ,exam_date ,marks ,exam_name) values (1,’2016-11-12′,68 ,’examC’);
cqlsh:students_details> select * from marks;
@ Row 1
———–+————————–
stuid | 1
exam_date | 2016-11-09 18:30:00+0000
exam_name | examA
marks | 76
@ Row 2
———–+————————–
stuid | 1
exam_date | 2016-11-10 18:30:00+0000
exam_name | examB
marks | 90
@ Row 3
———–+————————–
stuid | 1
exam_date | 2016-11-11 18:30:00+0000
exam_name | examC
marks | 68
Now, let's see how the partition concept has been applied:
cqlsh:students_details> select token(stuid) from marks;
@ Row 1
———————+———————-
system.token(stuid) | -4069959284402364209
@ Row 2
———————+———————-
system.token(stuid) | -4069959284402364209
@ Row 3
———————+———————-
system.token(stuid) | -4069959284402364209
We can see all the three rows have the same partition token, hence Cassandra stores only one row for each partition key. All the data associated with that partition key is stored as columns in the datastore. The data that we have stored through three different insert statements have the same stuid value, i.e. 1, therefore, all the data is saved in that row as columns, i.e under one partition.
If you remember, we discussed before that the second component of a primary key is called the clustering key. The role of the clustering key is to group related items together. All the data that is inserted against same clustering key is grouped together.
In this case, all the columns, such as exam_name and marks, will be grouped by value in exam_date, i.e 2016-11-11 18:30:00+0000, by default in ascending order .
I hope these examples have helped you to clarify some of the concepts of data modeling in Cassandra. Please feel free to leave any comments.
Published at DZone with permission of Piyush Rana, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments