Getting started with Cassandra

The Growth of Big Data - Big Data is one of the key forces driving the growth and popularity of NoSQL for business. The almost limitless array of data collection technologies ranging from simple online actions to point of sale systems to GPS tools to smartphones and tablets to sophisticated sensors – and many more – act as force multipliers for data growth.

In fact, one of the first reasons to use NoSQL is because you have a Big Data project to tackle. A Big Data project is normally typified by:
  1. High data velocity – lots of data coming in very quickly, possibly from different locations.
  2. Data variety – storage of data that is structured, semi-structured and unstructured.
  3. Data volume – data that involves many terabytes or petabytes in size.
  4. Data complexity – data that is stored and managed in different locations or data centers.
Datamodel Performance Scalability Flexibility Complexity Functionality
Key-value storeHighHighHighNoneVariable (None)
Column StoreHighHighLoqModerateLow Minimal
Document StoreHighVariableHighLowVariable
Graph DatabaseVariableVariableHighHighGraph Theory

Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times. Built-for-scale architecture means that it is capable of handling petabytes of information and thousands of concurrent users/operations per second.

An apache Software Foundation project, Cassandra is column oriented database and is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra does not support joins or subqueries. Rather, Cassandra emphasizes denormalization through features like collections.

Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster. When a node goes down, read/write requests can be served from other nodes in the network.

The key components of Cassandra are as follows:

1. Node − It is the place where data is stored.
2. Data center − It is a collection of related nodes.
3. Cluster− A cluster is a component that contains one or more data centers.
4. Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
5. Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
6. SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
7. Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

nodetool cfstats : displays statistics for each table and keyspace.
nodetool cfhistograms : provides statistics about a table, including read/write latency, row size, column count, and number of SSTables.
nodetool netstats : provides statistics about network operations and connections.
nodetool tpstats : provides statistics about the number of active, pending, and completed tasks for each stage of Cassandra operations by thread pool.
nodetool status :
cqlsh machine_ip - connects to the machine cqlsh

cqlsh command list:
HELP - Displays help topics for all cqlsh commands.
CAPTURE - Captures the output of a command and adds it to a file.
CONSISTENCY - Shows the current consistency level, or sets a new consistency level.
COPY - Copies data to and from Cassandra.
DESCRIBE - Describes the current cluster of Cassandra and its objects.
EXPAND - Expands the output of a query vertically.
EXIT - Using this command, you can terminate cqlsh.
PAGING - Enables or disables query paging.
SHOW - Displays the details of current cqlsh session such as Cassandra version, host, or data type assumptions.
SOURCE - Executes a file that contains CQL statements.
TRACING - Enables or disables request tracing.


To upgrade an existing cassandra installation, you can follow the below instructions:

  1. mkdir ~/cassandra_backup
  2. sudo cp -r /etc/cassandra/* ~/cassandra_backup
  3. sudo vi /etc/cassandra/cassandra.yaml and edit num_tokens to 1 and uncomment the initial_token and set it to 1
  4. nodetool upgradesstables
  5. nodetool drain
  6. sudo service cassandra stop
  7. sudo cp -r /etc/cassandra/* ~/cassandra_backup_new
  8. sudo apt-get install cassandra=2.1.12
  9. Open the old and new cassandra.yaml files and diff them.
  10. Merge the diffs by hand, including the partitioner setting, from the old file into the new one.
  11. Do not use the default partitioner setting in the new cassandra.yaml because it has changed in this release to the Murmur3Partitioner. The Murmur3Partitioner can only be used for new clusters. After data has been added to the cluster, you cannot change the partitioner without reworking tables, which is not practical. Use your old partitioner setting in the new cassandra.yaml file.
  12. Save the file as cassandra.yaml.
    Configuration file '/etc/cassandra/cassandra.yaml' ==> Modified (by you or by a script) since installation. ==> Package distributor has shipped an updated version. What would you like to do about it ? Your options are: Y or I : install the package maintainer's version N or O : keep your currently-installed version D : show the differences between the versions Z : start a shell to examine the situation The default action is to keep your current version. *** cassandra.yaml (Y/I/N/O/D/Z) [default=N] ?
Inserting values into tables
CREATE KEYSPACE key_space WITH replication = { 'class': 'NetworkTopologyStrategy', 'cdr_record': '2' };
INSERT INTO key_space.emp (emp_id,emp_city,emp_name,emp_phone,emp_sal) VALUES(3,'Kolkata','Stag1',4412,60);
UPDATE TABLE emp( emp_id int PRIMARY KEY, emp_name text, emp_city text, emp_sal varint, emp_phone varint);
INSERT INTO TABLE emp(emp_id int,emp_name,emp_city,emp_sal,emp_phone) VALUES(1,'foo','Bangalore',24,1234567);
Nodetool Command Set:
1. nodetool status
2. nodetool info
3. nodetool -host ring