doc/modules/cassandra/pages/architecture/overview.adoc - cassandra - Git at Google

 = Overview
 :exper: experimental

 Apache Cassandra is an open source, distributed, NoSQL database. It
 presents a partitioned wide column storage model with eventually
 consistent semantics.

 Apache Cassandra was initially designed at
 https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf[Facebook]
 using a staged event-driven architecture
 (http://www.sosp.org/2001/papers/welsh.pdf[SEDA]) to implement a
 combination of Amazon’s
 http://courses.cse.tamu.edu/caverlee/csce438/readings/dynamo-paper.pdf[Dynamo]
 distributed storage and replication techniques and Google's
 https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf[Bigtable]
 data and storage engine model. Dynamo and Bigtable were both developed
 to meet emerging requirements for scalable, reliable and highly
 available storage systems, but each had areas that could be improved.

 Cassandra was designed as a best-in-class combination of both systems to
 meet emerging largescale, both in data footprint and query volume,
 storage requirements. As applications began to require full global
 replication and always available low-latency reads and writes, it became
 imperative to design a new kind of database model as the relational
 database systems of the time struggled to meet the new requirements of
 global scale applications.

 Systems like Cassandra are designed for these challenges and seek the
 following design objectives:

 * Full multi-master database replication
 * Global availability at low latency
 * Scaling out on commodity hardware
 * Linear throughput increase with each additional processor
 * Online load balancing and cluster growth
 * Partitioned key-oriented queries
 * Flexible schema

 == Features

 Cassandra provides the Cassandra Query Language (xref:cql/ddl.adoc[CQL]), an SQL-like
 language, to create and update database schema and access data. CQL
 allows users to organize data within a cluster of Cassandra nodes using:

 * *Keyspace*: Defines how a dataset is replicated, per datacenter.
 Replication is the number of copies saved per cluster.
 Keyspaces contain tables.
 * *Table*: Defines the typed schema for a collection of partitions.
 Tables contain partitions, which contain rows, which contain columns.
 Cassandra tables can flexibly add new columns to tables with zero downtime.
 * *Partition*: Defines the mandatory part of the primary key all rows in
 Cassandra must have to identify the node in a cluster where the row is stored.
 All performant queries supply the partition key in the query.
 * *Row*: Contains a collection of columns identified by a unique primary
 key made up of the partition key and optionally additional clustering
 keys.
 * *Column*: A single datum with a type which belongs to a row.

 CQL supports numerous advanced features over a partitioned dataset such
 as:

 * Single partition lightweight transactions with atomic compare and set
 semantics.
 * User-defined types, functions and aggregates
 * Collection types including sets, maps, and lists.
 * Local secondary indices
 * (Experimental) materialized views

 Cassandra explicitly chooses not to implement operations that require
 cross partition coordination as they are typically slow and hard to
 provide highly available global semantics. For example Cassandra does
 not support:

 * Cross partition transactions
 * Distributed joins
 * Foreign keys or referential integrity.

 == Operating

 Apache Cassandra configuration settings are configured in the
 `cassandra.yaml` file that can be edited by hand or with the aid of
 configuration management tools. Some settings can be manipulated live
 using an online interface, but others require a restart of the database
 to take effect.

 Cassandra provides tools for managing a cluster. The `nodetool` command
 interacts with Cassandra's live control interface, allowing runtime
 manipulation of many settings from `cassandra.yaml`. The
 `auditlogviewer` is used to view the audit logs. The `fqltool` is used
 to view, replay and compare full query logs. The `auditlogviewer` and
 `fqltool` are new tools in Apache Cassandra {40_version}.

 In addition, Cassandra supports out of the box atomic snapshot
 functionality, which presents a point in time snapshot of Cassandra's
 data for easy integration with many backup tools. Cassandra also
 supports incremental backups where data can be backed up as it is
 written.

 Apache Cassandra {40_version} has added several new features including virtual
 tables, transient replication ({exper}), audit logging, full query logging, and
 support for Java 11 ({exper}).
	= Overview
	:exper: experimental

	Apache Cassandra is an open source, distributed, NoSQL database. It
	presents a partitioned wide column storage model with eventually
	consistent semantics.

	Apache Cassandra was initially designed at
	https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf[Facebook]
	using a staged event-driven architecture
	(http://www.sosp.org/2001/papers/welsh.pdf[SEDA]) to implement a
	combination of Amazon’s
	http://courses.cse.tamu.edu/caverlee/csce438/readings/dynamo-paper.pdf[Dynamo]
	distributed storage and replication techniques and Google's
	https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf[Bigtable]
	data and storage engine model. Dynamo and Bigtable were both developed
	to meet emerging requirements for scalable, reliable and highly
	available storage systems, but each had areas that could be improved.

	Cassandra was designed as a best-in-class combination of both systems to
	meet emerging largescale, both in data footprint and query volume,
	storage requirements. As applications began to require full global
	replication and always available low-latency reads and writes, it became
	imperative to design a new kind of database model as the relational
	database systems of the time struggled to meet the new requirements of
	global scale applications.

	Systems like Cassandra are designed for these challenges and seek the
	following design objectives:

	* Full multi-master database replication
	* Global availability at low latency
	* Scaling out on commodity hardware
	* Linear throughput increase with each additional processor
	* Online load balancing and cluster growth
	* Partitioned key-oriented queries
	* Flexible schema

	== Features

	Cassandra provides the Cassandra Query Language (xref:cql/ddl.adoc[CQL]), an SQL-like
	language, to create and update database schema and access data. CQL
	allows users to organize data within a cluster of Cassandra nodes using:

	* Keyspace: Defines how a dataset is replicated, per datacenter.
	Replication is the number of copies saved per cluster.
	Keyspaces contain tables.
	* Table: Defines the typed schema for a collection of partitions.
	Tables contain partitions, which contain rows, which contain columns.
	Cassandra tables can flexibly add new columns to tables with zero downtime.
	* Partition: Defines the mandatory part of the primary key all rows in
	Cassandra must have to identify the node in a cluster where the row is stored.
	All performant queries supply the partition key in the query.
	* Row: Contains a collection of columns identified by a unique primary
	key made up of the partition key and optionally additional clustering
	keys.
	* Column: A single datum with a type which belongs to a row.

	CQL supports numerous advanced features over a partitioned dataset such
	as:

	* Single partition lightweight transactions with atomic compare and set
	semantics.
	* User-defined types, functions and aggregates
	* Collection types including sets, maps, and lists.
	* Local secondary indices
	* (Experimental) materialized views

	Cassandra explicitly chooses not to implement operations that require
	cross partition coordination as they are typically slow and hard to
	provide highly available global semantics. For example Cassandra does
	not support:

	* Cross partition transactions
	* Distributed joins
	* Foreign keys or referential integrity.

	== Operating

	Apache Cassandra configuration settings are configured in the
	`cassandra.yaml` file that can be edited by hand or with the aid of
	configuration management tools. Some settings can be manipulated live
	using an online interface, but others require a restart of the database
	to take effect.

	Cassandra provides tools for managing a cluster. The `nodetool` command
	interacts with Cassandra's live control interface, allowing runtime
	manipulation of many settings from `cassandra.yaml`. The
	`auditlogviewer` is used to view the audit logs. The `fqltool` is used
	to view, replay and compare full query logs. The `auditlogviewer` and
	`fqltool` are new tools in Apache Cassandra {40_version}.

	In addition, Cassandra supports out of the box atomic snapshot
	functionality, which presents a point in time snapshot of Cassandra's
	data for easy integration with many backup tools. Cassandra also
	supports incremental backups where data can be backed up as it is
	written.

	Apache Cassandra {40_version} has added several new features including virtual
	tables, transient replication ({exper}), audit logging, full query logging, and
	support for Java 11 ({exper}).