blob: 605e347830afc712da3cff374dacfb9c7d62e1c3 [file] [log] [blame]
= Overview
:exper: experimental
Apache Cassandra is an open source, distributed, NoSQL database. It
presents a partitioned wide column storage model with eventually
consistent semantics.
Apache Cassandra was initially designed at
https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf[Facebook]
using a staged event-driven architecture
(http://www.sosp.org/2001/papers/welsh.pdf[SEDA]) to implement a
combination of Amazon’s
http://courses.cse.tamu.edu/caverlee/csce438/readings/dynamo-paper.pdf[Dynamo]
distributed storage and replication techniques and Google's
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf[Bigtable]
data and storage engine model. Dynamo and Bigtable were both developed
to meet emerging requirements for scalable, reliable and highly
available storage systems, but each had areas that could be improved.
Cassandra was designed as a best-in-class combination of both systems to
meet emerging largescale, both in data footprint and query volume,
storage requirements. As applications began to require full global
replication and always available low-latency reads and writes, it became
imperative to design a new kind of database model as the relational
database systems of the time struggled to meet the new requirements of
global scale applications.
Systems like Cassandra are designed for these challenges and seek the
following design objectives:
* Full multi-master database replication
* Global availability at low latency
* Scaling out on commodity hardware
* Linear throughput increase with each additional processor
* Online load balancing and cluster growth
* Partitioned key-oriented queries
* Flexible schema
== Features
Cassandra provides the Cassandra Query Language (xref:cql/ddl.adoc[CQL]), an SQL-like
language, to create and update database schema and access data. CQL
allows users to organize data within a cluster of Cassandra nodes using:
* *Keyspace*: Defines how a dataset is replicated, per datacenter.
Replication is the number of copies saved per cluster.
Keyspaces contain tables.
* *Table*: Defines the typed schema for a collection of partitions.
Tables contain partitions, which contain rows, which contain columns.
Cassandra tables can flexibly add new columns to tables with zero downtime.
* *Partition*: Defines the mandatory part of the primary key all rows in
Cassandra must have to identify the node in a cluster where the row is stored.
All performant queries supply the partition key in the query.
* *Row*: Contains a collection of columns identified by a unique primary
key made up of the partition key and optionally additional clustering
keys.
* *Column*: A single datum with a type which belongs to a row.
CQL supports numerous advanced features over a partitioned dataset such
as:
* Single partition lightweight transactions with atomic compare and set
semantics.
* User-defined types, functions and aggregates
* Collection types including sets, maps, and lists.
* Local secondary indices
* (Experimental) materialized views
Cassandra explicitly chooses not to implement operations that require
cross partition coordination as they are typically slow and hard to
provide highly available global semantics. For example Cassandra does
not support:
* Cross partition transactions
* Distributed joins
* Foreign keys or referential integrity.
== Operating
Apache Cassandra configuration settings are configured in the
`cassandra.yaml` file that can be edited by hand or with the aid of
configuration management tools. Some settings can be manipulated live
using an online interface, but others require a restart of the database
to take effect.
Cassandra provides tools for managing a cluster. The `nodetool` command
interacts with Cassandra's live control interface, allowing runtime
manipulation of many settings from `cassandra.yaml`. The
`auditlogviewer` is used to view the audit logs. The `fqltool` is used
to view, replay and compare full query logs. The `auditlogviewer` and
`fqltool` are new tools in Apache Cassandra {40_version}.
In addition, Cassandra supports out of the box atomic snapshot
functionality, which presents a point in time snapshot of Cassandra's
data for easy integration with many backup tools. Cassandra also
supports incremental backups where data can be backed up as it is
written.
Apache Cassandra {40_version} has added several new features including virtual
tables, transient replication ({exper}), audit logging, full query logging, and
support for Java 11 ({exper}).