| = Overview |
| :exper: experimental |
| |
| Apache Cassandra is an open source, distributed, NoSQL database. It |
| presents a partitioned wide column storage model with eventually |
| consistent semantics. |
| |
| Apache Cassandra was initially designed at |
| https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf[Facebook] |
| using a staged event-driven architecture |
| (http://www.sosp.org/2001/papers/welsh.pdf[SEDA]) to implement a |
| combination of Amazon’s |
| http://courses.cse.tamu.edu/caverlee/csce438/readings/dynamo-paper.pdf[Dynamo] |
| distributed storage and replication techniques and Google's |
| https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf[Bigtable] |
| data and storage engine model. Dynamo and Bigtable were both developed |
| to meet emerging requirements for scalable, reliable and highly |
| available storage systems, but each had areas that could be improved. |
| |
| Cassandra was designed as a best-in-class combination of both systems to |
| meet emerging largescale, both in data footprint and query volume, |
| storage requirements. As applications began to require full global |
| replication and always available low-latency reads and writes, it became |
| imperative to design a new kind of database model as the relational |
| database systems of the time struggled to meet the new requirements of |
| global scale applications. |
| |
| Systems like Cassandra are designed for these challenges and seek the |
| following design objectives: |
| |
| * Full multi-master database replication |
| * Global availability at low latency |
| * Scaling out on commodity hardware |
| * Linear throughput increase with each additional processor |
| * Online load balancing and cluster growth |
| * Partitioned key-oriented queries |
| * Flexible schema |
| |
| == Features |
| |
| Cassandra provides the Cassandra Query Language (xref:cql/ddl.adoc[CQL]), an SQL-like |
| language, to create and update database schema and access data. CQL |
| allows users to organize data within a cluster of Cassandra nodes using: |
| |
| * *Keyspace*: Defines how a dataset is replicated, per datacenter. |
| Replication is the number of copies saved per cluster. |
| Keyspaces contain tables. |
| * *Table*: Defines the typed schema for a collection of partitions. |
| Tables contain partitions, which contain rows, which contain columns. |
| Cassandra tables can flexibly add new columns to tables with zero downtime. |
| * *Partition*: Defines the mandatory part of the primary key all rows in |
| Cassandra must have to identify the node in a cluster where the row is stored. |
| All performant queries supply the partition key in the query. |
| * *Row*: Contains a collection of columns identified by a unique primary |
| key made up of the partition key and optionally additional clustering |
| keys. |
| * *Column*: A single datum with a type which belongs to a row. |
| |
| CQL supports numerous advanced features over a partitioned dataset such |
| as: |
| |
| * Single partition lightweight transactions with atomic compare and set |
| semantics. |
| * User-defined types, functions and aggregates |
| * Collection types including sets, maps, and lists. |
| * Local secondary indices |
| * (Experimental) materialized views |
| |
| Cassandra explicitly chooses not to implement operations that require |
| cross partition coordination as they are typically slow and hard to |
| provide highly available global semantics. For example Cassandra does |
| not support: |
| |
| * Cross partition transactions |
| * Distributed joins |
| * Foreign keys or referential integrity. |
| |
| == Operating |
| |
| Apache Cassandra configuration settings are configured in the |
| `cassandra.yaml` file that can be edited by hand or with the aid of |
| configuration management tools. Some settings can be manipulated live |
| using an online interface, but others require a restart of the database |
| to take effect. |
| |
| Cassandra provides tools for managing a cluster. The `nodetool` command |
| interacts with Cassandra's live control interface, allowing runtime |
| manipulation of many settings from `cassandra.yaml`. The |
| `auditlogviewer` is used to view the audit logs. The `fqltool` is used |
| to view, replay and compare full query logs. The `auditlogviewer` and |
| `fqltool` are new tools in Apache Cassandra {40_version}. |
| |
| In addition, Cassandra supports out of the box atomic snapshot |
| functionality, which presents a point in time snapshot of Cassandra's |
| data for easy integration with many backup tools. Cassandra also |
| supports incremental backups where data can be backed up as it is |
| written. |
| |
| Apache Cassandra {40_version} has added several new features including virtual |
| tables, transient replication ({exper}), audit logging, full query logging, and |
| support for Java 11 ({exper}). |