| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| .. |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| .. |
| .. Unless required by applicable law or agreed to in writing, software |
| .. distributed under the License is distributed on an "AS IS" BASIS, |
| .. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| .. See the License for the specific language governing permissions and |
| .. limitations under the License. |
| |
| .. highlight:: cql |
| |
| .. _data-definition: |
| |
| Data Definition |
| --------------- |
| |
| CQL stores data in *tables*, whose schema defines the layout of said data in the table, and those tables are grouped in |
| *keyspaces*. A keyspace defines a number of options that applies to all the tables it contains, most prominently of |
| which is the :ref:`replication strategy <replication-strategy>` used by the keyspace. It is generally encouraged to use |
| one keyspace by *application*, and thus many cluster may define only one keyspace. |
| |
| This section describes the statements used to create, modify, and remove those keyspace and tables. |
| |
| Common definitions |
| ^^^^^^^^^^^^^^^^^^ |
| |
| The names of the keyspaces and tables are defined by the following grammar: |
| |
| .. productionlist:: |
| keyspace_name: `name` |
| table_name: [ `keyspace_name` '.' ] `name` |
| name: `unquoted_name` | `quoted_name` |
| unquoted_name: re('[a-zA-Z_0-9]{1, 48}') |
| quoted_name: '"' `unquoted_name` '"' |
| |
| Both keyspace and table name should be comprised of only alphanumeric characters, cannot be empty and are limited in |
| size to 48 characters (that limit exists mostly to avoid filenames (which may include the keyspace and table name) to go |
| over the limits of certain file systems). By default, keyspace and table names are case insensitive (``myTable`` is |
| equivalent to ``mytable``) but case sensitivity can be forced by using double-quotes (``"myTable"`` is different from |
| ``mytable``). |
| |
| Further, a table is always part of a keyspace and a table name can be provided fully-qualified by the keyspace it is |
| part of. If is is not fully-qualified, the table is assumed to be in the *current* keyspace (see :ref:`USE statement |
| <use-statement>`). |
| |
| Further, the valid names for columns is simply defined as: |
| |
| .. productionlist:: |
| column_name: `identifier` |
| |
| We also define the notion of statement options for use in the following section: |
| |
| .. productionlist:: |
| options: `option` ( AND `option` )* |
| option: `identifier` '=' ( `identifier` | `constant` | `map_literal` ) |
| |
| .. _create-keyspace-statement: |
| |
| CREATE KEYSPACE |
| ^^^^^^^^^^^^^^^ |
| |
| A keyspace is created using a ``CREATE KEYSPACE`` statement: |
| |
| .. productionlist:: |
| create_keyspace_statement: CREATE KEYSPACE [ IF NOT EXISTS ] `keyspace_name` WITH `options` |
| |
| For instance:: |
| |
| CREATE KEYSPACE excelsior |
| WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}; |
| |
| CREATE KEYSPACE excalibur |
| WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 3} |
| AND durable_writes = false; |
| |
| Attempting to create a keyspace that already exists will return an error unless the ``IF NOT EXISTS`` option is used. If |
| it is used, the statement will be a no-op if the keyspace already exists. |
| |
| The supported ``options`` are: |
| |
| =================== ========== =========== ========= =================================================================== |
| name kind mandatory default description |
| =================== ========== =========== ========= =================================================================== |
| ``replication`` *map* yes The replication strategy and options to use for the keyspace (see |
| details below). |
| ``durable_writes`` *simple* no true Whether to use the commit log for updates on this keyspace |
| (disable this option at your own risk!). |
| =================== ========== =========== ========= =================================================================== |
| |
| The ``replication`` property is mandatory and must at least contains the ``'class'`` sub-option which defines the |
| :ref:`replication strategy <replication-strategy>` class to use. The rest of the sub-options depends on what replication |
| strategy is used. By default, Cassandra support the following ``'class'``: |
| |
| ``SimpleStrategy`` |
| """""""""""""""""" |
| |
| A simple strategy that defines a replication factor for data to be spread |
| across the entire cluster. This is generally not a wise choice for production |
| because it does not respect datacenter layouts and can lead to wildly varying |
| query latency. For a production ready strategy, see |
| ``NetworkTopologyStrategy``. ``SimpleStrategy`` supports a single mandatory argument: |
| |
| ========================= ====== ======= ============================================= |
| sub-option type since description |
| ========================= ====== ======= ============================================= |
| ``'replication_factor'`` int all The number of replicas to store per range |
| ========================= ====== ======= ============================================= |
| |
| ``NetworkTopologyStrategy`` |
| """"""""""""""""""""""""""" |
| |
| A production ready replication strategy that allows to set the replication |
| factor independently for each data-center. The rest of the sub-options are |
| key-value pairs where a key is a data-center name and its value is the |
| associated replication factor. Options: |
| |
| ===================================== ====== ====== ============================================= |
| sub-option type since description |
| ===================================== ====== ====== ============================================= |
| ``'<datacenter>'`` int all The number of replicas to store per range in |
| the provided datacenter. |
| ``'replication_factor'`` int 4.0 The number of replicas to use as a default |
| per datacenter if not specifically provided. |
| Note that this always defers to existing |
| definitions or explicit datacenter settings. |
| For example, to have three replicas per |
| datacenter, supply this with a value of 3. |
| ===================================== ====== ====== ============================================= |
| |
| Note that when ``ALTER`` ing keyspaces and supplying ``replication_factor``, |
| auto-expansion will only *add* new datacenters for safety, it will not alter |
| existing datacenters or remove any even if they are no longer in the cluster. |
| If you want to remove datacenters while still supplying ``replication_factor``, |
| explicitly zero out the datacenter you want to have zero replicas. |
| |
| An example of auto-expanding datacenters with two datacenters: ``DC1`` and ``DC2``:: |
| |
| CREATE KEYSPACE excalibur |
| WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor' : 3} |
| |
| DESCRIBE KEYSPACE excalibur |
| CREATE KEYSPACE excalibur WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true; |
| |
| |
| An example of auto-expanding and overriding a datacenter:: |
| |
| CREATE KEYSPACE excalibur |
| WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor' : 3, 'DC2': 2} |
| |
| DESCRIBE KEYSPACE excalibur |
| CREATE KEYSPACE excalibur WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '2'} AND durable_writes = true; |
| |
| An example that excludes a datacenter while using ``replication_factor``:: |
| |
| CREATE KEYSPACE excalibur |
| WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor' : 3, 'DC2': 0} ; |
| |
| DESCRIBE KEYSPACE excalibur |
| CREATE KEYSPACE excalibur WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true; |
| |
| If :ref:`transient replication <transient-replication>` has been enabled, transient replicas can be configured for both |
| SimpleStrategy and NetworkTopologyStrategy by defining replication factors in the format ``'<total_replicas>/<transient_replicas>'`` |
| |
| For instance, this keyspace will have 3 replicas in DC1, 1 of which is transient, and 5 replicas in DC2, 2 of which are transient:: |
| |
| CREATE KEYSPACE some_keysopace |
| WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : '3/1'', 'DC2' : '5/2'}; |
| |
| .. _use-statement: |
| |
| USE |
| ^^^ |
| |
| The ``USE`` statement allows to change the *current* keyspace (for the *connection* on which it is executed). A number |
| of objects in CQL are bound to a keyspace (tables, user-defined types, functions, ...) and the current keyspace is the |
| default keyspace used when those objects are referred without a fully-qualified name (that is, without being prefixed a |
| keyspace name). A ``USE`` statement simply takes the keyspace to use as current as argument: |
| |
| .. productionlist:: |
| use_statement: USE `keyspace_name` |
| |
| .. _alter-keyspace-statement: |
| |
| ALTER KEYSPACE |
| ^^^^^^^^^^^^^^ |
| |
| An ``ALTER KEYSPACE`` statement allows to modify the options of a keyspace: |
| |
| .. productionlist:: |
| alter_keyspace_statement: ALTER KEYSPACE `keyspace_name` WITH `options` |
| |
| For instance:: |
| |
| ALTER KEYSPACE Excelsior |
| WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 4}; |
| |
| The supported options are the same than for :ref:`creating a keyspace <create-keyspace-statement>`. |
| |
| .. _drop-keyspace-statement: |
| |
| DROP KEYSPACE |
| ^^^^^^^^^^^^^ |
| |
| Dropping a keyspace can be done using the ``DROP KEYSPACE`` statement: |
| |
| .. productionlist:: |
| drop_keyspace_statement: DROP KEYSPACE [ IF EXISTS ] `keyspace_name` |
| |
| For instance:: |
| |
| DROP KEYSPACE Excelsior; |
| |
| Dropping a keyspace results in the immediate, irreversible removal of that keyspace, including all the tables, UTD and |
| functions in it, and all the data contained in those tables. |
| |
| If the keyspace does not exists, the statement will return an error, unless ``IF EXISTS`` is used in which case the |
| operation is a no-op. |
| |
| .. _create-table-statement: |
| |
| CREATE TABLE |
| ^^^^^^^^^^^^ |
| |
| Creating a new table uses the ``CREATE TABLE`` statement: |
| |
| .. productionlist:: |
| create_table_statement: CREATE TABLE [ IF NOT EXISTS ] `table_name` |
| : '(' |
| : `column_definition` |
| : ( ',' `column_definition` )* |
| : [ ',' PRIMARY KEY '(' `primary_key` ')' ] |
| : ')' [ WITH `table_options` ] |
| column_definition: `column_name` `cql_type` [ STATIC ] [ PRIMARY KEY] |
| primary_key: `partition_key` [ ',' `clustering_columns` ] |
| partition_key: `column_name` |
| : | '(' `column_name` ( ',' `column_name` )* ')' |
| clustering_columns: `column_name` ( ',' `column_name` )* |
| table_options: COMPACT STORAGE [ AND `table_options` ] |
| : | CLUSTERING ORDER BY '(' `clustering_order` ')' [ AND `table_options` ] |
| : | `options` |
| clustering_order: `column_name` (ASC | DESC) ( ',' `column_name` (ASC | DESC) )* |
| |
| For instance:: |
| |
| CREATE TABLE monkeySpecies ( |
| species text PRIMARY KEY, |
| common_name text, |
| population varint, |
| average_size int |
| ) WITH comment='Important biological records'; |
| |
| CREATE TABLE timeline ( |
| userid uuid, |
| posted_month int, |
| posted_time uuid, |
| body text, |
| posted_by text, |
| PRIMARY KEY (userid, posted_month, posted_time) |
| ) WITH compaction = { 'class' : 'LeveledCompactionStrategy' }; |
| |
| CREATE TABLE loads ( |
| machine inet, |
| cpu int, |
| mtime timeuuid, |
| load float, |
| PRIMARY KEY ((machine, cpu), mtime) |
| ) WITH CLUSTERING ORDER BY (mtime DESC); |
| |
| A CQL table has a name and is composed of a set of *rows*. Creating a table amounts to defining which :ref:`columns |
| <column-definition>` the rows will be composed, which of those columns compose the :ref:`primary key <primary-key>`, as |
| well as optional :ref:`options <create-table-options>` for the table. |
| |
| Attempting to create an already existing table will return an error unless the ``IF NOT EXISTS`` directive is used. If |
| it is used, the statement will be a no-op if the table already exists. |
| |
| |
| .. _column-definition: |
| |
| Column definitions |
| ~~~~~~~~~~~~~~~~~~ |
| |
| Every rows in a CQL table has a set of predefined columns defined at the time of the table creation (or added later |
| using an :ref:`alter statement<alter-table-statement>`). |
| |
| A :token:`column_definition` is primarily comprised of the name of the column defined and it's :ref:`type <data-types>`, |
| which restrict which values are accepted for that column. Additionally, a column definition can have the following |
| modifiers: |
| |
| ``STATIC`` |
| it declares the column as being a :ref:`static column <static-columns>`. |
| |
| ``PRIMARY KEY`` |
| it declares the column as being the sole component of the :ref:`primary key <primary-key>` of the table. |
| |
| .. _static-columns: |
| |
| Static columns |
| `````````````` |
| Some columns can be declared as ``STATIC`` in a table definition. A column that is static will be “shared” by all the |
| rows belonging to the same partition (having the same :ref:`partition key <partition-key>`). For instance:: |
| |
| CREATE TABLE t ( |
| pk int, |
| t int, |
| v text, |
| s text static, |
| PRIMARY KEY (pk, t) |
| ); |
| |
| INSERT INTO t (pk, t, v, s) VALUES (0, 0, 'val0', 'static0'); |
| INSERT INTO t (pk, t, v, s) VALUES (0, 1, 'val1', 'static1'); |
| |
| SELECT * FROM t; |
| pk | t | v | s |
| ----+---+--------+----------- |
| 0 | 0 | 'val0' | 'static1' |
| 0 | 1 | 'val1' | 'static1' |
| |
| As can be seen, the ``s`` value is the same (``static1``) for both of the row in the partition (the partition key in |
| that example being ``pk``, both rows are in that same partition): the 2nd insertion has overridden the value for ``s``. |
| |
| The use of static columns as the following restrictions: |
| |
| - tables with the ``COMPACT STORAGE`` option (see below) cannot use them. |
| - a table without clustering columns cannot have static columns (in a table without clustering columns, every partition |
| has only one row, and so every column is inherently static). |
| - only non ``PRIMARY KEY`` columns can be static. |
| |
| .. _primary-key: |
| |
| The Primary key |
| ~~~~~~~~~~~~~~~ |
| |
| Within a table, a row is uniquely identified by its ``PRIMARY KEY``, and hence all table **must** define a PRIMARY KEY |
| (and only one). A ``PRIMARY KEY`` definition is composed of one or more of the columns defined in the table. |
| Syntactically, the primary key is defined the keywords ``PRIMARY KEY`` followed by comma-separated list of the column |
| names composing it within parenthesis, but if the primary key has only one column, one can alternatively follow that |
| column definition by the ``PRIMARY KEY`` keywords. The order of the columns in the primary key definition matter. |
| |
| A CQL primary key is composed of 2 parts: |
| |
| - the :ref:`partition key <partition-key>` part. It is the first component of the primary key definition. It can be a |
| single column or, using additional parenthesis, can be multiple columns. A table always have at least a partition key, |
| the smallest possible table definition is:: |
| |
| CREATE TABLE t (k text PRIMARY KEY); |
| |
| - the :ref:`clustering columns <clustering-columns>`. Those are the columns after the first component of the primary key |
| definition, and the order of those columns define the *clustering order*. |
| |
| Some example of primary key definition are: |
| |
| - ``PRIMARY KEY (a)``: ``a`` is the partition key and there is no clustering columns. |
| - ``PRIMARY KEY (a, b, c)`` : ``a`` is the partition key and ``b`` and ``c`` are the clustering columns. |
| - ``PRIMARY KEY ((a, b), c)`` : ``a`` and ``b`` compose the partition key (this is often called a *composite* partition |
| key) and ``c`` is the clustering column. |
| |
| |
| .. _partition-key: |
| |
| The partition key |
| ````````````````` |
| |
| Within a table, CQL defines the notion of a *partition*. A partition is simply the set of rows that share the same value |
| for their partition key. Note that if the partition key is composed of multiple columns, then rows belong to the same |
| partition only they have the same values for all those partition key column. So for instance, given the following table |
| definition and content:: |
| |
| CREATE TABLE t ( |
| a int, |
| b int, |
| c int, |
| d int, |
| PRIMARY KEY ((a, b), c, d) |
| ); |
| |
| SELECT * FROM t; |
| a | b | c | d |
| ---+---+---+--- |
| 0 | 0 | 0 | 0 // row 1 |
| 0 | 0 | 1 | 1 // row 2 |
| 0 | 1 | 2 | 2 // row 3 |
| 0 | 1 | 3 | 3 // row 4 |
| 1 | 1 | 4 | 4 // row 5 |
| |
| ``row 1`` and ``row 2`` are in the same partition, ``row 3`` and ``row 4`` are also in the same partition (but a |
| different one) and ``row 5`` is in yet another partition. |
| |
| Note that a table always has a partition key, and that if the table has no :ref:`clustering columns |
| <clustering-columns>`, then every partition of that table is only comprised of a single row (since the primary key |
| uniquely identifies rows and the primary key is equal to the partition key if there is no clustering columns). |
| |
| The most important property of partition is that all the rows belonging to the same partition are guarantee to be stored |
| on the same set of replica nodes. In other words, the partition key of a table defines which of the rows will be |
| localized together in the Cluster, and it is thus important to choose your partition key wisely so that rows that needs |
| to be fetch together are in the same partition (so that querying those rows together require contacting a minimum of |
| nodes). |
| |
| Please note however that there is a flip-side to this guarantee: as all rows sharing a partition key are guaranteed to |
| be stored on the same set of replica node, a partition key that groups too much data can create a hotspot. |
| |
| Another useful property of a partition is that when writing data, all the updates belonging to a single partition are |
| done *atomically* and in *isolation*, which is not the case across partitions. |
| |
| The proper choice of the partition key and clustering columns for a table is probably one of the most important aspect |
| of data modeling in Cassandra, and it largely impact which queries can be performed, and how efficiently they are. |
| |
| |
| .. _clustering-columns: |
| |
| The clustering columns |
| `````````````````````` |
| |
| The clustering columns of a table defines the clustering order for the partition of that table. For a given |
| :ref:`partition <partition-key>`, all the rows are physically ordered inside Cassandra by that clustering order. For |
| instance, given:: |
| |
| CREATE TABLE t ( |
| a int, |
| b int, |
| c int, |
| PRIMARY KEY (a, b, c) |
| ); |
| |
| SELECT * FROM t; |
| a | b | c |
| ---+---+--- |
| 0 | 0 | 4 // row 1 |
| 0 | 1 | 9 // row 2 |
| 0 | 2 | 2 // row 3 |
| 0 | 3 | 3 // row 4 |
| |
| then the rows (which all belong to the same partition) are all stored internally in the order of the values of their |
| ``b`` column (the order they are displayed above). So where the partition key of the table allows to group rows on the |
| same replica set, the clustering columns controls how those rows are stored on the replica. That sorting allows the |
| retrieval of a range of rows within a partition (for instance, in the example above, ``SELECT * FROM t WHERE a = 0 AND b |
| > 1 and b <= 3``) to be very efficient. |
| |
| |
| .. _create-table-options: |
| |
| Table options |
| ~~~~~~~~~~~~~ |
| |
| A CQL table has a number of options that can be set at creation (and, for most of them, :ref:`altered |
| <alter-table-statement>` later). These options are specified after the ``WITH`` keyword. |
| |
| Amongst those options, two important ones cannot be changed after creation and influence which queries can be done |
| against the table: the ``COMPACT STORAGE`` option and the ``CLUSTERING ORDER`` option. Those, as well as the other |
| options of a table are described in the following sections. |
| |
| .. _compact-tables: |
| |
| Compact tables |
| `````````````` |
| |
| .. warning:: Since Cassandra 3.0, compact tables have the exact same layout internally than non compact ones (for the |
| same schema obviously), and declaring a table compact **only** creates artificial limitations on the table definition |
| and usage. It only exists for historical reason and is preserved for backward compatibility And as ``COMPACT |
| STORAGE`` cannot, as of Cassandra |version|, be removed, it is strongly discouraged to create new table with the |
| ``COMPACT STORAGE`` option. |
| |
| A *compact* table is one defined with the ``COMPACT STORAGE`` option. This option is only maintained for backward |
| compatibility for definitions created before CQL version 3 and shouldn't be used for new tables. Declaring a |
| table with this option creates limitations for the table which are largely arbitrary (and exists for historical |
| reasons). Amongst those limitation: |
| |
| - a compact table cannot use collections nor static columns. |
| - if a compact table has at least one clustering column, then it must have *exactly* one column outside of the primary |
| key ones. This imply you cannot add or remove columns after creation in particular. |
| - a compact table is limited in the indexes it can create, and no materialized view can be created on it. |
| |
| .. _clustering-order: |
| |
| Reversing the clustering order |
| `````````````````````````````` |
| |
| The clustering order of a table is defined by the :ref:`clustering columns <clustering-columns>` of that table. By |
| default, that ordering is based on natural order of those clustering order, but the ``CLUSTERING ORDER`` allows to |
| change that clustering order to use the *reverse* natural order for some (potentially all) of the columns. |
| |
| The ``CLUSTERING ORDER`` option takes the comma-separated list of the clustering column, each with a ``ASC`` (for |
| *ascendant*, e.g. the natural order) or ``DESC`` (for *descendant*, e.g. the reverse natural order). Note in particular |
| that the default (if the ``CLUSTERING ORDER`` option is not used) is strictly equivalent to using the option with all |
| clustering columns using the ``ASC`` modifier. |
| |
| Note that this option is basically a hint for the storage engine to change the order in which it stores the row but it |
| has 3 visible consequences: |
| |
| # it limits which ``ORDER BY`` clause are allowed for :ref:`selects <select-statement>` on that table. You can only |
| order results by the clustering order or the reverse clustering order. Meaning that if a table has 2 clustering column |
| ``a`` and ``b`` and you defined ``WITH CLUSTERING ORDER (a DESC, b ASC)``, then in queries you will be allowed to use |
| ``ORDER BY (a DESC, b ASC)`` and (reverse clustering order) ``ORDER BY (a ASC, b DESC)`` but **not** ``ORDER BY (a |
| ASC, b ASC)`` (nor ``ORDER BY (a DESC, b DESC)``). |
| # it also change the default order of results when queried (if no ``ORDER BY`` is provided). Results are always returned |
| in clustering order (within a partition). |
| # it has a small performance impact on some queries as queries in reverse clustering order are slower than the one in |
| forward clustering order. In practice, this means that if you plan on querying mostly in the reverse natural order of |
| your columns (which is common with time series for instance where you often want data from the newest to the oldest), |
| it is an optimization to declare a descending clustering order. |
| |
| .. _create-table-general-options: |
| |
| Other table options |
| ``````````````````` |
| |
| .. todo:: review (misses cdc if nothing else) and link to proper categories when appropriate (compaction for instance) |
| |
| A table supports the following options: |
| |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | option | kind | default | description | |
| +================================+==========+=============+===========================================================+ |
| | ``comment`` | *simple* | none | A free-form, human-readable comment. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``speculative_retry`` | *simple* | 99PERCENTILE| :ref:`Speculative retry options | |
| | | | | <speculative-retry-options>`. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``additional_write_policy`` | *simple* | 99PERCENTILE| :ref:`Speculative retry options | |
| | | | | <speculative-retry-options>`. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``gc_grace_seconds`` | *simple* | 864000 | Time to wait before garbage collecting tombstones | |
| | | | | (deletion markers). | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``bloom_filter_fp_chance`` | *simple* | 0.00075 | The target probability of false positive of the sstable | |
| | | | | bloom filters. Said bloom filters will be sized to provide| |
| | | | | the provided probability (thus lowering this value impact | |
| | | | | the size of bloom filters in-memory and on-disk) | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``default_time_to_live`` | *simple* | 0 | The default expiration time (“TTL”) in seconds for a | |
| | | | | table. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``compaction`` | *map* | *see below* | :ref:`Compaction options <cql-compaction-options>`. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``compression`` | *map* | *see below* | :ref:`Compression options <cql-compression-options>`. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``caching`` | *map* | *see below* | :ref:`Caching options <cql-caching-options>`. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``memtable_flush_period_in_ms``| *simple* | 0 | Time (in ms) before Cassandra flushes memtables to disk. | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| | ``read_repair`` | *simple* | BLOCKING | Sets read repair behavior (see below) | |
| +--------------------------------+----------+-------------+-----------------------------------------------------------+ |
| |
| .. _speculative-retry-options: |
| |
| Speculative retry options |
| ######################### |
| |
| By default, Cassandra read coordinators only query as many replicas as necessary to satisfy |
| consistency levels: one for consistency level ``ONE``, a quorum for ``QUORUM``, and so on. |
| ``speculative_retry`` determines when coordinators may query additional replicas, which is useful |
| when replicas are slow or unresponsive. ``additional_write_policy`` specifies the threshold at which |
| a cheap quorum write will be upgraded to include transient replicas. The following are legal values (case-insensitive): |
| |
| ============================ ======================== ============================================================================= |
| Format Example Description |
| ============================ ======================== ============================================================================= |
| ``XPERCENTILE`` 90.5PERCENTILE Coordinators record average per-table response times for all replicas. |
| If a replica takes longer than ``X`` percent of this table's average |
| response time, the coordinator queries an additional replica. |
| ``X`` must be between 0 and 100. |
| ``XP`` 90.5P Synonym for ``XPERCENTILE`` |
| ``Yms`` 25ms If a replica takes more than ``Y`` milliseconds to respond, |
| the coordinator queries an additional replica. |
| ``MIN(XPERCENTILE,YMS)`` MIN(99PERCENTILE,35MS) A hybrid policy that will use either the specified percentile or fixed |
| milliseconds depending on which value is lower at the time of calculation. |
| Parameters are ``XPERCENTILE``, ``XP``, or ``Yms``. |
| This is helpful to help protect against a single slow instance; in the |
| happy case the 99th percentile is normally lower than the specified |
| fixed value however, a slow host may skew the percentile very high |
| meaning the slower the cluster gets, the higher the value of the percentile, |
| and the higher the calculated time used to determine if we should |
| speculate or not. This allows us to set an upper limit that we want to |
| speculate at, but avoid skewing the tail latencies by speculating at the |
| lower value when the percentile is less than the specified fixed upper bound. |
| ``MAX(XPERCENTILE,YMS)`` MAX(90.5P,25ms) A hybrid policy that will use either the specified percentile or fixed |
| milliseconds depending on which value is higher at the time of calculation. |
| ``ALWAYS`` Coordinators always query all replicas. |
| ``NEVER`` Coordinators never query additional replicas. |
| ============================ =================== ============================================================================= |
| |
| This setting does not affect reads with consistency level ``ALL`` because they already query all replicas. |
| |
| Note that frequently reading from additional replicas can hurt cluster performance. |
| When in doubt, keep the default ``99PERCENTILE``. |
| |
| .. _cql-compaction-options: |
| |
| Compaction options |
| ################## |
| |
| The ``compaction`` options must at least define the ``'class'`` sub-option, that defines the compaction strategy class |
| to use. The default supported class are ``'SizeTieredCompactionStrategy'`` (:ref:`STCS <STCS>`), |
| ``'LeveledCompactionStrategy'`` (:ref:`LCS <LCS>`) and ``'TimeWindowCompactionStrategy'`` (:ref:`TWCS <TWCS>`) (the |
| ``'DateTieredCompactionStrategy'`` is also supported but is deprecated and ``'TimeWindowCompactionStrategy'`` should be |
| preferred instead). Custom strategy can be provided by specifying the full class name as a :ref:`string constant |
| <constants>`. |
| |
| All default strategies support a number of :ref:`common options <compaction-options>`, as well as options specific to |
| the strategy chosen (see the section corresponding to your strategy for details: :ref:`STCS <stcs-options>`, :ref:`LCS |
| <lcs-options>` and :ref:`TWCS <TWCS>`). |
| |
| .. _cql-compression-options: |
| |
| Compression options |
| ################### |
| |
| The ``compression`` options define if and how the sstables of the table are compressed. The following sub-options are |
| available: |
| |
| ========================= =============== ============================================================================= |
| Option Default Description |
| ========================= =============== ============================================================================= |
| ``class`` LZ4Compressor The compression algorithm to use. Default compressor are: LZ4Compressor, |
| SnappyCompressor and DeflateCompressor. Use ``'enabled' : false`` to disable |
| compression. Custom compressor can be provided by specifying the full class |
| name as a “string constant”:#constants. |
| ``enabled`` true Enable/disable sstable compression. |
| ``chunk_length_in_kb`` 64 On disk SSTables are compressed by block (to allow random reads). This |
| defines the size (in KB) of said block. Bigger values may improve the |
| compression rate, but increases the minimum size of data to be read from disk |
| for a read |
| ``crc_check_chance`` 1.0 When compression is enabled, each compressed block includes a checksum of |
| that block for the purpose of detecting disk bitrot and avoiding the |
| propagation of corruption to other replica. This option defines the |
| probability with which those checksums are checked during read. By default |
| they are always checked. Set to 0 to disable checksum checking and to 0.5 for |
| instance to check them every other read | |
| ========================= =============== ============================================================================= |
| |
| |
| For instance, to create a table with LZ4Compressor and a chunk_lenth_in_kb of 4KB:: |
| |
| CREATE TABLE simple ( |
| id int, |
| key text, |
| value text, |
| PRIMARY KEY (key, value) |
| ) with compression = {'class': 'LZ4Compressor', 'chunk_length_in_kb': 4}; |
| |
| |
| .. _cql-caching-options: |
| |
| Caching options |
| ############### |
| |
| The ``caching`` options allows to configure both the *key cache* and the *row cache* for the table. The following |
| sub-options are available: |
| |
| ======================== ========= ==================================================================================== |
| Option Default Description |
| ======================== ========= ==================================================================================== |
| ``keys`` ALL Whether to cache keys (“key cache”) for this table. Valid values are: ``ALL`` and |
| ``NONE``. |
| ``rows_per_partition`` NONE The amount of rows to cache per partition (“row cache”). If an integer ``n`` is |
| specified, the first ``n`` queried rows of a partition will be cached. Other |
| possible options are ``ALL``, to cache all rows of a queried partition, or ``NONE`` |
| to disable row caching. |
| ======================== ========= ==================================================================================== |
| |
| |
| For instance, to create a table with both a key cache and 10 rows per partition:: |
| |
| CREATE TABLE simple ( |
| id int, |
| key text, |
| value text, |
| PRIMARY KEY (key, value) |
| ) WITH caching = {'keys': 'ALL', 'rows_per_partition': 10}; |
| |
| |
| Read Repair options |
| ################### |
| |
| The ``read_repair`` options configures the read repair behavior to allow tuning for various performance and |
| consistency behaviors. Two consistency properties are affected by read repair behavior. |
| |
| - Monotonic Quorum Reads: Provided by ``BLOCKING``. Monotonic quorum reads prevents reads from appearing to go back |
| in time in some circumstances. When monotonic quorum reads are not provided and a write fails to reach a quorum of |
| replicas, it may be visible in one read, and then disappear in a subsequent read. |
| - Write Atomicity: Provided by ``NONE``. Write atomicity prevents reads from returning partially applied writes. |
| Cassandra attempts to provide partition level write atomicity, but since only the data covered by a SELECT statement |
| is repaired by a read repair, read repair can break write atomicity when data is read at a more granular level than it |
| is written. For example read repair can break write atomicity if you write multiple rows to a clustered partition in a |
| batch, but then select a single row by specifying the clustering column in a SELECT statement. |
| |
| The available read repair settings are: |
| |
| Blocking |
| ```````` |
| The default setting. When ``read_repair`` is set to ``BLOCKING``, and a read repair is triggered, the read will block |
| on writes sent to other replicas until the CL is reached by the writes. Provides monotonic quorum reads, but not partition |
| level write atomicity |
| |
| None |
| ```` |
| |
| When ``read_repair`` is set to ``NONE``, the coordinator will reconcile any differences between replicas, but will not |
| attempt to repair them. Provides partition level write atomicity, but not monotonic quorum reads. |
| |
| |
| Other considerations: |
| ##################### |
| |
| - Adding new columns (see ``ALTER TABLE`` below) is a constant time operation. There is thus no need to try to |
| anticipate future usage when creating a table. |
| |
| .. _alter-table-statement: |
| |
| ALTER TABLE |
| ^^^^^^^^^^^ |
| |
| Altering an existing table uses the ``ALTER TABLE`` statement: |
| |
| .. productionlist:: |
| alter_table_statement: ALTER TABLE `table_name` `alter_table_instruction` |
| alter_table_instruction: ADD `column_name` `cql_type` ( ',' `column_name` `cql_type` )* |
| : | DROP `column_name` ( `column_name` )* |
| : | WITH `options` |
| |
| For instance:: |
| |
| ALTER TABLE addamsFamily ADD gravesite varchar; |
| |
| ALTER TABLE addamsFamily |
| WITH comment = 'A most excellent and useful table'; |
| |
| The ``ALTER TABLE`` statement can: |
| |
| - Add new column(s) to the table (through the ``ADD`` instruction). Note that the primary key of a table cannot be |
| changed and thus newly added column will, by extension, never be part of the primary key. Also note that :ref:`compact |
| tables <compact-tables>` have restrictions regarding column addition. Note that this is constant (in the amount of |
| data the cluster contains) time operation. |
| - Remove column(s) from the table. This drops both the column and all its content, but note that while the column |
| becomes immediately unavailable, its content is only removed lazily during compaction. Please also see the warnings |
| below. Due to lazy removal, the altering itself is a constant (in the amount of data removed or contained in the |
| cluster) time operation. |
| - Change some of the table options (through the ``WITH`` instruction). The :ref:`supported options |
| <create-table-options>` are the same that when creating a table (outside of ``COMPACT STORAGE`` and ``CLUSTERING |
| ORDER`` that cannot be changed after creation). Note that setting any ``compaction`` sub-options has the effect of |
| erasing all previous ``compaction`` options, so you need to re-specify all the sub-options if you want to keep them. |
| The same note applies to the set of ``compression`` sub-options. |
| |
| .. warning:: Dropping a column assumes that the timestamps used for the value of this column are "real" timestamp in |
| microseconds. Using "real" timestamps in microseconds is the default is and is **strongly** recommended but as |
| Cassandra allows the client to provide any timestamp on any table it is theoretically possible to use another |
| convention. Please be aware that if you do so, dropping a column will not work correctly. |
| |
| .. warning:: Once a column is dropped, it is allowed to re-add a column with the same name than the dropped one |
| **unless** the type of the dropped column was a (non-frozen) column (due to an internal technical limitation). |
| |
| |
| .. _drop-table-statement: |
| |
| DROP TABLE |
| ^^^^^^^^^^ |
| |
| Dropping a table uses the ``DROP TABLE`` statement: |
| |
| .. productionlist:: |
| drop_table_statement: DROP TABLE [ IF EXISTS ] `table_name` |
| |
| Dropping a table results in the immediate, irreversible removal of the table, including all data it contains. |
| |
| If the table does not exist, the statement will return an error, unless ``IF EXISTS`` is used in which case the |
| operation is a no-op. |
| |
| .. _truncate-statement: |
| |
| TRUNCATE |
| ^^^^^^^^ |
| |
| A table can be truncated using the ``TRUNCATE`` statement: |
| |
| .. productionlist:: |
| truncate_statement: TRUNCATE [ TABLE ] `table_name` |
| |
| Note that ``TRUNCATE TABLE foo`` is allowed for consistency with other DDL statements but tables are the only object |
| that can be truncated currently and so the ``TABLE`` keyword can be omitted. |
| |
| Truncating a table permanently removes all existing data from the table, but without removing the table itself. |