blob: b0d5c191daf0b381f9115312cff782e6b29c7dbb [file] [log] [blame]
= Change Data Capture
== Overview
Change data capture (CDC) provides a mechanism to flag specific tables
for archival as well as rejecting writes to those tables once a
configurable size-on-disk for the CDC log is reached. An operator can
enable CDC on a table by setting the table property `cdc=true` (either
when xref:cql/ddl.adoc#create-table[`creating the table`] or
xref:cql/ddl.adoc#alter-table[`altering it`]). Upon CommitLogSegment creation,
a hard-link to the segment is created in the directory specified in
`cassandra.yaml`. On segment fsync to disk, if CDC data is present
anywhere in the segment a <segment_name>_cdc.idx file is also created
with the integer offset of how much data in the original segment is
persisted to disk. Upon final segment flush, a second line with the
human-readable word "COMPLETED" will be added to the _cdc.idx file
indicating that Cassandra has completed all processing on the file.
We we use an index file rather than just encouraging clients to parse
the log realtime off a memory mapped handle as data can be reflected in
a kernel buffer that is not yet persisted to disk. Parsing only up to
the listed offset in the _cdc.idx file will ensure that you only parse
CDC data for data that is durable.
A threshold of total disk space allowed is specified in the yaml at
which time newly allocated CommitLogSegments will not allow CDC data
until a consumer parses and removes files from the specified cdc_raw
directory.
== Configuration
=== Enabling or disabling CDC on a table
CDC is enable or disable through the [.title-ref]#cdc# table property,
for instance:
[source,cql]
----
CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true;
ALTER TABLE foo WITH cdc=true;
ALTER TABLE foo WITH cdc=false;
----
=== cassandra.yaml parameters
The following cassandra.yaml options are available for CDC:
`cdc_enabled` (default: false)::
Enable or disable CDC operations node-wide.
`cdc_raw_directory` (default: `$CASSANDRA_HOME/data/cdc_raw`)::
Destination for CommitLogSegments to be moved after all corresponding
memtables are flushed.
`cdc_free_space_in_mb`: (default: min of 4096 and 1/8th volume space)::
Calculated as sum of all active CommitLogSegments that permit CDC +
all flushed CDC segments in `cdc_raw_directory`.
`cdc_free_space_check_interval_ms` (default: 250)::
When at capacity, we limit the frequency with which we re-calculate
the space taken up by `cdc_raw_directory` to prevent burning CPU
cycles unnecessarily. Default is to check 4 times per second.
== Reading CommitLogSegments
Use a
https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReader.java[CommitLogReader.java].
Usage is
https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReplayer.java#L132-L140[fairly
straightforward] with a
https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReader.java#L71-L103[variety
of signatures] available for use. In order to handle mutations read from
disk, implement
https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReadHandler.java[CommitLogReadHandler].
== Warnings
*Do not enable CDC without some kind of consumption process in-place.*
If CDC is enabled on a node and then on a table, the
`cdc_free_space_in_mb` will fill up and then writes to CDC-enabled
tables will be rejected unless some consumption process is in place.
== Further Reading
* https://issues.apache.org/jira/browse/CASSANDRA-8844[JIRA ticket]
* https://issues.apache.org/jira/browse/CASSANDRA-12148[JIRA ticket]