doc/modules/cassandra/pages/operating/compaction/index.adoc - cassandra - Git at Google

 = Compaction

 == Strategies

 Picking the right compaction strategy for your workload will ensure the
 best performance for both querying and for compaction itself.

 xref:operating/compaction/stcs.adoc[`Size Tiered Compaction Strategy (STCS)`]::
   The default compaction strategy. Useful as a fallback when other
   strategies don't fit the workload. Most useful for non pure time
   series workloads with spinning disks, or when the I/O from `LCS`
   is too high.
 xref:operating/compaction/lcs.adoc[`Leveled Compaction Strategy (LCS)`]::
   Leveled Compaction Strategy (LCS) is optimized for read heavy
   workloads, or workloads with lots of updates and deletes. It is not a
   good choice for immutable time series data.
 xref:operating/compaction/twcs.adoc[`Time Window Compaction Strategy (TWCS)`]::
   Time Window Compaction Strategy is designed for TTL'ed, mostly
   immutable time series data.

 == Types of compaction

 The concept of compaction is used for different kinds of operations in
 Cassandra, the common thing about these operations is that it takes one
 or more SSTables and output new SSTables. The types of compactions are:

 Minor compaction::
   triggered automatically in Cassandra.
 Major compaction::
   a user executes a compaction over all SSTables on the node.
 User defined compaction::
   a user triggers a compaction on a given set of SSTables.
 Scrub::
   try to fix any broken SSTables. This can actually remove valid data if
   that data is corrupted, if that happens you will need to run a full
   repair on the node.
 UpgradeSSTables::
   upgrade SSTables to the latest version. Run this after upgrading to a
   new major version.
 Cleanup::
   remove any ranges this node does not own anymore, typically triggered
   on neighbouring nodes after a node has been bootstrapped since that
   node will take ownership of some ranges from those nodes.
 Secondary index rebuild::
   rebuild the secondary indexes on the node.
 Anticompaction::
   after repair the ranges that were actually repaired are split out of
   the SSTables that existed when repair started.
 Sub range compaction::
   It is possible to only compact a given sub range - this could be
   useful if you know a token that has been misbehaving - either
   gathering many updates or many deletes.
   (`nodetool compact -st x -et y`) will pick all SSTables containing the
   range between x and y and issue a compaction for those SSTables. For
   STCS this will most likely include all SSTables but with LCS it can
   issue the compaction for a subset of the SSTables. With LCS the
   resulting sstable will end up in L0.

 == When is a minor compaction triggered?

 * When an sstable is added to the node through flushing/streaming
 * When autocompaction is enabled after being disabled (`nodetool enableautocompaction`)
 * When compaction adds new SSTables
 * A check for new minor compactions every 5 minutes

 == Merging SSTables

 Compaction is about merging SSTables, since partitions in SSTables are
 sorted based on the hash of the partition key it is possible to
 efficiently merge separate SSTables. Content of each partition is also
 sorted so each partition can be merged efficiently.

 == Tombstones and Garbage Collection (GC) Grace

 === Why Tombstones

 When a delete request is received by Cassandra it does not actually
 remove the data from the underlying store. Instead it writes a special
 piece of data known as a tombstone. The Tombstone represents the delete
 and causes all values which occurred before the tombstone to not appear
 in queries to the database. This approach is used instead of removing
 values because of the distributed nature of Cassandra.

 === Deletes without tombstones

 Imagine a three node cluster which has the value [A] replicated to every
 node.:

 [source,none]
 ----
 [A], [A], [A]
 ----

 If one of the nodes fails and and our delete operation only removes
 existing values we can end up with a cluster that looks like:

 [source,none]
 ----
 [], [], [A]
 ----

 Then a repair operation would replace the value of [A] back onto the two
 nodes which are missing the value.:

 [source,none]
 ----
 [A], [A], [A]
 ----

 This would cause our data to be resurrected even though it had been
 deleted.

 === Deletes with Tombstones

 Starting again with a three node cluster which has the value [A]
 replicated to every node.:

 [source,none]
 ----
 [A], [A], [A]
 ----

 If instead of removing data we add a tombstone record, our single node
 failure situation will look like this.:

 [source,none]
 ----
 [A, Tombstone[A]], [A, Tombstone[A]], [A]
 ----

 Now when we issue a repair the Tombstone will be copied to the replica,
 rather than the deleted data being resurrected.:

 [source,none]
 ----
 [A, Tombstone[A]], [A, Tombstone[A]], [A, Tombstone[A]]
 ----

 Our repair operation will correctly put the state of the system to what
 we expect with the record [A] marked as deleted on all nodes. This does
 mean we will end up accruing Tombstones which will permanently
 accumulate disk space. To avoid keeping tombstones forever we have a
 parameter known as `gc_grace_seconds` for every table in Cassandra.

 === The gc_grace_seconds parameter and Tombstone Removal

 The table level `gc_grace_seconds` parameter controls how long Cassandra
 will retain tombstones through compaction events before finally removing
 them. This duration should directly reflect the amount of time a user
 expects to allow before recovering a failed node. After
 `gc_grace_seconds` has expired the tombstone may be removed (meaning
 there will no longer be any record that a certain piece of data was
 deleted), but as a tombstone can live in one sstable and the data it
 covers in another, a compaction must also include both sstable for a
 tombstone to be removed. More precisely, to be able to drop an actual
 tombstone the following needs to be true;

 * The tombstone must be older than `gc_grace_seconds`
 * If partition X contains the tombstone, the sstable containing the
 partition plus all SSTables containing data older than the tombstone
 containing X must be included in the same compaction. We don't need to
 care if the partition is in an sstable if we can guarantee that all data
 in that sstable is newer than the tombstone. If the tombstone is older
 than the data it cannot shadow that data.
 * If the option `only_purge_repaired_tombstones` is enabled, tombstones
 are only removed if the data has also been repaired.

 If a node remains down or disconnected for longer than
 `gc_grace_seconds` it's deleted data will be repaired back to the other
 nodes and re-appear in the cluster. This is basically the same as in the
 "Deletes without Tombstones" section. Note that tombstones will not be
 removed until a compaction event even if `gc_grace_seconds` has elapsed.

 The default value for `gc_grace_seconds` is 864000 which is equivalent
 to 10 days. This can be set when creating or altering a table using
 `WITH gc_grace_seconds`.

 == TTL

 Data in Cassandra can have an additional property called time to live -
 this is used to automatically drop data that has expired once the time
 is reached. Once the TTL has expired the data is converted to a
 tombstone which stays around for at least `gc_grace_seconds`. Note that
 if you mix data with TTL and data without TTL (or just different length
 of the TTL) Cassandra will have a hard time dropping the tombstones
 created since the partition might span many SSTables and not all are
 compacted at once.

 == Fully expired SSTables

 If an sstable contains only tombstones and it is guaranteed that that
 sstable is not shadowing data in any other sstable compaction can drop
 that sstable. If you see SSTables with only tombstones (note that TTL:ed
 data is considered tombstones once the time to live has expired) but it
 is not being dropped by compaction, it is likely that other SSTables
 contain older data. There is a tool called `sstableexpiredblockers` that
 will list which SSTables are droppable and which are blocking them from
 being dropped. This is especially useful for time series compaction with
 `TimeWindowCompactionStrategy` (and the deprecated
 `DateTieredCompactionStrategy`). With `TimeWindowCompactionStrategy` it
 is possible to remove the guarantee (not check for shadowing data) by
 enabling `unsafe_aggressive_sstable_expiration`.

 == Repaired/unrepaired data

 With incremental repairs Cassandra must keep track of what data is
 repaired and what data is unrepaired. With anticompaction repaired data
 is split out into repaired and unrepaired SSTables. To avoid mixing up
 the data again separate compaction strategy instances are run on the two
 sets of data, each instance only knowing about either the repaired or
 the unrepaired SSTables. This means that if you only run incremental
 repair once and then never again, you might have very old data in the
 repaired SSTables that block compaction from dropping tombstones in the
 unrepaired (probably newer) SSTables.

 == Data directories

 Since tombstones and data can live in different SSTables it is important
 to realize that losing an sstable might lead to data becoming live again
 - the most common way of losing SSTables is to have a hard drive break
 down. To avoid making data live tombstones and actual data are always in
 the same data directory. This way, if a disk is lost, all versions of a
 partition are lost and no data can get undeleted. To achieve this a
 compaction strategy instance per data directory is run in addition to
 the compaction strategy instances containing repaired/unrepaired data,
 this means that if you have 4 data directories there will be 8
 compaction strategy instances running. This has a few more benefits than
 just avoiding data getting undeleted:

 * It is possible to run more compactions in parallel - leveled
 compaction will have several totally separate levelings and each one can
 run compactions independently from the others.
 * Users can backup and restore a single data directory.
 * Note though that currently all data directories are considered equal,
 so if you have a tiny disk and a big disk backing two data directories,
 the big one will be limited the by the small one. One work around to
 this is to create more data directories backed by the big disk.

 == Single sstable tombstone compaction

 When an sstable is written a histogram with the tombstone expiry times
 is created and this is used to try to find SSTables with very many
 tombstones and run single sstable compaction on that sstable in hope of
 being able to drop tombstones in that sstable. Before starting this it
 is also checked how likely it is that any tombstones will actually will
 be able to be dropped how much this sstable overlaps with other
 SSTables. To avoid most of these checks the compaction option
 `unchecked_tombstone_compaction` can be enabled.

 [[compaction-options]]
 == Common options

 There is a number of common options for all the compaction strategies;

 `enabled` (default: true)::
   Whether minor compactions should run. Note that you can have
   'enabled': true as a compaction option and then do 'nodetool
   enableautocompaction' to start running compactions.
 `tombstone_threshold` (default: 0.2)::
   How much of the sstable should be tombstones for us to consider doing
   a single sstable compaction of that sstable.
 `tombstone_compaction_interval` (default: 86400s (1 day))::
   Since it might not be possible to drop any tombstones when doing a
   single sstable compaction we need to make sure that one sstable is not
   constantly getting recompacted - this option states how often we
   should try for a given sstable.
 `log_all` (default: false)::
   New detailed compaction logging, see
   `below <detailed-compaction-logging>`.
 `unchecked_tombstone_compaction` (default: false)::
   The single sstable compaction has quite strict checks for whether it
   should be started, this option disables those checks and for some
   usecases this might be needed. Note that this does not change anything
   for the actual compaction, tombstones are only dropped if it is safe
   to do so - it might just rewrite an sstable without being able to drop
   any tombstones.
 `only_purge_repaired_tombstone` (default: false)::
   Option to enable the extra safety of making sure that tombstones are
   only dropped if the data has been repaired.
 `min_threshold` (default: 4)::
   Lower limit of number of SSTables before a compaction is triggered.
   Not used for `LeveledCompactionStrategy`.
 `max_threshold` (default: 32)::
   Upper limit of number of SSTables before a compaction is triggered.
   Not used for `LeveledCompactionStrategy`.

 Further, see the section on each strategy for specific additional
 options.

 == Compaction nodetool commands

 The `nodetool <nodetool>` utility provides a number of commands related
 to compaction:

 `enableautocompaction`::
   Enable compaction.
 `disableautocompaction`::
   Disable compaction.
 `setcompactionthroughput`::
   How fast compaction should run at most - defaults to 16MB/s, but note
   that it is likely not possible to reach this throughput.
 `compactionstats`::
   Statistics about current and pending compactions.
 `compactionhistory`::
   List details about the last compactions.
 `setcompactionthreshold`::
   Set the min/max sstable count for when to trigger compaction, defaults
   to 4/32.

 == Switching the compaction strategy and options using JMX

 It is possible to switch compaction strategies and its options on just a
 single node using JMX, this is a great way to experiment with settings
 without affecting the whole cluster. The mbean is:

 [source,none]
 ----
 org.apache.cassandra.db:type=ColumnFamilies,keyspace=<keyspace_name>,columnfamily=<table_name>
 ----

 and the attribute to change is `CompactionParameters` or
 `CompactionParametersJson` if you use jconsole or jmc. The syntax for
 the json version is the same as you would use in an
 `ALTER TABLE <alter-table-statement>` statement -for example:

 [source,none]
 ----
 { 'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 123, 'fanout_size': 10}
 ----

 The setting is kept until someone executes an
 `ALTER TABLE <alter-table-statement>` that touches the compaction
 settings or restarts the node.

 [[detailed-compaction-logging]]
 == More detailed compaction logging

 Enable with the compaction option `log_all` and a more detailed
 compaction log file will be produced in your log directory.
	= Compaction

	== Strategies

	Picking the right compaction strategy for your workload will ensure the
	best performance for both querying and for compaction itself.

	xref:operating/compaction/stcs.adoc[`Size Tiered Compaction Strategy (STCS)`]::
	The default compaction strategy. Useful as a fallback when other
	strategies don't fit the workload. Most useful for non pure time
	series workloads with spinning disks, or when the I/O from `LCS`
	is too high.
	xref:operating/compaction/lcs.adoc[`Leveled Compaction Strategy (LCS)`]::
	Leveled Compaction Strategy (LCS) is optimized for read heavy
	workloads, or workloads with lots of updates and deletes. It is not a
	good choice for immutable time series data.
	xref:operating/compaction/twcs.adoc[`Time Window Compaction Strategy (TWCS)`]::
	Time Window Compaction Strategy is designed for TTL'ed, mostly
	immutable time series data.

	== Types of compaction

	The concept of compaction is used for different kinds of operations in
	Cassandra, the common thing about these operations is that it takes one
	or more SSTables and output new SSTables. The types of compactions are:

	Minor compaction::
	triggered automatically in Cassandra.
	Major compaction::
	a user executes a compaction over all SSTables on the node.
	User defined compaction::
	a user triggers a compaction on a given set of SSTables.
	Scrub::
	try to fix any broken SSTables. This can actually remove valid data if
	that data is corrupted, if that happens you will need to run a full
	repair on the node.
	UpgradeSSTables::
	upgrade SSTables to the latest version. Run this after upgrading to a
	new major version.
	Cleanup::
	remove any ranges this node does not own anymore, typically triggered
	on neighbouring nodes after a node has been bootstrapped since that
	node will take ownership of some ranges from those nodes.
	Secondary index rebuild::
	rebuild the secondary indexes on the node.
	Anticompaction::
	after repair the ranges that were actually repaired are split out of
	the SSTables that existed when repair started.
	Sub range compaction::
	It is possible to only compact a given sub range - this could be
	useful if you know a token that has been misbehaving - either
	gathering many updates or many deletes.
	(`nodetool compact -st x -et y`) will pick all SSTables containing the
	range between x and y and issue a compaction for those SSTables. For
	STCS this will most likely include all SSTables but with LCS it can
	issue the compaction for a subset of the SSTables. With LCS the
	resulting sstable will end up in L0.

	== When is a minor compaction triggered?

	* When an sstable is added to the node through flushing/streaming
	* When autocompaction is enabled after being disabled (`nodetool enableautocompaction`)
	* When compaction adds new SSTables
	* A check for new minor compactions every 5 minutes

	== Merging SSTables

	Compaction is about merging SSTables, since partitions in SSTables are
	sorted based on the hash of the partition key it is possible to
	efficiently merge separate SSTables. Content of each partition is also
	sorted so each partition can be merged efficiently.

	== Tombstones and Garbage Collection (GC) Grace

	=== Why Tombstones

	When a delete request is received by Cassandra it does not actually
	remove the data from the underlying store. Instead it writes a special
	piece of data known as a tombstone. The Tombstone represents the delete
	and causes all values which occurred before the tombstone to not appear
	in queries to the database. This approach is used instead of removing
	values because of the distributed nature of Cassandra.

	=== Deletes without tombstones

	Imagine a three node cluster which has the value [A] replicated to every
	node.:

	[source,none]
	----
	[A], [A], [A]
	----

	If one of the nodes fails and and our delete operation only removes
	existing values we can end up with a cluster that looks like:

	[source,none]
	----
	[], [], [A]
	----

	Then a repair operation would replace the value of [A] back onto the two
	nodes which are missing the value.:

	[source,none]
	----
	[A], [A], [A]
	----

	This would cause our data to be resurrected even though it had been
	deleted.

	=== Deletes with Tombstones

	Starting again with a three node cluster which has the value [A]
	replicated to every node.:

	[source,none]
	----
	[A], [A], [A]
	----

	If instead of removing data we add a tombstone record, our single node
	failure situation will look like this.:

	[source,none]
	----
	[A, Tombstone[A]], [A, Tombstone[A]], [A]
	----

	Now when we issue a repair the Tombstone will be copied to the replica,
	rather than the deleted data being resurrected.:

	[source,none]
	----
	[A, Tombstone[A]], [A, Tombstone[A]], [A, Tombstone[A]]
	----

	Our repair operation will correctly put the state of the system to what
	we expect with the record [A] marked as deleted on all nodes. This does
	mean we will end up accruing Tombstones which will permanently
	accumulate disk space. To avoid keeping tombstones forever we have a
	parameter known as `gc_grace_seconds` for every table in Cassandra.

	=== The gc_grace_seconds parameter and Tombstone Removal

	The table level `gc_grace_seconds` parameter controls how long Cassandra
	will retain tombstones through compaction events before finally removing
	them. This duration should directly reflect the amount of time a user
	expects to allow before recovering a failed node. After
	`gc_grace_seconds` has expired the tombstone may be removed (meaning
	there will no longer be any record that a certain piece of data was
	deleted), but as a tombstone can live in one sstable and the data it
	covers in another, a compaction must also include both sstable for a
	tombstone to be removed. More precisely, to be able to drop an actual
	tombstone the following needs to be true;

	* The tombstone must be older than `gc_grace_seconds`
	* If partition X contains the tombstone, the sstable containing the
	partition plus all SSTables containing data older than the tombstone
	containing X must be included in the same compaction. We don't need to
	care if the partition is in an sstable if we can guarantee that all data
	in that sstable is newer than the tombstone. If the tombstone is older
	than the data it cannot shadow that data.
	* If the option `only_purge_repaired_tombstones` is enabled, tombstones
	are only removed if the data has also been repaired.

	If a node remains down or disconnected for longer than
	`gc_grace_seconds` it's deleted data will be repaired back to the other
	nodes and re-appear in the cluster. This is basically the same as in the
	"Deletes without Tombstones" section. Note that tombstones will not be
	removed until a compaction event even if `gc_grace_seconds` has elapsed.

	The default value for `gc_grace_seconds` is 864000 which is equivalent
	to 10 days. This can be set when creating or altering a table using
	`WITH gc_grace_seconds`.

	== TTL

	Data in Cassandra can have an additional property called time to live -
	this is used to automatically drop data that has expired once the time
	is reached. Once the TTL has expired the data is converted to a
	tombstone which stays around for at least `gc_grace_seconds`. Note that
	if you mix data with TTL and data without TTL (or just different length
	of the TTL) Cassandra will have a hard time dropping the tombstones
	created since the partition might span many SSTables and not all are
	compacted at once.

	== Fully expired SSTables

	If an sstable contains only tombstones and it is guaranteed that that
	sstable is not shadowing data in any other sstable compaction can drop
	that sstable. If you see SSTables with only tombstones (note that TTL:ed
	data is considered tombstones once the time to live has expired) but it
	is not being dropped by compaction, it is likely that other SSTables
	contain older data. There is a tool called `sstableexpiredblockers` that
	will list which SSTables are droppable and which are blocking them from
	being dropped. This is especially useful for time series compaction with
	`TimeWindowCompactionStrategy` (and the deprecated
	`DateTieredCompactionStrategy`). With `TimeWindowCompactionStrategy` it
	is possible to remove the guarantee (not check for shadowing data) by
	enabling `unsafe_aggressive_sstable_expiration`.

	== Repaired/unrepaired data

	With incremental repairs Cassandra must keep track of what data is
	repaired and what data is unrepaired. With anticompaction repaired data
	is split out into repaired and unrepaired SSTables. To avoid mixing up
	the data again separate compaction strategy instances are run on the two
	sets of data, each instance only knowing about either the repaired or
	the unrepaired SSTables. This means that if you only run incremental
	repair once and then never again, you might have very old data in the
	repaired SSTables that block compaction from dropping tombstones in the
	unrepaired (probably newer) SSTables.

	== Data directories

	Since tombstones and data can live in different SSTables it is important
	to realize that losing an sstable might lead to data becoming live again
	- the most common way of losing SSTables is to have a hard drive break
	down. To avoid making data live tombstones and actual data are always in
	the same data directory. This way, if a disk is lost, all versions of a
	partition are lost and no data can get undeleted. To achieve this a
	compaction strategy instance per data directory is run in addition to
	the compaction strategy instances containing repaired/unrepaired data,
	this means that if you have 4 data directories there will be 8
	compaction strategy instances running. This has a few more benefits than
	just avoiding data getting undeleted:

	* It is possible to run more compactions in parallel - leveled
	compaction will have several totally separate levelings and each one can
	run compactions independently from the others.
	* Users can backup and restore a single data directory.
	* Note though that currently all data directories are considered equal,
	so if you have a tiny disk and a big disk backing two data directories,
	the big one will be limited the by the small one. One work around to
	this is to create more data directories backed by the big disk.

	== Single sstable tombstone compaction

	When an sstable is written a histogram with the tombstone expiry times
	is created and this is used to try to find SSTables with very many
	tombstones and run single sstable compaction on that sstable in hope of
	being able to drop tombstones in that sstable. Before starting this it
	is also checked how likely it is that any tombstones will actually will
	be able to be dropped how much this sstable overlaps with other
	SSTables. To avoid most of these checks the compaction option
	`unchecked_tombstone_compaction` can be enabled.

	[[compaction-options]]
	== Common options

	There is a number of common options for all the compaction strategies;

	`enabled` (default: true)::
	Whether minor compactions should run. Note that you can have
	'enabled': true as a compaction option and then do 'nodetool
	enableautocompaction' to start running compactions.
	`tombstone_threshold` (default: 0.2)::
	How much of the sstable should be tombstones for us to consider doing
	a single sstable compaction of that sstable.
	`tombstone_compaction_interval` (default: 86400s (1 day))::
	Since it might not be possible to drop any tombstones when doing a
	single sstable compaction we need to make sure that one sstable is not
	constantly getting recompacted - this option states how often we
	should try for a given sstable.
	`log_all` (default: false)::
	New detailed compaction logging, see
	`below <detailed-compaction-logging>`.
	`unchecked_tombstone_compaction` (default: false)::
	The single sstable compaction has quite strict checks for whether it
	should be started, this option disables those checks and for some
	usecases this might be needed. Note that this does not change anything
	for the actual compaction, tombstones are only dropped if it is safe
	to do so - it might just rewrite an sstable without being able to drop
	any tombstones.
	`only_purge_repaired_tombstone` (default: false)::
	Option to enable the extra safety of making sure that tombstones are
	only dropped if the data has been repaired.
	`min_threshold` (default: 4)::
	Lower limit of number of SSTables before a compaction is triggered.
	Not used for `LeveledCompactionStrategy`.
	`max_threshold` (default: 32)::
	Upper limit of number of SSTables before a compaction is triggered.
	Not used for `LeveledCompactionStrategy`.

	Further, see the section on each strategy for specific additional
	options.

	== Compaction nodetool commands

	The `nodetool <nodetool>` utility provides a number of commands related
	to compaction:

	`enableautocompaction`::
	Enable compaction.
	`disableautocompaction`::
	Disable compaction.
	`setcompactionthroughput`::
	How fast compaction should run at most - defaults to 16MB/s, but note
	that it is likely not possible to reach this throughput.
	`compactionstats`::
	Statistics about current and pending compactions.
	`compactionhistory`::
	List details about the last compactions.
	`setcompactionthreshold`::
	Set the min/max sstable count for when to trigger compaction, defaults
	to 4/32.

	== Switching the compaction strategy and options using JMX

	It is possible to switch compaction strategies and its options on just a
	single node using JMX, this is a great way to experiment with settings
	without affecting the whole cluster. The mbean is:

	[source,none]
	----
	org.apache.cassandra.db:type=ColumnFamilies,keyspace=<keyspace_name>,columnfamily=<table_name>
	----

	and the attribute to change is `CompactionParameters` or
	`CompactionParametersJson` if you use jconsole or jmc. The syntax for
	the json version is the same as you would use in an
	`ALTER TABLE <alter-table-statement>` statement -for example:

	[source,none]
	----
	{ 'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 123, 'fanout_size': 10}
	----

	The setting is kept until someone executes an
	`ALTER TABLE <alter-table-statement>` that touches the compaction
	settings or restarts the node.

	[[detailed-compaction-logging]]
	== More detailed compaction logging

	Enable with the compaction option `log_all` and a more detailed
	compaction log file will be produced in your log directory.