doc/modules/cassandra/pages/operating/repair.adoc - cassandra - Git at Google

 = Repair

 Cassandra is designed to remain available if one of it's nodes is down
 or unreachable. However, when a node is down or unreachable, it needs to
 eventually discover the writes it missed. Hints attempt to inform a node
 of missed writes, but are a best effort, and aren't guaranteed to inform
 a node of 100% of the writes it missed. These inconsistencies can
 eventually result in data loss as nodes are replaced or tombstones
 expire.

 These inconsistencies are fixed with the repair process. Repair
 synchronizes the data between nodes by comparing their respective
 datasets for their common token ranges, and streaming the differences
 for any out of sync sections between the nodes. It compares the data
 with merkle trees, which are a hierarchy of hashes.

 == Incremental and Full Repairs

 There are 2 types of repairs: full repairs, and incremental repairs.
 Full repairs operate over all of the data in the token range being
 repaired. Incremental repairs only repair data that's been written since
 the previous incremental repair.

 Incremental repairs are the default repair type, and if run regularly,
 can significantly reduce the time and io cost of performing a repair.
 However, it's important to understand that once an incremental repair
 marks data as repaired, it won't try to repair it again. This is fine
 for syncing up missed writes, but it doesn't protect against things like
 disk corruption, data loss by operator error, or bugs in Cassandra. For
 this reason, full repairs should still be run occasionally.

 == Usage and Best Practices

 Since repair can result in a lot of disk and network io, it's not run
 automatically by Cassandra. It is run by the operator via nodetool.

 Incremental repair is the default and is run with the following command:

 [source,none]
 ----
 nodetool repair
 ----

 A full repair can be run with the following command:

 [source,none]
 ----
 nodetool repair --full
 ----

 Additionally, repair can be run on a single keyspace:

 [source,none]
 ----
 nodetool repair [options] <keyspace_name>
 ----

 Or even on specific tables:

 [source,none]
 ----
 nodetool repair [options] <keyspace_name> <table1> <table2>
 ----

 The repair command only repairs token ranges on the node being repaired,
 it doesn't repair the whole cluster. By default, repair will operate on
 all token ranges replicated by the node you're running repair on, which
 will cause duplicate work if you run it on every node. The `-pr` flag
 will only repair the "primary" ranges on a node, so you can repair your
 entire cluster by running `nodetool repair -pr` on each node in a single
 datacenter.

 The specific frequency of repair that's right for your cluster, of
 course, depends on several factors. However, if you're just starting out
 and looking for somewhere to start, running an incremental repair every
 1-3 days, and a full repair every 1-3 weeks is probably reasonable. If
 you don't want to run incremental repairs, a full repair every 5 days is
 a good place to start.

 At a minimum, repair should be run often enough that the gc grace period
 never expires on unrepaired data. Otherwise, deleted data could
 reappear. With a default gc grace period of 10 days, repairing every
 node in your cluster at least once every 7 days will prevent this, while
 providing enough slack to allow for delays.

 == Other Options

 `-pr, --partitioner-range`::
   Restricts repair to the 'primary' token ranges of the node being
   repaired. A primary range is just a token range for which a node is
   the first replica in the ring.
 `-prv, --preview`::
   Estimates the amount of streaming that would occur for the given
   repair command. This builds the merkle trees, and prints the expected
   streaming activity, but does not actually do any streaming. By
   default, incremental repairs are estimated, add the `--full` flag to
   estimate a full repair.
 `-vd, --validate`::
   Verifies that the repaired data is the same across all nodes. Similiar
   to `--preview`, this builds and compares merkle trees of repaired
   data, but doesn't do any streaming. This is useful for
   troubleshooting. If this shows that the repaired data is out of sync,
   a full repair should be run.

 `nodetool repair docs <nodetool_repair>`

 == Full Repair Example

 Full repair is typically needed to redistribute data after increasing
 the replication factor of a keyspace or after adding a node to the
 cluster. Full repair involves streaming SSTables. To demonstrate full
 repair start with a three node cluster.

 [source,none]
 ----
 [ec2-user@ip-10-0-2-238 ~]$ nodetool status
 Datacenter: us-east-1
 =====================
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load        Tokens  Owns  Host ID                              Rack
 UN  10.0.1.115  547 KiB     256    ?  b64cb32a-b32a-46b4-9eeb-e123fa8fc287  us-east-1b
 UN  10.0.3.206  617.91 KiB  256    ?  74863177-684b-45f4-99f7-d1006625dc9e  us-east-1d
 UN  10.0.2.238  670.26 KiB  256    ?  4dcdadd2-41f9-4f34-9892-1f20868b27c7  us-east-1c
 ----

 Create a keyspace with replication factor 3:

 [source,none]
 ----
 cqlsh> DROP KEYSPACE cqlkeyspace;
 cqlsh> CREATE KEYSPACE CQLKeyspace
   ... WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
 ----

 Add a table to the keyspace:

 [source,none]
 ----
 cqlsh> use cqlkeyspace;
 cqlsh:cqlkeyspace> CREATE TABLE t (
            ...   id int,
            ...   k int,
            ...   v text,
            ...   PRIMARY KEY (id)
            ... );
 ----

 Add table data:

 [source,none]
 ----
 cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (0, 0, 'val0');
 cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (1, 1, 'val1');
 cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (2, 2, 'val2');
 ----

 A query lists the data added:

 [source,none]
 ----
 cqlsh:cqlkeyspace> SELECT * FROM t;

 id | k | v
 ----+---+------
  1 | 1 | val1
  0 | 0 | val0
  2 | 2 | val2
 (3 rows)
 ----

 Make the following changes to a three node cluster:

 [arabic]
 . Increase the replication factor from 3 to 4.
 . Add a 4th node to the cluster

 When the replication factor is increased the following message gets
 output indicating that a full repair is needed as per
 (https://issues.apache.org/jira/browse/CASSANDRA-13079[CASSANDRA-13079]):

 [source,none]
 ----
 cqlsh:cqlkeyspace> ALTER KEYSPACE CQLKeyspace
            ... WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 4};
 Warnings :
 When increasing replication factor you need to run a full (-full) repair to distribute the
 data.
 ----

 Perform a full repair on the keyspace `cqlkeyspace` table `t` with
 following command:

 [source,none]
 ----
 nodetool repair -full cqlkeyspace t
 ----

 Full repair completes in about a second as indicated by the output:

 [source,none]
 ----
 [ec2-user@ip-10-0-2-238 ~]$ nodetool repair -full cqlkeyspace t
 [2019-08-17 03:06:21,445] Starting repair command #1 (fd576da0-c09b-11e9-b00c-1520e8c38f00), repairing keyspace cqlkeyspace with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [t], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 1024, pull repair: false, force repair: false, optimise streams: false)
 [2019-08-17 03:06:23,059] Repair session fd8e5c20-c09b-11e9-b00c-1520e8c38f00 for range [(-8792657144775336505,-8786320730900698730], (-5454146041421260303,-5439402053041523135], (4288357893651763201,4324309707046452322], ... , (4350676211955643098,4351706629422088296]] finished (progress: 0%)
 [2019-08-17 03:06:23,077] Repair completed successfully
 [2019-08-17 03:06:23,077] Repair command #1 finished in 1 second
 [ec2-user@ip-10-0-2-238 ~]$
 ----

 The `nodetool  tpstats` command should list a repair having been
 completed as `Repair-Task` > `Completed` column value of 1:

 [source,none]
 ----
 [ec2-user@ip-10-0-2-238 ~]$ nodetool tpstats
 Pool Name Active   Pending Completed   Blocked  All time blocked
 ReadStage  0           0           99       0              0
 …
 Repair-Task 0       0           1        0              0
 RequestResponseStage                  0        0        2078        0               0
 ----
	= Repair

	Cassandra is designed to remain available if one of it's nodes is down
	or unreachable. However, when a node is down or unreachable, it needs to
	eventually discover the writes it missed. Hints attempt to inform a node
	of missed writes, but are a best effort, and aren't guaranteed to inform
	a node of 100% of the writes it missed. These inconsistencies can
	eventually result in data loss as nodes are replaced or tombstones
	expire.

	These inconsistencies are fixed with the repair process. Repair
	synchronizes the data between nodes by comparing their respective
	datasets for their common token ranges, and streaming the differences
	for any out of sync sections between the nodes. It compares the data
	with merkle trees, which are a hierarchy of hashes.

	== Incremental and Full Repairs

	There are 2 types of repairs: full repairs, and incremental repairs.
	Full repairs operate over all of the data in the token range being
	repaired. Incremental repairs only repair data that's been written since
	the previous incremental repair.

	Incremental repairs are the default repair type, and if run regularly,
	can significantly reduce the time and io cost of performing a repair.
	However, it's important to understand that once an incremental repair
	marks data as repaired, it won't try to repair it again. This is fine
	for syncing up missed writes, but it doesn't protect against things like
	disk corruption, data loss by operator error, or bugs in Cassandra. For
	this reason, full repairs should still be run occasionally.

	== Usage and Best Practices

	Since repair can result in a lot of disk and network io, it's not run
	automatically by Cassandra. It is run by the operator via nodetool.

	Incremental repair is the default and is run with the following command:

	[source,none]
	----
	nodetool repair
	----

	A full repair can be run with the following command:

	[source,none]
	----
	nodetool repair --full
	----

	Additionally, repair can be run on a single keyspace:

	[source,none]
	----
	nodetool repair [options] <keyspace_name>
	----

	Or even on specific tables:

	[source,none]
	----
	nodetool repair [options] <keyspace_name> <table1> <table2>
	----

	The repair command only repairs token ranges on the node being repaired,
	it doesn't repair the whole cluster. By default, repair will operate on
	all token ranges replicated by the node you're running repair on, which
	will cause duplicate work if you run it on every node. The `-pr` flag
	will only repair the "primary" ranges on a node, so you can repair your
	entire cluster by running `nodetool repair -pr` on each node in a single
	datacenter.

	The specific frequency of repair that's right for your cluster, of
	course, depends on several factors. However, if you're just starting out
	and looking for somewhere to start, running an incremental repair every
	1-3 days, and a full repair every 1-3 weeks is probably reasonable. If
	you don't want to run incremental repairs, a full repair every 5 days is
	a good place to start.

	At a minimum, repair should be run often enough that the gc grace period
	never expires on unrepaired data. Otherwise, deleted data could
	reappear. With a default gc grace period of 10 days, repairing every
	node in your cluster at least once every 7 days will prevent this, while
	providing enough slack to allow for delays.

	== Other Options

	`-pr, --partitioner-range`::
	Restricts repair to the 'primary' token ranges of the node being
	repaired. A primary range is just a token range for which a node is
	the first replica in the ring.
	`-prv, --preview`::
	Estimates the amount of streaming that would occur for the given
	repair command. This builds the merkle trees, and prints the expected
	streaming activity, but does not actually do any streaming. By
	default, incremental repairs are estimated, add the `--full` flag to
	estimate a full repair.
	`-vd, --validate`::
	Verifies that the repaired data is the same across all nodes. Similiar
	to `--preview`, this builds and compares merkle trees of repaired
	data, but doesn't do any streaming. This is useful for
	troubleshooting. If this shows that the repaired data is out of sync,
	a full repair should be run.

	`nodetool repair docs <nodetool_repair>`

	== Full Repair Example

	Full repair is typically needed to redistribute data after increasing
	the replication factor of a keyspace or after adding a node to the
	cluster. Full repair involves streaming SSTables. To demonstrate full
	repair start with a three node cluster.

	[source,none]
	----
	[ec2-user@ip-10-0-2-238 ~]$ nodetool status
	Datacenter: us-east-1
	=====================
	Status=Up/Down
	\|/ State=Normal/Leaving/Joining/Moving
	-- Address Load Tokens Owns Host ID Rack
	UN 10.0.1.115 547 KiB 256 ? b64cb32a-b32a-46b4-9eeb-e123fa8fc287 us-east-1b
	UN 10.0.3.206 617.91 KiB 256 ? 74863177-684b-45f4-99f7-d1006625dc9e us-east-1d
	UN 10.0.2.238 670.26 KiB 256 ? 4dcdadd2-41f9-4f34-9892-1f20868b27c7 us-east-1c
	----

	Create a keyspace with replication factor 3:

	[source,none]
	----
	cqlsh> DROP KEYSPACE cqlkeyspace;
	cqlsh> CREATE KEYSPACE CQLKeyspace
	... WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
	----

	Add a table to the keyspace:

	[source,none]
	----
	cqlsh> use cqlkeyspace;
	cqlsh:cqlkeyspace> CREATE TABLE t (
	... id int,
	... k int,
	... v text,
	... PRIMARY KEY (id)
	... );
	----

	Add table data:

	[source,none]
	----
	cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (0, 0, 'val0');
	cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (1, 1, 'val1');
	cqlsh:cqlkeyspace> INSERT INTO t (id, k, v) VALUES (2, 2, 'val2');
	----

	A query lists the data added:

	[source,none]
	----
	cqlsh:cqlkeyspace> SELECT * FROM t;

	id \| k \| v
	----+---+------
	1 \| 1 \| val1
	0 \| 0 \| val0
	2 \| 2 \| val2
	(3 rows)
	----

	Make the following changes to a three node cluster:

	[arabic]
	. Increase the replication factor from 3 to 4.
	. Add a 4th node to the cluster

	When the replication factor is increased the following message gets
	output indicating that a full repair is needed as per
	(https://issues.apache.org/jira/browse/CASSANDRA-13079[CASSANDRA-13079]):

	[source,none]
	----
	cqlsh:cqlkeyspace> ALTER KEYSPACE CQLKeyspace
	... WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 4};
	Warnings :
	When increasing replication factor you need to run a full (-full) repair to distribute the
	data.
	----

	Perform a full repair on the keyspace `cqlkeyspace` table `t` with
	following command:

	[source,none]
	----
	nodetool repair -full cqlkeyspace t
	----

	Full repair completes in about a second as indicated by the output:

	[source,none]
	----
	[ec2-user@ip-10-0-2-238 ~]$ nodetool repair -full cqlkeyspace t
	[2019-08-17 03:06:21,445] Starting repair command #1 (fd576da0-c09b-11e9-b00c-1520e8c38f00), repairing keyspace cqlkeyspace with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [t], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 1024, pull repair: false, force repair: false, optimise streams: false)
	[2019-08-17 03:06:23,059] Repair session fd8e5c20-c09b-11e9-b00c-1520e8c38f00 for range [(-8792657144775336505,-8786320730900698730], (-5454146041421260303,-5439402053041523135], (4288357893651763201,4324309707046452322], ... , (4350676211955643098,4351706629422088296]] finished (progress: 0%)
	[2019-08-17 03:06:23,077] Repair completed successfully
	[2019-08-17 03:06:23,077] Repair command #1 finished in 1 second
	[ec2-user@ip-10-0-2-238 ~]$
	----

	The `nodetool tpstats` command should list a repair having been
	completed as `Repair-Task` > `Completed` column value of 1:

	[source,none]
	----
	[ec2-user@ip-10-0-2-238 ~]$ nodetool tpstats
	Pool Name Active Pending Completed Blocked All time blocked
	ReadStage 0 0 99 0 0
	…
	Repair-Task 0 0 1 0 0
	RequestResponseStage 0 0 2078 0 0
	----