| = Adding, replacing, moving and removing nodes |
| |
| == Bootstrap |
| |
| Adding new nodes is called "bootstrapping". The `num_tokens` parameter |
| will define the amount of virtual nodes (tokens) the joining node will |
| be assigned during bootstrap. The tokens define the sections of the ring |
| (token ranges) the node will become responsible for. |
| |
| === Token allocation |
| |
| With the default token allocation algorithm the new node will pick |
| `num_tokens` random tokens to become responsible for. Since tokens are |
| distributed randomly, load distribution improves with a higher amount of |
| virtual nodes, but it also increases token management overhead. The |
| default of 256 virtual nodes should provide a reasonable load balance |
| with acceptable overhead. |
| |
| On 3.0+ a new token allocation algorithm was introduced to allocate |
| tokens based on the load of existing virtual nodes for a given keyspace, |
| and thus yield an improved load distribution with a lower number of |
| tokens. To use this approach, the new node must be started with the JVM |
| option `-Dcassandra.allocate_tokens_for_keyspace=<keyspace>`, where |
| `<keyspace>` is the keyspace from which the algorithm can find the load |
| information to optimize token assignment for. |
| |
| ==== Manual token assignment |
| |
| You may specify a comma-separated list of tokens manually with the |
| `initial_token` `cassandra.yaml` parameter, and if that is specified |
| Cassandra will skip the token allocation process. This may be useful |
| when doing token assignment with an external tool or when restoring a |
| node with its previous tokens. |
| |
| === Range streaming |
| |
| After the tokens are allocated, the joining node will pick current |
| replicas of the token ranges it will become responsible for to stream |
| data from. By default it will stream from the primary replica of each |
| token range in order to guarantee data in the new node will be |
| consistent with the current state. |
| |
| In the case of any unavailable replica, the consistent bootstrap process |
| will fail. To override this behavior and potentially miss data from an |
| unavailable replica, set the JVM flag |
| `-Dcassandra.consistent.rangemovement=false`. |
| |
| === Resuming failed/hanged bootstrap |
| |
| On 2.2+, if the bootstrap process fails, it's possible to resume |
| bootstrap from the previous saved state by calling |
| `nodetool bootstrap resume`. If for some reason the bootstrap hangs or |
| stalls, it may also be resumed by simply restarting the node. In order |
| to cleanup bootstrap state and start fresh, you may set the JVM startup |
| flag `-Dcassandra.reset_bootstrap_progress=true`. |
| |
| On lower versions, when the bootstrap proces fails it is recommended to |
| wipe the node (remove all the data), and restart the bootstrap process |
| again. |
| |
| === Manual bootstrapping |
| |
| It's possible to skip the bootstrapping process entirely and join the |
| ring straight away by setting the hidden parameter |
| `auto_bootstrap: false`. This may be useful when restoring a node from a |
| backup or creating a new data-center. |
| |
| == Removing nodes |
| |
| You can take a node out of the cluster with `nodetool decommission` to a |
| live node, or `nodetool removenode` (to any other machine) to remove a |
| dead one. This will assign the ranges the old node was responsible for |
| to other nodes, and replicate the appropriate data there. If |
| decommission is used, the data will stream from the decommissioned node. |
| If removenode is used, the data will stream from the remaining replicas. |
| |
| No data is removed automatically from the node being decommissioned, so |
| if you want to put the node back into service at a different token on |
| the ring, it should be removed manually. |
| |
| == Moving nodes |
| |
| When `num_tokens: 1` it's possible to move the node position in the ring |
| with `nodetool move`. Moving is both a convenience over and more |
| efficient than decommission + bootstrap. After moving a node, |
| `nodetool cleanup` should be run to remove any unnecessary data. |
| |
| == Replacing a dead node |
| |
| In order to replace a dead node, start cassandra with the JVM startup |
| flag `-Dcassandra.replace_address_first_boot=<dead_node_ip>`. Once this |
| property is enabled the node starts in a hibernate state, during which |
| all the other nodes will see this node to be DOWN (DN), however this |
| node will see itself as UP (UN). Accurate replacement state can be found |
| in `nodetool netstats`. |
| |
| The replacing node will now start to bootstrap the data from the rest of |
| the nodes in the cluster. A replacing node will only receive writes |
| during the bootstrapping phase if it has a different ip address to the |
| node that is being replaced. (See CASSANDRA-8523 and CASSANDRA-12344) |
| |
| Once the bootstrapping is complete the node will be marked "UP". |
| |
| [NOTE] |
| .Note |
| ==== |
| If any of the following cases apply, you *MUST* run repair to make the |
| replaced node consistent again, since it missed ongoing writes |
| during/prior to bootstrapping. The _replacement_ timeframe refers to the |
| period from when the node initially dies to when a new node completes |
| the replacement process. |
| |
| [arabic] |
| . The node is down for longer than `max_hint_window_in_ms` before being |
| replaced. |
| . You are replacing using the same IP address as the dead node *and* |
| replacement takes longer than `max_hint_window_in_ms`. |
| ==== |
| |
| == Monitoring progress |
| |
| Bootstrap, replace, move and remove progress can be monitored using |
| `nodetool netstats` which will show the progress of the streaming |
| operations. |
| |
| == Cleanup data after range movements |
| |
| As a safety measure, Cassandra does not automatically remove data from |
| nodes that "lose" part of their token range due to a range movement |
| operation (bootstrap, move, replace). Run `nodetool cleanup` on the |
| nodes that lost ranges to the joining node when you are satisfied the |
| new node is up and working. If you do not do this the old data will |
| still be counted against the load on that node. |