doc/modules/cassandra/pages/troubleshooting/finding_nodes.adoc - cassandra - Git at Google

 = Find The Misbehaving Nodes

 The first step to troubleshooting a Cassandra issue is to use error
 messages, metrics and monitoring information to identify if the issue
 lies with the clients or the server and if it does lie with the server
 find the problematic nodes in the Cassandra cluster. The goal is to
 determine if this is a systemic issue (e.g. a query pattern that affects
 the entire cluster) or isolated to a subset of nodes (e.g. neighbors
 holding a shared token range or even a single node with bad hardware).

 There are many sources of information that help determine where the
 problem lies. Some of the most common are mentioned below.

 == Client Logs and Errors

 Clients of the cluster often leave the best breadcrumbs to follow.
 Perhaps client latencies or error rates have increased in a particular
 datacenter (likely eliminating other datacenter's nodes), or clients are
 receiving a particular kind of error code indicating a particular kind
 of problem. Troubleshooters can often rule out many failure modes just
 by reading the error messages. In fact, many Cassandra error messages
 include the last coordinator contacted to help operators find nodes to
 start with.

 Some common errors (likely culprit in parenthesis) assuming the client
 has similar error names as the Datastax `drivers <client-drivers>`:

 * `SyntaxError` (*client*). This and other `QueryValidationException`
 indicate that the client sent a malformed request. These are rarely
 server issues and usually indicate bad queries.
 * `UnavailableException` (*server*): This means that the Cassandra
 coordinator node has rejected the query as it believes that insufficent
 replica nodes are available. If many coordinators are throwing this
 error it likely means that there really are (typically) multiple nodes
 down in the cluster and you can identify them using `nodetool status
 <nodetool-status>` If only a single coordinator is throwing this error
 it may mean that node has been partitioned from the rest.
 * `OperationTimedOutException` (*server*): This is the most frequent
 timeout message raised when clients set timeouts and means that the
 query took longer than the supplied timeout. This is a _client side_
 timeout meaning that it took longer than the client specified timeout.
 The error message will include the coordinator node that was last tried
 which is usually a good starting point. This error usually indicates
 either aggressive client timeout values or latent server
 coordinators/replicas.
 * `ReadTimeoutException` or `WriteTimeoutException` (*server*): These
 are raised when clients do not specify lower timeouts and there is a
 _coordinator_ timeouts based on the values supplied in the
 `cassandra.yaml` configuration file. They usually indicate a serious
 server side problem as the default values are usually multiple seconds.

 == Metrics

 If you have Cassandra xref:operating/metrics.adoc[`metrics`] reporting to a
 centralized location such as https://graphiteapp.org/[Graphite] or
 https://grafana.com/[Grafana] you can typically use those to narrow down
 the problem. At this stage narrowing down the issue to a particular
 datacenter, rack, or even group of nodes is the main goal. Some helpful
 metrics to look at are:

 === Errors

 Cassandra refers to internode messaging errors as "drops", and provided
 a number of xref:operating/metrics.adoc#droppedmessage-metrics[`Dropped Message Metrics`] to help narrow
 down errors. If particular nodes are dropping messages actively, they
 are likely related to the issue.

 === Latency

 For timeouts or latency related issues you can start with operating/metrics.adoc#table-metrics[`table metrics`]
 by comparing Coordinator level metrics e.g.
 `CoordinatorReadLatency` or `CoordinatorWriteLatency` with their
 associated replica metrics e.g. `ReadLatency` or `WriteLatency`. Issues
 usually show up on the `99th` percentile before they show up on the
 `50th` percentile or the `mean`. While `maximum` coordinator latencies
 are not typically very helpful due to the exponentially decaying
 reservoir used internally to produce metrics, `maximum` replica
 latencies that correlate with increased `99th` percentiles on
 coordinators can help narrow down the problem.

 There are usually three main possibilities:

 [arabic]
 . Coordinator latencies are high on all nodes, but only a few node's
 local read latencies are high. This points to slow replica nodes and the
 coordinator's are just side-effects. This usually happens when clients
 are not token aware.
 . Coordinator latencies and replica latencies increase at the same time
 on the a few nodes. If clients are token aware this is almost always
 what happens and points to slow replicas of a subset of token ranges
 (only part of the ring).
 . Coordinator and local latencies are high on many nodes. This usually
 indicates either a tipping point in the cluster capacity (too many
 writes or reads per second), or a new query pattern.

 It's important to remember that depending on the client's load balancing
 behavior and consistency levels coordinator and replica metrics may or
 may not correlate. In particular if you use `TokenAware` policies the
 same node's coordinator and replica latencies will often increase
 together, but if you just use normal `DCAwareRoundRobin` coordinator
 latencies can increase with unrelated replica node's latencies. For
 example:

 * `TokenAware` + `LOCAL_ONE`: should always have coordinator and replica
 latencies on the same node rise together
 * `TokenAware` + `LOCAL_QUORUM`: should always have coordinator and
 multiple replica latencies rise together in the same datacenter.
 * `TokenAware` + `QUORUM`: replica latencies in other datacenters can
 affect coordinator latencies.
 * `DCAwareRoundRobin` + `LOCAL_ONE`: coordinator latencies and unrelated
 replica node's latencies will rise together.
 * `DCAwareRoundRobin` + `LOCAL_QUORUM`: different coordinator and
 replica latencies will rise together with little correlation.

 === Query Rates

 Sometimes the xref:operating/metrics.adoc#table-metrics[`table metric`] query rate metrics can help narrow
 down load issues as "small" increase in coordinator queries per second
 (QPS) may correlate with a very large increase in replica level QPS.
 This most often happens with `BATCH` writes, where a client may send a
 single `BATCH` query that might contain 50 statements in it, which if
 you have 9 copies (RF=3, three datacenters) means that every coordinator
 `BATCH` write turns into 450 replica writes! This is why keeping
 `BATCH`'s to the same partition is so critical, otherwise you can
 exhaust significant CPU capacitity with a "single" query.

 == Next Step: Investigate the Node(s)

 Once you have narrowed down the problem as much as possible (datacenter,
 rack , node), login to one of the nodes using SSH and proceed to debug
 using xref:reading_logs.adoc[`logs`], xref:use_nodetooladoc[`nodetool`], and
 xref:use_tools.adoc[`os tools`].
 If you are not able to login you may still have access to `logs` and `nodetool` remotely.
	= Find The Misbehaving Nodes

	The first step to troubleshooting a Cassandra issue is to use error
	messages, metrics and monitoring information to identify if the issue
	lies with the clients or the server and if it does lie with the server
	find the problematic nodes in the Cassandra cluster. The goal is to
	determine if this is a systemic issue (e.g. a query pattern that affects
	the entire cluster) or isolated to a subset of nodes (e.g. neighbors
	holding a shared token range or even a single node with bad hardware).

	There are many sources of information that help determine where the
	problem lies. Some of the most common are mentioned below.

	== Client Logs and Errors

	Clients of the cluster often leave the best breadcrumbs to follow.
	Perhaps client latencies or error rates have increased in a particular
	datacenter (likely eliminating other datacenter's nodes), or clients are
	receiving a particular kind of error code indicating a particular kind
	of problem. Troubleshooters can often rule out many failure modes just
	by reading the error messages. In fact, many Cassandra error messages
	include the last coordinator contacted to help operators find nodes to
	start with.

	Some common errors (likely culprit in parenthesis) assuming the client
	has similar error names as the Datastax `drivers <client-drivers>`:

	* `SyntaxError` (client). This and other `QueryValidationException`
	indicate that the client sent a malformed request. These are rarely
	server issues and usually indicate bad queries.
	* `UnavailableException` (server): This means that the Cassandra
	coordinator node has rejected the query as it believes that insufficent
	replica nodes are available. If many coordinators are throwing this
	error it likely means that there really are (typically) multiple nodes
	down in the cluster and you can identify them using `nodetool status
	<nodetool-status>` If only a single coordinator is throwing this error
	it may mean that node has been partitioned from the rest.
	* `OperationTimedOutException` (server): This is the most frequent
	timeout message raised when clients set timeouts and means that the
	query took longer than the supplied timeout. This is a _client side_
	timeout meaning that it took longer than the client specified timeout.
	The error message will include the coordinator node that was last tried
	which is usually a good starting point. This error usually indicates
	either aggressive client timeout values or latent server
	coordinators/replicas.
	* `ReadTimeoutException` or `WriteTimeoutException` (server): These
	are raised when clients do not specify lower timeouts and there is a
	_coordinator_ timeouts based on the values supplied in the
	`cassandra.yaml` configuration file. They usually indicate a serious
	server side problem as the default values are usually multiple seconds.

	== Metrics

	If you have Cassandra xref:operating/metrics.adoc[`metrics`] reporting to a
	centralized location such as https://graphiteapp.org/[Graphite] or
	https://grafana.com/[Grafana] you can typically use those to narrow down
	the problem. At this stage narrowing down the issue to a particular
	datacenter, rack, or even group of nodes is the main goal. Some helpful
	metrics to look at are:

	=== Errors

	Cassandra refers to internode messaging errors as "drops", and provided
	a number of xref:operating/metrics.adoc#droppedmessage-metrics[`Dropped Message Metrics`] to help narrow
	down errors. If particular nodes are dropping messages actively, they
	are likely related to the issue.

	=== Latency

	For timeouts or latency related issues you can start with operating/metrics.adoc#table-metrics[`table metrics`]
	by comparing Coordinator level metrics e.g.
	`CoordinatorReadLatency` or `CoordinatorWriteLatency` with their
	associated replica metrics e.g. `ReadLatency` or `WriteLatency`. Issues
	usually show up on the `99th` percentile before they show up on the
	`50th` percentile or the `mean`. While `maximum` coordinator latencies
	are not typically very helpful due to the exponentially decaying
	reservoir used internally to produce metrics, `maximum` replica
	latencies that correlate with increased `99th` percentiles on
	coordinators can help narrow down the problem.

	There are usually three main possibilities:

	[arabic]
	. Coordinator latencies are high on all nodes, but only a few node's
	local read latencies are high. This points to slow replica nodes and the
	coordinator's are just side-effects. This usually happens when clients
	are not token aware.
	. Coordinator latencies and replica latencies increase at the same time
	on the a few nodes. If clients are token aware this is almost always
	what happens and points to slow replicas of a subset of token ranges
	(only part of the ring).
	. Coordinator and local latencies are high on many nodes. This usually
	indicates either a tipping point in the cluster capacity (too many
	writes or reads per second), or a new query pattern.

	It's important to remember that depending on the client's load balancing
	behavior and consistency levels coordinator and replica metrics may or
	may not correlate. In particular if you use `TokenAware` policies the
	same node's coordinator and replica latencies will often increase
	together, but if you just use normal `DCAwareRoundRobin` coordinator
	latencies can increase with unrelated replica node's latencies. For
	example:

	* `TokenAware` + `LOCAL_ONE`: should always have coordinator and replica
	latencies on the same node rise together
	* `TokenAware` + `LOCAL_QUORUM`: should always have coordinator and
	multiple replica latencies rise together in the same datacenter.
	* `TokenAware` + `QUORUM`: replica latencies in other datacenters can
	affect coordinator latencies.
	* `DCAwareRoundRobin` + `LOCAL_ONE`: coordinator latencies and unrelated
	replica node's latencies will rise together.
	* `DCAwareRoundRobin` + `LOCAL_QUORUM`: different coordinator and
	replica latencies will rise together with little correlation.

	=== Query Rates

	Sometimes the xref:operating/metrics.adoc#table-metrics[`table metric`] query rate metrics can help narrow
	down load issues as "small" increase in coordinator queries per second
	(QPS) may correlate with a very large increase in replica level QPS.
	This most often happens with `BATCH` writes, where a client may send a
	single `BATCH` query that might contain 50 statements in it, which if
	you have 9 copies (RF=3, three datacenters) means that every coordinator
	`BATCH` write turns into 450 replica writes! This is why keeping
	`BATCH`'s to the same partition is so critical, otherwise you can
	exhaust significant CPU capacitity with a "single" query.

	== Next Step: Investigate the Node(s)

	Once you have narrowed down the problem as much as possible (datacenter,
	rack , node), login to one of the nodes using SSH and proceed to debug
	using xref:reading_logs.adoc[`logs`], xref:use_nodetooladoc[`nodetool`], and
	xref:use_tools.adoc[`os tools`].
	If you are not able to login you may still have access to `logs` and `nodetool` remotely.