blob: 522bbb92af350641bdd23f90e12243341d64403b [file] [log] [blame]
= Find The Misbehaving Nodes
The first step to troubleshooting a Cassandra issue is to use error
messages, metrics and monitoring information to identify if the issue
lies with the clients or the server and if it does lie with the server
find the problematic nodes in the Cassandra cluster. The goal is to
determine if this is a systemic issue (e.g. a query pattern that affects
the entire cluster) or isolated to a subset of nodes (e.g. neighbors
holding a shared token range or even a single node with bad hardware).
There are many sources of information that help determine where the
problem lies. Some of the most common are mentioned below.
== Client Logs and Errors
Clients of the cluster often leave the best breadcrumbs to follow.
Perhaps client latencies or error rates have increased in a particular
datacenter (likely eliminating other datacenter's nodes), or clients are
receiving a particular kind of error code indicating a particular kind
of problem. Troubleshooters can often rule out many failure modes just
by reading the error messages. In fact, many Cassandra error messages
include the last coordinator contacted to help operators find nodes to
start with.
Some common errors (likely culprit in parenthesis) assuming the client
has similar error names as the Datastax `drivers <client-drivers>`:
* `SyntaxError` (*client*). This and other `QueryValidationException`
indicate that the client sent a malformed request. These are rarely
server issues and usually indicate bad queries.
* `UnavailableException` (*server*): This means that the Cassandra
coordinator node has rejected the query as it believes that insufficent
replica nodes are available. If many coordinators are throwing this
error it likely means that there really are (typically) multiple nodes
down in the cluster and you can identify them using `nodetool status
<nodetool-status>` If only a single coordinator is throwing this error
it may mean that node has been partitioned from the rest.
* `OperationTimedOutException` (*server*): This is the most frequent
timeout message raised when clients set timeouts and means that the
query took longer than the supplied timeout. This is a _client side_
timeout meaning that it took longer than the client specified timeout.
The error message will include the coordinator node that was last tried
which is usually a good starting point. This error usually indicates
either aggressive client timeout values or latent server
coordinators/replicas.
* `ReadTimeoutException` or `WriteTimeoutException` (*server*): These
are raised when clients do not specify lower timeouts and there is a
_coordinator_ timeouts based on the values supplied in the
`cassandra.yaml` configuration file. They usually indicate a serious
server side problem as the default values are usually multiple seconds.
== Metrics
If you have Cassandra xref:operating/metrics.adoc[`metrics`] reporting to a
centralized location such as https://graphiteapp.org/[Graphite] or
https://grafana.com/[Grafana] you can typically use those to narrow down
the problem. At this stage narrowing down the issue to a particular
datacenter, rack, or even group of nodes is the main goal. Some helpful
metrics to look at are:
=== Errors
Cassandra refers to internode messaging errors as "drops", and provided
a number of xref:operating/metrics.adoc#droppedmessage-metrics[`Dropped Message Metrics`] to help narrow
down errors. If particular nodes are dropping messages actively, they
are likely related to the issue.
=== Latency
For timeouts or latency related issues you can start with operating/metrics.adoc#table-metrics[`table metrics`]
by comparing Coordinator level metrics e.g.
`CoordinatorReadLatency` or `CoordinatorWriteLatency` with their
associated replica metrics e.g. `ReadLatency` or `WriteLatency`. Issues
usually show up on the `99th` percentile before they show up on the
`50th` percentile or the `mean`. While `maximum` coordinator latencies
are not typically very helpful due to the exponentially decaying
reservoir used internally to produce metrics, `maximum` replica
latencies that correlate with increased `99th` percentiles on
coordinators can help narrow down the problem.
There are usually three main possibilities:
[arabic]
. Coordinator latencies are high on all nodes, but only a few node's
local read latencies are high. This points to slow replica nodes and the
coordinator's are just side-effects. This usually happens when clients
are not token aware.
. Coordinator latencies and replica latencies increase at the same time
on the a few nodes. If clients are token aware this is almost always
what happens and points to slow replicas of a subset of token ranges
(only part of the ring).
. Coordinator and local latencies are high on many nodes. This usually
indicates either a tipping point in the cluster capacity (too many
writes or reads per second), or a new query pattern.
It's important to remember that depending on the client's load balancing
behavior and consistency levels coordinator and replica metrics may or
may not correlate. In particular if you use `TokenAware` policies the
same node's coordinator and replica latencies will often increase
together, but if you just use normal `DCAwareRoundRobin` coordinator
latencies can increase with unrelated replica node's latencies. For
example:
* `TokenAware` + `LOCAL_ONE`: should always have coordinator and replica
latencies on the same node rise together
* `TokenAware` + `LOCAL_QUORUM`: should always have coordinator and
multiple replica latencies rise together in the same datacenter.
* `TokenAware` + `QUORUM`: replica latencies in other datacenters can
affect coordinator latencies.
* `DCAwareRoundRobin` + `LOCAL_ONE`: coordinator latencies and unrelated
replica node's latencies will rise together.
* `DCAwareRoundRobin` + `LOCAL_QUORUM`: different coordinator and
replica latencies will rise together with little correlation.
=== Query Rates
Sometimes the xref:operating/metrics.adoc#table-metrics[`table metric`] query rate metrics can help narrow
down load issues as "small" increase in coordinator queries per second
(QPS) may correlate with a very large increase in replica level QPS.
This most often happens with `BATCH` writes, where a client may send a
single `BATCH` query that might contain 50 statements in it, which if
you have 9 copies (RF=3, three datacenters) means that every coordinator
`BATCH` write turns into 450 replica writes! This is why keeping
`BATCH`'s to the same partition is so critical, otherwise you can
exhaust significant CPU capacitity with a "single" query.
== Next Step: Investigate the Node(s)
Once you have narrowed down the problem as much as possible (datacenter,
rack , node), login to one of the nodes using SSH and proceed to debug
using xref:reading_logs.adoc[`logs`], xref:use_nodetooladoc[`nodetool`], and
xref:use_tools.adoc[`os tools`].
If you are not able to login you may still have access to `logs` and `nodetool` remotely.