| |
| |
| <!DOCTYPE html> |
| <html class="writer-html5" lang="en" > |
| <head> |
| <meta charset="utf-8" /> |
| |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /> |
| |
| <title>Find The Misbehaving Nodes — Apache Cassandra Documentation v4.0-rc2</title> |
| |
| |
| |
| <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" /> |
| <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> |
| <link rel="stylesheet" href="../_static/extra.css" type="text/css" /> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| <!--[if lt IE 9]> |
| <script src="../_static/js/html5shiv.min.js"></script> |
| <![endif]--> |
| |
| |
| <script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script> |
| <script src="../_static/jquery.js"></script> |
| <script src="../_static/underscore.js"></script> |
| <script src="../_static/doctools.js"></script> |
| |
| <script type="text/javascript" src="../_static/js/theme.js"></script> |
| |
| |
| <link rel="index" title="Index" href="../genindex.html" /> |
| <link rel="search" title="Search" href="../search.html" /> |
| <link rel="next" title="Cassandra Logs" href="reading_logs.html" /> |
| <link rel="prev" title="Troubleshooting" href="index.html" /> |
| </head> |
| |
| <body class="wy-body-for-nav"> |
| |
| |
| <div class="wy-grid-for-nav"> |
| |
| <nav data-toggle="wy-nav-shift" class="wy-nav-side"> |
| <div class="wy-side-scroll"> |
| <div class="wy-side-nav-search" > |
| |
| |
| |
| <a href="../index.html" class="icon icon-home"> Apache Cassandra |
| |
| |
| |
| </a> |
| |
| |
| |
| |
| <div class="version"> |
| 4.0-rc2 |
| </div> |
| |
| |
| |
| |
| <div role="search"> |
| <form id="rtd-search-form" class="wy-form" action="../search.html" method="get"> |
| <input type="text" name="q" placeholder="Search docs" /> |
| <input type="hidden" name="check_keywords" value="yes" /> |
| <input type="hidden" name="area" value="default" /> |
| </form> |
| </div> |
| |
| |
| </div> |
| |
| |
| <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> |
| |
| |
| |
| |
| |
| |
| <ul class="current"> |
| <li class="toctree-l1"><a class="reference internal" href="../getting_started/index.html">Getting Started</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../new/index.html">New Features in Apache Cassandra 4.0</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../architecture/index.html">Architecture</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../cql/index.html">The Cassandra Query Language (CQL)</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../data_modeling/index.html">Data Modeling</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../configuration/index.html">Configuring Cassandra</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../operating/index.html">Operating Cassandra</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../tools/index.html">Cassandra Tools</a></li> |
| <li class="toctree-l1 current"><a class="reference internal" href="index.html">Troubleshooting</a><ul class="current"> |
| <li class="toctree-l2 current"><a class="current reference internal" href="#">Find The Misbehaving Nodes</a><ul> |
| <li class="toctree-l3"><a class="reference internal" href="#client-logs-and-errors">Client Logs and Errors</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#metrics">Metrics</a><ul> |
| <li class="toctree-l4"><a class="reference internal" href="#errors">Errors</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#latency">Latency</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#query-rates">Query Rates</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l3"><a class="reference internal" href="#next-step-investigate-the-node-s">Next Step: Investigate the Node(s)</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l2"><a class="reference internal" href="reading_logs.html">Cassandra Logs</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="use_nodetool.html">Use Nodetool</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="use_tools.html">Diving Deep, Use External Tools</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l1"><a class="reference internal" href="../development/index.html">Contributing to Cassandra</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../faq/index.html">Frequently Asked Questions</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../plugins/index.html">Third-Party Plugins</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../bugs.html">Reporting Bugs</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../contactus.html">Contact us</a></li> |
| </ul> |
| |
| |
| |
| </div> |
| |
| </div> |
| </nav> |
| |
| <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> |
| |
| |
| <nav class="wy-nav-top" aria-label="top navigation"> |
| |
| <i data-toggle="wy-nav-top" class="fa fa-bars"></i> |
| <a href="../index.html">Apache Cassandra</a> |
| |
| </nav> |
| |
| |
| <div class="wy-nav-content"> |
| |
| <div class="rst-content"> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| <div role="navigation" aria-label="breadcrumbs navigation"> |
| |
| <ul class="wy-breadcrumbs"> |
| |
| <li><a href="../index.html" class="icon icon-home"></a> »</li> |
| |
| <li><a href="index.html">Troubleshooting</a> »</li> |
| |
| <li>Find The Misbehaving Nodes</li> |
| |
| |
| <li class="wy-breadcrumbs-aside"> |
| |
| |
| <a href="../_sources/troubleshooting/finding_nodes.rst.txt" rel="nofollow"> View page source</a> |
| |
| |
| </li> |
| |
| </ul> |
| |
| |
| <hr/> |
| </div> |
| <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> |
| <div itemprop="articleBody"> |
| |
| <div class="section" id="find-the-misbehaving-nodes"> |
| <h1>Find The Misbehaving Nodes<a class="headerlink" href="#find-the-misbehaving-nodes" title="Permalink to this headline">¶</a></h1> |
| <p>The first step to troubleshooting a Cassandra issue is to use error messages, |
| metrics and monitoring information to identify if the issue lies with the |
| clients or the server and if it does lie with the server find the problematic |
| nodes in the Cassandra cluster. The goal is to determine if this is a systemic |
| issue (e.g. a query pattern that affects the entire cluster) or isolated to a |
| subset of nodes (e.g. neighbors holding a shared token range or even a single |
| node with bad hardware).</p> |
| <p>There are many sources of information that help determine where the problem |
| lies. Some of the most common are mentioned below.</p> |
| <div class="section" id="client-logs-and-errors"> |
| <h2>Client Logs and Errors<a class="headerlink" href="#client-logs-and-errors" title="Permalink to this headline">¶</a></h2> |
| <p>Clients of the cluster often leave the best breadcrumbs to follow. Perhaps |
| client latencies or error rates have increased in a particular datacenter |
| (likely eliminating other datacenter’s nodes), or clients are receiving a |
| particular kind of error code indicating a particular kind of problem. |
| Troubleshooters can often rule out many failure modes just by reading the error |
| messages. In fact, many Cassandra error messages include the last coordinator |
| contacted to help operators find nodes to start with.</p> |
| <p>Some common errors (likely culprit in parenthesis) assuming the client has |
| similar error names as the Datastax <a class="reference internal" href="../getting_started/drivers.html#client-drivers"><span class="std std-ref">drivers</span></a>:</p> |
| <ul class="simple"> |
| <li><p><code class="docutils literal notranslate"><span class="pre">SyntaxError</span></code> (<strong>client</strong>). This and other <code class="docutils literal notranslate"><span class="pre">QueryValidationException</span></code> |
| indicate that the client sent a malformed request. These are rarely server |
| issues and usually indicate bad queries.</p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">UnavailableException</span></code> (<strong>server</strong>): This means that the Cassandra |
| coordinator node has rejected the query as it believes that insufficent |
| replica nodes are available. If many coordinators are throwing this error it |
| likely means that there really are (typically) multiple nodes down in the |
| cluster and you can identify them using <a class="reference internal" href="use_nodetool.html#nodetool-status"><span class="std std-ref">nodetool status</span></a> If only a single coordinator is throwing this error it may |
| mean that node has been partitioned from the rest.</p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">OperationTimedOutException</span></code> (<strong>server</strong>): This is the most frequent |
| timeout message raised when clients set timeouts and means that the query |
| took longer than the supplied timeout. This is a <em>client side</em> timeout |
| meaning that it took longer than the client specified timeout. The error |
| message will include the coordinator node that was last tried which is |
| usually a good starting point. This error usually indicates either |
| aggressive client timeout values or latent server coordinators/replicas.</p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">ReadTimeoutException</span></code> or <code class="docutils literal notranslate"><span class="pre">WriteTimeoutException</span></code> (<strong>server</strong>): These |
| are raised when clients do not specify lower timeouts and there is a |
| <em>coordinator</em> timeouts based on the values supplied in the <code class="docutils literal notranslate"><span class="pre">cassandra.yaml</span></code> |
| configuration file. They usually indicate a serious server side problem as |
| the default values are usually multiple seconds.</p></li> |
| </ul> |
| </div> |
| <div class="section" id="metrics"> |
| <h2>Metrics<a class="headerlink" href="#metrics" title="Permalink to this headline">¶</a></h2> |
| <p>If you have Cassandra <a class="reference internal" href="../operating/metrics.html#monitoring-metrics"><span class="std std-ref">metrics</span></a> reporting to a |
| centralized location such as <a class="reference external" href="https://graphiteapp.org/">Graphite</a> or |
| <a class="reference external" href="https://grafana.com/">Grafana</a> you can typically use those to narrow down |
| the problem. At this stage narrowing down the issue to a particular |
| datacenter, rack, or even group of nodes is the main goal. Some helpful metrics |
| to look at are:</p> |
| <div class="section" id="errors"> |
| <h3>Errors<a class="headerlink" href="#errors" title="Permalink to this headline">¶</a></h3> |
| <p>Cassandra refers to internode messaging errors as “drops”, and provided a |
| number of <a class="reference internal" href="../operating/metrics.html#dropped-metrics"><span class="std std-ref">Dropped Message Metrics</span></a> to help narrow |
| down errors. If particular nodes are dropping messages actively, they are |
| likely related to the issue.</p> |
| </div> |
| <div class="section" id="latency"> |
| <h3>Latency<a class="headerlink" href="#latency" title="Permalink to this headline">¶</a></h3> |
| <p>For timeouts or latency related issues you can start with <a class="reference internal" href="../operating/metrics.html#table-metrics"><span class="std std-ref">Table |
| Metrics</span></a> by comparing Coordinator level metrics e.g. |
| <code class="docutils literal notranslate"><span class="pre">CoordinatorReadLatency</span></code> or <code class="docutils literal notranslate"><span class="pre">CoordinatorWriteLatency</span></code> with their associated |
| replica metrics e.g. <code class="docutils literal notranslate"><span class="pre">ReadLatency</span></code> or <code class="docutils literal notranslate"><span class="pre">WriteLatency</span></code>. Issues usually show |
| up on the <code class="docutils literal notranslate"><span class="pre">99th</span></code> percentile before they show up on the <code class="docutils literal notranslate"><span class="pre">50th</span></code> percentile or |
| the <code class="docutils literal notranslate"><span class="pre">mean</span></code>. While <code class="docutils literal notranslate"><span class="pre">maximum</span></code> coordinator latencies are not typically very |
| helpful due to the exponentially decaying reservoir used internally to produce |
| metrics, <code class="docutils literal notranslate"><span class="pre">maximum</span></code> replica latencies that correlate with increased <code class="docutils literal notranslate"><span class="pre">99th</span></code> |
| percentiles on coordinators can help narrow down the problem.</p> |
| <p>There are usually three main possibilities:</p> |
| <ol class="arabic simple"> |
| <li><p>Coordinator latencies are high on all nodes, but only a few node’s local |
| read latencies are high. This points to slow replica nodes and the |
| coordinator’s are just side-effects. This usually happens when clients are |
| not token aware.</p></li> |
| <li><p>Coordinator latencies and replica latencies increase at the |
| same time on the a few nodes. If clients are token aware this is almost |
| always what happens and points to slow replicas of a subset of token |
| ranges (only part of the ring).</p></li> |
| <li><p>Coordinator and local latencies are high on many nodes. This usually |
| indicates either a tipping point in the cluster capacity (too many writes or |
| reads per second), or a new query pattern.</p></li> |
| </ol> |
| <p>It’s important to remember that depending on the client’s load balancing |
| behavior and consistency levels coordinator and replica metrics may or may |
| not correlate. In particular if you use <code class="docutils literal notranslate"><span class="pre">TokenAware</span></code> policies the same |
| node’s coordinator and replica latencies will often increase together, but if |
| you just use normal <code class="docutils literal notranslate"><span class="pre">DCAwareRoundRobin</span></code> coordinator latencies can increase |
| with unrelated replica node’s latencies. For example:</p> |
| <ul class="simple"> |
| <li><p><code class="docutils literal notranslate"><span class="pre">TokenAware</span></code> + <code class="docutils literal notranslate"><span class="pre">LOCAL_ONE</span></code>: should always have coordinator and replica |
| latencies on the same node rise together</p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">TokenAware</span></code> + <code class="docutils literal notranslate"><span class="pre">LOCAL_QUORUM</span></code>: should always have coordinator and |
| multiple replica latencies rise together in the same datacenter.</p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">TokenAware</span></code> + <code class="docutils literal notranslate"><span class="pre">QUORUM</span></code>: replica latencies in other datacenters can |
| affect coordinator latencies.</p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">DCAwareRoundRobin</span></code> + <code class="docutils literal notranslate"><span class="pre">LOCAL_ONE</span></code>: coordinator latencies and unrelated |
| replica node’s latencies will rise together.</p></li> |
| <li><p><code class="docutils literal notranslate"><span class="pre">DCAwareRoundRobin</span></code> + <code class="docutils literal notranslate"><span class="pre">LOCAL_QUORUM</span></code>: different coordinator and replica |
| latencies will rise together with little correlation.</p></li> |
| </ul> |
| </div> |
| <div class="section" id="query-rates"> |
| <h3>Query Rates<a class="headerlink" href="#query-rates" title="Permalink to this headline">¶</a></h3> |
| <p>Sometimes the <a class="reference internal" href="../operating/metrics.html#table-metrics"><span class="std std-ref">Table</span></a> query rate metrics can help |
| narrow down load issues as “small” increase in coordinator queries per second |
| (QPS) may correlate with a very large increase in replica level QPS. This most |
| often happens with <code class="docutils literal notranslate"><span class="pre">BATCH</span></code> writes, where a client may send a single <code class="docutils literal notranslate"><span class="pre">BATCH</span></code> |
| query that might contain 50 statements in it, which if you have 9 copies (RF=3, |
| three datacenters) means that every coordinator <code class="docutils literal notranslate"><span class="pre">BATCH</span></code> write turns into 450 |
| replica writes! This is why keeping <code class="docutils literal notranslate"><span class="pre">BATCH</span></code>’s to the same partition is so |
| critical, otherwise you can exhaust significant CPU capacitity with a “single” |
| query.</p> |
| </div> |
| </div> |
| <div class="section" id="next-step-investigate-the-node-s"> |
| <h2>Next Step: Investigate the Node(s)<a class="headerlink" href="#next-step-investigate-the-node-s" title="Permalink to this headline">¶</a></h2> |
| <p>Once you have narrowed down the problem as much as possible (datacenter, rack |
| , node), login to one of the nodes using SSH and proceed to debug using |
| <a class="reference internal" href="reading_logs.html#reading-logs"><span class="std std-ref">logs</span></a>, <a class="reference internal" href="use_nodetool.html#use-nodetool"><span class="std std-ref">nodetool</span></a>, and |
| <a class="reference internal" href="use_tools.html#use-os-tools"><span class="std std-ref">os tools</span></a>. If you are not able to login you may still |
| have access to <a class="reference internal" href="reading_logs.html#reading-logs"><span class="std std-ref">logs</span></a> and <a class="reference internal" href="use_nodetool.html#use-nodetool"><span class="std std-ref">nodetool</span></a> |
| remotely.</p> |
| </div> |
| </div> |
| |
| |
| </div> |
| |
| </div> |
| <footer> |
| <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation"> |
| <a href="reading_logs.html" class="btn btn-neutral float-right" title="Cassandra Logs" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a> |
| <a href="index.html" class="btn btn-neutral float-left" title="Troubleshooting" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a> |
| </div> |
| |
| <hr/> |
| |
| <div role="contentinfo"> |
| <p> |
| © Copyright 2020, The Apache Cassandra team. |
| |
| </p> |
| </div> |
| |
| |
| |
| Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a |
| |
| <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a> |
| |
| provided by <a href="https://readthedocs.org">Read the Docs</a>. |
| |
| </footer> |
| </div> |
| </div> |
| |
| </section> |
| |
| </div> |
| |
| |
| <script type="text/javascript"> |
| jQuery(function () { |
| SphinxRtdTheme.Navigation.enable(true); |
| }); |
| </script> |
| |
| |
| |
| |
| |
| |
| </body> |
| </html> |