| |
| |
| <!DOCTYPE html> |
| <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> |
| <head> |
| <meta charset="utf-8"> |
| |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| |
| <title>Compaction — Apache Cassandra Documentation v3.11.11</title> |
| |
| |
| |
| |
| |
| |
| |
| |
| <script type="text/javascript" src="../_static/js/modernizr.min.js"></script> |
| |
| |
| <script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script> |
| <script type="text/javascript" src="../_static/jquery.js"></script> |
| <script type="text/javascript" src="../_static/underscore.js"></script> |
| <script type="text/javascript" src="../_static/doctools.js"></script> |
| <script type="text/javascript" src="../_static/language_data.js"></script> |
| |
| <script type="text/javascript" src="../_static/js/theme.js"></script> |
| |
| |
| |
| |
| <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" /> |
| <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> |
| <link rel="stylesheet" href="../_static/extra.css" type="text/css" /> |
| <link rel="index" title="Index" href="../genindex.html" /> |
| <link rel="search" title="Search" href="../search.html" /> |
| <link rel="next" title="Bloom Filters" href="bloom_filters.html" /> |
| <link rel="prev" title="Hints" href="hints.html" /> |
| </head> |
| |
| <body class="wy-body-for-nav"> |
| |
| |
| <div class="wy-grid-for-nav"> |
| |
| <nav data-toggle="wy-nav-shift" class="wy-nav-side"> |
| <div class="wy-side-scroll"> |
| <div class="wy-side-nav-search" > |
| |
| |
| |
| <a href="../index.html" class="icon icon-home"> Apache Cassandra |
| |
| |
| |
| </a> |
| |
| |
| |
| |
| <div class="version"> |
| 3.11.11 |
| </div> |
| |
| |
| |
| |
| <div role="search"> |
| <form id="rtd-search-form" class="wy-form" action="../search.html" method="get"> |
| <input type="text" name="q" placeholder="Search docs" /> |
| <input type="hidden" name="check_keywords" value="yes" /> |
| <input type="hidden" name="area" value="default" /> |
| </form> |
| </div> |
| |
| |
| </div> |
| |
| <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> |
| |
| |
| |
| |
| |
| |
| <ul class="current"> |
| <li class="toctree-l1"><a class="reference internal" href="../getting_started/index.html">Getting Started</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../architecture/index.html">Architecture</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../data_modeling/index.html">Data Modeling</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../cql/index.html">The Cassandra Query Language (CQL)</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../configuration/index.html">Configuring Cassandra</a></li> |
| <li class="toctree-l1 current"><a class="reference internal" href="index.html">Operating Cassandra</a><ul class="current"> |
| <li class="toctree-l2"><a class="reference internal" href="snitch.html">Snitch</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="topo_changes.html">Adding, replacing, moving and removing nodes</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="repair.html">Repair</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="read_repair.html">Read repair</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="hints.html">Hints</a></li> |
| <li class="toctree-l2 current"><a class="current reference internal" href="#">Compaction</a><ul> |
| <li class="toctree-l3"><a class="reference internal" href="#types-of-compaction">Types of compaction</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#when-is-a-minor-compaction-triggered">When is a minor compaction triggered?</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#merging-sstables">Merging sstables</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#tombstones-and-garbage-collection-gc-grace">Tombstones and Garbage Collection (GC) Grace</a><ul> |
| <li class="toctree-l4"><a class="reference internal" href="#why-tombstones">Why Tombstones</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#deletes-without-tombstones">Deletes without tombstones</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#deletes-with-tombstones">Deletes with Tombstones</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#the-gc-grace-seconds-parameter-and-tombstone-removal">The gc_grace_seconds parameter and Tombstone Removal</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l3"><a class="reference internal" href="#ttl">TTL</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#fully-expired-sstables">Fully expired sstables</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#repaired-unrepaired-data">Repaired/unrepaired data</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#data-directories">Data directories</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#single-sstable-tombstone-compaction">Single sstable tombstone compaction</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#common-options">Common options</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#compaction-nodetool-commands">Compaction nodetool commands</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#switching-the-compaction-strategy-and-options-using-jmx">Switching the compaction strategy and options using JMX</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#more-detailed-compaction-logging">More detailed compaction logging</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#size-tiered-compaction-strategy">Size Tiered Compaction Strategy</a><ul> |
| <li class="toctree-l4"><a class="reference internal" href="#major-compaction">Major compaction</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#stcs-options">STCS options</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#defragmentation">Defragmentation</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l3"><a class="reference internal" href="#leveled-compaction-strategy">Leveled Compaction Strategy</a><ul> |
| <li class="toctree-l4"><a class="reference internal" href="#id3">Major compaction</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#bootstrapping">Bootstrapping</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#stcs-in-l0">STCS in L0</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#starved-sstables">Starved sstables</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#lcs-options">LCS options</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l3"><a class="reference internal" href="#time-window-compactionstrategy">Time Window CompactionStrategy</a><ul> |
| <li class="toctree-l4"><a class="reference internal" href="#timewindowcompactionstrategy-operational-concerns">TimeWindowCompactionStrategy Operational Concerns</a></li> |
| <li class="toctree-l4"><a class="reference internal" href="#changing-timewindowcompactionstrategy-options">Changing TimeWindowCompactionStrategy Options</a></li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| <li class="toctree-l2"><a class="reference internal" href="bloom_filters.html">Bloom Filters</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="compression.html">Compression</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="cdc.html">Change Data Capture</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="backups.html">Backups</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="bulk_loading.html">Bulk Loading</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="metrics.html">Monitoring</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="security.html">Security</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="hardware.html">Hardware Choices</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l1"><a class="reference internal" href="../tools/index.html">Cassandra Tools</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../troubleshooting/index.html">Troubleshooting</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../development/index.html">Cassandra Development</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../faq/index.html">Frequently Asked Questions</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../bugs.html">Reporting Bugs and Contributing</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../contactus.html">Contact us</a></li> |
| </ul> |
| |
| |
| |
| </div> |
| </div> |
| </nav> |
| |
| <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> |
| |
| |
| <nav class="wy-nav-top" aria-label="top navigation"> |
| |
| <i data-toggle="wy-nav-top" class="fa fa-bars"></i> |
| <a href="../index.html">Apache Cassandra</a> |
| |
| </nav> |
| |
| |
| <div class="wy-nav-content"> |
| |
| <div class="rst-content"> |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| <div role="navigation" aria-label="breadcrumbs navigation"> |
| |
| <ul class="wy-breadcrumbs"> |
| |
| <li><a href="../index.html">Docs</a> »</li> |
| |
| <li><a href="index.html">Operating Cassandra</a> »</li> |
| |
| <li>Compaction</li> |
| |
| |
| <li class="wy-breadcrumbs-aside"> |
| |
| |
| <a href="../_sources/operating/compaction.rst.txt" rel="nofollow"> View page source</a> |
| |
| |
| </li> |
| |
| </ul> |
| |
| |
| <hr/> |
| </div> |
| <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> |
| <div itemprop="articleBody"> |
| |
| <div class="section" id="compaction"> |
| <span id="id1"></span><h1>Compaction<a class="headerlink" href="#compaction" title="Permalink to this headline">¶</a></h1> |
| <div class="section" id="types-of-compaction"> |
| <h2>Types of compaction<a class="headerlink" href="#types-of-compaction" title="Permalink to this headline">¶</a></h2> |
| <p>The concept of compaction is used for different kinds of operations in Cassandra, the common thing about these |
| operations is that it takes one or more sstables and output new sstables. The types of compactions are;</p> |
| <dl class="simple"> |
| <dt>Minor compaction</dt><dd><p>triggered automatically in Cassandra.</p> |
| </dd> |
| <dt>Major compaction</dt><dd><p>a user executes a compaction over all sstables on the node.</p> |
| </dd> |
| <dt>User defined compaction</dt><dd><p>a user triggers a compaction on a given set of sstables.</p> |
| </dd> |
| <dt>Scrub</dt><dd><p>try to fix any broken sstables. This can actually remove valid data if that data is corrupted, if that happens you |
| will need to run a full repair on the node.</p> |
| </dd> |
| <dt>Upgradesstables</dt><dd><p>upgrade sstables to the latest version. Run this after upgrading to a new major version.</p> |
| </dd> |
| <dt>Cleanup</dt><dd><p>remove any ranges this node does not own anymore, typically triggered on neighbouring nodes after a node has been |
| bootstrapped since that node will take ownership of some ranges from those nodes.</p> |
| </dd> |
| <dt>Secondary index rebuild</dt><dd><p>rebuild the secondary indexes on the node.</p> |
| </dd> |
| <dt>Anticompaction</dt><dd><p>after repair the ranges that were actually repaired are split out of the sstables that existed when repair started.</p> |
| </dd> |
| <dt>Sub range compaction</dt><dd><p>It is possible to only compact a given sub range - this could be useful if you know a token that has been |
| misbehaving - either gathering many updates or many deletes. (<code class="docutils literal notranslate"><span class="pre">nodetool</span> <span class="pre">compact</span> <span class="pre">-st</span> <span class="pre">x</span> <span class="pre">-et</span> <span class="pre">y</span></code>) will pick |
| all sstables containing the range between x and y and issue a compaction for those sstables. For STCS this will |
| most likely include all sstables but with LCS it can issue the compaction for a subset of the sstables. With LCS |
| the resulting sstable will end up in L0.</p> |
| </dd> |
| </dl> |
| </div> |
| <div class="section" id="when-is-a-minor-compaction-triggered"> |
| <h2>When is a minor compaction triggered?<a class="headerlink" href="#when-is-a-minor-compaction-triggered" title="Permalink to this headline">¶</a></h2> |
| <p># When an sstable is added to the node through flushing/streaming etc. |
| # When autocompaction is enabled after being disabled (<code class="docutils literal notranslate"><span class="pre">nodetool</span> <span class="pre">enableautocompaction</span></code>) |
| # When compaction adds new sstables. |
| # A check for new minor compactions every 5 minutes.</p> |
| </div> |
| <div class="section" id="merging-sstables"> |
| <h2>Merging sstables<a class="headerlink" href="#merging-sstables" title="Permalink to this headline">¶</a></h2> |
| <p>Compaction is about merging sstables, since partitions in sstables are sorted based on the hash of the partition key it |
| is possible to efficiently merge separate sstables. Content of each partition is also sorted so each partition can be |
| merged efficiently.</p> |
| </div> |
| <div class="section" id="tombstones-and-garbage-collection-gc-grace"> |
| <h2>Tombstones and Garbage Collection (GC) Grace<a class="headerlink" href="#tombstones-and-garbage-collection-gc-grace" title="Permalink to this headline">¶</a></h2> |
| <div class="section" id="why-tombstones"> |
| <h3>Why Tombstones<a class="headerlink" href="#why-tombstones" title="Permalink to this headline">¶</a></h3> |
| <p>When a delete request is received by Cassandra it does not actually remove the data from the underlying store. Instead |
| it writes a special piece of data known as a tombstone. The Tombstone represents the delete and causes all values which |
| occurred before the tombstone to not appear in queries to the database. This approach is used instead of removing values |
| because of the distributed nature of Cassandra.</p> |
| </div> |
| <div class="section" id="deletes-without-tombstones"> |
| <h3>Deletes without tombstones<a class="headerlink" href="#deletes-without-tombstones" title="Permalink to this headline">¶</a></h3> |
| <p>Imagine a three node cluster which has the value [A] replicated to every node.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A], [A], [A] |
| </pre></div> |
| </div> |
| <p>If one of the nodes fails and and our delete operation only removes existing values we can end up with a cluster that |
| looks like:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[], [], [A] |
| </pre></div> |
| </div> |
| <p>Then a repair operation would replace the value of [A] back onto the two |
| nodes which are missing the value.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A], [A], [A] |
| </pre></div> |
| </div> |
| <p>This would cause our data to be resurrected even though it had been |
| deleted.</p> |
| </div> |
| <div class="section" id="deletes-with-tombstones"> |
| <h3>Deletes with Tombstones<a class="headerlink" href="#deletes-with-tombstones" title="Permalink to this headline">¶</a></h3> |
| <p>Starting again with a three node cluster which has the value [A] replicated to every node.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A], [A], [A] |
| </pre></div> |
| </div> |
| <p>If instead of removing data we add a tombstone record, our single node failure situation will look like this.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A, Tombstone[A]], [A, Tombstone[A]], [A] |
| </pre></div> |
| </div> |
| <p>Now when we issue a repair the Tombstone will be copied to the replica, rather than the deleted data being |
| resurrected.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A, Tombstone[A]], [A, Tombstone[A]], [A, Tombstone[A]] |
| </pre></div> |
| </div> |
| <p>Our repair operation will correctly put the state of the system to what we expect with the record [A] marked as deleted |
| on all nodes. This does mean we will end up accruing Tombstones which will permanently accumulate disk space. To avoid |
| keeping tombstones forever we have a parameter known as <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> for every table in Cassandra.</p> |
| </div> |
| <div class="section" id="the-gc-grace-seconds-parameter-and-tombstone-removal"> |
| <h3>The gc_grace_seconds parameter and Tombstone Removal<a class="headerlink" href="#the-gc-grace-seconds-parameter-and-tombstone-removal" title="Permalink to this headline">¶</a></h3> |
| <p>The table level <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> parameter controls how long Cassandra will retain tombstones through compaction |
| events before finally removing them. This duration should directly reflect the amount of time a user expects to allow |
| before recovering a failed node. After <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> has expired the tombstone may be removed (meaning there will |
| no longer be any record that a certain piece of data was deleted), but as a tombstone can live in one sstable and the |
| data it covers in another, a compaction must also include both sstable for a tombstone to be removed. More precisely, to |
| be able to drop an actual tombstone the following needs to be true;</p> |
| <ul class="simple"> |
| <li><p>The tombstone must be older than <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code></p></li> |
| <li><p>If partition X contains the tombstone, the sstable containing the partition plus all sstables containing data older |
| than the tombstone containing X must be included in the same compaction. We don’t need to care if the partition is in |
| an sstable if we can guarantee that all data in that sstable is newer than the tombstone. If the tombstone is older |
| than the data it cannot shadow that data.</p></li> |
| <li><p>If the option <code class="docutils literal notranslate"><span class="pre">only_purge_repaired_tombstones</span></code> is enabled, tombstones are only removed if the data has also been |
| repaired.</p></li> |
| </ul> |
| <p>If a node remains down or disconnected for longer than <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> it’s deleted data will be repaired back to |
| the other nodes and re-appear in the cluster. This is basically the same as in the “Deletes without Tombstones” section. |
| Note that tombstones will not be removed until a compaction event even if <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> has elapsed.</p> |
| <p>The default value for <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> is 864000 which is equivalent to 10 days. This can be set when creating or |
| altering a table using <code class="docutils literal notranslate"><span class="pre">WITH</span> <span class="pre">gc_grace_seconds</span></code>.</p> |
| </div> |
| </div> |
| <div class="section" id="ttl"> |
| <h2>TTL<a class="headerlink" href="#ttl" title="Permalink to this headline">¶</a></h2> |
| <p>Data in Cassandra can have an additional property called time to live - this is used to automatically drop data that has |
| expired once the time is reached. Once the TTL has expired the data is converted to a tombstone which stays around for |
| at least <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code>. Note that if you mix data with TTL and data without TTL (or just different length of the |
| TTL) Cassandra will have a hard time dropping the tombstones created since the partition might span many sstables and |
| not all are compacted at once.</p> |
| </div> |
| <div class="section" id="fully-expired-sstables"> |
| <h2>Fully expired sstables<a class="headerlink" href="#fully-expired-sstables" title="Permalink to this headline">¶</a></h2> |
| <p>If an sstable contains only tombstones and it is guaranteed that that sstable is not shadowing data in any other sstable |
| compaction can drop that sstable. If you see sstables with only tombstones (note that TTL:ed data is considered |
| tombstones once the time to live has expired) but it is not being dropped by compaction, it is likely that other |
| sstables contain older data. There is a tool called <code class="docutils literal notranslate"><span class="pre">sstableexpiredblockers</span></code> that will list which sstables are |
| droppable and which are blocking them from being dropped. This is especially useful for time series compaction with |
| <code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> (and the deprecated <code class="docutils literal notranslate"><span class="pre">DateTieredCompactionStrategy</span></code>).</p> |
| </div> |
| <div class="section" id="repaired-unrepaired-data"> |
| <h2>Repaired/unrepaired data<a class="headerlink" href="#repaired-unrepaired-data" title="Permalink to this headline">¶</a></h2> |
| <p>With incremental repairs Cassandra must keep track of what data is repaired and what data is unrepaired. With |
| anticompaction repaired data is split out into repaired and unrepaired sstables. To avoid mixing up the data again |
| separate compaction strategy instances are run on the two sets of data, each instance only knowing about either the |
| repaired or the unrepaired sstables. This means that if you only run incremental repair once and then never again, you |
| might have very old data in the repaired sstables that block compaction from dropping tombstones in the unrepaired |
| (probably newer) sstables.</p> |
| </div> |
| <div class="section" id="data-directories"> |
| <h2>Data directories<a class="headerlink" href="#data-directories" title="Permalink to this headline">¶</a></h2> |
| <p>Since tombstones and data can live in different sstables it is important to realize that losing an sstable might lead to |
| data becoming live again - the most common way of losing sstables is to have a hard drive break down. To avoid making |
| data live tombstones and actual data are always in the same data directory. This way, if a disk is lost, all versions of |
| a partition are lost and no data can get undeleted. To achieve this a compaction strategy instance per data directory is |
| run in addition to the compaction strategy instances containing repaired/unrepaired data, this means that if you have 4 |
| data directories there will be 8 compaction strategy instances running. This has a few more benefits than just avoiding |
| data getting undeleted:</p> |
| <ul class="simple"> |
| <li><p>It is possible to run more compactions in parallel - leveled compaction will have several totally separate levelings |
| and each one can run compactions independently from the others.</p></li> |
| <li><p>Users can backup and restore a single data directory.</p></li> |
| <li><p>Note though that currently all data directories are considered equal, so if you have a tiny disk and a big disk |
| backing two data directories, the big one will be limited the by the small one. One work around to this is to create |
| more data directories backed by the big disk.</p></li> |
| </ul> |
| </div> |
| <div class="section" id="single-sstable-tombstone-compaction"> |
| <h2>Single sstable tombstone compaction<a class="headerlink" href="#single-sstable-tombstone-compaction" title="Permalink to this headline">¶</a></h2> |
| <p>When an sstable is written a histogram with the tombstone expiry times is created and this is used to try to find |
| sstables with very many tombstones and run single sstable compaction on that sstable in hope of being able to drop |
| tombstones in that sstable. Before starting this it is also checked how likely it is that any tombstones will actually |
| will be able to be dropped how much this sstable overlaps with other sstables. To avoid most of these checks the |
| compaction option <code class="docutils literal notranslate"><span class="pre">unchecked_tombstone_compaction</span></code> can be enabled.</p> |
| </div> |
| <div class="section" id="common-options"> |
| <span id="compaction-options"></span><h2>Common options<a class="headerlink" href="#common-options" title="Permalink to this headline">¶</a></h2> |
| <p>There is a number of common options for all the compaction strategies;</p> |
| <dl class="simple"> |
| <dt><code class="docutils literal notranslate"><span class="pre">enabled</span></code> (default: true)</dt><dd><p>Whether minor compactions should run. Note that you can have ‘enabled’: true as a compaction option and then do |
| ‘nodetool enableautocompaction’ to start running compactions.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">tombstone_threshold</span></code> (default: 0.2)</dt><dd><p>How much of the sstable should be tombstones for us to consider doing a single sstable compaction of that sstable.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">tombstone_compaction_interval</span></code> (default: 86400s (1 day))</dt><dd><p>Since it might not be possible to drop any tombstones when doing a single sstable compaction we need to make sure |
| that one sstable is not constantly getting recompacted - this option states how often we should try for a given |
| sstable.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">log_all</span></code> (default: false)</dt><dd><p>New detailed compaction logging, see <a class="reference internal" href="#detailed-compaction-logging"><span class="std std-ref">below</span></a>.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">unchecked_tombstone_compaction</span></code> (default: false)</dt><dd><p>The single sstable compaction has quite strict checks for whether it should be started, this option disables those |
| checks and for some usecases this might be needed. Note that this does not change anything for the actual |
| compaction, tombstones are only dropped if it is safe to do so - it might just rewrite an sstable without being able |
| to drop any tombstones.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">only_purge_repaired_tombstone</span></code> (default: false)</dt><dd><p>Option to enable the extra safety of making sure that tombstones are only dropped if the data has been repaired.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">min_threshold</span></code> (default: 4)</dt><dd><p>Lower limit of number of sstables before a compaction is triggered. Not used for <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code>.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">max_threshold</span></code> (default: 32)</dt><dd><p>Upper limit of number of sstables before a compaction is triggered. Not used for <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code>.</p> |
| </dd> |
| </dl> |
| <p>Further, see the section on each strategy for specific additional options.</p> |
| </div> |
| <div class="section" id="compaction-nodetool-commands"> |
| <h2>Compaction nodetool commands<a class="headerlink" href="#compaction-nodetool-commands" title="Permalink to this headline">¶</a></h2> |
| <p>The <a class="reference internal" href="../tools/nodetool.html#nodetool"><span class="std std-ref">nodetool</span></a> utility provides a number of commands related to compaction:</p> |
| <dl class="simple"> |
| <dt><code class="docutils literal notranslate"><span class="pre">enableautocompaction</span></code></dt><dd><p>Enable compaction.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">disableautocompaction</span></code></dt><dd><p>Disable compaction.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">setcompactionthroughput</span></code></dt><dd><p>How fast compaction should run at most - defaults to 16MB/s, but note that it is likely not possible to reach this |
| throughput.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">compactionstats</span></code></dt><dd><p>Statistics about current and pending compactions.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">compactionhistory</span></code></dt><dd><p>List details about the last compactions.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">setcompactionthreshold</span></code></dt><dd><p>Set the min/max sstable count for when to trigger compaction, defaults to 4/32.</p> |
| </dd> |
| </dl> |
| </div> |
| <div class="section" id="switching-the-compaction-strategy-and-options-using-jmx"> |
| <h2>Switching the compaction strategy and options using JMX<a class="headerlink" href="#switching-the-compaction-strategy-and-options-using-jmx" title="Permalink to this headline">¶</a></h2> |
| <p>It is possible to switch compaction strategies and its options on just a single node using JMX, this is a great way to |
| experiment with settings without affecting the whole cluster. The mbean is:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>org.apache.cassandra.db:type=ColumnFamilies,keyspace=<keyspace_name>,columnfamily=<table_name> |
| </pre></div> |
| </div> |
| <p>and the attribute to change is <code class="docutils literal notranslate"><span class="pre">CompactionParameters</span></code> or <code class="docutils literal notranslate"><span class="pre">CompactionParametersJson</span></code> if you use jconsole or jmc. The |
| syntax for the json version is the same as you would use in an <a class="reference internal" href="../cql/ddl.html#alter-table-statement"><span class="std std-ref">ALTER TABLE</span></a> statement - |
| for example:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>{ 'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 123, 'fanout_size': 10} |
| </pre></div> |
| </div> |
| <p>The setting is kept until someone executes an <a class="reference internal" href="../cql/ddl.html#alter-table-statement"><span class="std std-ref">ALTER TABLE</span></a> that touches the compaction |
| settings or restarts the node.</p> |
| </div> |
| <div class="section" id="more-detailed-compaction-logging"> |
| <span id="detailed-compaction-logging"></span><h2>More detailed compaction logging<a class="headerlink" href="#more-detailed-compaction-logging" title="Permalink to this headline">¶</a></h2> |
| <p>Enable with the compaction option <code class="docutils literal notranslate"><span class="pre">log_all</span></code> and a more detailed compaction log file will be produced in your log |
| directory.</p> |
| </div> |
| <div class="section" id="size-tiered-compaction-strategy"> |
| <span id="stcs"></span><h2>Size Tiered Compaction Strategy<a class="headerlink" href="#size-tiered-compaction-strategy" title="Permalink to this headline">¶</a></h2> |
| <p>The basic idea of <code class="docutils literal notranslate"><span class="pre">SizeTieredCompactionStrategy</span></code> (STCS) is to merge sstables of approximately the same size. All |
| sstables are put in different buckets depending on their size. An sstable is added to the bucket if size of the sstable |
| is within <code class="docutils literal notranslate"><span class="pre">bucket_low</span></code> and <code class="docutils literal notranslate"><span class="pre">bucket_high</span></code> of the current average size of the sstables already in the bucket. This |
| will create several buckets and the most interesting of those buckets will be compacted. The most interesting one is |
| decided by figuring out which bucket’s sstables takes the most reads.</p> |
| <div class="section" id="major-compaction"> |
| <h3>Major compaction<a class="headerlink" href="#major-compaction" title="Permalink to this headline">¶</a></h3> |
| <p>When running a major compaction with STCS you will end up with two sstables per data directory (one for repaired data |
| and one for unrepaired data). There is also an option (-s) to do a major compaction that splits the output into several |
| sstables. The sizes of the sstables are approximately 50%, 25%, 12.5%… of the total size.</p> |
| </div> |
| <div class="section" id="stcs-options"> |
| <span id="id2"></span><h3>STCS options<a class="headerlink" href="#stcs-options" title="Permalink to this headline">¶</a></h3> |
| <dl class="simple"> |
| <dt><code class="docutils literal notranslate"><span class="pre">min_sstable_size</span></code> (default: 50MB)</dt><dd><p>Sstables smaller than this are put in the same bucket.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">bucket_low</span></code> (default: 0.5)</dt><dd><p>How much smaller than the average size of a bucket a sstable should be before not being included in the bucket. That |
| is, if <code class="docutils literal notranslate"><span class="pre">bucket_low</span> <span class="pre">*</span> <span class="pre">avg_bucket_size</span> <span class="pre"><</span> <span class="pre">sstable_size</span></code> (and the <code class="docutils literal notranslate"><span class="pre">bucket_high</span></code> condition holds, see below), then |
| the sstable is added to the bucket.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">bucket_high</span></code> (default: 1.5)</dt><dd><p>How much bigger than the average size of a bucket a sstable should be before not being included in the bucket. That |
| is, if <code class="docutils literal notranslate"><span class="pre">sstable_size</span> <span class="pre"><</span> <span class="pre">bucket_high</span> <span class="pre">*</span> <span class="pre">avg_bucket_size</span></code> (and the <code class="docutils literal notranslate"><span class="pre">bucket_low</span></code> condition holds, see above), then |
| the sstable is added to the bucket.</p> |
| </dd> |
| </dl> |
| </div> |
| <div class="section" id="defragmentation"> |
| <h3>Defragmentation<a class="headerlink" href="#defragmentation" title="Permalink to this headline">¶</a></h3> |
| <p>Defragmentation is done when many sstables are touched during a read. The result of the read is put in to the memtable |
| so that the next read will not have to touch as many sstables. This can cause writes on a read-only-cluster.</p> |
| </div> |
| </div> |
| <div class="section" id="leveled-compaction-strategy"> |
| <span id="lcs"></span><h2>Leveled Compaction Strategy<a class="headerlink" href="#leveled-compaction-strategy" title="Permalink to this headline">¶</a></h2> |
| <p>The idea of <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code> (LCS) is that all sstables are put into different levels where we guarantee |
| that no overlapping sstables are in the same level. By overlapping we mean that the first/last token of a single sstable |
| are never overlapping with other sstables. This means that for a SELECT we will only have to look for the partition key |
| in a single sstable per level. Each level is 10x the size of the previous one and each sstable is 160MB by default. L0 |
| is where sstables are streamed/flushed - no overlap guarantees are given here.</p> |
| <p>When picking compaction candidates we have to make sure that the compaction does not create overlap in the target level. |
| This is done by always including all overlapping sstables in the next level. For example if we select an sstable in L3, |
| we need to guarantee that we pick all overlapping sstables in L4 and make sure that no currently ongoing compactions |
| will create overlap if we start that compaction. We can start many parallel compactions in a level if we guarantee that |
| we wont create overlap. For L0 -> L1 compactions we almost always need to include all L1 sstables since most L0 sstables |
| cover the full range. We also can’t compact all L0 sstables with all L1 sstables in a single compaction since that can |
| use too much memory.</p> |
| <p>When deciding which level to compact LCS checks the higher levels first (with LCS, a “higher” level is one with a higher |
| number: L0 is the lowest one, L8 is the highest one) and if the level is behind a compaction will be started |
| in that level.</p> |
| <div class="section" id="id3"> |
| <h3>Major compaction<a class="headerlink" href="#id3" title="Permalink to this headline">¶</a></h3> |
| <p>It is possible to do a major compaction with LCS - it will currently start by filling out L1 and then once L1 is full, |
| it continues with L2 etc. This is sub optimal and will change to create all the sstables in a high level instead, |
| CASSANDRA-11817.</p> |
| </div> |
| <div class="section" id="bootstrapping"> |
| <h3>Bootstrapping<a class="headerlink" href="#bootstrapping" title="Permalink to this headline">¶</a></h3> |
| <p>During bootstrap sstables are streamed from other nodes. The level of the remote sstable is kept to avoid many |
| compactions after the bootstrap is done. During bootstrap the new node also takes writes while it is streaming the data |
| from a remote node - these writes are flushed to L0 like all other writes and to avoid those sstables blocking the |
| remote sstables from going to the correct level, we only do STCS in L0 until the bootstrap is done.</p> |
| </div> |
| <div class="section" id="stcs-in-l0"> |
| <h3>STCS in L0<a class="headerlink" href="#stcs-in-l0" title="Permalink to this headline">¶</a></h3> |
| <p>If LCS gets very many L0 sstables reads are going to hit all (or most) of the L0 sstables since they are likely to be |
| overlapping. To more quickly remedy this LCS does STCS compactions in L0 if there are more than 32 sstables there. This |
| should improve read performance more quickly compared to letting LCS do its L0 -> L1 compactions. If you keep getting |
| too many sstables in L0 it is likely that LCS is not the best fit for your workload and STCS could work out better.</p> |
| </div> |
| <div class="section" id="starved-sstables"> |
| <h3>Starved sstables<a class="headerlink" href="#starved-sstables" title="Permalink to this headline">¶</a></h3> |
| <p>If a node ends up with a leveling where there are a few very high level sstables that are not getting compacted they |
| might make it impossible for lower levels to drop tombstones etc. For example, if there are sstables in L6 but there is |
| only enough data to actually get a L4 on the node the left over sstables in L6 will get starved and not compacted. This |
| can happen if a user changes sstable_size_in_mb from 5MB to 160MB for example. To avoid this LCS tries to include |
| those starved high level sstables in other compactions if there has been 25 compaction rounds where the highest level |
| has not been involved.</p> |
| </div> |
| <div class="section" id="lcs-options"> |
| <span id="id4"></span><h3>LCS options<a class="headerlink" href="#lcs-options" title="Permalink to this headline">¶</a></h3> |
| <dl class="simple"> |
| <dt><code class="docutils literal notranslate"><span class="pre">sstable_size_in_mb</span></code> (default: 160MB)</dt><dd><p>The target compressed (if using compression) sstable size - the sstables can end up being larger if there are very |
| large partitions on the node.</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">fanout_size</span></code> (default: 10)</dt><dd><p>The target size of levels increases by this fanout_size multiplier. You can reduce the space amplification by tuning |
| this option.</p> |
| </dd> |
| </dl> |
| <p>LCS also support the <code class="docutils literal notranslate"><span class="pre">cassandra.disable_stcs_in_l0</span></code> startup option (<code class="docutils literal notranslate"><span class="pre">-Dcassandra.disable_stcs_in_l0=true</span></code>) to avoid |
| doing STCS in L0.</p> |
| </div> |
| </div> |
| <div class="section" id="time-window-compactionstrategy"> |
| <span id="twcs"></span><h2>Time Window CompactionStrategy<a class="headerlink" href="#time-window-compactionstrategy" title="Permalink to this headline">¶</a></h2> |
| <p><code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> (TWCS) is designed specifically for workloads where it’s beneficial to have data on |
| disk grouped by the timestamp of the data, a common goal when the workload is time-series in nature or when all data is |
| written with a TTL. In an expiring/TTL workload, the contents of an entire SSTable likely expire at approximately the |
| same time, allowing them to be dropped completely, and space reclaimed much more reliably than when using |
| <code class="docutils literal notranslate"><span class="pre">SizeTieredCompactionStrategy</span></code> or <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code>. The basic concept is that |
| <code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> will create 1 sstable per file for a given window, where a window is simply calculated |
| as the combination of two primary options:</p> |
| <dl class="simple"> |
| <dt><code class="docutils literal notranslate"><span class="pre">compaction_window_unit</span></code> (default: DAYS)</dt><dd><p>A Java TimeUnit (MINUTES, HOURS, or DAYS).</p> |
| </dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">compaction_window_size</span></code> (default: 1)</dt><dd><p>The number of units that make up a window.</p> |
| </dd> |
| </dl> |
| <p>Taken together, the operator can specify windows of virtually any size, and <cite>TimeWindowCompactionStrategy</cite> will work to |
| create a single sstable for writes within that window. For efficiency during writing, the newest window will be |
| compacted using <cite>SizeTieredCompactionStrategy</cite>.</p> |
| <p>Ideally, operators should select a <code class="docutils literal notranslate"><span class="pre">compaction_window_unit</span></code> and <code class="docutils literal notranslate"><span class="pre">compaction_window_size</span></code> pair that produces |
| approximately 20-30 windows - if writing with a 90 day TTL, for example, a 3 Day window would be a reasonable choice |
| (<code class="docutils literal notranslate"><span class="pre">'compaction_window_unit':'DAYS','compaction_window_size':3</span></code>).</p> |
| <div class="section" id="timewindowcompactionstrategy-operational-concerns"> |
| <h3>TimeWindowCompactionStrategy Operational Concerns<a class="headerlink" href="#timewindowcompactionstrategy-operational-concerns" title="Permalink to this headline">¶</a></h3> |
| <p>The primary motivation for TWCS is to separate data on disk by timestamp and to allow fully expired SSTables to drop |
| more efficiently. One potential way this optimal behavior can be subverted is if data is written to SSTables out of |
| order, with new data and old data in the same SSTable. Out of order data can appear in two ways:</p> |
| <ul class="simple"> |
| <li><p>If the user mixes old data and new data in the traditional write path, the data will be comingled in the memtables |
| and flushed into the same SSTable, where it will remain comingled.</p></li> |
| <li><p>If the user’s read requests for old data cause read repairs that pull old data into the current memtable, that data |
| will be comingled and flushed into the same SSTable.</p></li> |
| </ul> |
| <p>While TWCS tries to minimize the impact of comingled data, users should attempt to avoid this behavior. Specifically, |
| users should avoid queries that explicitly set the timestamp via CQL <code class="docutils literal notranslate"><span class="pre">USING</span> <span class="pre">TIMESTAMP</span></code>. Additionally, users should run |
| frequent repairs (which streams data in such a way that it does not become comingled), and disable background read |
| repair by setting the table’s <code class="docutils literal notranslate"><span class="pre">read_repair_chance</span></code> and <code class="docutils literal notranslate"><span class="pre">dclocal_read_repair_chance</span></code> to 0.</p> |
| </div> |
| <div class="section" id="changing-timewindowcompactionstrategy-options"> |
| <h3>Changing TimeWindowCompactionStrategy Options<a class="headerlink" href="#changing-timewindowcompactionstrategy-options" title="Permalink to this headline">¶</a></h3> |
| <p>Operators wishing to enable <code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> on existing data should consider running a major compaction |
| first, placing all existing data into a single (old) window. Subsequent newer writes will then create typical SSTables |
| as expected.</p> |
| <p>Operators wishing to change <code class="docutils literal notranslate"><span class="pre">compaction_window_unit</span></code> or <code class="docutils literal notranslate"><span class="pre">compaction_window_size</span></code> can do so, but may trigger |
| additional compactions as adjacent windows are joined together. If the window size is decrease d (for example, from 24 |
| hours to 12 hours), then the existing SSTables will not be modified - TWCS can not split existing SSTables into multiple |
| windows.</p> |
| </div> |
| </div> |
| </div> |
| |
| |
| </div> |
| |
| </div> |
| <footer> |
| |
| <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation"> |
| |
| <a href="bloom_filters.html" class="btn btn-neutral float-right" title="Bloom Filters" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a> |
| |
| |
| <a href="hints.html" class="btn btn-neutral float-left" title="Hints" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a> |
| |
| </div> |
| |
| |
| <hr/> |
| |
| <div role="contentinfo"> |
| <p> |
| © Copyright 2016, The Apache Cassandra team |
| |
| </p> |
| </div> |
| Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. |
| |
| </footer> |
| |
| </div> |
| </div> |
| |
| </section> |
| |
| </div> |
| |
| |
| |
| <script type="text/javascript"> |
| jQuery(function () { |
| SphinxRtdTheme.Navigation.enable(true); |
| }); |
| </script> |
| |
| |
| |
| |
| |
| |
| </body> |
| </html> |