| --- |
| layout: docpage |
| |
| title: "Documentation" |
| |
| is_homepage: false |
| is_sphinx_doc: true |
| |
| doc-parent: "Operating Cassandra" |
| |
| doc-title: "Compaction" |
| doc-header-links: ' |
| <link rel="top" title="Apache Cassandra Documentation v4.0-alpha2" href="../index.html"/> |
| <link rel="up" title="Operating Cassandra" href="index.html"/> |
| <link rel="next" title="Bloom Filters" href="bloom_filters.html"/> |
| <link rel="prev" title="Hints" href="hints.html"/> |
| ' |
| doc-search-path: "../search.html" |
| |
| extra-footer: ' |
| <script type="text/javascript"> |
| var DOCUMENTATION_OPTIONS = { |
| URL_ROOT: "", |
| VERSION: "", |
| COLLAPSE_INDEX: false, |
| FILE_SUFFIX: ".html", |
| HAS_SOURCE: false, |
| SOURCELINK_SUFFIX: ".txt" |
| }; |
| </script> |
| ' |
| |
| --- |
| <div class="container-fluid"> |
| <div class="row"> |
| <div class="col-md-3"> |
| <div class="doc-navigation"> |
| <div class="doc-menu" role="navigation"> |
| <div class="navbar-header"> |
| <button type="button" class="pull-left navbar-toggle" data-toggle="collapse" data-target=".sidebar-navbar-collapse"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| </div> |
| <div class="navbar-collapse collapse sidebar-navbar-collapse"> |
| <form id="doc-search-form" class="navbar-form" action="../search.html" method="get" role="search"> |
| <div class="form-group"> |
| <input type="text" size="30" class="form-control input-sm" name="q" placeholder="Search docs"> |
| <input type="hidden" name="check_keywords" value="yes" /> |
| <input type="hidden" name="area" value="default" /> |
| </div> |
| </form> |
| |
| |
| |
| <ul class="current"> |
| <li class="toctree-l1"><a class="reference internal" href="../getting_started/index.html">Getting Started</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../architecture/index.html">Architecture</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../data_modeling/index.html">Data Modeling</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../cql/index.html">The Cassandra Query Language (CQL)</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../configuration/index.html">Configuring Cassandra</a></li> |
| <li class="toctree-l1 current"><a class="reference internal" href="index.html">Operating Cassandra</a><ul class="current"> |
| <li class="toctree-l2"><a class="reference internal" href="snitch.html">Snitch</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="topo_changes.html">Adding, replacing, moving and removing nodes</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="repair.html">Repair</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="read_repair.html">Read repair</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="hints.html">Hints</a></li> |
| <li class="toctree-l2 current"><a class="current reference internal" href="#">Compaction</a><ul> |
| <li class="toctree-l3"><a class="reference internal" href="#types-of-compaction">Types of compaction</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#when-is-a-minor-compaction-triggered">When is a minor compaction triggered?</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#merging-sstables">Merging sstables</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#tombstones-and-garbage-collection-gc-grace">Tombstones and Garbage Collection (GC) Grace</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#ttl">TTL</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#fully-expired-sstables">Fully expired sstables</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#repaired-unrepaired-data">Repaired/unrepaired data</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#data-directories">Data directories</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#single-sstable-tombstone-compaction">Single sstable tombstone compaction</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#common-options">Common options</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#compaction-nodetool-commands">Compaction nodetool commands</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#switching-the-compaction-strategy-and-options-using-jmx">Switching the compaction strategy and options using JMX</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#more-detailed-compaction-logging">More detailed compaction logging</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#size-tiered-compaction-strategy">Size Tiered Compaction Strategy</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#leveled-compaction-strategy">Leveled Compaction Strategy</a></li> |
| <li class="toctree-l3"><a class="reference internal" href="#time-window-compactionstrategy">Time Window CompactionStrategy</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l2"><a class="reference internal" href="bloom_filters.html">Bloom Filters</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="compression.html">Compression</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="cdc.html">Change Data Capture</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="backups.html">Backups</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="bulk_loading.html">Bulk Loading</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="metrics.html">Monitoring</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="security.html">Security</a></li> |
| <li class="toctree-l2"><a class="reference internal" href="hardware.html">Hardware Choices</a></li> |
| </ul> |
| </li> |
| <li class="toctree-l1"><a class="reference internal" href="../tools/index.html">Cassandra Tools</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../troubleshooting/index.html">Troubleshooting</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../development/index.html">Contributing to Cassandra</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../faq/index.html">Frequently Asked Questions</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../plugins/index.html">Third-Party Plugins</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../bugs.html">Reporting Bugs</a></li> |
| <li class="toctree-l1"><a class="reference internal" href="../contactus.html">Contact us</a></li> |
| </ul> |
| |
| |
| |
| </div><!--/.nav-collapse --> |
| </div> |
| </div> |
| </div> |
| <div class="col-md-8"> |
| <div class="content doc-content"> |
| <div class="content-container"> |
| |
| <div class="section" id="compaction"> |
| <span id="id1"></span><h1>Compaction<a class="headerlink" href="#compaction" title="Permalink to this headline">¶</a></h1> |
| <div class="section" id="types-of-compaction"> |
| <h2>Types of compaction<a class="headerlink" href="#types-of-compaction" title="Permalink to this headline">¶</a></h2> |
| <p>The concept of compaction is used for different kinds of operations in Cassandra, the common thing about these |
| operations is that it takes one or more sstables and output new sstables. The types of compactions are;</p> |
| <dl class="docutils"> |
| <dt>Minor compaction</dt> |
| <dd>triggered automatically in Cassandra.</dd> |
| <dt>Major compaction</dt> |
| <dd>a user executes a compaction over all sstables on the node.</dd> |
| <dt>User defined compaction</dt> |
| <dd>a user triggers a compaction on a given set of sstables.</dd> |
| <dt>Scrub</dt> |
| <dd>try to fix any broken sstables. This can actually remove valid data if that data is corrupted, if that happens you |
| will need to run a full repair on the node.</dd> |
| <dt>Upgradesstables</dt> |
| <dd>upgrade sstables to the latest version. Run this after upgrading to a new major version.</dd> |
| <dt>Cleanup</dt> |
| <dd>remove any ranges this node does not own anymore, typically triggered on neighbouring nodes after a node has been |
| bootstrapped since that node will take ownership of some ranges from those nodes.</dd> |
| <dt>Secondary index rebuild</dt> |
| <dd>rebuild the secondary indexes on the node.</dd> |
| <dt>Anticompaction</dt> |
| <dd>after repair the ranges that were actually repaired are split out of the sstables that existed when repair started.</dd> |
| <dt>Sub range compaction</dt> |
| <dd>It is possible to only compact a given sub range - this could be useful if you know a token that has been |
| misbehaving - either gathering many updates or many deletes. (<code class="docutils literal notranslate"><span class="pre">nodetool</span> <span class="pre">compact</span> <span class="pre">-st</span> <span class="pre">x</span> <span class="pre">-et</span> <span class="pre">y</span></code>) will pick |
| all sstables containing the range between x and y and issue a compaction for those sstables. For STCS this will |
| most likely include all sstables but with LCS it can issue the compaction for a subset of the sstables. With LCS |
| the resulting sstable will end up in L0.</dd> |
| </dl> |
| </div> |
| <div class="section" id="when-is-a-minor-compaction-triggered"> |
| <h2>When is a minor compaction triggered?<a class="headerlink" href="#when-is-a-minor-compaction-triggered" title="Permalink to this headline">¶</a></h2> |
| <p># When an sstable is added to the node through flushing/streaming etc. |
| # When autocompaction is enabled after being disabled (<code class="docutils literal notranslate"><span class="pre">nodetool</span> <span class="pre">enableautocompaction</span></code>) |
| # When compaction adds new sstables. |
| # A check for new minor compactions every 5 minutes.</p> |
| </div> |
| <div class="section" id="merging-sstables"> |
| <h2>Merging sstables<a class="headerlink" href="#merging-sstables" title="Permalink to this headline">¶</a></h2> |
| <p>Compaction is about merging sstables, since partitions in sstables are sorted based on the hash of the partition key it |
| is possible to efficiently merge separate sstables. Content of each partition is also sorted so each partition can be |
| merged efficiently.</p> |
| </div> |
| <div class="section" id="tombstones-and-garbage-collection-gc-grace"> |
| <h2>Tombstones and Garbage Collection (GC) Grace<a class="headerlink" href="#tombstones-and-garbage-collection-gc-grace" title="Permalink to this headline">¶</a></h2> |
| <div class="section" id="why-tombstones"> |
| <h3>Why Tombstones<a class="headerlink" href="#why-tombstones" title="Permalink to this headline">¶</a></h3> |
| <p>When a delete request is received by Cassandra it does not actually remove the data from the underlying store. Instead |
| it writes a special piece of data known as a tombstone. The Tombstone represents the delete and causes all values which |
| occurred before the tombstone to not appear in queries to the database. This approach is used instead of removing values |
| because of the distributed nature of Cassandra.</p> |
| </div> |
| <div class="section" id="deletes-without-tombstones"> |
| <h3>Deletes without tombstones<a class="headerlink" href="#deletes-without-tombstones" title="Permalink to this headline">¶</a></h3> |
| <p>Imagine a three node cluster which has the value [A] replicated to every node.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A], [A], [A] |
| </pre></div> |
| </div> |
| <p>If one of the nodes fails and and our delete operation only removes existing values we can end up with a cluster that |
| looks like:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[], [], [A] |
| </pre></div> |
| </div> |
| <p>Then a repair operation would replace the value of [A] back onto the two |
| nodes which are missing the value.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A], [A], [A] |
| </pre></div> |
| </div> |
| <p>This would cause our data to be resurrected even though it had been |
| deleted.</p> |
| </div> |
| <div class="section" id="deletes-with-tombstones"> |
| <h3>Deletes with Tombstones<a class="headerlink" href="#deletes-with-tombstones" title="Permalink to this headline">¶</a></h3> |
| <p>Starting again with a three node cluster which has the value [A] replicated to every node.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A], [A], [A] |
| </pre></div> |
| </div> |
| <p>If instead of removing data we add a tombstone record, our single node failure situation will look like this.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A, Tombstone[A]], [A, Tombstone[A]], [A] |
| </pre></div> |
| </div> |
| <p>Now when we issue a repair the Tombstone will be copied to the replica, rather than the deleted data being |
| resurrected.:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>[A, Tombstone[A]], [A, Tombstone[A]], [A, Tombstone[A]] |
| </pre></div> |
| </div> |
| <p>Our repair operation will correctly put the state of the system to what we expect with the record [A] marked as deleted |
| on all nodes. This does mean we will end up accruing Tombstones which will permanently accumulate disk space. To avoid |
| keeping tombstones forever we have a parameter known as <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> for every table in Cassandra.</p> |
| </div> |
| <div class="section" id="the-gc-grace-seconds-parameter-and-tombstone-removal"> |
| <h3>The gc_grace_seconds parameter and Tombstone Removal<a class="headerlink" href="#the-gc-grace-seconds-parameter-and-tombstone-removal" title="Permalink to this headline">¶</a></h3> |
| <p>The table level <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> parameter controls how long Cassandra will retain tombstones through compaction |
| events before finally removing them. This duration should directly reflect the amount of time a user expects to allow |
| before recovering a failed node. After <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> has expired the tombstone may be removed (meaning there will |
| no longer be any record that a certain piece of data was deleted), but as a tombstone can live in one sstable and the |
| data it covers in another, a compaction must also include both sstable for a tombstone to be removed. More precisely, to |
| be able to drop an actual tombstone the following needs to be true;</p> |
| <ul class="simple"> |
| <li>The tombstone must be older than <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code></li> |
| <li>If partition X contains the tombstone, the sstable containing the partition plus all sstables containing data older |
| than the tombstone containing X must be included in the same compaction. We don’t need to care if the partition is in |
| an sstable if we can guarantee that all data in that sstable is newer than the tombstone. If the tombstone is older |
| than the data it cannot shadow that data.</li> |
| <li>If the option <code class="docutils literal notranslate"><span class="pre">only_purge_repaired_tombstones</span></code> is enabled, tombstones are only removed if the data has also been |
| repaired.</li> |
| </ul> |
| <p>If a node remains down or disconnected for longer than <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> it’s deleted data will be repaired back to |
| the other nodes and re-appear in the cluster. This is basically the same as in the “Deletes without Tombstones” section. |
| Note that tombstones will not be removed until a compaction event even if <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> has elapsed.</p> |
| <p>The default value for <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code> is 864000 which is equivalent to 10 days. This can be set when creating or |
| altering a table using <code class="docutils literal notranslate"><span class="pre">WITH</span> <span class="pre">gc_grace_seconds</span></code>.</p> |
| </div> |
| </div> |
| <div class="section" id="ttl"> |
| <h2>TTL<a class="headerlink" href="#ttl" title="Permalink to this headline">¶</a></h2> |
| <p>Data in Cassandra can have an additional property called time to live - this is used to automatically drop data that has |
| expired once the time is reached. Once the TTL has expired the data is converted to a tombstone which stays around for |
| at least <code class="docutils literal notranslate"><span class="pre">gc_grace_seconds</span></code>. Note that if you mix data with TTL and data without TTL (or just different length of the |
| TTL) Cassandra will have a hard time dropping the tombstones created since the partition might span many sstables and |
| not all are compacted at once.</p> |
| </div> |
| <div class="section" id="fully-expired-sstables"> |
| <h2>Fully expired sstables<a class="headerlink" href="#fully-expired-sstables" title="Permalink to this headline">¶</a></h2> |
| <p>If an sstable contains only tombstones and it is guaranteed that that sstable is not shadowing data in any other sstable |
| compaction can drop that sstable. If you see sstables with only tombstones (note that TTL:ed data is considered |
| tombstones once the time to live has expired) but it is not being dropped by compaction, it is likely that other |
| sstables contain older data. There is a tool called <code class="docutils literal notranslate"><span class="pre">sstableexpiredblockers</span></code> that will list which sstables are |
| droppable and which are blocking them from being dropped. This is especially useful for time series compaction with |
| <code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> (and the deprecated <code class="docutils literal notranslate"><span class="pre">DateTieredCompactionStrategy</span></code>). With <code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> |
| it is possible to remove the guarantee (not check for shadowing data) by enabling <code class="docutils literal notranslate"><span class="pre">unsafe_aggressive_sstable_expiration</span></code>.</p> |
| </div> |
| <div class="section" id="repaired-unrepaired-data"> |
| <h2>Repaired/unrepaired data<a class="headerlink" href="#repaired-unrepaired-data" title="Permalink to this headline">¶</a></h2> |
| <p>With incremental repairs Cassandra must keep track of what data is repaired and what data is unrepaired. With |
| anticompaction repaired data is split out into repaired and unrepaired sstables. To avoid mixing up the data again |
| separate compaction strategy instances are run on the two sets of data, each instance only knowing about either the |
| repaired or the unrepaired sstables. This means that if you only run incremental repair once and then never again, you |
| might have very old data in the repaired sstables that block compaction from dropping tombstones in the unrepaired |
| (probably newer) sstables.</p> |
| </div> |
| <div class="section" id="data-directories"> |
| <h2>Data directories<a class="headerlink" href="#data-directories" title="Permalink to this headline">¶</a></h2> |
| <p>Since tombstones and data can live in different sstables it is important to realize that losing an sstable might lead to |
| data becoming live again - the most common way of losing sstables is to have a hard drive break down. To avoid making |
| data live tombstones and actual data are always in the same data directory. This way, if a disk is lost, all versions of |
| a partition are lost and no data can get undeleted. To achieve this a compaction strategy instance per data directory is |
| run in addition to the compaction strategy instances containing repaired/unrepaired data, this means that if you have 4 |
| data directories there will be 8 compaction strategy instances running. This has a few more benefits than just avoiding |
| data getting undeleted:</p> |
| <ul class="simple"> |
| <li>It is possible to run more compactions in parallel - leveled compaction will have several totally separate levelings |
| and each one can run compactions independently from the others.</li> |
| <li>Users can backup and restore a single data directory.</li> |
| <li>Note though that currently all data directories are considered equal, so if you have a tiny disk and a big disk |
| backing two data directories, the big one will be limited the by the small one. One work around to this is to create |
| more data directories backed by the big disk.</li> |
| </ul> |
| </div> |
| <div class="section" id="single-sstable-tombstone-compaction"> |
| <h2>Single sstable tombstone compaction<a class="headerlink" href="#single-sstable-tombstone-compaction" title="Permalink to this headline">¶</a></h2> |
| <p>When an sstable is written a histogram with the tombstone expiry times is created and this is used to try to find |
| sstables with very many tombstones and run single sstable compaction on that sstable in hope of being able to drop |
| tombstones in that sstable. Before starting this it is also checked how likely it is that any tombstones will actually |
| will be able to be dropped how much this sstable overlaps with other sstables. To avoid most of these checks the |
| compaction option <code class="docutils literal notranslate"><span class="pre">unchecked_tombstone_compaction</span></code> can be enabled.</p> |
| </div> |
| <div class="section" id="common-options"> |
| <span id="compaction-options"></span><h2>Common options<a class="headerlink" href="#common-options" title="Permalink to this headline">¶</a></h2> |
| <p>There is a number of common options for all the compaction strategies;</p> |
| <dl class="docutils"> |
| <dt><code class="docutils literal notranslate"><span class="pre">enabled</span></code> (default: true)</dt> |
| <dd>Whether minor compactions should run. Note that you can have ‘enabled’: true as a compaction option and then do |
| ‘nodetool enableautocompaction’ to start running compactions.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">tombstone_threshold</span></code> (default: 0.2)</dt> |
| <dd>How much of the sstable should be tombstones for us to consider doing a single sstable compaction of that sstable.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">tombstone_compaction_interval</span></code> (default: 86400s (1 day))</dt> |
| <dd>Since it might not be possible to drop any tombstones when doing a single sstable compaction we need to make sure |
| that one sstable is not constantly getting recompacted - this option states how often we should try for a given |
| sstable.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">log_all</span></code> (default: false)</dt> |
| <dd>New detailed compaction logging, see <a class="reference internal" href="#detailed-compaction-logging"><span class="std std-ref">below</span></a>.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">unchecked_tombstone_compaction</span></code> (default: false)</dt> |
| <dd>The single sstable compaction has quite strict checks for whether it should be started, this option disables those |
| checks and for some usecases this might be needed. Note that this does not change anything for the actual |
| compaction, tombstones are only dropped if it is safe to do so - it might just rewrite an sstable without being able |
| to drop any tombstones.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">only_purge_repaired_tombstone</span></code> (default: false)</dt> |
| <dd>Option to enable the extra safety of making sure that tombstones are only dropped if the data has been repaired.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">min_threshold</span></code> (default: 4)</dt> |
| <dd>Lower limit of number of sstables before a compaction is triggered. Not used for <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code>.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">max_threshold</span></code> (default: 32)</dt> |
| <dd>Upper limit of number of sstables before a compaction is triggered. Not used for <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code>.</dd> |
| </dl> |
| <p>Further, see the section on each strategy for specific additional options.</p> |
| </div> |
| <div class="section" id="compaction-nodetool-commands"> |
| <h2>Compaction nodetool commands<a class="headerlink" href="#compaction-nodetool-commands" title="Permalink to this headline">¶</a></h2> |
| <p>The <span class="xref std std-ref">nodetool</span> utility provides a number of commands related to compaction:</p> |
| <dl class="docutils"> |
| <dt><code class="docutils literal notranslate"><span class="pre">enableautocompaction</span></code></dt> |
| <dd>Enable compaction.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">disableautocompaction</span></code></dt> |
| <dd>Disable compaction.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">setcompactionthroughput</span></code></dt> |
| <dd>How fast compaction should run at most - defaults to 16MB/s, but note that it is likely not possible to reach this |
| throughput.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">compactionstats</span></code></dt> |
| <dd>Statistics about current and pending compactions.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">compactionhistory</span></code></dt> |
| <dd>List details about the last compactions.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">setcompactionthreshold</span></code></dt> |
| <dd>Set the min/max sstable count for when to trigger compaction, defaults to 4/32.</dd> |
| </dl> |
| </div> |
| <div class="section" id="switching-the-compaction-strategy-and-options-using-jmx"> |
| <h2>Switching the compaction strategy and options using JMX<a class="headerlink" href="#switching-the-compaction-strategy-and-options-using-jmx" title="Permalink to this headline">¶</a></h2> |
| <p>It is possible to switch compaction strategies and its options on just a single node using JMX, this is a great way to |
| experiment with settings without affecting the whole cluster. The mbean is:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>org.apache.cassandra.db:type=ColumnFamilies,keyspace=<keyspace_name>,columnfamily=<table_name> |
| </pre></div> |
| </div> |
| <p>and the attribute to change is <code class="docutils literal notranslate"><span class="pre">CompactionParameters</span></code> or <code class="docutils literal notranslate"><span class="pre">CompactionParametersJson</span></code> if you use jconsole or jmc. The |
| syntax for the json version is the same as you would use in an <a class="reference internal" href="../cql/ddl.html#alter-table-statement"><span class="std std-ref">ALTER TABLE</span></a> statement - |
| for example:</p> |
| <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>{ 'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 123, 'fanout_size': 10} |
| </pre></div> |
| </div> |
| <p>The setting is kept until someone executes an <a class="reference internal" href="../cql/ddl.html#alter-table-statement"><span class="std std-ref">ALTER TABLE</span></a> that touches the compaction |
| settings or restarts the node.</p> |
| </div> |
| <div class="section" id="more-detailed-compaction-logging"> |
| <span id="detailed-compaction-logging"></span><h2>More detailed compaction logging<a class="headerlink" href="#more-detailed-compaction-logging" title="Permalink to this headline">¶</a></h2> |
| <p>Enable with the compaction option <code class="docutils literal notranslate"><span class="pre">log_all</span></code> and a more detailed compaction log file will be produced in your log |
| directory.</p> |
| </div> |
| <div class="section" id="size-tiered-compaction-strategy"> |
| <span id="stcs"></span><h2>Size Tiered Compaction Strategy<a class="headerlink" href="#size-tiered-compaction-strategy" title="Permalink to this headline">¶</a></h2> |
| <p>The basic idea of <code class="docutils literal notranslate"><span class="pre">SizeTieredCompactionStrategy</span></code> (STCS) is to merge sstables of approximately the same size. All |
| sstables are put in different buckets depending on their size. An sstable is added to the bucket if size of the sstable |
| is within <code class="docutils literal notranslate"><span class="pre">bucket_low</span></code> and <code class="docutils literal notranslate"><span class="pre">bucket_high</span></code> of the current average size of the sstables already in the bucket. This |
| will create several buckets and the most interesting of those buckets will be compacted. The most interesting one is |
| decided by figuring out which bucket’s sstables takes the most reads.</p> |
| <div class="section" id="major-compaction"> |
| <h3>Major compaction<a class="headerlink" href="#major-compaction" title="Permalink to this headline">¶</a></h3> |
| <p>When running a major compaction with STCS you will end up with two sstables per data directory (one for repaired data |
| and one for unrepaired data). There is also an option (-s) to do a major compaction that splits the output into several |
| sstables. The sizes of the sstables are approximately 50%, 25%, 12.5%… of the total size.</p> |
| </div> |
| <div class="section" id="stcs-options"> |
| <span id="id2"></span><h3>STCS options<a class="headerlink" href="#stcs-options" title="Permalink to this headline">¶</a></h3> |
| <dl class="docutils"> |
| <dt><code class="docutils literal notranslate"><span class="pre">min_sstable_size</span></code> (default: 50MB)</dt> |
| <dd>Sstables smaller than this are put in the same bucket.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">bucket_low</span></code> (default: 0.5)</dt> |
| <dd>How much smaller than the average size of a bucket a sstable should be before not being included in the bucket. That |
| is, if <code class="docutils literal notranslate"><span class="pre">bucket_low</span> <span class="pre">*</span> <span class="pre">avg_bucket_size</span> <span class="pre"><</span> <span class="pre">sstable_size</span></code> (and the <code class="docutils literal notranslate"><span class="pre">bucket_high</span></code> condition holds, see below), then |
| the sstable is added to the bucket.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">bucket_high</span></code> (default: 1.5)</dt> |
| <dd>How much bigger than the average size of a bucket a sstable should be before not being included in the bucket. That |
| is, if <code class="docutils literal notranslate"><span class="pre">sstable_size</span> <span class="pre"><</span> <span class="pre">bucket_high</span> <span class="pre">*</span> <span class="pre">avg_bucket_size</span></code> (and the <code class="docutils literal notranslate"><span class="pre">bucket_low</span></code> condition holds, see above), then |
| the sstable is added to the bucket.</dd> |
| </dl> |
| </div> |
| <div class="section" id="defragmentation"> |
| <h3>Defragmentation<a class="headerlink" href="#defragmentation" title="Permalink to this headline">¶</a></h3> |
| <p>Defragmentation is done when many sstables are touched during a read. The result of the read is put in to the memtable |
| so that the next read will not have to touch as many sstables. This can cause writes on a read-only-cluster.</p> |
| </div> |
| </div> |
| <div class="section" id="leveled-compaction-strategy"> |
| <span id="lcs"></span><h2>Leveled Compaction Strategy<a class="headerlink" href="#leveled-compaction-strategy" title="Permalink to this headline">¶</a></h2> |
| <p>The idea of <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code> (LCS) is that all sstables are put into different levels where we guarantee |
| that no overlapping sstables are in the same level. By overlapping we mean that the first/last token of a single sstable |
| are never overlapping with other sstables. This means that for a SELECT we will only have to look for the partition key |
| in a single sstable per level. Each level is 10x the size of the previous one and each sstable is 160MB by default. L0 |
| is where sstables are streamed/flushed - no overlap guarantees are given here.</p> |
| <p>When picking compaction candidates we have to make sure that the compaction does not create overlap in the target level. |
| This is done by always including all overlapping sstables in the next level. For example if we select an sstable in L3, |
| we need to guarantee that we pick all overlapping sstables in L4 and make sure that no currently ongoing compactions |
| will create overlap if we start that compaction. We can start many parallel compactions in a level if we guarantee that |
| we wont create overlap. For L0 -> L1 compactions we almost always need to include all L1 sstables since most L0 sstables |
| cover the full range. We also can’t compact all L0 sstables with all L1 sstables in a single compaction since that can |
| use too much memory.</p> |
| <p>When deciding which level to compact LCS checks the higher levels first (with LCS, a “higher” level is one with a higher |
| number, L0 being the lowest one) and if the level is behind a compaction will be started in that level.</p> |
| <div class="section" id="id3"> |
| <h3>Major compaction<a class="headerlink" href="#id3" title="Permalink to this headline">¶</a></h3> |
| <p>It is possible to do a major compaction with LCS - it will currently start by filling out L1 and then once L1 is full, |
| it continues with L2 etc. This is sub optimal and will change to create all the sstables in a high level instead, |
| CASSANDRA-11817.</p> |
| </div> |
| <div class="section" id="bootstrapping"> |
| <h3>Bootstrapping<a class="headerlink" href="#bootstrapping" title="Permalink to this headline">¶</a></h3> |
| <p>During bootstrap sstables are streamed from other nodes. The level of the remote sstable is kept to avoid many |
| compactions after the bootstrap is done. During bootstrap the new node also takes writes while it is streaming the data |
| from a remote node - these writes are flushed to L0 like all other writes and to avoid those sstables blocking the |
| remote sstables from going to the correct level, we only do STCS in L0 until the bootstrap is done.</p> |
| </div> |
| <div class="section" id="stcs-in-l0"> |
| <h3>STCS in L0<a class="headerlink" href="#stcs-in-l0" title="Permalink to this headline">¶</a></h3> |
| <p>If LCS gets very many L0 sstables reads are going to hit all (or most) of the L0 sstables since they are likely to be |
| overlapping. To more quickly remedy this LCS does STCS compactions in L0 if there are more than 32 sstables there. This |
| should improve read performance more quickly compared to letting LCS do its L0 -> L1 compactions. If you keep getting |
| too many sstables in L0 it is likely that LCS is not the best fit for your workload and STCS could work out better.</p> |
| </div> |
| <div class="section" id="starved-sstables"> |
| <h3>Starved sstables<a class="headerlink" href="#starved-sstables" title="Permalink to this headline">¶</a></h3> |
| <p>If a node ends up with a leveling where there are a few very high level sstables that are not getting compacted they |
| might make it impossible for lower levels to drop tombstones etc. For example, if there are sstables in L6 but there is |
| only enough data to actually get a L4 on the node the left over sstables in L6 will get starved and not compacted. This |
| can happen if a user changes sstable_size_in_mb from 5MB to 160MB for example. To avoid this LCS tries to include |
| those starved high level sstables in other compactions if there has been 25 compaction rounds where the highest level |
| has not been involved.</p> |
| </div> |
| <div class="section" id="lcs-options"> |
| <span id="id4"></span><h3>LCS options<a class="headerlink" href="#lcs-options" title="Permalink to this headline">¶</a></h3> |
| <dl class="docutils"> |
| <dt><code class="docutils literal notranslate"><span class="pre">sstable_size_in_mb</span></code> (default: 160MB)</dt> |
| <dd>The target compressed (if using compression) sstable size - the sstables can end up being larger if there are very |
| large partitions on the node.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">fanout_size</span></code> (default: 10)</dt> |
| <dd>The target size of levels increases by this fanout_size multiplier. You can reduce the space amplification by tuning |
| this option.</dd> |
| </dl> |
| <p>LCS also support the <code class="docutils literal notranslate"><span class="pre">cassandra.disable_stcs_in_l0</span></code> startup option (<code class="docutils literal notranslate"><span class="pre">-Dcassandra.disable_stcs_in_l0=true</span></code>) to avoid |
| doing STCS in L0.</p> |
| </div> |
| </div> |
| <div class="section" id="time-window-compactionstrategy"> |
| <span id="twcs"></span><h2>Time Window CompactionStrategy<a class="headerlink" href="#time-window-compactionstrategy" title="Permalink to this headline">¶</a></h2> |
| <p><code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> (TWCS) is designed specifically for workloads where it’s beneficial to have data on |
| disk grouped by the timestamp of the data, a common goal when the workload is time-series in nature or when all data is |
| written with a TTL. In an expiring/TTL workload, the contents of an entire SSTable likely expire at approximately the |
| same time, allowing them to be dropped completely, and space reclaimed much more reliably than when using |
| <code class="docutils literal notranslate"><span class="pre">SizeTieredCompactionStrategy</span></code> or <code class="docutils literal notranslate"><span class="pre">LeveledCompactionStrategy</span></code>. The basic concept is that |
| <code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> will create 1 sstable per file for a given window, where a window is simply calculated |
| as the combination of two primary options:</p> |
| <dl class="docutils"> |
| <dt><code class="docutils literal notranslate"><span class="pre">compaction_window_unit</span></code> (default: DAYS)</dt> |
| <dd>A Java TimeUnit (MINUTES, HOURS, or DAYS).</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">compaction_window_size</span></code> (default: 1)</dt> |
| <dd>The number of units that make up a window.</dd> |
| <dt><code class="docutils literal notranslate"><span class="pre">unsafe_aggressive_sstable_expiration</span></code> (default: false)</dt> |
| <dd>Expired sstables will be dropped without checking its data is shadowing other sstables. This is a potentially |
| risky option that can lead to data loss or deleted data re-appearing, going beyond what |
| <cite>unchecked_tombstone_compaction</cite> does for single sstable compaction. Due to the risk the jvm must also be |
| started with <cite>-Dcassandra.unsafe_aggressive_sstable_expiration=true</cite>.</dd> |
| </dl> |
| <p>Taken together, the operator can specify windows of virtually any size, and <cite>TimeWindowCompactionStrategy</cite> will work to |
| create a single sstable for writes within that window. For efficiency during writing, the newest window will be |
| compacted using <cite>SizeTieredCompactionStrategy</cite>.</p> |
| <p>Ideally, operators should select a <code class="docutils literal notranslate"><span class="pre">compaction_window_unit</span></code> and <code class="docutils literal notranslate"><span class="pre">compaction_window_size</span></code> pair that produces |
| approximately 20-30 windows - if writing with a 90 day TTL, for example, a 3 Day window would be a reasonable choice |
| (<code class="docutils literal notranslate"><span class="pre">'compaction_window_unit':'DAYS','compaction_window_size':3</span></code>).</p> |
| <div class="section" id="timewindowcompactionstrategy-operational-concerns"> |
| <h3>TimeWindowCompactionStrategy Operational Concerns<a class="headerlink" href="#timewindowcompactionstrategy-operational-concerns" title="Permalink to this headline">¶</a></h3> |
| <p>The primary motivation for TWCS is to separate data on disk by timestamp and to allow fully expired SSTables to drop |
| more efficiently. One potential way this optimal behavior can be subverted is if data is written to SSTables out of |
| order, with new data and old data in the same SSTable. Out of order data can appear in two ways:</p> |
| <ul class="simple"> |
| <li>If the user mixes old data and new data in the traditional write path, the data will be comingled in the memtables |
| and flushed into the same SSTable, where it will remain comingled.</li> |
| <li>If the user’s read requests for old data cause read repairs that pull old data into the current memtable, that data |
| will be comingled and flushed into the same SSTable.</li> |
| </ul> |
| <p>While TWCS tries to minimize the impact of comingled data, users should attempt to avoid this behavior. Specifically, |
| users should avoid queries that explicitly set the timestamp via CQL <code class="docutils literal notranslate"><span class="pre">USING</span> <span class="pre">TIMESTAMP</span></code>. Additionally, users should run |
| frequent repairs (which streams data in such a way that it does not become comingled).</p> |
| </div> |
| <div class="section" id="changing-timewindowcompactionstrategy-options"> |
| <h3>Changing TimeWindowCompactionStrategy Options<a class="headerlink" href="#changing-timewindowcompactionstrategy-options" title="Permalink to this headline">¶</a></h3> |
| <p>Operators wishing to enable <code class="docutils literal notranslate"><span class="pre">TimeWindowCompactionStrategy</span></code> on existing data should consider running a major compaction |
| first, placing all existing data into a single (old) window. Subsequent newer writes will then create typical SSTables |
| as expected.</p> |
| <p>Operators wishing to change <code class="docutils literal notranslate"><span class="pre">compaction_window_unit</span></code> or <code class="docutils literal notranslate"><span class="pre">compaction_window_size</span></code> can do so, but may trigger |
| additional compactions as adjacent windows are joined together. If the window size is decrease d (for example, from 24 |
| hours to 12 hours), then the existing SSTables will not be modified - TWCS can not split existing SSTables into multiple |
| windows.</p> |
| </div> |
| </div> |
| </div> |
| |
| |
| |
| |
| <div class="doc-prev-next-links" role="navigation" aria-label="footer navigation"> |
| |
| <a href="bloom_filters.html" class="btn btn-default pull-right " role="button" title="Bloom Filters" accesskey="n">Next <span class="glyphicon glyphicon-circle-arrow-right" aria-hidden="true"></span></a> |
| |
| |
| <a href="hints.html" class="btn btn-default" role="button" title="Hints" accesskey="p"><span class="glyphicon glyphicon-circle-arrow-left" aria-hidden="true"></span> Previous</a> |
| |
| </div> |
| |
| </div> |
| </div> |
| </div> |
| </div> |
| </div> |