blob: c9c87f5b248b41a0611efa26891435c3d74bf5ba [file] [log] [blame]
<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Evaluating and Refining Data Models &mdash; Apache Cassandra Documentation v4.0-rc2</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/extra.css" type="text/css" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>
<script src="../_static/underscore.js"></script>
<script src="../_static/doctools.js"></script>
<script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Defining Database Schema" href="data_modeling_schema.html" />
<link rel="prev" title="Physical Data Modeling" href="data_modeling_physical.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home"> Apache Cassandra
</a>
<div class="version">
4.0-rc2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../getting_started/index.html">Getting Started</a></li>
<li class="toctree-l1"><a class="reference internal" href="../new/index.html">New Features in Apache Cassandra 4.0</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/index.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../cql/index.html">The Cassandra Query Language (CQL)</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Data Modeling</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="intro.html">Introduction</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_conceptual.html">Conceptual Data Modeling</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_rdbms.html">RDBMS Design</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_queries.html">Defining Application Queries</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_logical.html">Logical Data Modeling</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_physical.html">Physical Data Modeling</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">Evaluating and Refining Data Models</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#calculating-partition-size">Calculating Partition Size</a></li>
<li class="toctree-l3"><a class="reference internal" href="#calculating-size-on-disk">Calculating Size on Disk</a></li>
<li class="toctree-l3"><a class="reference internal" href="#breaking-up-large-partitions">Breaking Up Large Partitions</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_schema.html">Defining Database Schema</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_tools.html">Cassandra Data Modeling Tools</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../configuration/index.html">Configuring Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../operating/index.html">Operating Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tools/index.html">Cassandra Tools</a></li>
<li class="toctree-l1"><a class="reference internal" href="../troubleshooting/index.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../development/index.html">Contributing to Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faq/index.html">Frequently Asked Questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="../plugins/index.html">Third-Party Plugins</a></li>
<li class="toctree-l1"><a class="reference internal" href="../bugs.html">Reporting Bugs</a></li>
<li class="toctree-l1"><a class="reference internal" href="../contactus.html">Contact us</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">Apache Cassandra</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home"></a> &raquo;</li>
<li><a href="index.html">Data Modeling</a> &raquo;</li>
<li>Evaluating and Refining Data Models</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/data_modeling/data_modeling_refining.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="evaluating-and-refining-data-models">
<h1>Evaluating and Refining Data Models<a class="headerlink" href="#evaluating-and-refining-data-models" title="Permalink to this headline"></a></h1>
<p>Once you’ve created a physical model, there are some steps you’ll want
to take to evaluate and refine table designs to help ensure optimal
performance.</p>
<div class="section" id="calculating-partition-size">
<h2>Calculating Partition Size<a class="headerlink" href="#calculating-partition-size" title="Permalink to this headline"></a></h2>
<p>The first thing that you want to look for is whether your tables will have
partitions that will be overly large, or to put it another way, too
wide. Partition size is measured by the number of cells (values) that
are stored in the partition. Cassandra’s hard limit is 2 billion cells
per partition, but you’ll likely run into performance issues before
reaching that limit.</p>
<p>In order to calculate the size of partitions, use the following
formula:</p>
<div class="math notranslate nohighlight">
\[N_v = N_r (N_c - N_{pk} - N_s) + N_s\]</div>
<p>The number of values (or cells) in the partition (N<sub>v</sub>) is equal to
the number of static columns (N<sub>s</sub>) plus the product of the number
of rows (N<sub>r</sub>) and the number of of values per row. The number of
values per row is defined as the number of columns (N<sub>c</sub>) minus the
number of primary key columns (N<sub>pk</sub>) and static columns
(N<sub>s</sub>).</p>
<p>The number of columns tends to be relatively static, although it
is possible to alter tables at runtime. For this reason, a
primary driver of partition size is the number of rows in the partition.
This is a key factor that you must consider in determining whether a
partition has the potential to get too large. Two billion values sounds
like a lot, but in a sensor system where tens or hundreds of values are
measured every millisecond, the number of values starts to add up pretty
fast.</p>
<p>Let’s take a look at one of the tables to analyze the partition size.
Because it has a wide partition design with one partition per hotel,
look at the <code class="docutils literal notranslate"><span class="pre">available_rooms_by_hotel_date</span></code> table. The table has
four columns total (N<sub>c</sub> = 4), including three primary key columns
(N<sub>pk</sub> = 3) and no static columns (N<sub>s</sub> = 0). Plugging these
values into the formula, the result is:</p>
<div class="math notranslate nohighlight">
\[N_v = N_r (4 - 3 - 0) + 0 = 1N_r\]</div>
<p>Therefore the number of values for this table is equal to the number of
rows. You still need to determine a number of rows. To do this, make
estimates based on the application design. The table is
storing a record for each room, in each of hotel, for every night.
Let’s assume the system will be used to store two years of
inventory at a time, and there are 5,000 hotels in the system, with an
average of 100 rooms in each hotel.</p>
<p>Since there is a partition for each hotel, the estimated number of rows
per partition is as follows:</p>
<div class="math notranslate nohighlight">
\[N_r = 100 rooms/hotel \times 730 days = 73,000 rows\]</div>
<p>This relatively small number of rows per partition is not going to get
you in too much trouble, but if you start storing more dates of inventory,
or don’t manage the size of the inventory well using TTL, you could start
having issues. You still might want to look at breaking up this large
partition, which you’ll see how to do shortly.</p>
<p>When performing sizing calculations, it is tempting to assume the
nominal or average case for variables such as the number of rows.
Consider calculating the worst case as well, as these sorts of
predictions have a way of coming true in successful systems.</p>
</div>
<div class="section" id="calculating-size-on-disk">
<h2>Calculating Size on Disk<a class="headerlink" href="#calculating-size-on-disk" title="Permalink to this headline"></a></h2>
<p>In addition to calculating the size of a partition, it is also an
excellent idea to estimate the amount of disk space that will be
required for each table you plan to store in the cluster. In order to
determine the size, use the following formula to determine the size
S<sub>t</sub> of a partition:</p>
<div class="math notranslate nohighlight">
\[S_t = \displaystyle\sum_i sizeOf\big (c_{k_i}\big) + \displaystyle\sum_j sizeOf\big(c_{s_j}\big) + N_r\times \bigg(\displaystyle\sum_k sizeOf\big(c_{r_k}\big) + \displaystyle\sum_l sizeOf\big(c_{c_l}\big)\bigg) +\]</div>
<div class="math notranslate nohighlight">
\[N_v\times sizeOf\big(t_{avg}\big)\]</div>
<p>This is a bit more complex than the previous formula, but let’s break it
down a bit at a time. Let’s take a look at the notation first:</p>
<ul class="simple">
<li><p>In this formula, c<sub>k</sub> refers to partition key columns,
c<sub>s</sub> to static columns, c<sub>r</sub> to regular columns, and
c<sub>c</sub> to clustering columns.</p></li>
<li><p>The term t<sub>avg</sub> refers to the average number of bytes of
metadata stored per cell, such as timestamps. It is typical to use an
estimate of 8 bytes for this value.</p></li>
<li><p>You’ll recognize the number of rows N<sub>r</sub> and number of values
N<sub>v</sub> from previous calculations.</p></li>
<li><p>The <strong>sizeOf()</strong> function refers to the size in bytes of the CQL data
type of each referenced column.</p></li>
</ul>
<p>The first term asks you to sum the size of the partition key columns. For
this example, the <code class="docutils literal notranslate"><span class="pre">available_rooms_by_hotel_date</span></code> table has a single
partition key column, the <code class="docutils literal notranslate"><span class="pre">hotel_id</span></code>, which is of type
<code class="docutils literal notranslate"><span class="pre">text</span></code>. Assuming that hotel identifiers are simple 5-character codes,
you have a 5-byte value, so the sum of the partition key column sizes is
5 bytes.</p>
<p>The second term asks you to sum the size of the static columns. This table
has no static columns, so the size is 0 bytes.</p>
<p>The third term is the most involved, and for good reason—it is
calculating the size of the cells in the partition. Sum the size of
the clustering columns and regular columns. The two clustering columns
are the <code class="docutils literal notranslate"><span class="pre">date</span></code>, which is 4 bytes, and the <code class="docutils literal notranslate"><span class="pre">room_number</span></code>,
which is a 2-byte short integer, giving a sum of 6 bytes.
There is only a single regular column, the boolean <code class="docutils literal notranslate"><span class="pre">is_available</span></code>,
which is 1 byte in size. Summing the regular column size
(1 byte) plus the clustering column size (6 bytes) gives a total of 7
bytes. To finish up the term, multiply this value by the number of
rows (73,000), giving a result of 511,000 bytes (0.51 MB).</p>
<p>The fourth term is simply counting the metadata that that Cassandra
stores for each cell. In the storage format used by Cassandra 3.0 and
later, the amount of metadata for a given cell varies based on the type
of data being stored, and whether or not custom timestamp or TTL values
are specified for individual cells. For this table, reuse the number
of values from the previous calculation (73,000) and multiply by 8,
which gives 0.58 MB.</p>
<p>Adding these terms together, you get a final estimate:</p>
<div class="math notranslate nohighlight">
\[Partition size = 16 bytes + 0 bytes + 0.51 MB + 0.58 MB = 1.1 MB\]</div>
<p>This formula is an approximation of the actual size of a partition on
disk, but is accurate enough to be quite useful. Remembering that the
partition must be able to fit on a single node, it looks like the table
design will not put a lot of strain on disk storage.</p>
<p>Cassandra’s storage engine was re-implemented for the 3.0 release,
including a new format for SSTable files. The previous format stored a
separate copy of the clustering columns as part of the record for each
cell. The newer format eliminates this duplication, which reduces the
size of stored data and simplifies the formula for computing that size.</p>
<p>Keep in mind also that this estimate only counts a single replica of
data. You will need to multiply the value obtained here by the number of
partitions and the number of replicas specified by the keyspace’s
replication strategy in order to determine the total required total
capacity for each table. This will come in handy when you
plan your cluster.</p>
</div>
<div class="section" id="breaking-up-large-partitions">
<h2>Breaking Up Large Partitions<a class="headerlink" href="#breaking-up-large-partitions" title="Permalink to this headline"></a></h2>
<p>As discussed previously, the goal is to design tables that can provide
the data you need with queries that touch a single partition, or failing
that, the minimum possible number of partitions. However, as shown in
the examples, it is quite possible to design wide
partition-style tables that approach Cassandra’s built-in limits.
Performing sizing analysis on tables may reveal partitions that are
potentially too large, either in number of values, size on disk, or
both.</p>
<p>The technique for splitting a large partition is straightforward: add an
additional column to the partition key. In most cases, moving one of the
existing columns into the partition key will be sufficient. Another
option is to introduce an additional column to the table to act as a
sharding key, but this requires additional application logic.</p>
<p>Continuing to examine the available rooms example, if you add the <code class="docutils literal notranslate"><span class="pre">date</span></code>
column to the partition key for the <code class="docutils literal notranslate"><span class="pre">available_rooms_by_hotel_date</span></code>
table, each partition would then represent the availability of rooms
at a specific hotel on a specific date. This will certainly yield
partitions that are significantly smaller, perhaps too small, as the
data for consecutive days will likely be on separate nodes.</p>
<p>Another technique known as <strong>bucketing</strong> is often used to break the data
into moderate-size partitions. For example, you could bucketize the
<code class="docutils literal notranslate"><span class="pre">available_rooms_by_hotel_date</span></code> table by adding a <code class="docutils literal notranslate"><span class="pre">month</span></code> column to
the partition key, perhaps represented as an integer. The comparision
with the original design is shown in the figure below. While the
<code class="docutils literal notranslate"><span class="pre">month</span></code> column is partially duplicative of the <code class="docutils literal notranslate"><span class="pre">date</span></code>, it provides
a nice way of grouping related data in a partition that will not get
too large.</p>
<img alt="../_images/data_modeling_hotel_bucketing.png" src="../_images/data_modeling_hotel_bucketing.png" />
<p>If you really felt strongly about preserving a wide partition design, you
could instead add the <code class="docutils literal notranslate"><span class="pre">room_id</span></code> to the partition key, so that each
partition would represent the availability of the room across all
dates. Because there was no query identified that involves searching
availability of a specific room, the first or second design approach
is most suitable to the application needs.</p>
<p><em>Material adapted from Cassandra, The Definitive Guide. Published by
O’Reilly Media, Inc. Copyright © 2020 Jeff Carpenter, Eben Hewitt.
All rights reserved. Used with permission.</em></p>
</div>
</div>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="data_modeling_schema.html" class="btn btn-neutral float-right" title="Defining Database Schema" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
<a href="data_modeling_physical.html" class="btn btn-neutral float-left" title="Physical Data Modeling" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<p>
&#169; Copyright 2020, The Apache Cassandra team.
</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>