blob: 8d39989c188ff55e132d9dd6e594c5b23cc8918d [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Logical Data Modeling &mdash; Apache Cassandra Documentation v4.0-rc2</title>
<script type="text/javascript" src="../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/extra.css" type="text/css" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Physical Data Modeling" href="data_modeling_physical.html" />
<link rel="prev" title="Defining Application Queries" href="data_modeling_queries.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home"> Apache Cassandra
</a>
<div class="version">
4.0-rc2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../getting_started/index.html">Getting Started</a></li>
<li class="toctree-l1"><a class="reference internal" href="../new/index.html">New Features in Apache Cassandra 4.0</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/index.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../cql/index.html">The Cassandra Query Language (CQL)</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Data Modeling</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="intro.html">Introduction</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_conceptual.html">Conceptual Data Modeling</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_rdbms.html">RDBMS Design</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_queries.html">Defining Application Queries</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">Logical Data Modeling</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#hotel-logical-data-model">Hotel Logical Data Model</a></li>
<li class="toctree-l3"><a class="reference internal" href="#reservation-logical-data-model">Reservation Logical Data Model</a></li>
<li class="toctree-l3"><a class="reference internal" href="#patterns-and-anti-patterns">Patterns and Anti-Patterns</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_physical.html">Physical Data Modeling</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_refining.html">Evaluating and Refining Data Models</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_schema.html">Defining Database Schema</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_tools.html">Cassandra Data Modeling Tools</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../configuration/index.html">Configuring Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../operating/index.html">Operating Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tools/index.html">Cassandra Tools</a></li>
<li class="toctree-l1"><a class="reference internal" href="../troubleshooting/index.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../development/index.html">Contributing to Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faq/index.html">Frequently Asked Questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="../plugins/index.html">Third-Party Plugins</a></li>
<li class="toctree-l1"><a class="reference internal" href="../bugs.html">Reporting Bugs</a></li>
<li class="toctree-l1"><a class="reference internal" href="../contactus.html">Contact us</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">Apache Cassandra</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li><a href="index.html">Data Modeling</a> &raquo;</li>
<li>Logical Data Modeling</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/data_modeling/data_modeling_logical.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="logical-data-modeling">
<h1>Logical Data Modeling<a class="headerlink" href="#logical-data-modeling" title="Permalink to this headline"></a></h1>
<p>Now that you have defined your queries, you’re ready to begin designing
Cassandra tables. First, create a logical model containing a table
for each query, capturing entities and relationships from the conceptual
model.</p>
<p>To name each table, you’ll identify the primary entity type for which you
are querying and use that to start the entity name. If you are querying
by attributes of other related entities, append those to the table
name, separated with <code class="docutils literal notranslate"><span class="pre">_by_</span></code>. For example, <code class="docutils literal notranslate"><span class="pre">hotels_by_poi</span></code>.</p>
<p>Next, you identify the primary key for the table, adding partition key
columns based on the required query attributes, and clustering columns
in order to guarantee uniqueness and support desired sort ordering.</p>
<p>The design of the primary key is extremely important, as it will
determine how much data will be stored in each partition and how that
data is organized on disk, which in turn will affect how quickly
Cassandra processes reads.</p>
<p>Complete each table by adding any additional attributes identified by
the query. If any of these additional attributes are the same for every
instance of the partition key, mark the column as static.</p>
<p>Now that was a pretty quick description of a fairly involved process, so
it will be worthwhile to work through a detailed example. First,
let’s introduce a notation that you can use to represent logical
models.</p>
<p>Several individuals within the Cassandra community have proposed
notations for capturing data models in diagrammatic form. This document
uses a notation popularized by Artem Chebotko which provides a simple,
informative way to visualize the relationships between queries and
tables in your designs. This figure shows the Chebotko notation for a
logical data model.</p>
<img alt="../_images/data_modeling_chebotko_logical.png" src="../_images/data_modeling_chebotko_logical.png" />
<p>Each table is shown with its title and a list of columns. Primary key
columns are identified via symbols such as <strong>K</strong> for partition key
columns and <strong>C</strong>↑ or <strong>C</strong>↓ to represent clustering columns. Lines
are shown entering tables or between tables to indicate the queries that
each table is designed to support.</p>
<div class="section" id="hotel-logical-data-model">
<h2>Hotel Logical Data Model<a class="headerlink" href="#hotel-logical-data-model" title="Permalink to this headline"></a></h2>
<p>The figure below shows a Chebotko logical data model for the queries
involving hotels, points of interest, rooms, and amenities. One thing you’ll
notice immediately is that the Cassandra design doesn’t include dedicated
tables for rooms or amenities, as you had in the relational design. This
is because the workflow didn’t identify any queries requiring this
direct access.</p>
<img alt="../_images/data_modeling_hotel_logical.png" src="../_images/data_modeling_hotel_logical.png" />
<p>Let’s explore the details of each of these tables.</p>
<p>The first query Q1 is to find hotels near a point of interest, so you’ll
call this table <code class="docutils literal notranslate"><span class="pre">hotels_by_poi</span></code>. Searching by a named point of
interest is a clue that the point of interest should be a part
of the primary key. Let’s reference the point of interest by name,
because according to the workflow that is how users will start their
search.</p>
<p>You’ll note that you certainly could have more than one hotel near a
given point of interest, so you’ll need another component in the primary
key in order to make sure you have a unique partition for each hotel. So
you add the hotel key as a clustering column.</p>
<p>An important consideration in designing your table’s primary key is
making sure that it defines a unique data element. Otherwise you run the
risk of accidentally overwriting data.</p>
<p>Now for the second query (Q2), you’ll need a table to get information
about a specific hotel. One approach would have been to put all of the
attributes of a hotel in the <code class="docutils literal notranslate"><span class="pre">hotels_by_poi</span></code> table, but you added
only those attributes that were required by the application workflow.</p>
<p>From the workflow diagram, you know that the <code class="docutils literal notranslate"><span class="pre">hotels_by_poi</span></code> table is
used to display a list of hotels with basic information on each hotel,
and the application knows the unique identifiers of the hotels returned.
When the user selects a hotel to view details, you can then use Q2, which
is used to obtain details about the hotel. Because you already have the
<code class="docutils literal notranslate"><span class="pre">hotel_id</span></code> from Q1, you use that as a reference to the hotel you’re
looking for. Therefore the second table is just called <code class="docutils literal notranslate"><span class="pre">hotels</span></code>.</p>
<p>Another option would have been to store a set of <code class="docutils literal notranslate"><span class="pre">poi_names</span></code> in the
hotels table. This is an equally valid approach. You’ll learn through
experience which approach is best for your application.</p>
<p>Q3 is just a reverse of Q1—looking for points of interest near a hotel,
rather than hotels near a point of interest. This time, however, you need
to access the details of each point of interest, as represented by the
<code class="docutils literal notranslate"><span class="pre">pois_by_hotel</span></code> table. As previously, you add the point of
interest name as a clustering key to guarantee uniqueness.</p>
<p>At this point, let’s now consider how to support query Q4 to help the
user find available rooms at a selected hotel for the nights they are
interested in staying. Note that this query involves both a start date
and an end date. Because you’re querying over a range instead of a single
date, you know that you’ll need to use the date as a clustering key.
Use the <code class="docutils literal notranslate"><span class="pre">hotel_id</span></code> as a primary key to group room data for each hotel
on a single partition, which should help searches be super fast. Let’s
call this the <code class="docutils literal notranslate"><span class="pre">available_rooms_by_hotel_date</span></code> table.</p>
<p>To support searching over a range, use <a class="reference internal" href="../cql/ddl.html#clustering-columns"><span class="std std-ref">clustering columns</span></a> to store
attributes that you need to access in a range query. Remember that the
order of the clustering columns is important.</p>
<p>The design of the <code class="docutils literal notranslate"><span class="pre">available_rooms_by_hotel_date</span></code> table is an instance
of the <strong>wide partition</strong> pattern. This
pattern is sometimes called the <strong>wide row</strong> pattern when discussing
databases that support similar models, but wide partition is a more
accurate description from a Cassandra perspective. The essence of the
pattern is to group multiple related rows in a partition in order to
support fast access to multiple rows within the partition in a single
query.</p>
<p>In order to round out the shopping portion of the data model, add the
<code class="docutils literal notranslate"><span class="pre">amenities_by_room</span></code> table to support Q5. This will allow users to
view the amenities of one of the rooms that is available for the desired
stay dates.</p>
</div>
<div class="section" id="reservation-logical-data-model">
<h2>Reservation Logical Data Model<a class="headerlink" href="#reservation-logical-data-model" title="Permalink to this headline"></a></h2>
<p>Now let’s switch gears to look at the reservation queries. The figure
shows a logical data model for reservations. You’ll notice that these
tables represent a denormalized design; the same data appears in
multiple tables, with differing keys.</p>
<img alt="../_images/data_modeling_reservation_logical.png" src="../_images/data_modeling_reservation_logical.png" />
<p>In order to satisfy Q6, the <code class="docutils literal notranslate"><span class="pre">reservations_by_guest</span></code> table can be used
to look up the reservation by guest name. You could envision query Q7
being used on behalf of a guest on a self-serve website or a call center
agent trying to assist the guest. Because the guest name might not be
unique, you include the guest ID here as a clustering column as well.</p>
<p>Q8 and Q9 in particular help to remind you to create queries
that support various stakeholders of the application, not just customers
but staff as well, and perhaps even the analytics team, suppliers, and so
on.</p>
<p>The hotel staff might wish to see a record of upcoming reservations by
date in order to get insight into how the hotel is performing, such as
what dates the hotel is sold out or undersold. Q8 supports the retrieval
of reservations for a given hotel by date.</p>
<p>Finally, you create a <code class="docutils literal notranslate"><span class="pre">guests</span></code> table. This provides a single
location that used to store guest information. In this case, you specify a
separate unique identifier for guest records, as it is not uncommon
for guests to have the same name. In many organizations, a customer
database such as the <code class="docutils literal notranslate"><span class="pre">guests</span></code> table would be part of a separate
customer management application, which is why other guest
access patterns were omitted from the example.</p>
</div>
<div class="section" id="patterns-and-anti-patterns">
<h2>Patterns and Anti-Patterns<a class="headerlink" href="#patterns-and-anti-patterns" title="Permalink to this headline"></a></h2>
<p>As with other types of software design, there are some well-known
patterns and anti-patterns for data modeling in Cassandra. You’ve already
used one of the most common patterns in this hotel model—the wide
partition pattern.</p>
<p>The <strong>time series</strong> pattern is an extension of the wide partition
pattern. In this pattern, a series of measurements at specific time
intervals are stored in a wide partition, where the measurement time is
used as part of the partition key. This pattern is frequently used in
domains including business analysis, sensor data management, and
scientific experiments.</p>
<p>The time series pattern is also useful for data other than measurements.
Consider the example of a banking application. You could store each
customer’s balance in a row, but that might lead to a lot of read and
write contention as various customers check their balance or make
transactions. You’d probably be tempted to wrap a transaction around
writes just to protect the balance from being updated in error. In
contrast, a time series–style design would store each transaction as a
timestamped row and leave the work of calculating the current balance to
the application.</p>
<p>One design trap that many new users fall into is attempting to use
Cassandra as a queue. Each item in the queue is stored with a timestamp
in a wide partition. Items are appended to the end of the queue and read
from the front, being deleted after they are read. This is a design that
seems attractive, especially given its apparent similarity to the time
series pattern. The problem with this approach is that the deleted items
are now <a class="reference internal" href="../faq/index.html#asynch-deletes"><span class="std std-ref">tombstones</span></a> that Cassandra must scan past
in order to read from the front of the queue. Over time, a growing number
of tombstones begins to degrade read performance.</p>
<p>The queue anti-pattern serves as a reminder that any design that relies
on the deletion of data is potentially a poorly performing design.</p>
<p><em>Material adapted from Cassandra, The Definitive Guide. Published by
O’Reilly Media, Inc. Copyright © 2020 Jeff Carpenter, Eben Hewitt.
All rights reserved. Used with permission.</em></p>
</div>
</div>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="data_modeling_physical.html" class="btn btn-neutral float-right" title="Physical Data Modeling" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="data_modeling_queries.html" class="btn btn-neutral float-left" title="Defining Application Queries" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2020, The Apache Cassandra team
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>