blob: 846ac30b1305d976108ed2645d8ec7ef034830e7 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Introduction &mdash; Apache Cassandra Documentation v4.0-rc2</title>
<script type="text/javascript" src="../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/extra.css" type="text/css" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Conceptual Data Modeling" href="data_modeling_conceptual.html" />
<link rel="prev" title="Data Modeling" href="index.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home"> Apache Cassandra
</a>
<div class="version">
4.0-rc2
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../getting_started/index.html">Getting Started</a></li>
<li class="toctree-l1"><a class="reference internal" href="../new/index.html">New Features in Apache Cassandra 4.0</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/index.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../cql/index.html">The Cassandra Query Language (CQL)</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Data Modeling</a><ul class="current">
<li class="toctree-l2 current"><a class="current reference internal" href="#">Introduction</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#what-is-data-modeling">What is Data Modeling?</a></li>
<li class="toctree-l3"><a class="reference internal" href="#query-driven-modeling">Query-driven modeling</a></li>
<li class="toctree-l3"><a class="reference internal" href="#goals">Goals</a></li>
<li class="toctree-l3"><a class="reference internal" href="#partitions">Partitions</a></li>
<li class="toctree-l3"><a class="reference internal" href="#comparing-with-relational-data-model">Comparing with Relational Data Model</a></li>
<li class="toctree-l3"><a class="reference internal" href="#examples-of-data-modeling">Examples of Data Modeling</a></li>
<li class="toctree-l3"><a class="reference internal" href="#designing-schema">Designing Schema</a></li>
<li class="toctree-l3"><a class="reference internal" href="#data-model-analysis">Data Model Analysis</a></li>
<li class="toctree-l3"><a class="reference internal" href="#using-materialized-views">Using Materialized Views</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_conceptual.html">Conceptual Data Modeling</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_rdbms.html">RDBMS Design</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_queries.html">Defining Application Queries</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_logical.html">Logical Data Modeling</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_physical.html">Physical Data Modeling</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_refining.html">Evaluating and Refining Data Models</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_schema.html">Defining Database Schema</a></li>
<li class="toctree-l2"><a class="reference internal" href="data_modeling_tools.html">Cassandra Data Modeling Tools</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../configuration/index.html">Configuring Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../operating/index.html">Operating Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tools/index.html">Cassandra Tools</a></li>
<li class="toctree-l1"><a class="reference internal" href="../troubleshooting/index.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../development/index.html">Contributing to Cassandra</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faq/index.html">Frequently Asked Questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="../plugins/index.html">Third-Party Plugins</a></li>
<li class="toctree-l1"><a class="reference internal" href="../bugs.html">Reporting Bugs</a></li>
<li class="toctree-l1"><a class="reference internal" href="../contactus.html">Contact us</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">Apache Cassandra</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li><a href="index.html">Data Modeling</a> &raquo;</li>
<li>Introduction</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/data_modeling/intro.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="introduction">
<h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline"></a></h1>
<p>Apache Cassandra stores data in tables, with each table consisting of rows and columns. CQL (Cassandra Query Language) is used to query the data stored in tables. Apache Cassandra data model is based around and optimized for querying. Cassandra does not support relational data modeling intended for relational databases.</p>
<div class="section" id="what-is-data-modeling">
<h2>What is Data Modeling?<a class="headerlink" href="#what-is-data-modeling" title="Permalink to this headline"></a></h2>
<p>Data modeling is the process of identifying entities and their relationships. In relational databases, data is placed in normalized tables with foreign keys used to reference related data in other tables. Queries that the application will make are driven by the structure of the tables and related data are queried as table joins.</p>
<p>In Cassandra, data modeling is query-driven. The data access patterns and application queries determine the structure and organization of data which then used to design the database tables.</p>
<p>Data is modeled around specific queries. Queries are best designed to access a single table, which implies that all entities involved in a query must be in the same table to make data access (reads) very fast. Data is modeled to best suit a query or a set of queries. A table could have one or more entities as best suits a query. As entities do typically have relationships among them and queries could involve entities with relationships among them, a single entity may be included in multiple tables.</p>
</div>
<div class="section" id="query-driven-modeling">
<h2>Query-driven modeling<a class="headerlink" href="#query-driven-modeling" title="Permalink to this headline"></a></h2>
<p>Unlike a relational database model in which queries make use of table joins to get data from multiple tables, joins are not supported in Cassandra so all required fields (columns) must be grouped together in a single table. Since each query is backed by a table, data is duplicated across multiple tables in a process known as denormalization. Data duplication and a high write throughput are used to achieve a high read performance.</p>
</div>
<div class="section" id="goals">
<h2>Goals<a class="headerlink" href="#goals" title="Permalink to this headline"></a></h2>
<p>The choice of the primary key and partition key is important to distribute data evenly across the cluster. Keeping the number of partitions read for a query to a minimum is also important because different partitions could be located on different nodes and the coordinator would need to send a request to each node adding to the request overhead and latency. Even if the different partitions involved in a query are on the same node, fewer partitions make for a more efficient query.</p>
</div>
<div class="section" id="partitions">
<h2>Partitions<a class="headerlink" href="#partitions" title="Permalink to this headline"></a></h2>
<p>Apache Cassandra is a distributed database that stores data across a cluster of nodes. A partition key is used to partition data among the nodes. Cassandra partitions data over the storage nodes using a variant of consistent hashing for data distribution. Hashing is a technique used to map data with which given a key, a hash function generates a hash value (or simply a hash) that is stored in a hash table. A partition key is generated from the first field of a primary key. Data partitioned into hash tables using partition keys provides for rapid lookup. Fewer the partitions used for a query faster is the response time for the query.</p>
<p>As an example of partitioning, consider table <code class="docutils literal notranslate"><span class="pre">t</span></code> in which <code class="docutils literal notranslate"><span class="pre">id</span></code> is the only field in the primary key.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">CREATE</span> <span class="n">TABLE</span> <span class="n">t</span> <span class="p">(</span>
<span class="nb">id</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">k</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">v</span> <span class="n">text</span><span class="p">,</span>
<span class="n">PRIMARY</span> <span class="n">KEY</span> <span class="p">(</span><span class="nb">id</span><span class="p">)</span>
<span class="p">);</span>
</pre></div>
</div>
<p>The partition key is generated from the primary key <code class="docutils literal notranslate"><span class="pre">id</span></code> for data distribution across the nodes in a cluster.</p>
<p>Consider a variation of table <code class="docutils literal notranslate"><span class="pre">t</span></code> that has two fields constituting the primary key to make a composite or compound primary key.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">CREATE</span> <span class="n">TABLE</span> <span class="n">t</span> <span class="p">(</span>
<span class="nb">id</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">c</span> <span class="n">text</span><span class="p">,</span>
<span class="n">k</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">v</span> <span class="n">text</span><span class="p">,</span>
<span class="n">PRIMARY</span> <span class="n">KEY</span> <span class="p">(</span><span class="nb">id</span><span class="p">,</span><span class="n">c</span><span class="p">)</span>
<span class="p">);</span>
</pre></div>
</div>
<p>For the table <code class="docutils literal notranslate"><span class="pre">t</span></code> with a composite primary key the first field <code class="docutils literal notranslate"><span class="pre">id</span></code> is used to generate the partition key and the second field <code class="docutils literal notranslate"><span class="pre">c</span></code> is the clustering key used for sorting within a partition. Using clustering keys to sort data makes retrieval of adjacent data more efficient.</p>
<p>In general, the first field or component of a primary key is hashed to generate the partition key and the remaining fields or components are the clustering keys that are used to sort data within a partition. Partitioning data improves the efficiency of reads and writes. The other fields that are not primary key fields may be indexed separately to further improve query performance.</p>
<p>The partition key could be generated from multiple fields if they are grouped as the first component of a primary key. As another variation of the table <code class="docutils literal notranslate"><span class="pre">t</span></code>, consider a table with the first component of the primary key made of two fields grouped using parentheses.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">CREATE</span> <span class="n">TABLE</span> <span class="n">t</span> <span class="p">(</span>
<span class="n">id1</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">id2</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">c1</span> <span class="n">text</span><span class="p">,</span>
<span class="n">c2</span> <span class="n">text</span>
<span class="n">k</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">v</span> <span class="n">text</span><span class="p">,</span>
<span class="n">PRIMARY</span> <span class="n">KEY</span> <span class="p">((</span><span class="n">id1</span><span class="p">,</span><span class="n">id2</span><span class="p">),</span><span class="n">c1</span><span class="p">,</span><span class="n">c2</span><span class="p">)</span>
<span class="p">);</span>
</pre></div>
</div>
<p>For the preceding table <code class="docutils literal notranslate"><span class="pre">t</span></code> the first component of the primary key constituting fields <code class="docutils literal notranslate"><span class="pre">id1</span></code> and <code class="docutils literal notranslate"><span class="pre">id2</span></code> is used to generate the partition key and the rest of the fields <code class="docutils literal notranslate"><span class="pre">c1</span></code> and <code class="docutils literal notranslate"><span class="pre">c2</span></code> are the clustering keys used for sorting within a partition.</p>
</div>
<div class="section" id="comparing-with-relational-data-model">
<h2>Comparing with Relational Data Model<a class="headerlink" href="#comparing-with-relational-data-model" title="Permalink to this headline"></a></h2>
<p>Relational databases store data in tables that have relations with other tables using foreign keys. A relational database’s approach to data modeling is table-centric. Queries must use table joins to get data from multiple tables that have a relation between them. Apache Cassandra does not have the concept of foreign keys or relational integrity. Apache Cassandra’s data model is based around designing efficient queries; queries that don’t involve multiple tables. Relational databases normalize data to avoid duplication. Apache Cassandra in contrast de-normalizes data by duplicating data in multiple tables for a query-centric data model. If a Cassandra data model cannot fully integrate the complexity of relationships between the different entities for a particular query, client-side joins in application code may be used.</p>
</div>
<div class="section" id="examples-of-data-modeling">
<h2>Examples of Data Modeling<a class="headerlink" href="#examples-of-data-modeling" title="Permalink to this headline"></a></h2>
<p>As an example, a <code class="docutils literal notranslate"><span class="pre">magazine</span></code> data set consists of data for magazines with attributes such as magazine id, magazine name, publication frequency, publication date, and publisher. A basic query (Q1) for magazine data is to list all the magazine names including their publication frequency. As not all data attributes are needed for Q1 the data model would only consist of <code class="docutils literal notranslate"><span class="pre">id</span></code> ( for partition key), magazine name and publication frequency as shown in Figure 1.</p>
<div class="figure align-default">
<img alt="../_images/Figure_1_data_model.jpg" src="../_images/Figure_1_data_model.jpg" />
</div>
<p>Figure 1. Data Model for Q1</p>
<p>Another query (Q2) is to list all the magazine names by publisher. For Q2 the data model would consist of an additional attribute <code class="docutils literal notranslate"><span class="pre">publisher</span></code> for the partition key. The <code class="docutils literal notranslate"><span class="pre">id</span></code> would become the clustering key for sorting within a partition. Data model for Q2 is illustrated in Figure 2.</p>
<div class="figure align-default">
<img alt="../_images/Figure_2_data_model.jpg" src="../_images/Figure_2_data_model.jpg" />
</div>
<p>Figure 2. Data Model for Q2</p>
</div>
<div class="section" id="designing-schema">
<h2>Designing Schema<a class="headerlink" href="#designing-schema" title="Permalink to this headline"></a></h2>
<p>After the conceptual data model has been created a schema may be designed for a query. For Q1 the following schema may be used.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">CREATE</span> <span class="n">TABLE</span> <span class="n">magazine_name</span> <span class="p">(</span><span class="nb">id</span> <span class="nb">int</span> <span class="n">PRIMARY</span> <span class="n">KEY</span><span class="p">,</span> <span class="n">name</span> <span class="n">text</span><span class="p">,</span> <span class="n">publicationFrequency</span> <span class="n">text</span><span class="p">)</span>
</pre></div>
</div>
<p>For Q2 the schema definition would include a clustering key for sorting.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">CREATE</span> <span class="n">TABLE</span> <span class="n">magazine_publisher</span> <span class="p">(</span><span class="n">publisher</span> <span class="n">text</span><span class="p">,</span><span class="nb">id</span> <span class="nb">int</span><span class="p">,</span><span class="n">name</span> <span class="n">text</span><span class="p">,</span> <span class="n">publicationFrequency</span> <span class="n">text</span><span class="p">,</span>
<span class="n">PRIMARY</span> <span class="n">KEY</span> <span class="p">(</span><span class="n">publisher</span><span class="p">,</span> <span class="nb">id</span><span class="p">))</span> <span class="n">WITH</span> <span class="n">CLUSTERING</span> <span class="n">ORDER</span> <span class="n">BY</span> <span class="p">(</span><span class="nb">id</span> <span class="n">DESC</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="section" id="data-model-analysis">
<h2>Data Model Analysis<a class="headerlink" href="#data-model-analysis" title="Permalink to this headline"></a></h2>
<p>The data model is a conceptual model that must be analyzed and optimized based on storage, capacity, redundancy and consistency. A data model may need to be modified as a result of the analysis. Considerations or limitations that are used in data model analysis include:</p>
<ul class="simple">
<li><p>Partition Size</p></li>
<li><p>Data Redundancy</p></li>
<li><p>Disk space</p></li>
<li><p>Lightweight Transactions (LWT)</p></li>
</ul>
<p>The two measures of partition size are the number of values in a partition and partition size on disk. Though requirements for these measures may vary based on the application a general guideline is to keep number of values per partition to below 100,000 and disk space per partition to below 100MB.</p>
<p>Data redundancies as duplicate data in tables and multiple partition replicates are to be expected in the design of a data model , but nevertheless should be kept in consideration as a parameter to keep to the minimum. LWT transactions (compare-and-set, conditional update) could affect performance and queries using LWT should be kept to the minimum.</p>
</div>
<div class="section" id="using-materialized-views">
<h2>Using Materialized Views<a class="headerlink" href="#using-materialized-views" title="Permalink to this headline"></a></h2>
<div class="admonition warning">
<p class="admonition-title">Warning</p>
<p>Materialized views (MVs) are experimental in the latest (4.0) release.</p>
</div>
<p>Materialized views (MVs) could be used to implement multiple queries for a single table. A materialized view is a table built from data from another table, the base table, with new primary key and new properties. Changes to the base table data automatically add and update data in a MV. Different queries may be implemented using a materialized view as an MV’s primary key differs from the base table. Queries are optimized by the primary key definition.</p>
</div>
</div>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="data_modeling_conceptual.html" class="btn btn-neutral float-right" title="Conceptual Data Modeling" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="index.html" class="btn btn-neutral float-left" title="Data Modeling" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2020, The Apache Cassandra team
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>