blob: 9c588475a014893fe6ae4e3fbf2a30794a0d3373 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Apache Kudu Schema Design</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7"
crossorigin="anonymous">
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="logo" href="/"><img
src="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png"
srcset="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png 1x, //d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_160px.png 2x"
alt="Apache Kudu"/></a>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
</li>
<li >
<a href="/overview.html">Overview</a>
</li>
<li class="active">
<a href="/docs/">Documentation</a>
</li>
<li >
<a href="/releases/">Releases</a>
</li>
<li >
<a href="/blog/">Blog</a>
</li>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="https://getkudu-slack.herokuapp.com/">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<li><a href="/ecosystem.html">Ecosystem</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="https://github.com/apache/incubator-kudu">GitHub</a></li>
<li><a class="icon gerrit" href="http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu">Gerrit Code Review</a></li>
<li><a class="icon jira" href="https://issues.apache.org/jira/browse/KUDU">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="https://twitter.com/ApacheKudu">Twitter</a></li>
<li><a href="https://www.reddit.com/r/kudu/">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
<li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li>
</ul>
</li>
<li >
<a href="/faq.html">FAQ</a>
</li>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
</nav>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<div class="container">
<div class="row">
<div class="col-md-9">
<h1>Apache Kudu Schema Design</h1>
<div id="preamble">
<div class="sectionbody">
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
This document applies to Apache Kudu version 1.15.0. Please consult the
<a href="http://kudu.apache.org/releases/">documentation of the appropriate release</a> that&#8217;s applicable
to the version of the Kudu cluster.
</td>
</tr>
</table>
</div>
</div>
</div>
<div class="sect1">
<h2 id="schema_design"><a class="link" href="#schema_design">Overview</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Kudu tables have a structured data model similar to tables in a traditional
RDBMS. Schema design is critical for achieving the best performance and
operational stability from Kudu. Every workload is unique, and there is no
single schema design that is best for every table. This document outlines
effective schema design philosophies for Kudu, paying particular attention to
where they differ from approaches used for traditional RDBMS schemas.</p>
</div>
<div class="paragraph">
<p>At a high level, there are three concerns when creating Kudu tables:
<a href="#column-design">column design</a>, <a href="#primary-keys">primary key design</a>, and
<a href="#partitioning">partitioning design</a>. Of these, only partitioning will be a new
concept for those familiar with traditional non-distributed relational
databases. The final sections discuss <a href="#alter-schema">altering the schema</a> of an
existing table, and <a href="#known-limitations">known limitations</a> with regard to
schema design.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_the_perfect_schema"><a class="link" href="#_the_perfect_schema">The Perfect Schema</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>The perfect schema would accomplish the following:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Data would be distributed in such a way that reads and writes are spread
evenly across tablet servers. This is impacted by partitioning.</p>
</li>
<li>
<p>Tablets would grow at an even, predictable rate and load across tablets would
remain steady over time. This is most impacted by partitioning.</p>
</li>
<li>
<p>Scans would read the minimum amount of data necessary to fulfill a query. This
is impacted mostly by primary key design, but partitioning also plays a role
via partition pruning.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>The perfect schema depends on the characteristics of your data, what you need to do
with it, and the topology of your cluster. Schema design is the single most important
thing within your control to maximize the performance of your Kudu cluster.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="column-design"><a class="link" href="#column-design">Column Design</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>A Kudu Table consists of one or more columns, each with a defined type. Columns
that are not part of the primary key may be nullable. Supported
column types include:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>boolean</p>
</li>
<li>
<p>8-bit signed integer</p>
</li>
<li>
<p>16-bit signed integer</p>
</li>
<li>
<p>32-bit signed integer</p>
</li>
<li>
<p>64-bit signed integer</p>
</li>
<li>
<p>date (32-bit days since the Unix epoch)</p>
</li>
<li>
<p>unixtime_micros (64-bit microseconds since the Unix epoch)</p>
</li>
<li>
<p>single-precision (32-bit) IEEE-754 floating-point number</p>
</li>
<li>
<p>double-precision (64-bit) IEEE-754 floating-point number</p>
</li>
<li>
<p>decimal (see <a href="#decimal">Decimal Type</a> for details)</p>
</li>
<li>
<p>varchar (see <a href="#varchar">Varchar Type</a> for details)</p>
</li>
<li>
<p>UTF-8 encoded string (up to 64KB uncompressed)</p>
</li>
<li>
<p>binary (up to 64KB uncompressed)</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>Kudu takes advantage of strongly-typed columns and a columnar on-disk storage
format to provide efficient encoding and serialization. To make the most of
these features, columns should be specified as the appropriate type, rather than
simulating a 'schemaless' table using string or binary columns for data which
may otherwise be structured. In addition to encoding, Kudu allows compression to
be specified on a per-column basis.</p>
</div>
<div id="no_version_column" class="admonitionblock important">
<table>
<tr>
<td class="icon">
<i class="fa icon-important" title="Important"></i>
</td>
<td class="content">
<div class="title">No Version or Timestamp Column</div>
Kudu does not provide a version or timestamp column to track changes to a row.
If version or timestamp information is needed, the schema should include an
explicit version or timestamp column.
</td>
</tr>
</table>
</div>
<div class="sect2">
<h3 id="decimal"><a class="link" href="#decimal">Decimal Type</a></h3>
<div class="paragraph">
<p>The <code>decimal</code> type is a numeric data type with fixed scale and precision suitable for
financial and other arithmetic calculations where the imprecise representation and
rounding behavior of <code>float</code> and <code>double</code> make those types impractical. The <code>decimal</code>
type is also useful for integers larger than int64 and cases with fractional values
in a primary key.</p>
</div>
<div class="paragraph">
<p>The <code>decimal</code> type is a parameterized type that takes precision and scale type
attributes.</p>
</div>
<div class="paragraph">
<p><strong>Precision</strong> represents the total number of digits that can be represented by the
column, regardless of the location of the decimal point. This value must be between
1 and 38 and has no default. For example, a precision of 4 is required to represent
integer values up to 9999, or to represent values up to 99.99 with two fractional
digits. You can also represent corresponding negative values, without any
change in the precision. For example, the range -9999 to 9999 still only requires
a precision of 4.</p>
</div>
<div class="paragraph">
<p><strong>Scale</strong> represents the number of fractional digits. This value must be between 0
and the precision. A scale of 0 produces integral values, with no fractional part.
If precision and scale are equal, all of the digits come after the decimal point.
For example, a decimal with precision and scale equal to 3 can represent values
between -0.999 and 0.999.</p>
</div>
<div class="paragraph">
<p><strong>Performance considerations:</strong></p>
</div>
<div class="paragraph">
<p>Kudu stores each value in as few bytes as possible depending on the precision
specified for the decimal column. For that reason it is not advised to just use
the highest precision possible for convenience. Doing so could negatively impact
performance, memory and storage.</p>
</div>
<div class="paragraph">
<p>Before encoding and compression:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Decimal values with precision of 9 or less are stored in 4 bytes.</p>
</li>
<li>
<p>Decimal values with precision of 10 through 18 are stored in 8 bytes.</p>
</li>
<li>
<p>Decimal values with precision greater than 18 are stored in 16 bytes.</p>
</li>
</ul>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
The precision and scale of <code>decimal</code> columns cannot be changed by altering
the table.
</td>
</tr>
</table>
</div>
</div>
<div class="sect2">
<h3 id="varchar"><a class="link" href="#varchar">Varchar Type</a></h3>
<div class="paragraph">
<p>The <code>varchar</code> type is a UTF-8 encoded string (up to 64KB uncompressed) with a
fixed maximum character length. This type is especially useful when migrating
from or integrating with legacy systems that support the <code>varchar</code> type.
If a maximum character length is not required the <code>string</code> type should be
used instead.</p>
</div>
<div class="paragraph">
<p>The <code>varchar</code> type is a parameterized type that takes a length attribute.</p>
</div>
<div class="paragraph">
<p><strong>Length</strong> represents the maximum number of UTF-8 characters allowed. Values
with characters greater than the limit will be truncated. This value must
be between 1 and 65535 and has no default. Note that some other systems
may represent the length limit in bytes instead of characters. That means
that Kudu may be able to represent longer values in the case of multi-byte
UTF-8 characters.</p>
</div>
</div>
<div class="sect2">
<h3 id="encoding"><a class="link" href="#encoding">Column Encoding</a></h3>
<div class="paragraph">
<p>Each column in a Kudu table can be created with an encoding, based on the type
of the column.</p>
</div>
<table class="tableblock frame-all grid-all stretch">
<caption class="title">Table 1. Encoding Types</caption>
<colgroup>
<col style="width: 33.3333%;">
<col style="width: 33.3333%;">
<col style="width: 33.3334%;">
</colgroup>
<thead>
<tr>
<th class="tableblock halign-left valign-top">Column Type</th>
<th class="tableblock halign-left valign-top">Encoding</th>
<th class="tableblock halign-left valign-top">Default</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">int8, int16, int32, int64</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">plain, bitshuffle, run length</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">bitshuffle</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">date, unixtime_micros</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">plain, bitshuffle, run length</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">bitshuffle</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">float, double, decimal</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">plain, bitshuffle</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">bitshuffle</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">bool</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">plain, run length</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">run length</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">string, varchar, binary</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">plain, prefix, dictionary</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">dictionary</p></td>
</tr>
</tbody>
</table>
<div id="plain" class="dlist">
<dl>
<dt class="hdlist1">Plain Encoding</dt>
<dd>
<p>Data is stored in its natural format. For example, <code>int32</code>
values are stored as fixed-size 32-bit little-endian integers.</p>
</dd>
</dl>
</div>
<div id="bitshuffle" class="dlist">
<dl>
<dt class="hdlist1">Bitshuffle Encoding</dt>
<dd>
<p>A block of values is rearranged to store the most
significant bit of every value, followed by the second most significant bit of
every value, and so on. Finally, the result is LZ4 compressed. Bitshuffle
encoding is a good choice for columns that have many repeated values, or values
that change by small amounts when sorted by primary key. The
<a href="https://github.com/kiyo-masui/bitshuffle">bitshuffle</a> project has a good overview
of performance and use cases.</p>
</dd>
</dl>
</div>
<div id="run-length" class="dlist">
<dl>
<dt class="hdlist1">Run Length Encoding</dt>
<dd>
<p><em>Runs</em> (consecutive repeated values) are compressed in a
column by storing only the value and the count. Run length encoding is effective
for columns with many consecutive repeated values when sorted by primary key.</p>
</dd>
</dl>
</div>
<div id="dictionary" class="dlist">
<dl>
<dt class="hdlist1">Dictionary Encoding</dt>
<dd>
<p>A dictionary of unique values is built, and each column
value is encoded as its corresponding index in the dictionary. Dictionary
encoding is effective for columns with low cardinality. If the column values of
a given row set are unable to be compressed because the number of unique values
is too high, Kudu will transparently fall back to plain encoding for that row
set. This is evaluated during flush.</p>
</dd>
</dl>
</div>
<div id="prefix" class="dlist">
<dl>
<dt class="hdlist1">Prefix Encoding</dt>
<dd>
<p>Common prefixes are compressed in consecutive column values.
Prefix encoding can be effective for values that share common prefixes, or the
first column of the primary key, since rows are sorted by primary key within
tablets.</p>
</dd>
</dl>
</div>
</div>
<div class="sect2">
<h3 id="compression"><a class="link" href="#compression">Column Compression</a></h3>
<div class="paragraph">
<p>Kudu allows per-column compression using the <code>LZ4</code>, <code>Snappy</code>, or <code>zlib</code>
compression codecs. By default, columns that are Bitshuffle-encoded are
inherently compressed with LZ4 compression. Otherwise, columns are stored
uncompressed. Consider using compression if reducing storage space is more
important than raw scan performance.</p>
</div>
<div class="paragraph">
<p>Every data set will compress differently, but in general LZ4 is the most
performant codec, while <code>zlib</code> will compress to the smallest data sizes.
Bitshuffle-encoded columns are automatically compressed using LZ4, so it is not
recommended to apply additional compression on top of this encoding.</p>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="primary-keys"><a class="link" href="#primary-keys">Primary Key Design</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Every Kudu table must declare a primary key comprised of one or more columns.
Like an RDBMS primary key, the Kudu primary key enforces a uniqueness constraint.
Attempting to insert a row with the same primary key values as an existing row
will result in a duplicate key error.</p>
</div>
<div class="paragraph">
<p>Primary key columns must be non-nullable, and may not be a boolean, float
or double type.</p>
</div>
<div class="paragraph">
<p>Once set during table creation, the set of columns in the primary key may not
be altered.</p>
</div>
<div class="paragraph">
<p>Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature,
so the application must always provide the full primary key during insert.</p>
</div>
<div class="paragraph">
<p>Row delete and update operations must also specify the full primary key of the
row to be changed. Kudu does not natively support range deletes or updates.</p>
</div>
<div class="paragraph">
<p>The primary key values of a column may not be updated after the row is inserted.
However, the row may be deleted and re-inserted with the updated value.</p>
</div>
<div class="sect2">
<h3 id="indexing"><a class="link" href="#indexing">Primary Key Index</a></h3>
<div class="paragraph">
<p>As with many traditional relational databases, Kudu’s primary key is in a
clustered index. All rows within a tablet are sorted by its primary key.</p>
</div>
<div class="paragraph">
<p>When scanning Kudu rows, use equality or range predicates on primary key
columns to efficiently find the rows.</p>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
Primary key indexing optimizations apply to scans on individual tablets.
See the <a href="#partition-pruning">Partition Pruning</a> section for details on how scans can use
predicates to skip entire tablets.
</td>
</tr>
</table>
</div>
</div>
<div class="sect2">
<h3 id="Backfilling"><a class="link" href="#Backfilling">Considerations for Backfill Inserts</a></h3>
<div class="paragraph">
<p>This section discuss a primary key design consideration for timeseries use
cases where the primary key is a timestamp, or the first column of the primary
key is a timestamp.</p>
</div>
<div class="paragraph">
<p>Each time a row is inserted into a Kudu table, Kudu looks up the primary key in
the primary key index storage to check whether that primary key is already
present in the table. If the primary key exists in the table, a "duplicate key"
error is returned. In the typical case where data is being inserted at
the current time as it arrives from the data source, only a small range of
primary keys are "hot". So, each of these "check for presence" operations is
very fast. It hits the cached primary key storage in memory and doesn&#8217;t require
going to disk.</p>
</div>
<div class="paragraph">
<p>In the case when you load historical data, which is called "backfilling", from
an offline data source, each row that is inserted is likely to hit a cold area
of the primary key index which is not resident in memory and will cause one or
more HDD disk seeks. For example, in a normal ingestion case where Kudu sustains
a few million inserts per second, the "backfill" use case might sustain only
a few thousand inserts per second.</p>
</div>
<div class="paragraph">
<p>To alleviate the performance issue during backfilling, consider the following
options:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Make the primary keys more compressible.</p>
<div class="paragraph">
<p>For example, with the first column of a primary key being a random ID of 32-bytes,
caching one billion primary keys would require at least 32 GB of RAM to stay in
cache. If caching backfill primary keys from several days ago, you need to have
several times 32 GB of memory. By changing the primary key to be more compressible,
you increase the likelihood that the primary keys can fit in cache and thus
reducing the amount of random disk I/Os.</p>
</div>
</li>
<li>
<p>Use SSDs for storage as random seeks are orders of magnitude faster than spinning disks.</p>
</li>
<li>
<p>Change the primary key structure such that the backfill writes hit a continuous range of primary keys.</p>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="partitioning"><a class="link" href="#partitioning">Partitioning</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>In order to provide scalability, Kudu tables are partitioned into units called
tablets, and distributed across many tablet servers. A row always belongs to a
single tablet. The method of assigning rows to tablets is determined by the
partitioning of the table, which is set during table creation.</p>
</div>
<div class="paragraph">
<p>Choosing a partitioning strategy requires understanding the data model and the
expected workload of a table. For write-heavy workloads, it is important to
design the partitioning such that writes are spread across tablets in order to
avoid overloading a single tablet. For workloads involving many short scans,
where the overhead of contacting remote servers dominates, performance can be
improved if all of the data for the scan is located in the same tablet.
Understanding these fundamental trade-offs is central to designing an effective
partition schema.</p>
</div>
<div id="no_default_partitioning" class="admonitionblock important">
<table>
<tr>
<td class="icon">
<i class="fa icon-important" title="Important"></i>
</td>
<td class="content">
<div class="title">No Default Partitioning</div>
Kudu does not provide a default partitioning strategy when creating tables. It
is recommended that new tables which are expected to have heavy read and write
workloads have at least as many tablets as tablet servers.
</td>
</tr>
</table>
</div>
<div class="paragraph">
<p>Kudu provides two types of partitioning: <a href="#range-partitioning">range
partitioning</a> and <a href="#hash-partitioning">hash partitioning</a>. Tables may also have
<a href="#multilevel-partitioning">multilevel partitioning</a>, which combines range and hash
partitioning, or multiple instances of hash partitioning.</p>
</div>
<div class="sect2">
<h3 id="range-partitioning"><a class="link" href="#range-partitioning">Range Partitioning</a></h3>
<div class="paragraph">
<p>Range partitioning distributes rows using a totally-ordered range partition key.
Each partition is assigned a contiguous segment of the range partition keyspace.
The key must be comprised of a subset of the primary key columns. If the range
partition columns match the primary key columns, then the range partition key of
a row will equal its primary key. In range partitioned tables without hash
partitioning, each range partition will correspond to exactly one tablet.</p>
</div>
<div class="paragraph">
<p>The initial set of range partitions is specified during table creation as a set
of partition bounds and split rows. For each bound, a range partition will be
created in the table. Each split will divide a range partition in two. If no
partition bounds are specified, then the table will default to a single
partition covering the entire key space (unbounded below and above). Range
partitions must always be non-overlapping, and split rows must fall within a
range partition.</p>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
see the <a href="#range-partitioning-example">Range Partitioning Example</a> for further discussion of range
partitioning.
</td>
</tr>
</table>
</div>
<div class="sect3">
<h4 id="range-partition-management"><a class="link" href="#range-partition-management">Range Partition Management</a></h4>
<div class="paragraph">
<p>Kudu allows range partitions to be dynamically added and removed from a table at
runtime, without affecting the availability of other partitions. Removing a
partition will delete the tablets belonging to the partition, as well as the
data contained in them. Subsequent inserts into the dropped partition will fail.
New partitions can be added, but they must not overlap with any existing range
partitions. Kudu allows dropping and adding any number of range partitions in a
single transactional alter table operation.</p>
</div>
<div class="paragraph">
<p>Dynamically adding and dropping range partitions is particularly useful for time
series use cases. As time goes on, range partitions can be added to cover
upcoming time ranges. For example, a table storing an event log could add a
month-wide partition just before the start of each month in order to hold the
upcoming events. Old range partitions can be dropped in order to efficiently
remove historical data, as necessary.</p>
</div>
</div>
</div>
<div class="sect2">
<h3 id="hash-partitioning"><a class="link" href="#hash-partitioning">Hash Partitioning</a></h3>
<div class="paragraph">
<p>Hash partitioning distributes rows by hash value into one of many buckets. In
single-level hash partitioned tables, each bucket will correspond to exactly
one tablet. The number of buckets is set during table creation. Typically the
primary key columns are used as the columns to hash, but as with range
partitioning, any subset of the primary key columns can be used.</p>
</div>
<div class="paragraph">
<p>Hash partitioning is an effective strategy when ordered access to the table is
not needed. Hash partitioning is effective for spreading writes randomly among
tablets, which helps mitigate hot-spotting and uneven tablet sizes.</p>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
see the <a href="#hash-partitioning-example">Hash Partitioning Example</a> for further discussion of hash
partitioning.
</td>
</tr>
</table>
</div>
</div>
<div class="sect2">
<h3 id="multilevel-partitioning"><a class="link" href="#multilevel-partitioning">Multilevel Partitioning</a></h3>
<div class="paragraph">
<p>Kudu allows a table to combine multiple levels of partitioning on a single
table. Zero or more hash partition levels can be combined with an optional range
partition level. The only additional constraint on multilevel partitioning
beyond the constraints of the individual partition types, is that multiple levels
of hash partitions must not hash the same columns.</p>
</div>
<div class="paragraph">
<p>When used correctly, multilevel partitioning can retain the benefits of the
individual partitioning types, while reducing the downsides of each. The total
number of tablets in a multilevel partitioned table is the product of the
number of partitions in each level.</p>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
see the <a href="#hash-range-partitioning-example">Hash and Range Partitioning Example</a> and the
<a href="#hash-hash-partitioning-example">Hash and Hash Partitioning Example</a> for further discussion of multilevel
partitioning.
</td>
</tr>
</table>
</div>
</div>
<div class="sect2">
<h3 id="partition-pruning"><a class="link" href="#partition-pruning">Partition Pruning</a></h3>
<div class="paragraph">
<p>Kudu scans will automatically skip scanning entire partitions when it can be
determined that the partition can be entirely filtered by the scan predicates.
To prune hash partitions, the scan must include equality predicates on every
hashed column. To prune range partitions, the scan must include equality or
range predicates on the range partitioned columns. Scans on multilevel
partitioned tables can take advantage of partition pruning on any of the levels
independently.</p>
</div>
</div>
<div class="sect2">
<h3 id="partitioning-examples"><a class="link" href="#partitioning-examples">Partitioning Examples</a></h3>
<div class="paragraph">
<p>To illustrate the factors and trade-offs associated with designing a partitioning
strategy for a table, we will walk through some different partitioning
scenarios. Consider the following table schema for storing machine metrics data
(using SQL syntax and date-formatted timestamps for clarity):</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-sql" data-lang="sql">CREATE TABLE metrics (
host STRING NOT NULL,
metric STRING NOT NULL,
time INT64 NOT NULL,
value DOUBLE NOT NULL,
PRIMARY KEY (host, metric, time)
);</code></pre>
</div>
</div>
<div class="sect3">
<h4 id="range-partitioning-example"><a class="link" href="#range-partitioning-example">Range Partitioning Example</a></h4>
<div class="paragraph">
<p>A natural way to partition the <code>metrics</code> table is to range partition on the
<code>time</code> column. Let&#8217;s assume that we want to have a partition per year, and the
table will hold data for 2014, 2015, and 2016. There are at least two ways that
the table could be partitioned: with unbounded range partitions, or with bounded
range partitions.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="./images/range-partitioning-example.png" alt="Range Partitioning by `time`">
</div>
</div>
<div class="paragraph">
<p>The image above shows the two ways the <code>metrics</code> table can be range partitioned
on the <code>time</code> column. In the first example (in blue), the default range
partition bounds are used, with splits at <code>2015-01-01</code> and <code>2016-01-01</code>. This
results in three tablets: the first containing values before 2015, the second
containing values in the year 2015, and the third containing values after 2016.
The second example (in green) uses a range partition bound of <code>[(2014-01-01),
(2017-01-01)]</code>, and splits at <code>2015-01-01</code> and <code>2016-01-01</code>. The second example
could have equivalently been expressed through range partition bounds of
<code>[(2014-01-01), (2015-01-01)]</code>, <code>[(2015-01-01), (2016-01-01)]</code>, and
<code>[(2016-01-01), (2017-01-01)]</code>, with no splits. The first example has unbounded
lower and upper range partitions, while the second example includes bounds.</p>
</div>
<div class="paragraph">
<p>Each of the range partition examples above allows time-bounded scans to prune
partitions falling outside of the scan&#8217;s time bound. This can greatly improve
performance when there are many partitions. When writing, both examples suffer
from potential hot-spotting issues. Because metrics tend to always be written
at the current time, most writes will go into a single range partition.</p>
</div>
<div class="paragraph">
<p>The second example is more flexible than the first, because it allows range
partitions for future years to be added to the table. In the first example, all
writes for times after <code>2016-01-01</code> will fall into the last partition, so the
partition may eventually become too large for a single tablet server to handle.</p>
</div>
</div>
<div class="sect3">
<h4 id="hash-partitioning-example"><a class="link" href="#hash-partitioning-example">Hash Partitioning Example</a></h4>
<div class="paragraph">
<p>Another way of partitioning the <code>metrics</code> table is to hash partition on the
<code>host</code> and <code>metric</code> columns.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="./images/hash-partitioning-example.png" alt="Hash Partitioning by `host` and `metric`">
</div>
</div>
<div class="paragraph">
<p>In the example above, the <code>metrics</code> table is hash partitioned on the <code>host</code> and
<code>metric</code> columns into four buckets. Unlike the range partitioning example
earlier, this partitioning strategy will spread writes over all tablets in the
table evenly, which helps overall write throughput. Scans over a specific host
and metric can take advantage of partition pruning by specifying equality
predicates, reducing the number of scanned tablets to one. One issue to be
careful of with a pure hash partitioning strategy, is that tablets could grow
indefinitely as more and more data is inserted into the table. Eventually
tablets will become too big for an individual tablet server to hold.</p>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
Although these examples number the tablets, in reality tablets are only
given UUID identifiers. There is no natural ordering among the tablets in a hash
partitioned table.
</td>
</tr>
</table>
</div>
</div>
<div class="sect3">
<h4 id="hash-range-partitioning-example"><a class="link" href="#hash-range-partitioning-example">Hash and Range Partitioning Example</a></h4>
<div class="paragraph">
<p>The previous examples showed how the <code>metrics</code> table could be range partitioned
on the <code>time</code> column, or hash partitioned on the <code>host</code> and <code>metric</code> columns.
These strategies have associated strength and weaknesses:</p>
</div>
<table class="tableblock frame-all grid-all stretch">
<caption class="title">Table 2. Partitioning Strategies</caption>
<colgroup>
<col style="width: 25%;">
<col style="width: 25%;">
<col style="width: 25%;">
<col style="width: 25%;">
</colgroup>
<thead>
<tr>
<th class="tableblock halign-left valign-top">Strategy</th>
<th class="tableblock halign-left valign-top">Writes</th>
<th class="tableblock halign-left valign-top">Reads</th>
<th class="tableblock halign-left valign-top">Tablet Growth</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock"><code>range(time)</code></p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">✗ - all writes go to latest partition</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">✓ - time-bounded scans can be pruned</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">✓ - new tablets can be added for future time periods</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock"><code>hash(host, metric)</code></p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">✓ - writes are spread evenly among tablets</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">✓ - scans on specific hosts and metrics can be pruned</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">✗ - tablets could grow too large</p></td>
</tr>
</tbody>
</table>
<div class="paragraph">
<p>Hash partitioning is good at maximizing write throughput, while range
partitioning avoids issues of unbounded tablet growth. Both strategies can take
advantage of partition pruning to optimize scans in different scenarios. Using
multilevel partitioning, it is possible to combine the two strategies in order
to gain the benefits of both, while minimizing the drawbacks of each.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="./images/hash-range-partitioning-example.png" alt="Hash and Range Partitioning">
</div>
</div>
<div class="paragraph">
<p>In the example above, range partitioning on the <code>time</code> column is combined with
hash partitioning on the <code>host</code> and <code>metric</code> columns. This strategy can be
thought of as having two dimensions of partitioning: one for the hash level and
one for the range level. Writes into this table at the current time will be
parallelized up to the number of hash buckets, in this case 4. Reads can take
advantage of time bound <strong>and</strong> specific host and metric predicates to prune
partitions. New range partitions can be added, which results in creating 4
additional tablets (as if a new column were added to the diagram).</p>
</div>
</div>
<div class="sect3">
<h4 id="hash-hash-partitioning-example"><a class="link" href="#hash-hash-partitioning-example">Hash and Hash Partitioning Example</a></h4>
<div class="paragraph">
<p>Kudu can support any number of hash partitioning levels in the same table, as
long as the levels have no hashed columns in common.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="./images/hash-hash-partitioning-example.png" alt="Hash and Hash Partitioning">
</div>
</div>
<div class="paragraph">
<p>In the example above, the table is hash partitioned on <code>host</code> into 4 buckets,
and hash partitioned on <code>metric</code> into 3 buckets, resulting in 12 tablets.
Although writes will tend to be spread among all tablets when using this
strategy, it is slightly more prone to hot-spotting than when hash partitioning
over multiple independent columns, since all values for an individual host or
metric will always belong to a single tablet. Scans can take advantage of
equality predicates on the <code>host</code> and <code>metric</code> columns separately to prune
partitions.</p>
</div>
<div class="paragraph">
<p>Multiple levels of hash partitioning can also be combined with range
partitioning, which logically adds another dimension of partitioning.</p>
</div>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="alter-schema"><a class="link" href="#alter-schema">Schema Alterations</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>You can alter a table&#8217;s schema in the following ways:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Rename the table</p>
</li>
<li>
<p>Rename primary key columns</p>
</li>
<li>
<p>Rename, add, or drop non-primary key columns</p>
</li>
<li>
<p>Add and drop range partitions</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>Multiple alteration steps can be combined in a single transactional operation.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="known-limitations"><a class="link" href="#known-limitations">Known Limitations</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Kudu currently has some known limitations that may factor into schema design.</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1">Number of Columns</dt>
<dd>
<p>By default, Kudu will not permit the creation of tables with
more than 300 columns. We recommend schema designs that use fewer columns for best
performance.</p>
</dd>
<dt class="hdlist1">Size of Cells</dt>
<dd>
<p>No individual cell may be larger than 64KB before encoding or
compression. The cells making up a composite key are limited to a total of 16KB
after the internal composite-key encoding done by Kudu. Inserting rows not
conforming to these limitations will result in errors being returned to the
client.</p>
</dd>
<dt class="hdlist1">Size of Rows</dt>
<dd>
<p>Although individual cells may be up to 64KB, and Kudu supports up to
300 columns, it is recommended that no single row be larger than a few hundred KB.</p>
</dd>
<dt class="hdlist1">Valid Identifiers</dt>
<dd>
<p>Identifiers such as table and column names must be valid UTF-8
sequences and no longer than 256 bytes.</p>
</dd>
<dt class="hdlist1">Immutable Primary Keys</dt>
<dd>
<p>Kudu does not allow you to update the primary key
columns of a row.</p>
</dd>
<dt class="hdlist1">Non-alterable Primary Key</dt>
<dd>
<p>Kudu does not allow you to alter the primary key
columns after table creation.</p>
</dd>
<dt class="hdlist1">Non-alterable Partitioning</dt>
<dd>
<p>Kudu does not allow you to change how a table is
partitioned after creation, with the exception of adding or dropping range
partitions.</p>
</dd>
<dt class="hdlist1">Non-alterable Column Types</dt>
<dd>
<p>Kudu does not allow the type of a column to be
altered.</p>
</dd>
<dt class="hdlist1">Partition Splitting</dt>
<dd>
<p>Partitions cannot be split or merged after table creation.</p>
</dd>
<dt class="hdlist1">Deleted row disk space is not reclaimed</dt>
<dd>
<p>The disk space occupied by a deleted
row is only reclaimable via compaction, and only when the deletion&#8217;s age
exceeds the "tablet history maximum age" (controlled by the
<code>--tablet_history_max_age_sec</code> flag). Furthermore, Kudu currently only schedules
compactions in order to improve read/write performance; a tablet will never be
compacted purely to reclaim disk space. As such, range partitioning should be
used when it is expected that large swaths of rows will be discarded. With range
partitioning, individual partitions may be dropped to discard data and reclaim
disk space. See <a href="https://issues.apache.org/jira/browse/KUDU-1625">KUDU-1625</a>
for details.</p>
</dd>
</dl>
</div>
</div>
</div>
</div>
<div class="col-md-3">
<div id="toc" data-spy="affix" data-offset-top="70">
<ul>
<li>
<a href="index.html">Introducing Kudu</a>
</li>
<li>
<a href="release_notes.html">Kudu Release Notes</a>
</li>
<li>
<a href="quickstart.html">Quickstart Guide</a>
</li>
<li>
<a href="installation.html">Installation Guide</a>
</li>
<li>
<a href="configuration.html">Configuring Kudu</a>
</li>
<li>
<a href="hive_metastore.html">Using the Hive Metastore with Kudu</a>
</li>
<li>
<a href="kudu_impala_integration.html">Using Impala with Kudu</a>
</li>
<li>
<a href="administration.html">Administering Kudu</a>
</li>
<li>
<a href="troubleshooting.html">Troubleshooting Kudu</a>
</li>
<li>
<a href="developing.html">Developing Applications with Kudu</a>
</li>
<li>
<span class="active-toc">Kudu Schema Design</span>
<ul class="sectlevel1">
<li><a href="#schema_design">Overview</a></li>
<li><a href="#_the_perfect_schema">The Perfect Schema</a></li>
<li><a href="#column-design">Column Design</a>
<ul class="sectlevel2">
<li><a href="#decimal">Decimal Type</a></li>
<li><a href="#varchar">Varchar Type</a></li>
<li><a href="#encoding">Column Encoding</a></li>
<li><a href="#compression">Column Compression</a></li>
</ul>
</li>
<li><a href="#primary-keys">Primary Key Design</a>
<ul class="sectlevel2">
<li><a href="#indexing">Primary Key Index</a></li>
<li><a href="#Backfilling">Considerations for Backfill Inserts</a></li>
</ul>
</li>
<li><a href="#partitioning">Partitioning</a>
<ul class="sectlevel2">
<li><a href="#range-partitioning">Range Partitioning</a>
<ul class="sectlevel3">
<li><a href="#range-partition-management">Range Partition Management</a></li>
</ul>
</li>
<li><a href="#hash-partitioning">Hash Partitioning</a></li>
<li><a href="#multilevel-partitioning">Multilevel Partitioning</a></li>
<li><a href="#partition-pruning">Partition Pruning</a></li>
<li><a href="#partitioning-examples">Partitioning Examples</a>
<ul class="sectlevel3">
<li><a href="#range-partitioning-example">Range Partitioning Example</a></li>
<li><a href="#hash-partitioning-example">Hash Partitioning Example</a></li>
<li><a href="#hash-range-partitioning-example">Hash and Range Partitioning Example</a></li>
<li><a href="#hash-hash-partitioning-example">Hash and Hash Partitioning Example</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#alter-schema">Schema Alterations</a></li>
<li><a href="#known-limitations">Known Limitations</a></li>
</ul>
</li>
<li>
<a href="scaling_guide.html">Kudu Scaling Guide</a>
</li>
<li>
<a href="security.html">Kudu Security</a>
</li>
<li>
<a href="transaction_semantics.html">Kudu Transaction Semantics</a>
</li>
<li>
<a href="background_tasks.html">Background Maintenance Tasks</a>
</li>
<li>
<a href="configuration_reference.html">Kudu Configuration Reference</a>
</li>
<li>
<a href="command_line_tools_reference.html">Kudu Command Line Tools Reference</a>
</li>
<li>
<a href="metrics_reference.html">Kudu Metrics Reference</a>
</li>
<li>
<a href="known_issues.html">Known Issues and Limitations</a>
</li>
<li>
<a href="contributing.html">Contributing to Kudu</a>
</li>
<li>
<a href="export_control.html">Export Control Notice</a>
</li>
</ul>
</div>
</div>
</div>
</div>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2020 The Apache Software Foundation. Last updated 2021-05-20 10:46:53 -0700
</p>
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
</p>
</div>
<div class="col-md-3">
<a class="pull-right" href="https://www.apache.org/events/current-event.html">
<img src="https://www.apache.org/events/current-event-234x60.png"/>
</a>
</div>
</div>
</footer>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
$(document.documentElement).addClass("touch");
} else {
$(document.documentElement).addClass("no-touch");
}
});
</script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"
integrity="sha384-0mSbJDEHialfmuBBQP6A4Qrprq5OVfW37PRR3j5ELqxss1yVqOtnepnHVP9aJ7xS"
crossorigin="anonymous"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script>
<script>
anchors.options = {
placement: 'right',
visible: 'touch',
};
anchors.add();
</script>
</body>
</html>