// Copyright 2015 Cloudera, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

[[schema_design]]
= Kudu Schema Design
:author: Kudu Team
:imagesdir: ./images
:icons: font
:toc: left
:toclevels: 3
:doctype: book
:backend: html5
:sectlinks:
:experimental:

Kudu tables have a structured data model similar to tables in a traditional
RDBMS. Like relational databases, schema design is critical for achieving the
best performance and operational stability from Kudu. Every workload is unique,
and there is no single schema design that is best for every table. This document
outlines effective schema design philosophies for Kudu, paying particular
attention to where they differ from approaches used for traditional RDBMS
schemas. At a high level, there are three concerns in Kudu schema design:
<<column-design,column design>>, <<primary-keys,primary keys>>, and
<<data-distribution,data distribution>>. Of these, only data distribution will
be a new concept for those familiar with traditional relational databases. The
next sections discuss <<alter-schema,altering the schema>> of an existing table,
and <<known-limitations,known limitations>> with regard to schema design.

[[column-design]]

A Kudu Table consists of one or more columns, each with a predefined type.
Columns that are not part of the primary key may optionally be nullable.
Supported column types include:

* boolean
* 8 bit signed integer
* 16 bit signed integer
* 32 bit signed integer
* 64 bit signed integer
* timestamp
* single-precision (32 bit) IEEE-754 floating-point number
* double-precision (64 bit) IEEE-754 floating-point number
* UTF-8 encoded string
* binary

Kudu takes advantage of strongly-typed columns and a columnar on-disk storage
format to provide efficient type-specific compression, encoding, and
serialization. Dictionary encoding, integer packing, delta encoding, and other
similar techniques are used to compress values more efficiently than
general-purpose compression algorithms. In order to take advantage of these
features. However, the data must be specified as the appropriate type (as
opposed to simulating a 'schemaless' table using string or binary columns for
data which may otherwise be structured).

// TODO: discuss how to choose / when to choose a specific encoding and block
// size for a column.

[[primary-key]]

Each Kudu table must declare a primary key comprised of one or more columns.
Primary key columns must be non-nullable, and may not be a boolean or
floating-point type. Every row in a table must have a unique set of values for
its primary key columns. As with traditional relational databases, primary key
selection is critical to ensuring performant database operations.

Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature, so
the application must always provide the full primary key during insert or
ingestion. Aditionally, Kudu does not allow the primary key values of a row to
be updated.

Within a tablet, rows are stored sorted in primary key order. Advanced schema
designs can take advantage of this ordering to achieve good distribution of
data among tablets (through the techniques discussed in <<data-distribution>>),
while retaining a consistent ordering in intra-tablet scans.

[[data-distribution]]
== Data Distribution

Kudu tables, unlike traditional relational tables, are partitioned into tablets
and distributed across many tablet servers. A row always belongs to a single
tablet (and its replicas). The method of assigning rows to tablets is specified
in a configurable partition schema for each table.

Choosing a data distribution strategy requires understanding the data model and
expected workload of a table. For write-heavy workloads, it is important to
design the distribution such that writes are spread across tablets in order to
avoid overloading a single tablet. For workloads involving many short scans, it
can improve performance if all of the data for the scan is located on the same
tablet. These fundamental tradeoffs are central to designing an effective
partition schema.

Kudu provides two types of partition schema which can be <<hash-and-range, used
together>>, or independently: <<range-partitioning, range partitioning>> and
<<hash-bucketing,hash bucketing>>. Kudu does not yet allow tablets to be split
after creation, so you must design your partition schema ahead of time to ensure
a sufficient number of tablets are created.

[[range-partitioning]
=== Range Partitioning

Range partitioning distributes rows into tablets using a totally-ordered
distribution key. Every tablet is assigned a contiguous segment of the table's
distribution keyspace. By default, the distribution key uses the columns of the
primary key, but it may be configured to be any subset of the primary key
columns.

During table creation, tablets boundaries are specified as a sequence of split
rows. For example, in the following table schema (using SQL syntax for clarity):

[source,sql]
----
CREATE TABLE customers (
  first_name STRING NOT NULL,
  last_name STRING NOT NULL,
  order_count INT32,
  PRIMARY KEY (last_name, first_name),
)
----

Specifying the split rows as `\(("b", ""), ("c", ""), ("d", ""), .., ("z", ""))`
(25 split rows total) will result in the creation of 26 tablets, with each
tablet responsible for a range of customer surnames all beggining with the same
first character. This is an effective partition schema for a workload where
customers are inserted and updated uniformly by last name, and scans are
typically over a range of surnames.

It may make sense to partition a table by range using only a subset of the
primary key columns, or with a different ordering than the primary key. For
instance, you can change the above example to specify that the range partition
should only include the `last_name` column, then Kudu would guarantee that all
customers with the same last name would fall into the same tablet, regardless of
the provided split rows.

[[hash-bucketing]
=== Hash Bucketing

Hash bucketing distributes rows by hash value into one of many buckets. Each
tablet is responsible for the rows falling into a single bucket. The number of
buckets (and therefore tablets), is specified during table creation. Typically,
the primary key columns are used as the columns to hash, but as with range
partitioning, any subset of the primary key columns can be used.

Hash partitioning is an effective strategy to increase the amount of parallelism
for workloads that would otherwise skew writes into a small number of tablets.
For example, consider the following table schema:

[source,sql]
----
CREATE TABLE metrics (
  host STRING NOT NULL,
  metric STRING,
  time TIMESTAMP NOT NULL,
  measurement DOUBLE,
  PRIMARY KEY (time, metric, host),
)
----

If the default range partitioning over the primary key columns is used, then
inserts tend to only go to the tablet covering the current time, which limits
maximum write throughput to the throughput of a single tablet. If you use hash
partitioning, you can guarantee a number of parallel writes equal to the number
of buckets specified when defining the partition schema. The tradeoff is that a
scan over a single time range now must touch each of these tablets, instead of
(possibly) a single tablet. Hash bucketing can be an effective tool for other
types of write skew as well, such as monotonically increasing values.

As an advanced optimization, Kudu allows tables to be created with more than one
hash bucket component, as long as the column sets included in each are disjoint,
and all hashed columns are part of the primary key. The total number of tablets
created will be the product of the hash bucket counts. As an example, the above
`metrics` table could be created with two hash bucket components, one over the
`time` column with 4 buckets, and one over the `metric` and `host` columns with
8 buckets. The total number of tablets will be 32. The advantage of using two
seperate hash bucket components is that scans which specify equality constraints
on the `metric` and `host` columns will be able to skip 7/8 of the total
tablets, leaving a total of just 4 tablets to scan (this optimization is not yet
implemented, see <<known-limitations,known limitations>> for details).

[[hash-and-range]]
=== Hash Bucketing and Range Partitioning

Hash bucketing can be combined with range partitioning. Adding hash bucketing to
a range partitioned table has the effect of parallelizing operations that would
otherwise operate sequentially over the range. The total number of tablets is
the product of the number of hash buckets and the number of split rows plus one.

[[alter-schema]]
== Schema Alterations

You can rename Kudu tables, as well as rename, add, or drop columns in an
existing table. You can rename primary key columns, but you cannot drop them,
and new columns cannot be added to the primary key after table creation. You
cannot modify the partition schema after table creation.

[[known-limitations]]
== Known Limitations

Kudu currently has some known limitations that may factor into schema design:

* *Immutable Primary Keys* Kudu does allow you to update the primary key of a
  row after insertion.

* *Non-alterable Primary Key* Kudu does not allow you to alter the primary key
  columns after table creation.

* *Non-alterable Partition Schema* Kudu does not allow you to alter the
  partition schema after table creation.

* *Partition Pruning* The Kudu Java and C++ clients do not yet use scan
  predicates to prune tablets for scans over tables with hash buckets. In the
  future, specifying an equality predicate on all columns in the hash bucket
  component will limit the scan to only the tablets corresponding to the hash
  bucket.

* *Tablet Splitting* You currently cannot split or merge tablets after table
  creation. Instead, you must create the appropriate number of tablets in the
  partition schema at table creation.
