blob: e6a7f44040bcc76c160b92b1224eb981499f151d [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
[[schema_design]]
= Apache Kudu (incubating) Schema Design
:author: Kudu Team
:imagesdir: ./images
:icons: font
:toc: left
:toclevels: 3
:doctype: book
:backend: html5
:sectlinks:
:experimental:
Kudu tables have a structured data model similar to tables in a traditional
RDBMS. Schema design is critical for achieving the best performance and operational
stability from Kudu. Every workload is unique, and there is no single schema design
that is best for every table. This document outlines effective schema design
philosophies for Kudu, paying particular attention to where they differ from
approaches used for traditional RDBMS schemas.
At a high level, there are three concerns in Kudu schema design:
<<column-design,column design>>, <<primary-keys,primary keys>>, and
<<data-distribution,data distribution>>. Of these, only data distribution will
be a new concept for those familiar with traditional relational databases. The
next sections discuss <<alter-schema,altering the schema>> of an existing table,
and <<known-limitations,known limitations>> with regard to schema design.
[[column-design]]
== Column Design
A Kudu Table consists of one or more columns, each with a predefined type.
Columns that are not part of the primary key may optionally be nullable.
Supported column types include:
* boolean
* 8-bit signed integer
* 16-bit signed integer
* 32-bit signed integer
* 64-bit signed integer
* timestamp
* single-precision (32-bit) IEEE-754 floating-point number
* double-precision (64-bit) IEEE-754 floating-point number
* UTF-8 encoded string
* binary
Kudu takes advantage of strongly-typed columns and a columnar on-disk storage
format to provide efficient encoding and serialization. To make the most of these
features, columns must be specified as the appropriate type, rather than
simulating a 'schemaless' table using string or binary columns for data which
may otherwise be structured. In addition to encoding, Kudu optionally allows
compression to be specified on a per-column basis.
[[encoding]]
=== Column Encoding
Each column in a Kudu table can be created with an encoding, based on the type
of the column. Columns use plain encoding by default.
.Encoding Types
[options="header"]
|===
| Column Type | Encoding
| integer, timestamp | plain, bitshuffle, run length
| float | plain, bitshuffle
| bool | plain, dictionary, run length
| string, binary | plain, prefix, dictionary
|===
[[plain]]
Plain Encoding:: Data is stored in its natural format. For example, `int32` values
are stored as fixed-size 32-bit little-endian integers.
[[bitshuffle]]
Bitshuffle Encoding:: Data is rearranged to store the most significant bit of
every value, followed by the second most significant bit of every value, and so
on. Finally, the result is LZ4 compressed. Bitshuffle encoding is a good choice for
columns that have many repeated values, or values that change by small amounts
when sorted by primary key. The
https://github.com/kiyo-masui/bitshuffle[bitshuffle] project has a good
overview of performance and use cases.
[[run-length]]
Run Length Encoding:: _Runs_ (consecutive repeated values) are compressed in a
column by storing only the value and the count. Run length encoding is effective
for columns with many consecutive repeated values when sorted by primary key.
[[dictionary]]
Dictionary Encoding:: A dictionary of unique values is built, and each column value
is encoded as its corresponding index in the dictionary. Dictionary encoding
is effective for columns with low cardinality. If the column values of a given row set
are unable to be compressed because the number of unique values is too high, Kudu will
transparently fall back to plain encoding for that row set. This is evaluated during
flush.
[[prefix]]
Prefix Encoding:: Common prefixes are compressed in consecutive column values. Prefix
encoding can be effective for values that share common prefixes, or the first
column of the primary key, since rows are sorted by primary key within tablets.
[[compression]]
=== Column Compression
Kudu allows per-column compression using LZ4, `snappy`, or `zlib` compression
codecs. By default, columns are stored uncompressed. Consider using compression
if reducing storage space is more important than raw scan performance.
Every data set will compress differently, but in general LZ4 has the least effect on
performance, while `zlib` will compress to the smallest data sizes.
Bitshuffle-encoded columns are inherently compressed using LZ4, so it is not
typically beneficial to apply additional compression on top of this encoding.
[[primary-keys]]
== Primary Keys
Each Kudu table must declare a primary key comprised of one or more columns.
Primary key columns must be non-nullable, and may not be a boolean or
floating-point type. Every row in a table must have a unique set of values for
its primary key columns. As with a traditional RDBMS, primary key
selection is critical to ensuring performant database operations.
Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature, so
the application must always provide the full primary key during insert or
ingestion. In addition, Kudu does not allow the primary key values of a row to
be updated.
Within a tablet, rows are stored sorted lexicographically by primary key. Advanced
schema designs can take advantage of this ordering to achieve good distribution of
data among tablets, while retaining consistent ordering in intra-tablet scans. See
<<data-distribution>> for more information.
[[data-distribution]]
== Data Distribution
Kudu tables, unlike traditional relational tables, are partitioned into tablets
and distributed across many tablet servers. A row always belongs to a single
tablet (and its replicas). The method of assigning rows to tablets is specified
in a configurable _partition schema_ for each table, during table creation.
Choosing a data distribution strategy requires you to understand the data model and
expected workload of a table. For write-heavy workloads, it is important to
design the distribution such that writes are spread across tablets in order to
avoid overloading a single tablet. For workloads involving many short scans, performance
can be improved if all of the data for the scan is located in the same
tablet. Understanding these fundamental trade-offs is central to designing an effective
partition schema.
[[no_default_partitioning]]
[IMPORTANT]
.No Default Partitioning
===
Kudu does not provide a default partitioning strategy when creating tables. It
is strongly recommended to ensure that new tables have at least as many tablets
as tablet servers (but Kudu can support many tablets per tablet server).
===
Kudu provides two types of partition schema: <<range-partitioning, range partitioning>> and
<<hash-bucketing,hash bucketing>>. These schema types can be <<hash-and-range, used
together>> or independently. Kudu does not yet allow tablets to be split after
creation, so you must design your partition schema ahead of time to ensure that
a sufficient number of tablets are created.
[[range-partitioning]]
=== Range Partitioning
With range partitioning, rows are distributed into tablets using a totally-ordered
distribution key. Each tablet is assigned a contiguous segment of the table's
distribution keyspace. Tables may be range partitioned on any subset of the
primary key columns.
During table creation, tablet boundaries are specified as a sequence of _split
rows_. Consider the following table schema (using SQL syntax for clarity):
[source,sql]
----
CREATE TABLE customers (last_name STRING NOT NULL,
first_name STRING NOT NULL,
order_count INT32)
PRIMARY KEY (last_name, first_name)
DISTRIBUTE BY RANGE (last_name, first_name);
----
Specifying the split rows as `\(("b", ""), ("c", ""), ("d", ""), .., ("z", ""))`
(25 split rows total) will result in the creation of 26 tablets, with each
tablet containing a range of customer surnames all beginning with a given letter.
This is an effective partition schema for a workload where customers are inserted
and updated uniformly by last name, and scans are typically performed over a range
of surnames.
It may make sense to partition a table by range using only a subset of the
primary key columns, or with a different ordering than the primary key. For
instance, you can change the above example to specify that the range partition
should only include the `last_name` column. In that case, Kudu would guarantee that all
customers with the same last name would fall into the same tablet, regardless of
the provided split rows.
[[hash-bucketing]]
=== Hash Bucketing
Hash bucketing distributes rows by hash value into one of many buckets. Each
tablet is responsible for the rows falling into a single bucket. The number of
buckets (and therefore tablets), is specified during table creation. Typically,
all of the primary key columns are used as the columns to hash, but as with range
partitioning, any subset of the primary key columns can be used.
Hash partitioning is an effective strategy to increase the amount of parallelism
for workloads that would otherwise skew writes into a small number of tablets.
Consider the following table schema.
[source,sql]
----
CREATE TABLE metrics (
host STRING NOT NULL,
metric STRING,
time TIMESTAMP NOT NULL,
measurement DOUBLE,
PRIMARY KEY (time, metric, host),
)
----
If you use range partitioning over the primary key columns, inserts will
tend to only go to the tablet covering the current time, which limits the
maximum write throughput to the throughput of a single tablet. If you use hash
partitioning, you can guarantee a number of parallel writes equal to the number
of buckets specified when defining the partition schema. The trade-off is that a
scan over a single time range now must touch each of these tablets, instead of
(possibly) a single tablet. Hash bucketing can be an effective tool for mitigating
other types of write skew as well, such as monotonically increasing values.
As an advanced optimization, you can create a table with more than one
hash bucket component, as long as the column sets included in each are disjoint,
and all hashed columns are part of the primary key. The total number of tablets
created will be the product of the hash bucket counts. For example, the above
`metrics` table could be created with two hash bucket components, one over the
`time` column with 4 buckets, and one over the `metric` and `host` columns with
8 buckets. The total number of tablets will be 32. The advantage of using two
separate hash bucket components is that scans which specify equality constraints
on the `metric` and `host` columns will be able to skip 7/8 of the total
tablets, leaving a total of just 4 tablets to scan.
[[hash-and-range]]
=== Hash Bucketing and Range Partitioning
Hash bucketing can be combined with range partitioning. Adding hash bucketing to
a range partitioned table has the effect of parallelizing operations that would
otherwise operate sequentially over the range. The total number of tablets is
the product of the number of hash buckets and the number of split rows plus one.
[[alter-schema]]
== Schema Alterations
You can alter a table's schema in the following ways:
- Rename the table
- Rename, add, or drop columns
- Rename (but not drop) primary key columns
You cannot modify the partition schema after table creation.
[[known-limitations]]
== Known Limitations
Kudu currently has some known limitations that may factor into schema design:
Immutable Primary Keys:: Kudu does not allow you to update the primary key of a
row after insertion.
Non-alterable Primary Key:: Kudu does not allow you to alter the primary key
columns after table creation.
Non-alterable Partition Schema:: Kudu does not allow you to alter the
partition schema after table creation.
Partition Pruning:: When tables use hash buckets, the Java client does not yet
use scan predicates to prune tablets for scans over these tables. In the future,
specifying an equality predicate on all columns in the hash bucket component
will limit the scan to only the tablets corresponding to the hash bucket.
Tablet Splitting:: You currently cannot split or merge tablets after table
creation. You must create the appropriate number of tablets in the
partition schema at table creation. As a workaround, you can copy the contents
of one table to another by using a `CREATE TABLE AS SELECT` statement or creating
an empty table and using an `INSERT` query with `SELECT` in the predicate to
populate the new table.