| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| [[schema_design]] |
| = Apache Kudu (incubating) Schema Design |
| :author: Kudu Team |
| :imagesdir: ./images |
| :icons: font |
| :toc: left |
| :toclevels: 3 |
| :doctype: book |
| :backend: html5 |
| :sectlinks: |
| :experimental: |
| |
| Kudu tables have a structured data model similar to tables in a traditional |
| RDBMS. Schema design is critical for achieving the best performance and operational |
| stability from Kudu. Every workload is unique, and there is no single schema design |
| that is best for every table. This document outlines effective schema design |
| philosophies for Kudu, paying particular attention to where they differ from |
| approaches used for traditional RDBMS schemas. |
| |
| At a high level, there are three concerns in Kudu schema design: |
| <<column-design,column design>>, <<primary-keys,primary keys>>, and |
| <<data-distribution,data distribution>>. Of these, only data distribution will |
| be a new concept for those familiar with traditional relational databases. The |
| next sections discuss <<alter-schema,altering the schema>> of an existing table, |
| and <<known-limitations,known limitations>> with regard to schema design. |
| |
| [[column-design]] |
| == Column Design |
| |
| A Kudu Table consists of one or more columns, each with a predefined type. |
| Columns that are not part of the primary key may optionally be nullable. |
| Supported column types include: |
| |
| * boolean |
| * 8-bit signed integer |
| * 16-bit signed integer |
| * 32-bit signed integer |
| * 64-bit signed integer |
| * timestamp |
| * single-precision (32-bit) IEEE-754 floating-point number |
| * double-precision (64-bit) IEEE-754 floating-point number |
| * UTF-8 encoded string |
| * binary |
| |
| Kudu takes advantage of strongly-typed columns and a columnar on-disk storage |
| format to provide efficient encoding and serialization. To make the most of these |
| features, columns must be specified as the appropriate type, rather than |
| simulating a 'schemaless' table using string or binary columns for data which |
| may otherwise be structured. In addition to encoding, Kudu optionally allows |
| compression to be specified on a per-column basis. |
| |
| [[encoding]] |
| === Column Encoding |
| |
| Each column in a Kudu table can be created with an encoding, based on the type |
| of the column. Columns use plain encoding by default. |
| |
| .Encoding Types |
| [options="header"] |
| |=== |
| | Column Type | Encoding |
| | integer, timestamp | plain, bitshuffle, run length |
| | float | plain, bitshuffle |
| | bool | plain, dictionary, run length |
| | string, binary | plain, prefix, dictionary |
| |=== |
| |
| [[plain]] |
| Plain Encoding:: Data is stored in its natural format. For example, `int32` values |
| are stored as fixed-size 32-bit little-endian integers. |
| |
| [[bitshuffle]] |
| Bitshuffle Encoding:: Data is rearranged to store the most significant bit of |
| every value, followed by the second most significant bit of every value, and so |
| on. Finally, the result is LZ4 compressed. Bitshuffle encoding is a good choice for |
| columns that have many repeated values, or values that change by small amounts |
| when sorted by primary key. The |
| https://github.com/kiyo-masui/bitshuffle[bitshuffle] project has a good |
| overview of performance and use cases. |
| |
| [[run-length]] |
| Run Length Encoding:: _Runs_ (consecutive repeated values) are compressed in a |
| column by storing only the value and the count. Run length encoding is effective |
| for columns with many consecutive repeated values when sorted by primary key. |
| |
| [[dictionary]] |
| Dictionary Encoding:: A dictionary of unique values is built, and each column value |
| is encoded as its corresponding index in the dictionary. Dictionary encoding |
| is effective for columns with low cardinality. If the column values of a given row set |
| are unable to be compressed because the number of unique values is too high, Kudu will |
| transparently fall back to plain encoding for that row set. This is evaluated during |
| flush. |
| |
| [[prefix]] |
| Prefix Encoding:: Common prefixes are compressed in consecutive column values. Prefix |
| encoding can be effective for values that share common prefixes, or the first |
| column of the primary key, since rows are sorted by primary key within tablets. |
| |
| [[compression]] |
| === Column Compression |
| |
| Kudu allows per-column compression using LZ4, `snappy`, or `zlib` compression |
| codecs. By default, columns are stored uncompressed. Consider using compression |
| if reducing storage space is more important than raw scan performance. |
| |
| Every data set will compress differently, but in general LZ4 has the least effect on |
| performance, while `zlib` will compress to the smallest data sizes. |
| Bitshuffle-encoded columns are inherently compressed using LZ4, so it is not |
| typically beneficial to apply additional compression on top of this encoding. |
| |
| [[primary-keys]] |
| == Primary Keys |
| |
| Each Kudu table must declare a primary key comprised of one or more columns. |
| Primary key columns must be non-nullable, and may not be a boolean or |
| floating-point type. Every row in a table must have a unique set of values for |
| its primary key columns. As with a traditional RDBMS, primary key |
| selection is critical to ensuring performant database operations. |
| |
| Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature, so |
| the application must always provide the full primary key during insert or |
| ingestion. In addition, Kudu does not allow the primary key values of a row to |
| be updated. |
| |
| Within a tablet, rows are stored sorted lexicographically by primary key. Advanced |
| schema designs can take advantage of this ordering to achieve good distribution of |
| data among tablets, while retaining consistent ordering in intra-tablet scans. See |
| <<data-distribution>> for more information. |
| |
| [[data-distribution]] |
| == Data Distribution |
| |
| Kudu tables, unlike traditional relational tables, are partitioned into tablets |
| and distributed across many tablet servers. A row always belongs to a single |
| tablet (and its replicas). The method of assigning rows to tablets is specified |
| in a configurable _partition schema_ for each table, during table creation. |
| |
| Choosing a data distribution strategy requires you to understand the data model and |
| expected workload of a table. For write-heavy workloads, it is important to |
| design the distribution such that writes are spread across tablets in order to |
| avoid overloading a single tablet. For workloads involving many short scans, performance |
| can be improved if all of the data for the scan is located in the same |
| tablet. Understanding these fundamental trade-offs is central to designing an effective |
| partition schema. |
| |
| [[no_default_partitioning]] |
| [IMPORTANT] |
| .No Default Partitioning |
| === |
| Kudu does not provide a default partitioning strategy when creating tables. It |
| is strongly recommended to ensure that new tables have at least as many tablets |
| as tablet servers (but Kudu can support many tablets per tablet server). |
| === |
| |
| Kudu provides two types of partition schema: <<range-partitioning, range partitioning>> and |
| <<hash-bucketing,hash bucketing>>. These schema types can be <<hash-and-range, used |
| together>> or independently. Kudu does not yet allow tablets to be split after |
| creation, so you must design your partition schema ahead of time to ensure that |
| a sufficient number of tablets are created. |
| |
| [[range-partitioning]] |
| === Range Partitioning |
| |
| With range partitioning, rows are distributed into tablets using a totally-ordered |
| distribution key. Each tablet is assigned a contiguous segment of the table's |
| distribution keyspace. Tables may be range partitioned on any subset of the |
| primary key columns. |
| |
| During table creation, tablet boundaries are specified as a sequence of _split |
| rows_. Consider the following table schema (using SQL syntax for clarity): |
| |
| [source,sql] |
| ---- |
| CREATE TABLE customers (last_name STRING NOT NULL, |
| first_name STRING NOT NULL, |
| order_count INT32) |
| PRIMARY KEY (last_name, first_name) |
| DISTRIBUTE BY RANGE (last_name, first_name); |
| ---- |
| |
| Specifying the split rows as `\(("b", ""), ("c", ""), ("d", ""), .., ("z", ""))` |
| (25 split rows total) will result in the creation of 26 tablets, with each |
| tablet containing a range of customer surnames all beginning with a given letter. |
| This is an effective partition schema for a workload where customers are inserted |
| and updated uniformly by last name, and scans are typically performed over a range |
| of surnames. |
| |
| It may make sense to partition a table by range using only a subset of the |
| primary key columns, or with a different ordering than the primary key. For |
| instance, you can change the above example to specify that the range partition |
| should only include the `last_name` column. In that case, Kudu would guarantee that all |
| customers with the same last name would fall into the same tablet, regardless of |
| the provided split rows. |
| |
| [[hash-bucketing]] |
| === Hash Bucketing |
| |
| Hash bucketing distributes rows by hash value into one of many buckets. Each |
| tablet is responsible for the rows falling into a single bucket. The number of |
| buckets (and therefore tablets), is specified during table creation. Typically, |
| all of the primary key columns are used as the columns to hash, but as with range |
| partitioning, any subset of the primary key columns can be used. |
| |
| Hash partitioning is an effective strategy to increase the amount of parallelism |
| for workloads that would otherwise skew writes into a small number of tablets. |
| Consider the following table schema. |
| |
| [source,sql] |
| ---- |
| CREATE TABLE metrics ( |
| host STRING NOT NULL, |
| metric STRING, |
| time TIMESTAMP NOT NULL, |
| measurement DOUBLE, |
| PRIMARY KEY (time, metric, host), |
| ) |
| ---- |
| |
| If you use range partitioning over the primary key columns, inserts will |
| tend to only go to the tablet covering the current time, which limits the |
| maximum write throughput to the throughput of a single tablet. If you use hash |
| partitioning, you can guarantee a number of parallel writes equal to the number |
| of buckets specified when defining the partition schema. The trade-off is that a |
| scan over a single time range now must touch each of these tablets, instead of |
| (possibly) a single tablet. Hash bucketing can be an effective tool for mitigating |
| other types of write skew as well, such as monotonically increasing values. |
| |
| As an advanced optimization, you can create a table with more than one |
| hash bucket component, as long as the column sets included in each are disjoint, |
| and all hashed columns are part of the primary key. The total number of tablets |
| created will be the product of the hash bucket counts. For example, the above |
| `metrics` table could be created with two hash bucket components, one over the |
| `time` column with 4 buckets, and one over the `metric` and `host` columns with |
| 8 buckets. The total number of tablets will be 32. The advantage of using two |
| separate hash bucket components is that scans which specify equality constraints |
| on the `metric` and `host` columns will be able to skip 7/8 of the total |
| tablets, leaving a total of just 4 tablets to scan. |
| |
| [[hash-and-range]] |
| === Hash Bucketing and Range Partitioning |
| |
| Hash bucketing can be combined with range partitioning. Adding hash bucketing to |
| a range partitioned table has the effect of parallelizing operations that would |
| otherwise operate sequentially over the range. The total number of tablets is |
| the product of the number of hash buckets and the number of split rows plus one. |
| |
| [[alter-schema]] |
| == Schema Alterations |
| |
| You can alter a table's schema in the following ways: |
| |
| - Rename the table |
| - Rename, add, or drop columns |
| - Rename (but not drop) primary key columns |
| |
| You cannot modify the partition schema after table creation. |
| |
| [[known-limitations]] |
| == Known Limitations |
| |
| Kudu currently has some known limitations that may factor into schema design: |
| |
| Immutable Primary Keys:: Kudu does not allow you to update the primary key of a |
| row after insertion. |
| |
| Non-alterable Primary Key:: Kudu does not allow you to alter the primary key |
| columns after table creation. |
| |
| Non-alterable Partition Schema:: Kudu does not allow you to alter the |
| partition schema after table creation. |
| |
| Partition Pruning:: When tables use hash buckets, the Java client does not yet |
| use scan predicates to prune tablets for scans over these tables. In the future, |
| specifying an equality predicate on all columns in the hash bucket component |
| will limit the scan to only the tablets corresponding to the hash bucket. |
| |
| Tablet Splitting:: You currently cannot split or merge tablets after table |
| creation. You must create the appropriate number of tablets in the |
| partition schema at table creation. As a workaround, you can copy the contents |
| of one table to another by using a `CREATE TABLE AS SELECT` statement or creating |
| an empty table and using an `INSERT` query with `SELECT` in the predicate to |
| populate the new table. |