| // Copyright 2015 Cloudera, Inc. |
| // |
| // Licensed under the Apache License, Version 2.0 (the "License"); |
| // you may not use this file except in compliance with the License. |
| // You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| |
| [[schema_design]] |
| = Kudu Schema Design |
| :author: Kudu Team |
| :imagesdir: ./images |
| :icons: font |
| :toc: left |
| :toclevels: 3 |
| :doctype: book |
| :backend: html5 |
| :sectlinks: |
| :experimental: |
| |
| Kudu tables have a structured data model similar to tables in a traditional |
| RDBMS. Like relational databases, schema design is critical for achieving the |
| best performance and operational stability from Kudu. Every workload is unique, |
| and there is no single schema design that is best for every table. This document |
| outlines effective schema design philosophies for Kudu, paying particular |
| attention to where they differ from approaches used for traditional RDBMS |
| schemas. At a high level, there are three concerns in Kudu schema design: |
| <<column-design,column design>>, <<primary-keys,primary keys>>, and |
| <<data-distribution,data distribution>>. Of these, only data distribution will |
| be a new concept for those familiar with traditional relational databases. The |
| next sections discuss <<alter-schema,altering the schema>> of an existing table, |
| and <<known-limitations,known limitations>> with regard to schema design. |
| |
| [[column-design]] |
| |
| A Kudu Table consists of one or more columns, each with a predefined type. |
| Columns that are not part of the primary key may optionally be nullable. |
| Supported column types include: |
| |
| * boolean |
| * 8 bit signed integer |
| * 16 bit signed integer |
| * 32 bit signed integer |
| * 64 bit signed integer |
| * timestamp |
| * single-precision (32 bit) IEEE-754 floating-point number |
| * double-precision (64 bit) IEEE-754 floating-point number |
| * UTF-8 encoded string |
| * binary |
| |
| Kudu takes advantage of strongly-typed columns and a columnar on-disk storage |
| format to provide efficient type-specific compression, encoding, and |
| serialization. Dictionary encoding, integer packing, delta encoding, and other |
| similar techniques are used to compress values more efficiently than |
| general-purpose compression algorithms. In order to take advantage of these |
| features. However, the data must be specified as the appropriate type (as |
| opposed to simulating a 'schemaless' table using string or binary columns for |
| data which may otherwise be structured). |
| |
| // TODO: discuss how to choose / when to choose a specific encoding and block |
| // size for a column. |
| |
| [[primary-key]] |
| |
| Each Kudu table must declare a primary key comprised of one or more columns. |
| Primary key columns must be non-nullable, and may not be a boolean or |
| floating-point type. Every row in a table must have a unique set of values for |
| its primary key columns. As with traditional relational databases, primary key |
| selection is critical to ensuring performant database operations. |
| |
| Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature, so |
| the application must always provide the full primary key during insert or |
| ingestion. Aditionally, Kudu does not allow the primary key values of a row to |
| be updated. |
| |
| Within a tablet, rows are stored sorted in primary key order. Advanced schema |
| designs can take advantage of this ordering to achieve good distribution of |
| data among tablets (through the techniques discussed in <<data-distribution>>), |
| while retaining a consistent ordering in intra-tablet scans. |
| |
| [[data-distribution]] |
| == Data Distribution |
| |
| Kudu tables, unlike traditional relational tables, are partitioned into tablets |
| and distributed across many tablet servers. A row always belongs to a single |
| tablet (and its replicas). The method of assigning rows to tablets is specified |
| in a configurable partition schema for each table. |
| |
| Choosing a data distribution strategy requires understanding the data model and |
| expected workload of a table. For write-heavy workloads, it is important to |
| design the distribution such that writes are spread across tablets in order to |
| avoid overloading a single tablet. For workloads involving many short scans, it |
| can improve performance if all of the data for the scan is located on the same |
| tablet. These fundamental tradeoffs are central to designing an effective |
| partition schema. |
| |
| Kudu provides two types of partition schema which can be <<hash-and-range, used |
| together>>, or independently: <<range-partitioning, range partitioning>> and |
| <<hash-bucketing,hash bucketing>>. Kudu does not yet allow tablets to be split |
| after creation, so you must design your partition schema ahead of time to ensure |
| a sufficient number of tablets are created. |
| |
| [[range-partitioning] |
| === Range Partitioning |
| |
| Range partitioning distributes rows into tablets using a totally-ordered |
| distribution key. Every tablet is assigned a contiguous segment of the table's |
| distribution keyspace. By default, the distribution key uses the columns of the |
| primary key, but it may be configured to be any subset of the primary key |
| columns. |
| |
| During table creation, tablets boundaries are specified as a sequence of split |
| rows. For example, in the following table schema (using SQL syntax for clarity): |
| |
| [source,sql] |
| ---- |
| CREATE TABLE customers ( |
| first_name STRING NOT NULL, |
| last_name STRING NOT NULL, |
| order_count INT32, |
| PRIMARY KEY (last_name, first_name), |
| ) |
| ---- |
| |
| Specifying the split rows as `\(("b", ""), ("c", ""), ("d", ""), .., ("z", ""))` |
| (25 split rows total) will result in the creation of 26 tablets, with each |
| tablet responsible for a range of customer surnames all beggining with the same |
| first character. This is an effective partition schema for a workload where |
| customers are inserted and updated uniformly by last name, and scans are |
| typically over a range of surnames. |
| |
| It may make sense to partition a table by range using only a subset of the |
| primary key columns, or with a different ordering than the primary key. For |
| instance, you can change the above example to specify that the range partition |
| should only include the `last_name` column, then Kudu would guarantee that all |
| customers with the same last name would fall into the same tablet, regardless of |
| the provided split rows. |
| |
| [[hash-bucketing] |
| === Hash Bucketing |
| |
| Hash bucketing distributes rows by hash value into one of many buckets. Each |
| tablet is responsible for the rows falling into a single bucket. The number of |
| buckets (and therefore tablets), is specified during table creation. Typically, |
| the primary key columns are used as the columns to hash, but as with range |
| partitioning, any subset of the primary key columns can be used. |
| |
| Hash partitioning is an effective strategy to increase the amount of parallelism |
| for workloads that would otherwise skew writes into a small number of tablets. |
| For example, consider the following table schema: |
| |
| [source,sql] |
| ---- |
| CREATE TABLE metrics ( |
| host STRING NOT NULL, |
| metric STRING, |
| time TIMESTAMP NOT NULL, |
| measurement DOUBLE, |
| PRIMARY KEY (time, metric, host), |
| ) |
| ---- |
| |
| If the default range partitioning over the primary key columns is used, then |
| inserts tend to only go to the tablet covering the current time, which limits |
| maximum write throughput to the throughput of a single tablet. If you use hash |
| partitioning, you can guarantee a number of parallel writes equal to the number |
| of buckets specified when defining the partition schema. The tradeoff is that a |
| scan over a single time range now must touch each of these tablets, instead of |
| (possibly) a single tablet. Hash bucketing can be an effective tool for other |
| types of write skew as well, such as monotonically increasing values. |
| |
| As an advanced optimization, Kudu allows tables to be created with more than one |
| hash bucket component, as long as the column sets included in each are disjoint, |
| and all hashed columns are part of the primary key. The total number of tablets |
| created will be the product of the hash bucket counts. As an example, the above |
| `metrics` table could be created with two hash bucket components, one over the |
| `time` column with 4 buckets, and one over the `metric` and `host` columns with |
| 8 buckets. The total number of tablets will be 32. The advantage of using two |
| seperate hash bucket components is that scans which specify equality constraints |
| on the `metric` and `host` columns will be able to skip 7/8 of the total |
| tablets, leaving a total of just 4 tablets to scan (this optimization is not yet |
| implemented, see <<known-limitations,known limitations>> for details). |
| |
| [[hash-and-range]] |
| === Hash Bucketing and Range Partitioning |
| |
| Hash bucketing can be combined with range partitioning. Adding hash bucketing to |
| a range partitioned table has the effect of parallelizing operations that would |
| otherwise operate sequentially over the range. The total number of tablets is |
| the product of the number of hash buckets and the number of split rows plus one. |
| |
| [[alter-schema]] |
| == Schema Alterations |
| |
| You can rename Kudu tables, as well as rename, add, or drop columns in an |
| existing table. You can rename primary key columns, but you cannot drop them, |
| and new columns cannot be added to the primary key after table creation. You |
| cannot modify the partition schema after table creation. |
| |
| [[known-limitations]] |
| == Known Limitations |
| |
| Kudu currently has some known limitations that may factor into schema design: |
| |
| * *Immutable Primary Keys* Kudu does allow you to update the primary key of a |
| row after insertion. |
| |
| * *Non-alterable Primary Key* Kudu does not allow you to alter the primary key |
| columns after table creation. |
| |
| * *Non-alterable Partition Schema* Kudu does not allow you to alter the |
| partition schema after table creation. |
| |
| * *Partition Pruning* The Kudu Java and C++ clients do not yet use scan |
| predicates to prune tablets for scans over tables with hash buckets. In the |
| future, specifying an equality predicate on all columns in the hash bucket |
| component will limit the scan to only the tablets corresponding to the hash |
| bucket. |
| |
| * *Tablet Splitting* You currently cannot split or merge tablets after table |
| creation. Instead, you must create the appropriate number of tablets in the |
| partition schema at table creation. |