blob: d1a54bcd3b0e454d4d5bdc039108e731abf41a2f [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu (incubating) completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu (incubating) - Default Partitioning Changes Coming in Kudu 0.9</title>
<!-- Bootstrap core CSS -->
<link href="/css/bootstrap.min.css" rel="stylesheet" />
<!-- Custom styles for this template -->
<link href="/css/justified-nav.css" rel="stylesheet" />
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" />
<link rel="alternate" type="application/atom+xml"
title="RSS Feed for Apache Kudu blog"
href="/feed.xml" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<!-- Fork me on GitHub -->
<a class="fork-me-on-github" href="https://github.com/apache/incubator-kudu"><img src="//aral.github.io/fork-me-on-github-retina-ribbons/right-cerulean@2x.png" alt="Fork me on GitHub" /></a>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="container-fluid navbar-default">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="logo" href="/"><img src="/img/logo_small.png" width="80" /></a>
</div>
<div id="navbar" class="navbar-collapse collapse navbar-right">
<ul class="nav navbar-nav">
<li >
<a href="/">Home</a>
</li>
<li >
<a href="/overview.html">Overview</a>
</li>
<li >
<a href="/docs/">Documentation</a>
</li>
<li >
<a href="/releases/">Download</a>
</li>
<li class="active">
<a href="/blog/">Blog</a>
</li>
<li >
<a href="/community.html">Community</a>
</li>
<li >
<a href="/faq.html">FAQ</a>
</li>
</ul>
</div><!--/.nav-collapse -->
</nav>
<div class="row header">
<div class="col-lg-12">
<h2><a href="/blog">Apache Kudu (incubating) Blog</a></h2>
</div>
</div>
<div class="row-fluid">
<div class="col-lg-9">
<article>
<header>
<h1 class="entry-title">Default Partitioning Changes Coming in Kudu 0.9</h1>
<p class="meta">Posted 02 Jun 2016 by Dan Burkert</p>
</header>
<div class="entry-content">
<p>The upcoming Apache Kudu (incubating) 0.9 release is changing the default
partitioning configuration for new tables. This post will introduce the change,
explain the motivations, and show examples of how code can be updated to work
with the new release.</p>
<!--more-->
<p>The most common source of frustration with new Kudu users is the default
partitioning behavior when creating new tables. If partitioning is not
specified, the Kudu client prior to 0.9 creates tables with a <em>single tablet</em>.
Single tablet tables are a Kudu anti-pattern, since they are unable to get the
scalability benefit of distributing data across the cluster, and instead keep
all data on a single machine.</p>
<p>Unfortunately, automatically choosing a better default partitioning
configuration for new tables is not simple. In most cases, hash partitioning on
the primary key is a better default, but this approach can have its own
drawbacks. In particular, it is not clear how many buckets should be used for
the new table.</p>
<p>Since there is no bullet-proof default and changing the partitioning
configuration after table creation is impossible, <a href="https://lists.apache.org/thread.html/ca8972620839109334493424a1022fc08c77c315d9d623f5caaa815f@1463699013@%3Cuser.kudu.apache.org%3E">we
decided</a>
to remove the default altogether. Removing the default is a backwards
incompatible change, so it must be done before the 1.0 release. If we later find
a better way to create a default partitioning configuration, it should be
possible to adopt it in a backwards compatible way. The result of removing the
default is that new tables created with the 0.9 client must specify a
partitioning configuration, or table creation will fail. You can still create a
table with a single tablet, but it must be configured explicitly. These changes
only affect new table creation; existing tables, including tables created with
default partitioning before the 0.9 release, will continue to work.</p>
<p>In most cases updating existing code to explicitly set a partitioning
configuration should be simple. The examples below add hash partitioning, but
you can also specify range partitioning or a combination of range and hash
partitioning. See the <a href="http://getkudu.io/docs/schema_design.html#data-distribution">schema design
guide</a> for more
advanced configurations.</p>
<h1 id="c-client">C++ Client</h1>
<p>With the C++ client, creating a new table with hash partitions is as simple as
calling <code>KuduTableCreator:add_hash_partitions</code> with the columns to hash and the
number of buckets to use:</p>
<div class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">KuduTableCreator</span><span class="o">&gt;</span> <span class="n">table_creator</span><span class="p">(</span><span class="n">my_client</span><span class="o">-&gt;</span><span class="n">NewTableCreator</span><span class="p">());</span>
<span class="n">Status</span> <span class="n">create_status</span> <span class="o">=</span> <span class="n">table_creator</span><span class="o">-&gt;</span><span class="n">table_name</span><span class="p">(</span><span class="s">&quot;my-table&quot;</span><span class="p">)</span>
<span class="p">.</span><span class="n">schema</span><span class="p">(</span><span class="n">my_schema</span><span class="p">)</span>
<span class="p">.</span><span class="n">add_hash_partitions</span><span class="p">({</span> <span class="s">&quot;key_column_a&quot;</span><span class="p">,</span> <span class="s">&quot;key_column_b&quot;</span> <span class="p">},</span> <span class="mi">16</span><span class="p">)</span>
<span class="p">.</span><span class="n">Create</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">create_status</span><span class="p">.</span><span class="n">ok</span><span class="p">()</span> <span class="p">{</span> <span class="cm">/* handle error */</span> <span class="p">}</span></code></pre></div>
<h1 id="java-client">Java Client</h1>
<p>And similarly, in Java:</p>
<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">List</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;</span> <span class="n">hashColumns</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o">&lt;&gt;();</span>
<span class="n">hashColumns</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="s">&quot;key_column_a&quot;</span><span class="o">);</span>
<span class="n">hashColumn</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="s">&quot;key_column_b&quot;</span><span class="o">);</span>
<span class="n">CreateTableOptions</span> <span class="n">options</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">CreateTableOptions</span><span class="o">().</span><span class="na">addHashPartitions</span><span class="o">(</span><span class="n">hashColumns</span><span class="o">,</span> <span class="mi">16</span><span class="o">);</span>
<span class="n">myClient</span><span class="o">.</span><span class="na">createTable</span><span class="o">(</span><span class="s">&quot;my-table&quot;</span><span class="o">,</span> <span class="n">my_schema</span><span class="o">,</span> <span class="n">options</span><span class="o">);</span></code></pre></div>
<p>In the examples above, if the hash partition configuration is omitted the create
table operation will fail with the error <code>Table partitioning must be specified
using setRangePartitionColumns or addHashPartitions</code>. In the Java client this
manifests as a thrown <code>IllegalArgumentException</code>, while in the C++ client it is
returned as a <code>Status::InvalidArgument</code>.</p>
<h1 id="impala">Impala</h1>
<p>When creating Kudu tables with Impala, the formerly optional <code>DISTRIBUTE BY</code>
clause is now required:</p>
<div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table</span> <span class="p">(</span><span class="n">key_column_a</span> <span class="n">STRING</span><span class="p">,</span> <span class="n">key_column_b</span> <span class="n">STRING</span><span class="p">,</span> <span class="n">other_column</span> <span class="n">STRING</span><span class="p">)</span>
<span class="n">DISTRIBUTE</span> <span class="k">BY</span> <span class="n">HASH</span> <span class="p">(</span><span class="n">key_column_a</span><span class="p">,</span> <span class="n">key_column_b</span><span class="p">)</span> <span class="k">INTO</span> <span class="mi">16</span> <span class="n">BUCKETS</span>
<span class="n">TBLPROPERTIES</span><span class="p">(</span>
<span class="s1">&#39;storage_handler&#39;</span> <span class="o">=</span> <span class="s1">&#39;com.cloudera.kudu.hive.KuduStorageHandler&#39;</span><span class="p">,</span>
<span class="s1">&#39;kudu.table_name&#39;</span> <span class="o">=</span> <span class="s1">&#39;my_table&#39;</span><span class="p">,</span>
<span class="s1">&#39;kudu.master_addresses&#39;</span> <span class="o">=</span> <span class="s1">&#39;kudu-master.example.com:7051&#39;</span><span class="p">,</span>
<span class="s1">&#39;kudu.key_columns&#39;</span> <span class="o">=</span> <span class="s1">&#39;key_column_a,key_column_b&#39;</span>
<span class="p">);</span></code></pre></div>
</div>
</article>
</div>
<div class="col-lg-3 recent-posts">
<h3>Recent posts</h3>
<ul>
<li> <a href="/2016/06/24/multi-master-1-0-0.html">Master fault tolerance in Kudu 1.0</a> </li>
<li> <a href="/2016/06/21/weekly-update.html">Apache Kudu (incubating) Weekly Update June 21, 2016</a> </li>
<li> <a href="/2016/06/17/raft-consensus-single-node.html">Using Raft Consensus on a Single Node</a> </li>
<li> <a href="/2016/06/13/weekly-update.html">Apache Kudu (incubating) Weekly Update June 13, 2016</a> </li>
<li> <a href="/2016/06/10/apache-kudu-0-9-0-released.html">Apache Kudu (incubating) 0.9.0 released</a> </li>
<li> <a href="/2016/06/06/weekly-update.html">Apache Kudu (incubating) Weekly Update June 6, 2016</a> </li>
<li> <a href="/2016/06/02/no-default-partitioning.html">Default Partitioning Changes Coming in Kudu 0.9</a> </li>
<li> <a href="/2016/06/01/weekly-update.html">Apache Kudu (incubating) Weekly Update June 1, 2016</a> </li>
<li> <a href="/2016/05/23/weekly-update.html">Apache Kudu (incubating) Weekly Update May 23, 2016</a> </li>
<li> <a href="/2016/05/16/weekly-update.html">Apache Kudu (incubating) Weekly Update May 16, 2016</a> </li>
<li> <a href="/2016/05/09/weekly-update.html">Apache Kudu (incubating) Weekly Update May 9, 2016</a> </li>
<li> <a href="/2016/05/03/weekly-update.html">Apache Kudu (incubating) Weekly Update May 3, 2016</a> </li>
<li> <a href="/2016/04/26/ycsb.html">Benchmarking and Improving Kudu Insert Performance with YCSB</a> </li>
<li> <a href="/2016/04/25/weekly-update.html">Apache Kudu (incubating) Weekly Update April 25, 2016</a> </li>
<li> <a href="/2016/04/19/kudu-0-8-0-predicate-improvements.html">Predicate Improvements in Kudu 0.8</a> </li>
</ul>
</div>
</div>
<footer class="footer">
<p class="pull-left">
<a href="http://incubator.apache.org"><img src="/img/apache-incubator.png" width="225" height="53" align="right"/></a>
</p>
<p class="small">
Apache Kudu (incubating) is an effort undergoing incubation at the Apache Software
Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making
process have stabilized in a manner consistent with other successful
ASF projects. While incubation status is not necessarily a reflection
of the completeness or stability of the code, it does indicate that the
project has yet to be fully endorsed by the ASF.
Copyright &copy; 2016 The Apache Software Foundation.
</p>
</footer>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script src="/js/bootstrap.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script>
<script>
anchors.options = {
placement: 'right',
visible: 'touch',
};
anchors.add();
</script>
</body>
</html>