| <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2017-03-20T17:16:38-07:00</updated><id>/</id><entry><title>Apache Kudu 1.2.0 released</title><link href="/2017/01/20/apache-kudu-1-2-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.2.0 released" /><published>2017-01-20T00:00:00-08:00</published><updated>2017-01-20T00:00:00-08:00</updated><id>/2017/01/20/apache-kudu-1-2-0-released</id><content type="html" xml:base="/2017/01/20/apache-kudu-1-2-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.2.0!</p> |
| |
| <p>The new release adds several new features and improvements, including:</p> |
| |
| <!--more--> |
| |
| <ul> |
| <li>User data such as row contents is now redacted from logging statements.</li> |
| <li>Kudu’s ability to provide strong consistency guarantees has been substantially improved.</li> |
| <li>Various performance improvements in metadata management as well as optimizations for BITSHUFFLE encoding on AVX2-capable hosts.</li> |
| </ul> |
| |
| <p>Additionally, 1.2.0 fixes a number of important bugs, including:</p> |
| |
| <ul> |
| <li>Kudu now automatically limits its usage of file descriptors, preventing crashes due to ulimit exhaustion.</li> |
| <li>Fixed a long-standing issue which could cause ext4 file system corruption on RHEL 6.</li> |
| <li>Fixed a disk space leak.</li> |
| <li>Several fixes for correctness in various edge cases.</li> |
| </ul> |
| |
| <p>The above list of changes is non-exhaustive. Please refer to the |
| <a href="/releases/1.2.0/docs/release_notes.html">release notes</a> |
| for an expanded list of important improvements, bug fixes, and |
| incompatible changes before upgrading.</p> |
| |
| <ul> |
| <li>Download the <a href="/releases/1.2.0/">Kudu 1.2.0 source release</a></li> |
| <li>Convenience binary artifacts for the Java client and various Java |
| integrations (eg Spark, Flume) are also now available via the ASF Maven |
| repository.</li> |
| </ul></content><author><name>Todd Lipcon</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.2.0! |
| |
| The new release adds several new features and improvements, including:</summary></entry><entry><title>Apache Kudu Weekly Update November 15th, 2016</title><link href="/2016/11/15/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update November 15th, 2016" /><published>2016-11-15T00:00:00-08:00</published><updated>2016-11-15T00:00:00-08:00</updated><id>/2016/11/15/weekly-update</id><content type="html" xml:base="/2016/11/15/weekly-update.html"><p>Welcome to the twenty-third edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</p> |
| |
| <!--more--> |
| |
| <h2 id="project-news">Project news</h2> |
| |
| <ul> |
| <li> |
| <p>The first release candidate for Kudu 1.1.0 is <a href="http://mail-archives.apache.org/mod_mbox/kudu-dev/201611.mbox/%3CCADY20s7ZKZkPmUEcTexW%3D%2B_%2BLnDY2hABZg0-UZD3jvWAs9-pog%40mail.gmail.com%3E">now available</a>.</p> |
| |
| <dl> |
| <dt>Noteworthy new features/improvements:</dt> |
| <dd> |
| <ul> |
| <li>The Python client has been brought to feature parity with the C++ and Java clients.</li> |
| <li>IN LIST predicates.</li> |
| <li>Java client now features client-side tracing.</li> |
| <li>Kudu now publishes jar files for Spark 2.0 compiled with Scala 2.11.</li> |
| <li>Kudu’s Raft implementation now features pre-elections. In our tests this has greatly improved stability.</li> |
| </ul> |
| </dd> |
| </dl> |
| |
| <p>Community developers and users are encouraged to download the source |
| tarball and vote on the release.</p> |
| |
| <p>For more information on what’s new, check out the |
| <a href="https://github.com/apache/kudu/blob/branch-1.1.x/docs/release_notes.adoc">release notes</a>. |
| <em>Note:</em> some links from these in-progress release notes will not be live until the |
| release itself is published.</p> |
| </li> |
| <li> |
| <p>On November 7th, the Kudu PMC announced that Jordan Birdsell, from State Farm, had been voted |
| in as a new committer and PMC member.</p> |
| |
| <p>Jordan’s contributions include extensive work on the python client, throwing it some much needed |
| love, and bringing it to feature parity with the other clients.</p> |
| |
| <p>Besides his extensive code contributions Jordan has also been active in reviewing other |
| developer’s patches and helping the community in general, on slack and other channels.</p> |
| |
| <p>Jordan has been doing great work and the Kudu PMC was pleased to recognize his contributions |
| with committership.</p> |
| </li> |
| <li> |
| <p>Mike Percy will be presenting Kudu Wednesday 16th November at <a href="https://apachebigdataeu2016.sched.org/">Apache Big Data Europe, in Seville</a>.</p> |
| </li> |
| <li> |
| <p>Congratulations to Haijie Hong for his <a href="https://gerrit.cloudera.org/#/c/4822/">first contribution to Kudu!</a>. |
| Haijie fixed some edge cases in BitWriter that were blocking RLE usage for 64 bit types.</p> |
| </li> |
| <li> |
| <p>Congratulations to Maxim Smyatkin for his <a href="https://gerrit.cloudera.org/#/q/Maxim">first contributions to Kudu!</a>. |
| Maxim has contributed several patches helping with debug and cleanup.</p> |
| </li> |
| </ul> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li> |
| <p>A lot of progress has been done towards the goals that were set in the scope docs introduced in |
| the last couple of posts. Specifically:</p> |
| |
| <ul> |
| <li> |
| <p>Dan Burkert, Todd Lipcon and Alexey Serbin have doubled down on the security effort. They have |
| been working on enabling Kerberos authentication and rpc encryption. The <a href="https://docs.google.com/document/d/1cPNDTpVkIUo676RlszpTF1gHZ8l0TdbB7zFBAuOuYUw/edit#heading=h.gsibhnd5dyem">security scope doc</a> |
| has been updated with the latest plans for security and many patches have been merged already.</p> |
| </li> |
| <li> |
| <p>David Alves has continued the work on <a href="https://s.apache.org/7VCo">consistency</a>. Up for review |
| and partially pushed is a patch series to address row history loss if a row is deleted and then |
| re-inserted. Also in progress is work to make sure that scans at a snapshot from followers |
| always return same data as if they were executed on the leader. This helps with Read-Your-Writes |
| when reading from lagging replicas.</p> |
| </li> |
| <li> |
| <p>Adar Dembo has been making good progress <a href="https://s.apache.org/uOOt">addressing issues seen with the LogBlockManager</a>. |
| A series of patches have been merged with various fixes to block managers in general and to the |
| log block manager in particular.</p> |
| </li> |
| <li> |
| <p>Dinesh Bhat has been working on improving the manual recovery tools for Kudu. Namely, he has |
| added a tool to force a remote replica copy to a destination server, and a tool to delete a |
| local replica of a tablet. The latter is useful when a tablet cannot come up due to bad state.</p> |
| </li> |
| <li> |
| <p>Jean-Daniel Cryans has implemented RPC tracing for the java client, greatly improving |
| debuggability. JD also has added ReplicaSelection to the java client, allowing to perform |
| scans on replicas other than the leader, which should be of great help for load-balancing.</p> |
| </li> |
| <li> |
| <p>Besides the feature parity contributions, Jordan Birdsell has laid out a |
| <a href="http://mail-archives.apache.org/mod_mbox/kudu-dev/201611.mbox/%3CCAGaaj_VKfB4mhu6eExHCWo0%3D6Qd0HFWy7bg9e39JgOaFPGJ1nQ%40mail.gmail.com%3E">roadmap for Python client work</a> |
| for the 1.2 release. Feedback from other Python client users is certainly appreciated.</p> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>David Alves</name></author><summary>Welcome to the twenty-third edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>Apache Kudu Weekly Update November 1st, 2016</title><link href="/2016/11/01/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update November 1st, 2016" /><published>2016-11-01T00:00:00-07:00</published><updated>2016-11-01T00:00:00-07:00</updated><id>/2016/11/01/weekly-update</id><content type="html" xml:base="/2016/11/01/weekly-update.html"><p>Welcome to the twenty-third edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</p> |
| |
| <!--more--> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li> |
| <p>Dan Burkert committed a piece of test infrastructure |
| called “MiniKDC” for both Java and C++. The MiniKDC sets up a short-lived |
| Kerberos environment in the context of a single test case, making it |
| easy to build tests of security features without requiring any special |
| infrastructure on the part of the developer.</p> |
| </li> |
| <li> |
| <p>Todd Lipcon added support for Kerberos (GSSAPI) support to Kudu’s |
| RPC system, allowing servers to authenticate the user principal of |
| any inbound RPC connection. He also integrated Kudu’s C++ “MiniCluster” |
| test infrastructure to allow starting a Kerberized cluster in the |
| context of a test.</p> |
| </li> |
| <li> |
| <p>Dan, Todd, and Alexey Serbin have been iterating on a more detailed |
| <a href="https://docs.google.com/document/d/1Yu4iuIhaERwug1vS95yWDd_WzrNRIKvvVGUb31y-_mY/edit#">design doc</a> |
| for authentication in Kudu. This doc outlines the various non-Kerberos |
| methods that Kudu will use for authentication as well as how TLS will |
| be used to encrypt and authenticate some types of connections.</p> |
| </li> |
| <li> |
| <p>Part of the above design document involves Kudu servers generating and |
| signing X509 certificates on the fly to use for authenticated TLS. |
| Alexey has been working on a large <a href="https://gerrit.cloudera.org/#/c/4799/">patch</a> |
| which uses OpenSSL to provide key generation and signing functionality.</p> |
| </li> |
| <li> |
| <p>Sailesh Mukil has been working on adding support for |
| <a href="https://gerrit.cloudera.org/#/c/4789/">TLS in Kudu’s RPC system</a>. The TLS |
| support is a critical part of the overall design for security. This patch |
| has gone through several rounds of review and nearing completion.</p> |
| </li> |
| <li> |
| <p>JD Cryans has been continuing to improve the Java client, including adding |
| the ability to specify that the client would like to read the “closest” |
| replica (e.g. reading from a local copy if possible). Additionally, |
| JD has been working on some basic <a href="https://gerrit.cloudera.org/#/c/4781/">tracing support</a> |
| within the Java client. This tracing aims to make timeouts easier to understand |
| and diagnose.</p> |
| </li> |
| <li> |
| <p>Jordan Birdsell committed 9 more patches to the Python client, bringing it |
| very close to feature parity with C++. Jordan has a few more patches in flight |
| which should complete this long-running effort.</p> |
| </li> |
| <li> |
| <p>Congrats to new contributor Haijie Hong who committed his first patch this week. |
| Haijie added support for <a href="https://gerrit.cloudera.org/#/c/4822/">run-length encoding 64-bit integers</a>.</p> |
| </li> |
| <li> |
| <p>Will Berkeley picked back up work on <a href="https://gerrit.cloudera.org/#/c/4310/">improving the capability of ALTER |
| TABLE</a>. His in-flight patch adds support |
| for changing the default value of a column as well as changing storage attributes |
| such as desired block size, encoding, and compression.</p> |
| </li> |
| <li> |
| <p>Adar Dembo has been working on a series of patches for the Block Manager, the |
| component of Kudu which is responsible for laying out blocks on the local |
| file system. His patch series consists of a number of refactors to clean up |
| and improve the code structure, followed by an <a href="https://gerrit.cloudera.org/#/c/4848/">improvement to reduce file system |
| fragmentation</a>.</p> |
| </li> |
| <li> |
| <p>David Alves has been working on a <a href="https://gerrit.cloudera.org/#/c/4819/">patch series</a> |
| which adds support for storing ‘REINSERT’ deltas on disk. These records are |
| generated if a user inserts a row, deletes it, and inserts a new row with the |
| same primary key. Current versions of Kudu lose track of the history of the |
| prior version of the row in this scenario, which prevents correct snapshot reads. |
| David’s patch series fixes this.</p> |
| </li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>Todd Lipcon</name></author><summary>Welcome to the twenty-third edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>Apache Kudu Weekly Update October 20th, 2016</title><link href="/2016/10/20/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update October 20th, 2016" /><published>2016-10-20T00:00:00-07:00</published><updated>2016-10-20T00:00:00-07:00</updated><id>/2016/10/20/weekly-update</id><content type="html" xml:base="/2016/10/20/weekly-update.html"><p>Welcome to the twenty-second edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</p> |
| |
| <!--more--> |
| |
| <h2 id="project-news">Project news</h2> |
| |
| <ul> |
| <li> |
| <p>Kudu 1.0.1 was <a href="http://mail-archives.apache.org/mod_mbox/kudu-user/201610.mbox/%3CCALo2W-UgTa%2BX15_q_9FQpRUPWN53eyqFS10C5MXK1KpsFgqcyQ%40mail.gmail.com%3E">released</a> |
| on October 11th. This is a bug fix release which fixes several bugs found |
| in 1.0.0. See the <a href="http://kudu.apache.org/releases/1.0.1/docs/release_notes.html">Kudu 1.0.1 release notes</a> |
| for more details.</p> |
| </li> |
| <li> |
| <p>Todd Lipcon has proposed a <a href="https://lists.apache.org/thread.html/4c94d313e28381bb107682ffaf43adfd38bd7fb3b03c98e3c86c52e2@%3Cdev.kudu.apache.org%3E">release plan</a> |
| for the next few months. The proposal is to have a 1.1 release in mid-November and |
| a 1.2 release in mid-January. These would be time-based releases rather than |
| gated on any particular feature scope; however, it’s anticipated that several |
| new features and improvements will be ready in time for these releases.</p> |
| </li> |
| <li> |
| <p>Happy fourth birthday to the Kudu project! The initial commit was made |
| on October 11th, 2012! Since then we’ve had 4888 more commits by 60 |
| authors!</p> |
| </li> |
| </ul> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li>As mentioned last week, a lot of contributors have been collaborating on |
| design documents for upcoming work. Here’s the complete list of in-flight |
| documents, along with the primary authors of these docs: |
| <ul> |
| <li><a href="https://docs.google.com/document/d/1cPNDTpVkIUo676RlszpTF1gHZ8l0TdbB7zFBAuOuYUw/edit#heading=h.gsibhnd5dyem">Security features</a> (Todd Lipcon)</li> |
| <li><a href="https://goo.gl/wP5BJb">Improved disk-failure handling</a> (Dinesh Bhat)</li> |
| <li><a href="https://s.apache.org/7K48">Tools for manual recovery from corruption</a> (Mike Percy and Dinesh Bhat)</li> |
| <li><a href="https://s.apache.org/uOOt">Addressing issues seen with the LogBlockManager</a> (Adar Dembo)</li> |
| <li><a href="https://s.apache.org/7VCo">Providing proper snapshot/serializable consistency</a> (David Alves)</li> |
| <li><a href="https://s.apache.org/ARUP">Improving re-replication of under-replicated tablets</a> (Mike Percy)</li> |
| <li><a href="https://docs.google.com/document/d/1066W63e2YUTNnecmfRwgAHghBPnL1Pte_gJYAaZ_Bjo/edit">Avoiding Raft election storms</a> (Todd Lipcon)</li> |
| <li><a href="https://s.apache.org/kudu-backup-scope">Backup and bulk load</a> (Dan Burkert)</li> |
| <li><a href="https://s.apache.org/SM6V">Improving diagnosability of client errors</a> (Alexey Serbin)</li> |
| </ul> |
| |
| <p>In many cases, work is now progressing on implementation of these ideas, |
| but these are considered living documents. It’s not too late to add your |
| comments or volunteer to help out.</p> |
| </li> |
| <li> |
| <p>JD Cryans has been working on cleaning up the Java client. Several complex pieces |
| of code were completely removed, and other parts were refactored into new |
| standalone classes for better modularity. Along the way, JD also |
| <a href="http://gerrit.cloudera.org:8080/4706">reduced lock contention</a> on a frequently-accessed |
| data structure.</p> |
| </li> |
| <li> |
| <p>Todd Lipcon implemented and committed Raft “pre-elections” as described in the |
| [election storm mitigation design document]((https://docs.google.com/document/d/1066W63e2YUTNnecmfRwgAHghBPnL1Pte_gJYAaZ_Bjo/edit). |
| Initial experiments, detailed in the document, indicate that this will substantially |
| improve leader stability on clusters with overloaded disks and lots of tablets.</p> |
| |
| <p>Following this patch, Todd worked on some cleanup and refactor of the Consensus |
| implementation, removing a bunch of dead code and splitting some classes up |
| into smaller pieces. This is preparing for some improvements in locking |
| granularity also described in the same document.</p> |
| </li> |
| <li> |
| <p>Dan Burkert and Todd Lipcon have started submitting patches to integrate Kerberos |
| authentication with Kudu’s RPC system. Dan posted a |
| <a href="https://gerrit.cloudera.org/#/c/4752/">patch</a> which adds “MiniKDC”, some test |
| infrastructure for starting and stopping a standalone Kerberos service in |
| the context of a test. Todd worked on adding |
| <a href="https://gerrit.cloudera.org/#/c/4763/">support for Kerberos authentication</a> |
| during RPC negotiation.</p> |
| |
| <p>These patches are just the beginning of the security work, but form an important |
| base to build on top of. The design uses Kerberos both as a mechanism to authenticate |
| clients as well as a way to mutually authenticate tablet servers with the master.</p> |
| </li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>Todd Lipcon</name></author><summary>Welcome to the twenty-second edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>Apache Kudu Weekly Update October 11th, 2016</title><link href="/2016/10/11/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update October 11th, 2016" /><published>2016-10-11T00:00:00-07:00</published><updated>2016-10-11T00:00:00-07:00</updated><id>/2016/10/11/weekly-update</id><content type="html" xml:base="/2016/10/11/weekly-update.html"><p>Welcome to the twenty-first edition of the Kudu Weekly Update. Astute |
| readers will notice that the weekly blog posts have been not-so-weekly |
| of late – in fact, it has been nearly two months since the previous post |
| as I and others have focused on releases, conferences, etc.</p> |
| |
| <p>So, rather than covering just this past week, this post will cover highlights |
| of the progress since the 1.0 release in mid-September. If you’re interested |
| in learning about progress prior to that release, check the |
| <a href="http://kudu.apache.org/releases/1.0.0/docs/release_notes.html">release notes</a>.</p> |
| |
| <!--more--> |
| |
| <h2 id="project-news">Project news</h2> |
| |
| <ul> |
| <li> |
| <p>On September 12th, the Kudu PMC announced that Alexey Serbin and Will |
| Berkeley had been voted as new committers and PMC members.</p> |
| |
| <p>Alexey’s contributions prior to committership included |
| <a href="https://gerrit.cloudera.org/#/c/3952/">AUTO_FLUSH_BACKGROUND</a> support |
| in C++ as well as <a href="http://kudu.apache.org/apidocs/">API documentation</a> |
| for the C++ client API.</p> |
| |
| <p>Will’s contributions include several fixes to the web UIs, large |
| improvements the Flume integration, and a lot of good work |
| burning down long-standing bugs.</p> |
| |
| <p>Both contributors were “acting the part” and the PMC was pleased to |
| recognize their contributions with committership.</p> |
| </li> |
| <li> |
| <p>Kudu 1.0.0 was <a href="https://kudu.apache.org/2016/09/20/apache-kudu-1-0-0-released.html">released</a> |
| on September 19th. Most community members have upgraded by this point |
| and have been reporting improved stability and performance.</p> |
| </li> |
| <li> |
| <p>Dan Burkert has been managing a Kudu 1.0.1 release to address a few |
| important bugs discovered since 1.0.0. The vote passed on Monday |
| afternoon, so the release should be made officially available |
| later this week.</p> |
| </li> |
| </ul> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li>After the 1.0 release, many contributors have gone into a design phase |
| for upcoming work. Over the last couple of weeks, developers have posted |
| scoping and design documents for topics including: |
| <ul> |
| <li><a href="https://docs.google.com/document/d/1cPNDTpVkIUo676RlszpTF1gHZ8l0TdbB7zFBAuOuYUw/edit#heading=h.gsibhnd5dyem">Security features</a> (Todd Lipcon)</li> |
| <li><a href="https://goo.gl/wP5BJb">Improved disk-failure handling</a> (Dinesh Bhat)</li> |
| <li><a href="https://s.apache.org/7K48">Tools for manual recovery from corruption</a> (Mike Percy and Dinesh Bhat)</li> |
| <li><a href="https://s.apache.org/uOOt">Addressing issues seen with the LogBlockManager</a> (Adar Dembo)</li> |
| <li><a href="https://s.apache.org/7VCo">Providing proper snapshot/serializable consistency</a> (David Alves)</li> |
| <li><a href="https://s.apache.org/ARUP">Improving re-replication of under-replicated tablets</a> (Mike Percy)</li> |
| <li><a href="https://docs.google.com/document/d/1066W63e2YUTNnecmfRwgAHghBPnL1Pte_gJYAaZ_Bjo/edit">Avoiding Raft election storms</a> (Todd Lipcon)</li> |
| </ul> |
| |
| <p>The development community has no particular rule that all work must be |
| accompanied by such a document, but in the past they have proven useful |
| for fleshing out ideas around a design before beginning implementation. |
| As Kudu matures, we can probably expect to see more of this kind of planning |
| and design discussion.</p> |
| |
| <p>If any of the above work areas sounds interesting to you, please take a |
| look and leave your comments! Similarly, if you are interested in contributing |
| in any of these areas, please feel free to volunteer on the mailing list. |
| Help of all kinds (coding, documentation, testing, etc) is welcomed.</p> |
| </li> |
| <li>Adar Dembo spent a chunk of time re-working the <code>thirdparty</code> directory |
| that contains most of Kudu’s native dependencies. The major resulting |
| changes are: |
| <ul> |
| <li>Build directories are now cleanly isolated from source directories, |
| improving cleanliness of re-builds.</li> |
| <li>ThreadSanitizer (TSAN) builds now use <code>libc++</code> instead of <code>libstdcxx</code> |
| for C++ library support. The <code>libc++</code> library has better support for |
| sanitizers, is easier to build in isolation, and solves some compatibility |
| issues that Adar was facing with GCC 5 on Ubuntu Xenial.</li> |
| <li>All of the thirdparty dependencies now build with TSAN instrumentation, |
| which improves our coverage of this very effective tooling.</li> |
| </ul> |
| |
| <p>The impact to most developers is that, if you have an old source checkout, |
| it’s highly likely you will need to clean and re-build the thirdparty |
| directory.</p> |
| </li> |
| <li>Many contributors spent time in recent weeks trying to address the |
| flakiness of various test cases. The Kudu project uses a |
| <a href="http://dist-test.cloudera.org:8080/">dashboard</a> to track the flakiness |
| of each test case, and <a href="http://dist-test.cloudera.org/">distributed test infrastructure</a> |
| to facilitate reproducing test flakes. <!-- spaces cause line break --> |
| As might be expected, some of the flaky tests were due to bugs or |
| timing assumptions in the tests themselves. However, this effort |
| also identified several real bugs: |
| <ul> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4570]">tight retry loop</a> in the |
| Java client.</li> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4395">memory leak</a> due to circular |
| references in the C++ client.</li> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4551">crash</a> which could affect |
| tools used for problem diagnosis.</li> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4409">divergence bug</a> in Raft consensus |
| under particularly torturous scenarios.</li> |
| <li>A potential <a href="http://gerrit.cloudera.org:8080/4394">crash during tablet server startup</a>.</li> |
| <li>A case in which <a href="http://gerrit.cloudera.org:8080/4626">thread startup could be delayed</a> |
| by built-in monitoring code.</li> |
| </ul> |
| |
| <p>As a result of these efforts, the failure rate of these flaky tests has |
| decreased significantly and the stability of Kudu releases continues |
| to increase.</p> |
| </li> |
| <li> |
| <p>Dan Burkert picked up work originally started by Sameer Abhyankar on |
| <a href="https://issues.apache.org/jira/browse/KUDU-1363">KUDU-1363</a>, which adds |
| support for adding <code>IN (...)</code> predicates to scanners. Dan committed the |
| <a href="http://gerrit.cloudera.org:8080/2986">main patch</a> as well as corresponding |
| <a href="http://gerrit.cloudera.org:8080/4530">support in the Java client</a>. |
| Jordan Birdsell quickly added corresponding support in <a href="http://gerrit.cloudera.org:8080/4548">Python</a>. |
| This new feature will be available in an upcoming release.</p> |
| </li> |
| <li> |
| <p>Work continues on the <code>kudu</code> command line tool. Dinesh Bhat added |
| the ability to ask a tablet’s leader to <a href="http://gerrit.cloudera.org:8080/4533">step down</a> |
| and Alexey Serbin added a <a href="http://gerrit.cloudera.org:8080/4412">tool to insert random data into a |
| table</a>.</p> |
| </li> |
| <li> |
| <p>Jordan Birdsell continues to be on a tear improving the Python client. |
| The patches are too numerous to mention, but highlights include Python 3 |
| support as well as near feature parity with the C++ client.</p> |
| </li> |
| <li> |
| <p>Todd Lipcon has been doing some refactoring and cleanup in the Raft |
| consensus implementation. In addition to simplifying and removing code, |
| he committed <a href="https://issues.apache.org/jira/browse/KUDU-1567">KUDU-1567</a>, |
| which improves write performance in many cases by a factor of three |
| or more while also improving stability.</p> |
| </li> |
| <li> |
| <p>Brock Noland is working on support for <a href="https://gerrit.cloudera.org/#/c/4491/">INSERT IGNORE</a> |
| as a first-class part of the Kudu API. Of course this functionality |
| can already be done by simply performing normal inserts and ignoring any |
| resulting errors, but pushing it to the server prevents the server |
| from counting such operations as errors.</p> |
| </li> |
| <li>Congratulations to Ninad Shringarpure for contributing his first patches |
| to Kudu. Ninad contributed two documentation fixes and improved |
| formatting on the Kudu web UI.</li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>Todd Lipcon</name></author><summary>Welcome to the twenty-first edition of the Kudu Weekly Update. Astute |
| readers will notice that the weekly blog posts have been not-so-weekly |
| of late &#8211; in fact, it has been nearly two months since the previous post |
| as I and others have focused on releases, conferences, etc. |
| |
| So, rather than covering just this past week, this post will cover highlights |
| of the progress since the 1.0 release in mid-September. If you&#8217;re interested |
| in learning about progress prior to that release, check the |
| release notes.</summary></entry><entry><title>Apache Kudu at Strata+Hadoop World NYC 2016</title><link href="/2016/09/26/strata-nyc-kudu-talks.html" rel="alternate" type="text/html" title="Apache Kudu at Strata+Hadoop World NYC 2016" /><published>2016-09-26T00:00:00-07:00</published><updated>2016-09-26T00:00:00-07:00</updated><id>/2016/09/26/strata-nyc-kudu-talks</id><content type="html" xml:base="/2016/09/26/strata-nyc-kudu-talks.html"><p>This week in New York, O’Reilly and Cloudera will be hosting Strata+Hadoop World |
| 2016. If you’re interested in Kudu, there will be several opportunities to |
| learn more, both from the open source development team as well as some companies |
| who are already adopting Kudu for their use cases. |
| <!--more--> |
| Here are some of the sessions to check out:</p> |
| |
| <ul> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52146">Powering real-time analytics on Xfinity using Kudu</a> (Wednesday, 11:20am)</p> |
| |
| <p>Sridhar Alla and Kiran Muglurmath from Comcast will talk about how they’re using |
| Kudu to store hundreds of billions of Set-Top Box (STB) events, performing |
| analytics concurrently with real-time streaming ingest of thousands of events |
| per second.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52248">Creating real-time, data-centric applications with Impala and Kudu</a> (Wednesday, 2:05pm)</p> |
| |
| <p>Marcel Kornacker and Todd Lipcon will introduce how Impala and Kudu together |
| allow users to build real-time applications that support streaming ingest, |
| random access updates and deletes, and high performance analytic SQL in |
| a single system.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52168">Streaming cybersecurity into Graph: Accelerating data into Datastax Graph and Blazegraph</a> (Thursday, 1:15pm)</p> |
| |
| <p>Joshua Patterson, Michael Wendt, and Keith Kraus from Accenture Labs will discuss |
| how they have built cybersecurity solutions using graph analytics on top of open |
| source technology like Apache Kafka, Spark, and Flink. They will also touch on |
| why Kudu is becoming an integral part of Accenture’s technology stack.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52050">How GE analyzes billions of mission-critical events in real time using Apache Apex, Spark, and Kudu</a> (Thursday, 2:05pm)</p> |
| |
| <p>Venkatesh Sivasubramanian and Luis Ramos from GE Digital will discuss how they |
| collect and process real-time IoT data using Apache Apex and Apache Spark, and |
| how they’ve been experimenting with Apache Kudu for time series data storage.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/51887">Apache Kudu: 1.0 and Beyond</a> (Thursday, 4:35pm)</p> |
| |
| <p>Todd Lipcon from Cloudera will review the new features that were developed between Kudu 0.5 |
| (the first public release one year ago) and Kudu 1.0, released just last week. Additionally, |
| this talk will provide some insight into the upcoming project roadmap for the coming year.</p> |
| </li> |
| </ul> |
| |
| <p>Aside from these organized sessions, word has it that there will be various demos |
| featuring Apache Kudu at the Cloudera and ZoomData vendor booths.</p> |
| |
| <p>If you’re not attending the conference, but still based in NYC, all hope is |
| not lost. Michael Crutcher from Cloudera will be presenting an introduction |
| to Apache Kudu at the <a href="http://www.meetup.com/mysqlnyc/events/233599664/">SQL NYC Meetup</a>. |
| Be sure to RSVP as spots are filling up fast.</p></content><author><name>Todd Lipcon</name></author><summary>This week in New York, O&#8217;Reilly and Cloudera will be hosting Strata+Hadoop World |
| 2016. If you&#8217;re interested in Kudu, there will be several opportunities to |
| learn more, both from the open source development team as well as some companies |
| who are already adopting Kudu for their use cases.</summary></entry><entry><title>Apache Kudu 1.0.0 released</title><link href="/2016/09/20/apache-kudu-1-0-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.0.0 released" /><published>2016-09-20T00:00:00-07:00</published><updated>2016-09-20T00:00:00-07:00</updated><id>/2016/09/20/apache-kudu-1-0-0-released</id><content type="html" xml:base="/2016/09/20/apache-kudu-1-0-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.0.0!</p> |
| |
| <p>This latest version adds several new features, including:</p> |
| |
| <!--more--> |
| |
| <ul> |
| <li> |
| <p>Removal of multiversion concurrency control (MVCC) history is now supported. |
| This allows Kudu to reclaim disk space, where previously Kudu would keep a full |
| history of all changes made to a given table since the beginning of time.</p> |
| </li> |
| <li> |
| <p>Most of Kudu’s command line tools have been consolidated under a new |
| top-level <code>kudu</code> tool. This reduces the number of large binaries distributed |
| with Kudu and also includes much-improved help output.</p> |
| </li> |
| <li> |
| <p>Administrative tools including <code>kudu cluster ksck</code> now support running |
| against multi-master Kudu clusters.</p> |
| </li> |
| <li> |
| <p>The C++ client API now supports writing data in <code>AUTO_FLUSH_BACKGROUND</code> mode. |
| This can provide higher throughput for ingest workloads.</p> |
| </li> |
| </ul> |
| |
| <p>This release also includes many bug fixes, optimizations, and other |
| improvements, detailed in the <a href="/releases/1.0.0/docs/release_notes.html">release notes</a>.</p> |
| |
| <ul> |
| <li>Download the <a href="/releases/1.0.0/">Kudu 1.0.0 source release</a></li> |
| <li>Convenience binary artifacts for the Java client and various Java |
| integrations (eg Spark, Flume) are also now available via the ASF Maven |
| repository.</li> |
| </ul></content><author><name>Todd Lipcon</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.0.0! |
| |
| This latest version adds several new features, including:</summary></entry><entry><title>Pushing Down Predicate Evaluation in Apache Kudu</title><link href="/2016/09/16/predicate-pushdown.html" rel="alternate" type="text/html" title="Pushing Down Predicate Evaluation in Apache Kudu" /><published>2016-09-16T00:00:00-07:00</published><updated>2016-09-16T00:00:00-07:00</updated><id>/2016/09/16/predicate-pushdown</id><content type="html" xml:base="/2016/09/16/predicate-pushdown.html"><p>I had the pleasure of interning with the Apache Kudu team at Cloudera this |
| summer. This project was my summer contribution to Kudu: a restructuring of the |
| scan path to speed up queries.</p> |
| |
| <!--more--> |
| |
| <h2 id="introduction">Introduction</h2> |
| |
| <p>In Kudu, <em>predicate pushdown</em> refers to the way in which predicates are |
| handled. When a scan is requested, its predicates are passed through the |
| different layers of Kudu’s storage hierarchy, allowing for pruning and other |
| optimizations to happen at each level before reaching the underlying data.</p> |
| |
| <p>While predicates are pushed down, predicate evaluation itself occurs at a fairly |
| high level, precluding the evaluation process from certain data-specific |
| optimizations. These optimizations can make tablet scans an order of magnitude |
| faster, if not more.</p> |
| |
| <h2 id="a-day-in-the-life-of-a-query">A Day in the Life of a Query</h2> |
| |
| <p>Because Kudu is a columnar storage engine, its scan path has a number of |
| optimizations to avoid extraneous reads, copies, and computation. When a query |
| is sent to a tablet server, the server prunes tablets based on the |
| primary key, directing the request to only the tablets that contain the key |
| range of interest. Once at a tablet, only the columns relevant to the query are |
| scanned. Further pruning is done over the primary key, and if the query is |
| predicated on non-key columns, the entire column is scanned. The columns in a |
| tablet are stored as <em>cfiles</em>, which are split into encoded <em>blocks</em>. Once the |
| relevant cfiles are determined, the data are materialized by the block |
| decoders, i.e. their underlying data are decoded and copied into a buffer, |
| which is passed back to the tablet layer. The tablet can then evaluate the |
| predicate on the batch of data and mark which rows should be returned to the |
| client.</p> |
| |
| <p>One of the encoding types I worked very closely with is <em>dictionary encoding</em>, |
| an encoding type for strings that performs particularly well for cfiles that |
| have repeating values. Rather than storing every row’s string, each unique |
| string is assigned a numeric codeword, and the rows are stored numerically on |
| disk. When materializing a dictionary block, all of the numeric data are scanned |
| and all of the corresponding strings are copied and buffered for evaluation. |
| When the vocabulary of a dictionary-encoded cfile gets too large, the blocks |
| begin switching to <em>plain encoding mode</em> to act like <em>plain-encoded</em> blocks.</p> |
| |
| <p>In a plain-encoded block, strings are stored contiguously and the character |
| offsets to the start of each string are stored as a list of integers. When |
| materializing, all of the strings are copied to a buffer for evaluation.</p> |
| |
| <p>Therein lies room for improvement: this predicate evaluation path is the same |
| for all data types and encoding types. Within the tablet, the correct cfiles |
| are determined, the cfiles’ decoders are opened, all of the data are copied to |
| a buffer, and the predicates are evaluated on this buffered data via |
| type-specific comparators. This path is extremely flexible, but because it was |
| designed to be encoding-independent, there is room for improvement.</p> |
| |
| <h2 id="trimming-the-fat">Trimming the Fat</h2> |
| |
| <p>The first step is to allow the decoders access to the predicate. In doing so, |
| each encoding type can specialize its evaluation. Additionally, this puts the |
| decoder in a position where it can determine whether a given row satisfies the |
| query, which in turn, allows the decoders to determine what data gets copied |
| instead of eagerly copying all of its data to get evaluated.</p> |
| |
| <p>Take the case of dictionary-encoded strings as an example. With the existing |
| scan path, not only are all of the strings in a column copied into a buffer, but |
| string comparisons are done on every row. By taking advantage of the fact that |
| the data can be represented as integers, the cost of determining the query |
| results can be greatly reduced. The string comparisons can be swapped out with |
| evaluation based on the codewords, in which case the room for improvement boils |
| down to how to most quickly determine whether or not a given codeword |
| corresponds to a string that satisfies the predicate. Dictionary columns will |
| now use a bitset to store the codewords that match the predicates. It will then |
| scan through the integer-valued data and checks the bitset to determine whether |
| it should copy the corresponding string over.</p> |
| |
| <p>This is great in the best case scenario where a cfile’s vocabulary is small, |
| but when the vocabulary gets too large and the dictionary blocks switch to plain |
| encoding mode, performance is hampered. In this mode, the blocks don’t utilize |
| any dictionary metadata and end up wasting the codeword bitset. That isn’t to |
| say all is lost: the decoders can still evaluate a predicate via string |
| comparison, and the fact that evaluation can still occur at the decoder-level |
| means the eager buffering can still be avoided.</p> |
| |
| <p>Dictionary encoding is a perfect storm in that the decoders can completely |
| evaluate the predicates. This is not the case for most other encoding types, |
| but having decoders support evaluation leaves the door open for other encoding |
| types to extend this idea.</p> |
| |
| <h2 id="performance">Performance</h2> |
| <p>Depending on the dataset and query, predicate pushdown can lead to significant |
| improvements. Tablet scans were timed with datasets consisting of repeated |
| string patterns of tunable length and tunable cardinality.</p> |
| |
| <p><img src="/img/predicate-pushdown/pushdown-10.png" alt="png" class="img-responsive" /> |
| <img src="/img/predicate-pushdown/pushdown-10M.png" alt="png" class="img-responsive" /></p> |
| |
| <p>The above plots show the time taken to completely scan a single tablet, recorded |
| using a dataset of ten million rows of strings with length ten. Predicates were |
| designed to select values out of bounds (Empty), select a single value (Equal, |
| i.e. for cardinality <em>k</em>, this would select 1/<em>k</em> of the dataset), select half |
| of the full range (Half), and select the full range of values (All).</p> |
| |
| <p>With the original evaluation implementation, the tablet must copy and scan |
| through the tablet to determine whether any values match. This means that even |
| when the result set is small, the full column is still copied. This is avoided |
| by pushing down predicates, which only copies as needed, and can be seen in the |
| above queries: those with near-empty result sets (Empty and Equal) have shorter |
| scan times than those with larger result sets (Half and All).</p> |
| |
| <p>Note that for dictionary encoding, given a low cardinality, Kudu can completely |
| rely on the dictionary codewords to evaluate, making the query significantly |
| faster. At higher cardinalities, the dictionaries completely fill up and the |
| blocks fall back on plain encoding. The slower, albeit still improved, |
| performance on the dataset containing 10M unique values reflects this.</p> |
| |
| <p><img src="/img/predicate-pushdown/pushdown-tpch.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Similar predicates were run with the TPC-H dataset, querying on the shipdate |
| column. The full path of a query includes not only the tablet scanning itself, |
| but also RPCs and batched data transfer to the caller as the scan progresses. |
| As such, the times plotted above refer to the average end-to-end time required |
| to scan and return a batch of rows. Regardless of this additional overhead, |
| significant improvements on the scan path still yield substantial improvements |
| to the query performance as a whole.</p> |
| |
| <h2 id="conclusion">Conclusion</h2> |
| |
| <p>Pushing down predicate evaluation in Kudu yielded substantial improvements to |
| the scan path. For dictionary encoding, pushdown can be particularly powerful, |
| and other encoding types are either unaffected or also improved. This change has |
| been pushed to the main branch of Kudu, and relevant commits can be found |
| <a href="https://github.com/cloudera/kudu/commit/c0f37278cb09a7781d9073279ea54b08db6e2010">here</a> |
| and |
| <a href="https://github.com/cloudera/kudu/commit/ec80fdb37be44d380046a823b5e6d8e2241ec3da">here</a>.</p> |
| |
| <p>This summer has been a phenomenal learning experience for me, in terms of the |
| tools, the workflow, the datasets, the thought-processes that go into building |
| something at Kudu’s scale. I am extremely thankful for all of the mentoring and |
| support I received, and that I got to be a part of Kudu’s journey from |
| incubating to a Top Level Apache project. I can’t express enough how grateful I |
| am for the amount of support I got from the Kudu team, from the intern |
| coordinators, and from the Cloudera community as a whole.</p></content><author><name>Andrew Wong</name></author><summary>I had the pleasure of interning with the Apache Kudu team at Cloudera this |
| summer. This project was my summer contribution to Kudu: a restructuring of the |
| scan path to speed up queries.</summary></entry><entry><title>An Introduction to the Flume Kudu Sink</title><link href="/2016/08/31/intro-flume-kudu-sink.html" rel="alternate" type="text/html" title="An Introduction to the Flume Kudu Sink" /><published>2016-08-31T00:00:00-07:00</published><updated>2016-08-31T00:00:00-07:00</updated><id>/2016/08/31/intro-flume-kudu-sink</id><content type="html" xml:base="/2016/08/31/intro-flume-kudu-sink.html"><p>This post discusses the Kudu Flume Sink. First, I’ll give some background on why we considered |
| using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.</p> |
| |
| <h2 id="why-kudu">Why Kudu</h2> |
| |
| <p>Traditionally in the Hadoop ecosystem we’ve dealt with various <em>batch processing</em> technologies such |
| as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig, |
| Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to |
| process the whole data set in batches, again and again, as soon as new data gets added. Things get |
| really complicated when a few such tasks need to get chained together, or when the same data set |
| needs to be processed in various ways by different jobs, while all compete for the shared cluster |
| resources.</p> |
| |
| <p>The opposite of this approach is <em>stream processing</em>: process the data as soon as it arrives, not |
| in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make |
| this possible. But writing streaming services is not trivial. The streaming systems are becoming |
| more and more capable and support more complex constructs, but they are not yet easy to use. All |
| queries and processes need to be carefully planned and implemented.</p> |
| |
| <p>To summarize, <em>batch processing</em> is:</p> |
| |
| <ul> |
| <li>file-based</li> |
| <li>a paradigm that processes large chunks of data as a group</li> |
| <li>high latency and high throughput, both for ingest and query</li> |
| <li>typically easy to program, but hard to orchestrate</li> |
| <li>well suited for writing ad-hoc queries, although they are typically high latency</li> |
| </ul> |
| |
| <p>While <em>stream processing</em> is:</p> |
| |
| <ul> |
| <li>a totally different paradigm, which involves single events and time windows instead of large groups of events</li> |
| <li>still file-based and not a long-term database</li> |
| <li>not batch-oriented, but incremental</li> |
| <li>ultra-fast ingest and ultra-fast query (query results basically pre-calculated)</li> |
| <li>not so easy to program, relatively easy to orchestrate</li> |
| <li>impossible to write ad-hoc queries</li> |
| </ul> |
| |
| <p>And a Kudu-based <em>near real-time</em> approach is:</p> |
| |
| <ul> |
| <li>flexible and expressive, thanks to SQL support via Apache Impala (incubating)</li> |
| <li>a table-oriented, mutable data store that feels like a traditional relational database</li> |
| <li>very easy to program, you can even pretend it’s good old MySQL</li> |
| <li>low-latency and relatively high throughput, both for ingest and query</li> |
| </ul> |
| |
| <p>At Argyle Data, we’re dealing with complex fraud detection scenarios. We need to ingest massive |
| amounts of data, run machine learning algorithms and generate reports. When we created our current |
| architecture two years ago we decided to opt for a database as the backbone of our system. That |
| database is Apache Accumulo. It’s a key-value based database which runs on top of Hadoop HDFS, |
| quite similar to HBase but with some important improvements such as cell level security and ease |
| of deployment and management. To enable querying of this data for quite complex reporting and |
| analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced |
| by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This |
| architecture has served us well, but there were a few problems:</p> |
| |
| <ul> |
| <li>we need to ingest even more massive volumes of data in real-time</li> |
| <li>we need to perform complex machine-learning calculations on even larger data-sets</li> |
| <li>we need to support ad-hoc queries, plus long-term data warehouse functionality</li> |
| </ul> |
| |
| <p>So, we’ve started gradually moving the core machine-learning pipeline to a streaming based |
| solution. This way we can ingest and process larger data-sets faster in the real-time. But then how |
| would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While |
| the machine learning pipeline ingests and processes real-time data, we store a copy of the same |
| ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our <em>data warehouse</em>. By |
| using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala’s |
| super-fast query engine.</p> |
| |
| <p>But how would we make sure data is reliably ingested into the streaming pipeline <em>and</em> the |
| Kudu-based data warehouse? This is where Apache Flume comes in.</p> |
| |
| <h2 id="why-flume">Why Flume</h2> |
| |
| <p>According to their <a href="http://flume.apache.org/">website</a> “Flume is a distributed, reliable, and |
| available service for efficiently collecting, aggregating, and moving large amounts of log data. |
| It has a simple and flexible architecture based on streaming data flows. It is robust and fault |
| tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” As you |
| can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop |
| clusters.</p> |
| |
| <p><img src="https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad" alt="png" /></p> |
| |
| <p>Flume has an extensible architecture. An instance of Flume, called an <em>agent</em>, can have multiple |
| <em>channels</em>, with each having multiple <em>sources</em> and <em>sinks</em> of various types. Sources queue data |
| in channels, which in turn write out data to sinks. Such <em>pipelines</em> can be chained together to |
| create even more complex ones. There may be more than one agent and agents can be configured to |
| support failover and recovery.</p> |
| |
| <p>Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the |
| default (an in-memory queue with no persistence to disk), but other options such as Kafka- and |
| File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory |
| source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing |
| data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.</p> |
| |
| <p>In the rest of this post I’ll go over the Kudu Flume sink and show you how to configure Flume to |
| write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8 |
| release and the source code can be found <a href="https://github.com/apache/kudu/tree/master/java/kudu-flume-sink">here</a>.</p> |
| |
| <h2 id="configuring-the-kudu-flume-sink">Configuring the Kudu Flume Sink</h2> |
| |
| <p>Here is a sample flume configuration file:</p> |
| |
| <pre><code>agent1.sources = source1 |
| agent1.channels = channel1 |
| agent1.sinks = sink1 |
| |
| agent1.sources.source1.type = exec |
| agent1.sources.source1.command = /usr/bin/vmstat 1 |
| agent1.sources.source1.channels = channel1 |
| |
| agent1.channels.channel1.type = memory |
| agent1.channels.channel1.capacity = 10000 |
| agent1.channels.channel1.transactionCapacity = 1000 |
| |
| agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink |
| agent1.sinks.sink1.masterAddresses = localhost |
| agent1.sinks.sink1.tableName = stats |
| agent1.sinks.sink1.channel = channel1 |
| agent1.sinks.sink1.batchSize = 50 |
| agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer |
| </code></pre> |
| |
| <p>We define a source called <code>source1</code> which simply executes a <code>vmstat</code> command to continuously generate |
| virtual memory statistics for the machine and queue events into an in-memory <code>channel1</code> channel, |
| which in turn is used for writing these events to a Kudu table called <code>stats</code>. We are using |
| <code>org.apache.kudu.flume.sink.SimpleKuduEventProducer</code> as the producer. <code>SimpleKuduEventProducer</code> is |
| the built-in and default producer, but it’s implemented as a showcase for how to write Flume |
| events into Kudu tables. For any serious functionality we’d have to write a custom producer. We |
| need to make this producer and the <code>KuduSink</code> class available to Flume. We can do that by simply |
| copying the <code>kudu-flume-sink-&lt;VERSION&gt;.jar</code> jar file from the Kudu distribution to the |
| <code>$FLUME_HOME/plugins.d/kudu-sink/lib</code> directory in the Flume installation. The jar file contains |
| <code>KuduSink</code> and all of its dependencies (including Kudu java client classes).</p> |
| |
| <p>At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are |
| (<code>agent1.sinks.sink1.masterAddresses = localhost</code>) and which Kudu table should be used for writing |
| Flume events to (<code>agent1.sinks.sink1.tableName = stats</code>). The Kudu Flume Sink doesn’t create this |
| table, it has to be created before the Kudu Flume Sink is started.</p> |
| |
| <p>You may also notice the <code>batchSize</code> parameter. Batch size is used for batching up to that many |
| Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge |
| impact on ingest performance of the Kudu cluster.</p> |
| |
| <p>Here is a complete list of KuduSink parameters:</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Parameter Name</th> |
| <th>Default</th> |
| <th>Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>masterAddresses</td> |
| <td>N/A</td> |
| <td>Comma-separated list of “host:port” pairs of the masters (port optional)</td> |
| </tr> |
| <tr> |
| <td>tableName</td> |
| <td>N/A</td> |
| <td>The name of the table in Kudu to write to</td> |
| </tr> |
| <tr> |
| <td>producer</td> |
| <td>org.apache.kudu.flume.sink.SimpleKuduEventProducer</td> |
| <td>The fully qualified class name of the Kudu event producer the sink should use</td> |
| </tr> |
| <tr> |
| <td>batchSize</td> |
| <td>100</td> |
| <td>Maximum number of events the sink should take from the channel per transaction, if available</td> |
| </tr> |
| <tr> |
| <td>timeoutMillis</td> |
| <td>30000</td> |
| <td>Timeout period for Kudu operations, in milliseconds</td> |
| </tr> |
| <tr> |
| <td>ignoreDuplicateRows</td> |
| <td>true</td> |
| <td>Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <p>Let’s take a look at the source code for the built-in producer class:</p> |
| |
| <pre><code class="language-java">public class SimpleKuduEventProducer implements KuduEventProducer { |
| private byte[] payload; |
| private KuduTable table; |
| private String payloadColumn; |
| |
| public SimpleKuduEventProducer(){ |
| } |
| |
| @Override |
| public void configure(Context context) { |
| payloadColumn = context.getString("payloadColumn","payload"); |
| } |
| |
| @Override |
| public void configure(ComponentConfiguration conf) { |
| } |
| |
| @Override |
| public void initialize(Event event, KuduTable table) { |
| this.payload = event.getBody(); |
| this.table = table; |
| } |
| |
| @Override |
| public List&lt;Operation&gt; getOperations() throws FlumeException { |
| try { |
| Insert insert = table.newInsert(); |
| PartialRow row = insert.getRow(); |
| row.addBinary(payloadColumn, payload); |
| |
| return Collections.singletonList((Operation) insert); |
| } catch (Exception e){ |
| throw new FlumeException("Failed to create Kudu Insert object!", e); |
| } |
| } |
| |
| @Override |
| public void close() { |
| } |
| } |
| </code></pre> |
| |
| <p><code>SimpleKuduEventProducer</code> implements the <code>org.apache.kudu.flume.sink.KuduEventProducer</code> interface, |
| which itself looks like this:</p> |
| |
| <pre><code class="language-java">public interface KuduEventProducer extends Configurable, ConfigurableComponent { |
| /** |
| * Initialize the event producer. |
| * @param event to be written to Kudu |
| * @param table the KuduTable object used for creating Kudu Operation objects |
| */ |
| void initialize(Event event, KuduTable table); |
| |
| /** |
| * Get the operations that should be written out to Kudu as a result of this |
| * event. This list is written to Kudu using the Kudu client API. |
| * @return List of {@link org.kududb.client.Operation} which |
| * are written as such to Kudu |
| */ |
| List&lt;Operation&gt; getOperations(); |
| |
| /* |
| * Clean up any state. This will be called when the sink is being stopped. |
| */ |
| void close(); |
| } |
| </code></pre> |
| |
| <p><code>public void configure(Context context)</code> is called when an instance of our producer is instantiated |
| by the KuduSink. SimpleKuduEventProducer’s implementation looks for a producer parameter named |
| <code>payloadColumn</code> and uses its value (“payload” if not overridden in Flume configuration file) as the |
| column which will hold the value of the Flume event payload. If you recall from above, we had |
| configured the KuduSink to listen for events generated from the <code>vmstat</code> command. Each output row |
| from that command will be stored as a new row containing a <code>payload</code> column in the <code>stats</code> table. |
| <code>SimpleKuduEventProducer</code> does not have any configuration parameters, but if it had any we would |
| define them by prefixing it with <code>producer.</code> (<code>agent1.sinks.sink1.producer.parameter1</code> for |
| example).</p> |
| |
| <p>The main producer logic resides in the <code>public List&lt;Operation&gt; getOperations()</code> method. In |
| SimpleKuduEventProducer’s implementation we simply insert the binary body of the Flume event into |
| the Kudu table. Here we call Kudu’s <code>newInsert()</code> to initiate an insert, but could have used |
| <code>Upsert</code> if updating an existing row was also an option, in fact there’s another producer |
| implementation available for doing just that: <code>SimpleKeyedKuduEventProducer</code>. Most probably you |
| will need to write your own custom producer in the real world, but you can base your implementation |
| on the built-in ones.</p> |
| |
| <p>In the future, we plan to add more flexible event producer implementations so that creation of a |
| custom event producer is not required to write data to Kudu. See |
| <a href="https://gerrit.cloudera.org/#/c/4034/">here</a> for a work-in-progress generic event producer for |
| Avro-encoded Events.</p> |
| |
| <h2 id="conclusion">Conclusion</h2> |
| |
| <p>Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume |
| helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store |
| the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of |
| disparate sources.</p> |
| |
| <p><em>Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using |
| sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that |
| is included in the Kudu distribution. You can follow him on Twitter at |
| <a href="https://twitter.com/ara_e">@ara_e</a>.</em></p></content><author><name>Ara Abrahamian</name></author><summary>This post discusses the Kudu Flume Sink. First, I&#8217;ll give some background on why we considered |
| using Kudu, what Flume does for us, and how Flume fits with Kudu in our project. |
| |
| Why Kudu |
| |
| Traditionally in the Hadoop ecosystem we&#8217;ve dealt with various batch processing technologies such |
| as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig, |
| Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to |
| process the whole data set in batches, again and again, as soon as new data gets added. Things get |
| really complicated when a few such tasks need to get chained together, or when the same data set |
| needs to be processed in various ways by different jobs, while all compete for the shared cluster |
| resources. |
| |
| The opposite of this approach is stream processing: process the data as soon as it arrives, not |
| in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make |
| this possible. But writing streaming services is not trivial. The streaming systems are becoming |
| more and more capable and support more complex constructs, but they are not yet easy to use. All |
| queries and processes need to be carefully planned and implemented. |
| |
| To summarize, batch processing is: |
| |
| |
| file-based |
| a paradigm that processes large chunks of data as a group |
| high latency and high throughput, both for ingest and query |
| typically easy to program, but hard to orchestrate |
| well suited for writing ad-hoc queries, although they are typically high latency |
| |
| |
| While stream processing is: |
| |
| |
| a totally different paradigm, which involves single events and time windows instead of large groups of events |
| still file-based and not a long-term database |
| not batch-oriented, but incremental |
| ultra-fast ingest and ultra-fast query (query results basically pre-calculated) |
| not so easy to program, relatively easy to orchestrate |
| impossible to write ad-hoc queries |
| |
| |
| And a Kudu-based near real-time approach is: |
| |
| |
| flexible and expressive, thanks to SQL support via Apache Impala (incubating) |
| a table-oriented, mutable data store that feels like a traditional relational database |
| very easy to program, you can even pretend it&#8217;s good old MySQL |
| low-latency and relatively high throughput, both for ingest and query |
| |
| |
| At Argyle Data, we&#8217;re dealing with complex fraud detection scenarios. We need to ingest massive |
| amounts of data, run machine learning algorithms and generate reports. When we created our current |
| architecture two years ago we decided to opt for a database as the backbone of our system. That |
| database is Apache Accumulo. It&#8217;s a key-value based database which runs on top of Hadoop HDFS, |
| quite similar to HBase but with some important improvements such as cell level security and ease |
| of deployment and management. To enable querying of this data for quite complex reporting and |
| analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced |
| by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This |
| architecture has served us well, but there were a few problems: |
| |
| |
| we need to ingest even more massive volumes of data in real-time |
| we need to perform complex machine-learning calculations on even larger data-sets |
| we need to support ad-hoc queries, plus long-term data warehouse functionality |
| |
| |
| So, we&#8217;ve started gradually moving the core machine-learning pipeline to a streaming based |
| solution. This way we can ingest and process larger data-sets faster in the real-time. But then how |
| would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While |
| the machine learning pipeline ingests and processes real-time data, we store a copy of the same |
| ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our data warehouse. By |
| using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala&#8217;s |
| super-fast query engine. |
| |
| But how would we make sure data is reliably ingested into the streaming pipeline and the |
| Kudu-based data warehouse? This is where Apache Flume comes in. |
| |
| Why Flume |
| |
| According to their website &#8220;Flume is a distributed, reliable, and |
| available service for efficiently collecting, aggregating, and moving large amounts of log data. |
| It has a simple and flexible architecture based on streaming data flows. It is robust and fault |
| tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.&#8221; As you |
| can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop |
| clusters. |
| |
| |
| |
| Flume has an extensible architecture. An instance of Flume, called an agent, can have multiple |
| channels, with each having multiple sources and sinks of various types. Sources queue data |
| in channels, which in turn write out data to sinks. Such pipelines can be chained together to |
| create even more complex ones. There may be more than one agent and agents can be configured to |
| support failover and recovery. |
| |
| Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the |
| default (an in-memory queue with no persistence to disk), but other options such as Kafka- and |
| File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory |
| source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing |
| data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents. |
| |
| In the rest of this post I&#8217;ll go over the Kudu Flume sink and show you how to configure Flume to |
| write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8 |
| release and the source code can be found here. |
| |
| Configuring the Kudu Flume Sink |
| |
| Here is a sample flume configuration file: |
| |
| agent1.sources = source1 |
| agent1.channels = channel1 |
| agent1.sinks = sink1 |
| |
| agent1.sources.source1.type = exec |
| agent1.sources.source1.command = /usr/bin/vmstat 1 |
| agent1.sources.source1.channels = channel1 |
| |
| agent1.channels.channel1.type = memory |
| agent1.channels.channel1.capacity = 10000 |
| agent1.channels.channel1.transactionCapacity = 1000 |
| |
| agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink |
| agent1.sinks.sink1.masterAddresses = localhost |
| agent1.sinks.sink1.tableName = stats |
| agent1.sinks.sink1.channel = channel1 |
| agent1.sinks.sink1.batchSize = 50 |
| agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer |
| |
| |
| We define a source called source1 which simply executes a vmstat command to continuously generate |
| virtual memory statistics for the machine and queue events into an in-memory channel1 channel, |
| which in turn is used for writing these events to a Kudu table called stats. We are using |
| org.apache.kudu.flume.sink.SimpleKuduEventProducer as the producer. SimpleKuduEventProducer is |
| the built-in and default producer, but it&#8217;s implemented as a showcase for how to write Flume |
| events into Kudu tables. For any serious functionality we&#8217;d have to write a custom producer. We |
| need to make this producer and the KuduSink class available to Flume. We can do that by simply |
| copying the kudu-flume-sink-&lt;VERSION&gt;.jar jar file from the Kudu distribution to the |
| $FLUME_HOME/plugins.d/kudu-sink/lib directory in the Flume installation. The jar file contains |
| KuduSink and all of its dependencies (including Kudu java client classes). |
| |
| At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are |
| (agent1.sinks.sink1.masterAddresses = localhost) and which Kudu table should be used for writing |
| Flume events to (agent1.sinks.sink1.tableName = stats). The Kudu Flume Sink doesn&#8217;t create this |
| table, it has to be created before the Kudu Flume Sink is started. |
| |
| You may also notice the batchSize parameter. Batch size is used for batching up to that many |
| Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge |
| impact on ingest performance of the Kudu cluster. |
| |
| Here is a complete list of KuduSink parameters: |
| |
| |
| |
| |
| Parameter Name |
| Default |
| Description |
| |
| |
| |
| |
| masterAddresses |
| N/A |
| Comma-separated list of &#8220;host:port&#8221; pairs of the masters (port optional) |
| |
| |
| tableName |
| N/A |
| The name of the table in Kudu to write to |
| |
| |
| producer |
| org.apache.kudu.flume.sink.SimpleKuduEventProducer |
| The fully qualified class name of the Kudu event producer the sink should use |
| |
| |
| batchSize |
| 100 |
| Maximum number of events the sink should take from the channel per transaction, if available |
| |
| |
| timeoutMillis |
| 30000 |
| Timeout period for Kudu operations, in milliseconds |
| |
| |
| ignoreDuplicateRows |
| true |
| Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu |
| |
| |
| |
| |
| Let&#8217;s take a look at the source code for the built-in producer class: |
| |
| public class SimpleKuduEventProducer implements KuduEventProducer { |
| private byte[] payload; |
| private KuduTable table; |
| private String payloadColumn; |
| |
| public SimpleKuduEventProducer(){ |
| } |
| |
| @Override |
| public void configure(Context context) { |
| payloadColumn = context.getString("payloadColumn","payload"); |
| } |
| |
| @Override |
| public void configure(ComponentConfiguration conf) { |
| } |
| |
| @Override |
| public void initialize(Event event, KuduTable table) { |
| this.payload = event.getBody(); |
| this.table = table; |
| } |
| |
| @Override |
| public List&lt;Operation&gt; getOperations() throws FlumeException { |
| try { |
| Insert insert = table.newInsert(); |
| PartialRow row = insert.getRow(); |
| row.addBinary(payloadColumn, payload); |
| |
| return Collections.singletonList((Operation) insert); |
| } catch (Exception e){ |
| throw new FlumeException("Failed to create Kudu Insert object!", e); |
| } |
| } |
| |
| @Override |
| public void close() { |
| } |
| } |
| |
| |
| SimpleKuduEventProducer implements the org.apache.kudu.flume.sink.KuduEventProducer interface, |
| which itself looks like this: |
| |
| public interface KuduEventProducer extends Configurable, ConfigurableComponent { |
| /** |
| * Initialize the event producer. |
| * @param event to be written to Kudu |
| * @param table the KuduTable object used for creating Kudu Operation objects |
| */ |
| void initialize(Event event, KuduTable table); |
| |
| /** |
| * Get the operations that should be written out to Kudu as a result of this |
| * event. This list is written to Kudu using the Kudu client API. |
| * @return List of {@link org.kududb.client.Operation} which |
| * are written as such to Kudu |
| */ |
| List&lt;Operation&gt; getOperations(); |
| |
| /* |
| * Clean up any state. This will be called when the sink is being stopped. |
| */ |
| void close(); |
| } |
| |
| |
| public void configure(Context context) is called when an instance of our producer is instantiated |
| by the KuduSink. SimpleKuduEventProducer&#8217;s implementation looks for a producer parameter named |
| payloadColumn and uses its value (&#8220;payload&#8221; if not overridden in Flume configuration file) as the |
| column which will hold the value of the Flume event payload. If you recall from above, we had |
| configured the KuduSink to listen for events generated from the vmstat command. Each output row |
| from that command will be stored as a new row containing a payload column in the stats table. |
| SimpleKuduEventProducer does not have any configuration parameters, but if it had any we would |
| define them by prefixing it with producer. (agent1.sinks.sink1.producer.parameter1 for |
| example). |
| |
| The main producer logic resides in the public List&lt;Operation&gt; getOperations() method. In |
| SimpleKuduEventProducer&#8217;s implementation we simply insert the binary body of the Flume event into |
| the Kudu table. Here we call Kudu&#8217;s newInsert() to initiate an insert, but could have used |
| Upsert if updating an existing row was also an option, in fact there&#8217;s another producer |
| implementation available for doing just that: SimpleKeyedKuduEventProducer. Most probably you |
| will need to write your own custom producer in the real world, but you can base your implementation |
| on the built-in ones. |
| |
| In the future, we plan to add more flexible event producer implementations so that creation of a |
| custom event producer is not required to write data to Kudu. See |
| here for a work-in-progress generic event producer for |
| Avro-encoded Events. |
| |
| Conclusion |
| |
| Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume |
| helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store |
| the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of |
| disparate sources. |
| |
| Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using |
| sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that |
| is included in the Kudu distribution. You can follow him on Twitter at |
| @ara_e.</summary></entry><entry><title>New Range Partitioning Features in Kudu 0.10</title><link href="/2016/08/23/new-range-partitioning-features.html" rel="alternate" type="text/html" title="New Range Partitioning Features in Kudu 0.10" /><published>2016-08-23T00:00:00-07:00</published><updated>2016-08-23T00:00:00-07:00</updated><id>/2016/08/23/new-range-partitioning-features</id><content type="html" xml:base="/2016/08/23/new-range-partitioning-features.html"><p>Kudu 0.10 is shipping with a few important new features for range partitioning. |
| These features are designed to make Kudu easier to scale for certain workloads, |
| like time series. This post will introduce these features, and discuss how to use |
| them to effectively design tables for scalability and performance.</p> |
| |
| <!--more--> |
| |
| <p>Since Kudu’s initial release, tables have had the constraint that once created, |
| the set of partitions is static. This forces users to plan ahead and create |
| enough partitions for the expected size of the table, because once the table is |
| created no further partitions can be added. When using hash partitioning, |
| creating more partitions is as straightforward as specifying more buckets. For |
| range partitioning, however, knowing where to put the extra partitions ahead of |
| time can be difficult or impossible.</p> |
| |
| <p>The common solution to this problem in other distributed databases is to allow |
| range partitions to split into smaller child range partitions. Unfortunately, |
| range splitting typically has a large performance impact on running tables, |
| since child partitions need to eventually be recompacted and rebalanced to a |
| remote server. Range splitting is particularly thorny with Kudu, because rows |
| are stored in tablets in primary key sorted order, which does not necessarily |
| match the range partitioning order. If the range partition key is different than |
| the primary key, then splitting requires inspecting and shuffling each |
| individual row, instead of splitting the tablet in half.</p> |
| |
| <h2 id="adding-and-dropping-range-partitions">Adding and Dropping Range Partitions</h2> |
| |
| <p>As an alternative to range partition splitting, Kudu now allows range partitions |
| to be added and dropped on the fly, without locking the table or otherwise |
| affecting concurrent operations on other partitions. This solution is not |
| strictly as powerful as full range partition splitting, but it strikes a good |
| balance between flexibility, performance, and operational overhead. |
| Additionally, this feature does not preclude range splitting in the future if |
| there is a push to implement it. To support adding and dropping range |
| partitions, Kudu had to remove an even more fundamental restriction when using |
| range partitions.</p> |
| |
| <p>Previously, range partitions could only be created by specifying split points. |
| Split points divide an implicit partition covering the entire range into |
| contiguous and disjoint partitions. When using split points, the first and last |
| partitions are always unbounded below and above, respectively. A consequence of |
| the final partition being unbounded is that datasets which are range-partitioned |
| on a column that increases in value over time will eventually have far more rows |
| in the last partition than in any other. Unbalanced partitions are commonly |
| referred to as hotspots, and until Kudu 0.10 they have been difficult to avoid |
| when storing time series data in Kudu.</p> |
| |
| <p><img src="/img/2016-08-23-new-range-partitioning-features/range-partitioning-on-time.png" alt="png" class="img-responsive" /></p> |
| |
| <p>The figure above shows the tablets created by two different attempts to |
| partition a table by range on a timestamp column. The first, above in blue, uses |
| split points. The second, below in green, uses bounded range partitions |
| specified during table creation. With bounded range partitions, there is no |
| longer a guarantee that every possible row has a corresponding range partition. |
| As a result, Kudu will now reject writes which fall in a ‘non-covered’ range.</p> |
| |
| <p>Now that tables are no longer required to have range partitions covering all |
| possible rows, Kudu can support adding range partitions to cover the otherwise |
| unoccupied space. Dropping a range partition will result in unoccupied space |
| where the range partition was previously. In the example above, we may want to |
| add a range partition covering 2017 at the end of the year, so that we can |
| continue collecting data in the future. By lazily adding range partitions we |
| avoid hotspotting, avoid the need to specify range partitions up front for time |
| periods far in the future, and avoid the downsides of splitting. Additionally, |
| historical data which is no longer useful can be efficiently deleted by dropping |
| the entire range partition.</p> |
| |
| <h2 id="what-about-hash-partitioning">What About Hash Partitioning?</h2> |
| |
| <p>Since Kudu’s hash partitioning feature originally shipped in version 0.6, it has |
| been possible to create tables which combine hash partitioning with range |
| partitioning. The new range partitioning features continue to work seamlessly |
| when combined with hash partitioning. Just as before, the number of tablets |
| which comprise a table will be the product of the number of range partitions and |
| the number of hash partition buckets. Adding or dropping a range partition will |
| result in the creation or deletion of one tablet per hash bucket.</p> |
| |
| <p><img src="/img/2016-08-23-new-range-partitioning-features/range-and-hash-partitioning.png" alt="png" class="img-responsive" /></p> |
| |
| <p>The diagram above shows a time series table range-partitioned on the timestamp |
| and hash-partitioned with two buckets. The hash partitioning could be on the |
| timestamp column, or it could be on any other column or columns in the primary |
| key. In this example only two years of historical data is needed, so at the end |
| of 2016 a new range partition is added for 2017 and the historical 2014 range |
| partition is dropped. This causes two new tablets to be created for 2017, and |
| the two existing tablets for 2014 to be deleted.</p> |
| |
| <h2 id="getting-started">Getting Started</h2> |
| |
| <p>Beginning with the Kudu 0.10 release, users can add and drop range partitions |
| through the Java and C++ client APIs. Range partitions on existing tables can be |
| dropped and replacements added, but it requires the servers and all clients to |
| be updated to 0.10.</p></content><author><name>Dan Burkert</name></author><summary>Kudu 0.10 is shipping with a few important new features for range partitioning. |
| These features are designed to make Kudu easier to scale for certain workloads, |
| like time series. This post will introduce these features, and discuss how to use |
| them to effectively design tables for scalability and performance.</summary></entry></feed> |