| <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2016-10-11T15:00:02-07:00</updated><id>/</id><entry><title>Apache Kudu Weekly Update October 11th, 2016</title><link href="/2016/10/11/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update October 11th, 2016" /><published>2016-10-11T00:00:00-07:00</published><updated>2016-10-11T00:00:00-07:00</updated><id>/2016/10/11/weekly-update</id><content type="html" xml:base="/2016/10/11/weekly-update.html"><p>Welcome to the twenty-first edition of the Kudu Weekly Update. Astute |
| readers will notice that the weekly blog posts have been not-so-weekly |
| of late – in fact, it has been nearly two months since the previous post |
| as I and others have focused on releases, conferences, etc.</p> |
| |
| <p>So, rather than covering just this past week, this post will cover highlights |
| of the progress since the 1.0 release in mid-September. If you’re interested |
| in learning about progress prior to that release, check the |
| <a href="http://kudu.apache.org/releases/1.0.0/docs/release_notes.html">release notes</a>.</p> |
| |
| <!--more--> |
| |
| <h2 id="project-news">Project news</h2> |
| |
| <ul> |
| <li> |
| <p>On September 12th, the Kudu PMC announced that Alexey Serbin and Will |
| Berkeley had been voted as new committers and PMC members.</p> |
| |
| <p>Alexey’s contributions prior to committership included |
| <a href="https://gerrit.cloudera.org/#/c/3952/">AUTO_FLUSH_BACKGROUND</a> support |
| in C++ as well as <a href="http://kudu.apache.org/apidocs/">API documentation</a> |
| for the C++ client API.</p> |
| |
| <p>Will’s contributions include several fixes to the web UIs, large |
| improvements the Flume integration, and a lot of good work |
| burning down long-standing bugs.</p> |
| |
| <p>Both contributors were “acting the part” and the PMC was pleased to |
| recognize their contributions with committership.</p> |
| </li> |
| <li> |
| <p>Kudu 1.0.0 was <a href="https://kudu.apache.org/2016/09/20/apache-kudu-1-0-0-released.html">released</a> |
| on September 19th. Most community members have upgraded by this point |
| and have been reporting improved stability and performance.</p> |
| </li> |
| <li> |
| <p>Dan Burkert has been managing a Kudu 1.0.1 release to address a few |
| important bugs discovered since 1.0.0. The vote passed on Monday |
| afternoon, so the release should be made officially available |
| later this week.</p> |
| </li> |
| </ul> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li>After the 1.0 release, many contributors have gone into a design phase |
| for upcoming work. Over the last couple of weeks, developers have posted |
| scoping and design documents for topics including: |
| <ul> |
| <li><a href="https://docs.google.com/document/d/1cPNDTpVkIUo676RlszpTF1gHZ8l0TdbB7zFBAuOuYUw/edit#heading=h.gsibhnd5dyem">Security features</a> (Todd Lipcon)</li> |
| <li><a href="https://goo.gl/wP5BJb">Improved disk-failure handling</a> (Dinesh Bhat)</li> |
| <li><a href="https://s.apache.org/7K48">Tools for manual recovery from corruption</a> (Mike Percy and Dinesh Bhat)</li> |
| <li><a href="https://s.apache.org/uOOt">Addressing issues seen with the LogBlockManager</a> (Adar Dembo)</li> |
| <li><a href="https://s.apache.org/7VCo">Providing proper snapshot/serializable consistency</a> (David Alves)</li> |
| <li><a href="https://s.apache.org/ARUP">Improving re-replication of under-replicated tablets</a> (Mike Percy)</li> |
| <li><a href="https://docs.google.com/document/d/1066W63e2YUTNnecmfRwgAHghBPnL1Pte_gJYAaZ_Bjo/edit">Avoiding Raft election storms</a> (Todd Lipcon)</li> |
| </ul> |
| |
| <p>The development community has no particular rule that all work must be |
| accompanied by such a document, but in the past they have proven useful |
| for fleshing out ideas around a design before beginning implementation. |
| As Kudu matures, we can probably expect to see more of this kind of planning |
| and design discussion.</p> |
| |
| <p>If any of the above work areas sounds interesting to you, please take a |
| look and leave your comments! Similarly, if you are interested in contributing |
| in any of these areas, please feel free to volunteer on the mailing list. |
| Help of all kinds (coding, documentation, testing, etc) is welcomed.</p> |
| </li> |
| <li>Adar Dembo spent a chunk of time re-working the <code>thirdparty</code> directory |
| that contains most of Kudu’s native dependencies. The major resulting |
| changes are: |
| <ul> |
| <li>Build directories are now cleanly isolated from source directories, |
| improving cleanliness of re-builds.</li> |
| <li>ThreadSanitizer (TSAN) builds now use <code>libc++</code> instead of <code>libstdcxx</code> |
| for C++ library support. The <code>libc++</code> library has better support for |
| sanitizers, is easier to build in isolation, and solves some compatibility |
| issues that Adar was facing with GCC 5 on Ubuntu Xenial.</li> |
| <li>All of the thirdparty dependencies now build with TSAN instrumentation, |
| which improves our coverage of this very effective tooling.</li> |
| </ul> |
| |
| <p>The impact to most developers is that, if you have an old source checkout, |
| it’s highly likely you will need to clean and re-build the thirdparty |
| directory.</p> |
| </li> |
| <li>Many contributors spent time in recent weeks trying to address the |
| flakiness of various test cases. The Kudu project uses a |
| <a href="http://dist-test.cloudera.org:8080/">dashboard</a> to track the flakiness |
| of each test case, and <a href="http://dist-test.cloudera.org/">distributed test infrastructure</a> |
| to facilitate reproducing test flakes. <!-- spaces cause line break --> |
| As might be expected, some of the flaky tests were due to bugs or |
| timing assumptions in the tests themselves. However, this effort |
| also identified several real bugs: |
| <ul> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4570]">tight retry loop</a> in the |
| Java client.</li> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4395">memory leak</a> due to circular |
| references in the C++ client.</li> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4551">crash</a> which could affect |
| tools used for problem diagnosis.</li> |
| <li>A <a href="http://gerrit.cloudera.org:8080/4409">divergence bug</a> in Raft consensus |
| under particularly torturous scenarios.</li> |
| <li>A potential <a href="http://gerrit.cloudera.org:8080/4394">crash during tablet server startup</a>.</li> |
| <li>A case in which <a href="http://gerrit.cloudera.org:8080/4626">thread startup could be delayed</a> |
| by built-in monitoring code.</li> |
| </ul> |
| |
| <p>As a result of these efforts, the failure rate of these flaky tests has |
| decreased significantly and the stability of Kudu releases continues |
| to increase.</p> |
| </li> |
| <li> |
| <p>Dan Burkert picked up work originally started by Sameer Abhyankar on |
| <a href="https://issues.apache.org/jira/browse/KUDU-1363">KUDU-1363</a>, which adds |
| support for adding <code>IN (...)</code> predicates to scanners. Dan committed the |
| <a href="http://gerrit.cloudera.org:8080/2986">main patch</a> as well as corresponding |
| <a href="http://gerrit.cloudera.org:8080/4530">support in the Java client</a>. |
| Jordan Birdsell quickly added corresponding support in <a href="http://gerrit.cloudera.org:8080/4548">Python</a>. |
| This new feature will be available in an upcoming release.</p> |
| </li> |
| <li> |
| <p>Work continues on the <code>kudu</code> command line tool. Dinesh Bhat added |
| the ability to ask a tablet’s leader to <a href="http://gerrit.cloudera.org:8080/4533">step down</a> |
| and Alexey Serbin added a <a href="http://gerrit.cloudera.org:8080/4412">tool to insert random data into a |
| table</a>.</p> |
| </li> |
| <li> |
| <p>Jordan Birdsell continues to be on a tear improving the Python client. |
| The patches are too numerous to mention, but highlights include Python 3 |
| support as well as near feature parity with the C++ client.</p> |
| </li> |
| <li> |
| <p>Todd Lipcon has been doing some refactoring and cleanup in the Raft |
| consensus implementation. In addition to simplifying and removing code, |
| he committed <a href="https://issues.apache.org/jira/browse/KUDU-1567">KUDU-1567</a>, |
| which improves write performance in many cases by a factor of three |
| or more while also improving stability.</p> |
| </li> |
| <li> |
| <p>Brock Noland is working on support for <a href="https://gerrit.cloudera.org/#/c/4491/">INSERT IGNORE</a> |
| as a first-class part of the Kudu API. Of course this functionality |
| can already be done by simply performing normal inserts and ignoring any |
| resulting errors, but pushing it to the server prevents the server |
| from counting such operations as errors.</p> |
| </li> |
| <li>Congratulations to Ninad Shringarpure for contributing his first patches |
| to Kudu. Ninad contributed two documentation fixes and improved |
| formatting on the Kudu web UI.</li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>Todd Lipcon</name></author><summary>Welcome to the twenty-first edition of the Kudu Weekly Update. Astute |
| readers will notice that the weekly blog posts have been not-so-weekly |
| of late – in fact, it has been nearly two months since the previous post |
| as I and others have focused on releases, conferences, etc. |
| |
| So, rather than covering just this past week, this post will cover highlights |
| of the progress since the 1.0 release in mid-September. If you’re interested |
| in learning about progress prior to that release, check the |
| release notes.</summary></entry><entry><title>Apache Kudu at Strata+Hadoop World NYC 2016</title><link href="/2016/09/26/strata-nyc-kudu-talks.html" rel="alternate" type="text/html" title="Apache Kudu at Strata+Hadoop World NYC 2016" /><published>2016-09-26T00:00:00-07:00</published><updated>2016-09-26T00:00:00-07:00</updated><id>/2016/09/26/strata-nyc-kudu-talks</id><content type="html" xml:base="/2016/09/26/strata-nyc-kudu-talks.html"><p>This week in New York, O’Reilly and Cloudera will be hosting Strata+Hadoop World |
| 2016. If you’re interested in Kudu, there will be several opportunities to |
| learn more, both from the open source development team as well as some companies |
| who are already adopting Kudu for their use cases. |
| <!--more--> |
| Here are some of the sessions to check out:</p> |
| |
| <ul> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52146">Powering real-time analytics on Xfinity using Kudu</a> (Wednesday, 11:20am)</p> |
| |
| <p>Sridhar Alla and Kiran Muglurmath from Comcast will talk about how they’re using |
| Kudu to store hundreds of billions of Set-Top Box (STB) events, performing |
| analytics concurrently with real-time streaming ingest of thousands of events |
| per second.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52248">Creating real-time, data-centric applications with Impala and Kudu</a> (Wednesday, 2:05pm)</p> |
| |
| <p>Marcel Kornacker and Todd Lipcon will introduce how Impala and Kudu together |
| allow users to build real-time applications that support streaming ingest, |
| random access updates and deletes, and high performance analytic SQL in |
| a single system.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52168">Streaming cybersecurity into Graph: Accelerating data into Datastax Graph and Blazegraph</a> (Thursday, 1:15pm)</p> |
| |
| <p>Joshua Patterson, Michael Wendt, and Keith Kraus from Accenture Labs will discuss |
| how they have built cybersecurity solutions using graph analytics on top of open |
| source technology like Apache Kafka, Spark, and Flink. They will also touch on |
| why Kudu is becoming an integral part of Accenture’s technology stack.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52050">How GE analyzes billions of mission-critical events in real time using Apache Apex, Spark, and Kudu</a> (Thursday, 2:05pm)</p> |
| |
| <p>Venkatesh Sivasubramanian and Luis Ramos from GE Digital will discuss how they |
| collect and process real-time IoT data using Apache Apex and Apache Spark, and |
| how they’ve been experimenting with Apache Kudu for time series data storage.</p> |
| </li> |
| <li> |
| <p><a href="http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/51887">Apache Kudu: 1.0 and Beyond</a> (Thursday, 4:35pm)</p> |
| |
| <p>Todd Lipcon from Cloudera will review the new features that were developed between Kudu 0.5 |
| (the first public release one year ago) and Kudu 1.0, released just last week. Additionally, |
| this talk will provide some insight into the upcoming project roadmap for the coming year.</p> |
| </li> |
| </ul> |
| |
| <p>Aside from these organized sessions, word has it that there will be various demos |
| featuring Apache Kudu at the Cloudera and ZoomData vendor booths.</p> |
| |
| <p>If you’re not attending the conference, but still based in NYC, all hope is |
| not lost. Michael Crutcher from Cloudera will be presenting an introduction |
| to Apache Kudu at the <a href="http://www.meetup.com/mysqlnyc/events/233599664/">SQL NYC Meetup</a>. |
| Be sure to RSVP as spots are filling up fast.</p></content><author><name>Todd Lipcon</name></author><summary>This week in New York, O’Reilly and Cloudera will be hosting Strata+Hadoop World |
| 2016. If you’re interested in Kudu, there will be several opportunities to |
| learn more, both from the open source development team as well as some companies |
| who are already adopting Kudu for their use cases.</summary></entry><entry><title>Apache Kudu 1.0.0 released</title><link href="/2016/09/20/apache-kudu-1-0-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.0.0 released" /><published>2016-09-20T00:00:00-07:00</published><updated>2016-09-20T00:00:00-07:00</updated><id>/2016/09/20/apache-kudu-1-0-0-released</id><content type="html" xml:base="/2016/09/20/apache-kudu-1-0-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.0.0!</p> |
| |
| <p>This latest version adds several new features, including:</p> |
| |
| <!--more--> |
| |
| <ul> |
| <li> |
| <p>Removal of multiversion concurrency control (MVCC) history is now supported. |
| This allows Kudu to reclaim disk space, where previously Kudu would keep a full |
| history of all changes made to a given table since the beginning of time.</p> |
| </li> |
| <li> |
| <p>Most of Kudu’s command line tools have been consolidated under a new |
| top-level <code>kudu</code> tool. This reduces the number of large binaries distributed |
| with Kudu and also includes much-improved help output.</p> |
| </li> |
| <li> |
| <p>Administrative tools including <code>kudu cluster ksck</code> now support running |
| against multi-master Kudu clusters.</p> |
| </li> |
| <li> |
| <p>The C++ client API now supports writing data in <code>AUTO_FLUSH_BACKGROUND</code> mode. |
| This can provide higher throughput for ingest workloads.</p> |
| </li> |
| </ul> |
| |
| <p>This release also includes many bug fixes, optimizations, and other |
| improvements, detailed in the <a href="/releases/1.0.0/docs/release_notes.html">release notes</a>.</p> |
| |
| <ul> |
| <li>Download the <a href="/releases/1.0.0/">Kudu 1.0.0 source release</a></li> |
| <li>Convenience binary artifacts for the Java client and various Java |
| integrations (eg Spark, Flume) are also now available via the ASF Maven |
| repository.</li> |
| </ul></content><author><name>Todd Lipcon</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.0.0! |
| |
| This latest version adds several new features, including:</summary></entry><entry><title>Pushing Down Predicate Evaluation in Apache Kudu</title><link href="/2016/09/16/predicate-pushdown.html" rel="alternate" type="text/html" title="Pushing Down Predicate Evaluation in Apache Kudu" /><published>2016-09-16T00:00:00-07:00</published><updated>2016-09-16T00:00:00-07:00</updated><id>/2016/09/16/predicate-pushdown</id><content type="html" xml:base="/2016/09/16/predicate-pushdown.html"><p>I had the pleasure of interning with the Apache Kudu team at Cloudera this |
| summer. This project was my summer contribution to Kudu: a restructuring of the |
| scan path to speed up queries.</p> |
| |
| <!--more--> |
| |
| <h2 id="introduction">Introduction</h2> |
| |
| <p>In Kudu, <em>predicate pushdown</em> refers to the way in which predicates are |
| handled. When a scan is requested, its predicates are passed through the |
| different layers of Kudu’s storage hierarchy, allowing for pruning and other |
| optimizations to happen at each level before reaching the underlying data.</p> |
| |
| <p>While predicates are pushed down, predicate evaluation itself occurs at a fairly |
| high level, precluding the evaluation process from certain data-specific |
| optimizations. These optimizations can make tablet scans an order of magnitude |
| faster, if not more.</p> |
| |
| <h2 id="a-day-in-the-life-of-a-query">A Day in the Life of a Query</h2> |
| |
| <p>Because Kudu is a columnar storage engine, its scan path has a number of |
| optimizations to avoid extraneous reads, copies, and computation. When a query |
| is sent to a tablet server, the server prunes tablets based on the |
| primary key, directing the request to only the tablets that contain the key |
| range of interest. Once at a tablet, only the columns relevant to the query are |
| scanned. Further pruning is done over the primary key, and if the query is |
| predicated on non-key columns, the entire column is scanned. The columns in a |
| tablet are stored as <em>cfiles</em>, which are split into encoded <em>blocks</em>. Once the |
| relevant cfiles are determined, the data are materialized by the block |
| decoders, i.e. their underlying data are decoded and copied into a buffer, |
| which is passed back to the tablet layer. The tablet can then evaluate the |
| predicate on the batch of data and mark which rows should be returned to the |
| client.</p> |
| |
| <p>One of the encoding types I worked very closely with is <em>dictionary encoding</em>, |
| an encoding type for strings that performs particularly well for cfiles that |
| have repeating values. Rather than storing every row’s string, each unique |
| string is assigned a numeric codeword, and the rows are stored numerically on |
| disk. When materializing a dictionary block, all of the numeric data are scanned |
| and all of the corresponding strings are copied and buffered for evaluation. |
| When the vocabulary of a dictionary-encoded cfile gets too large, the blocks |
| begin switching to <em>plain encoding mode</em> to act like <em>plain-encoded</em> blocks.</p> |
| |
| <p>In a plain-encoded block, strings are stored contiguously and the character |
| offsets to the start of each string are stored as a list of integers. When |
| materializing, all of the strings are copied to a buffer for evaluation.</p> |
| |
| <p>Therein lies room for improvement: this predicate evaluation path is the same |
| for all data types and encoding types. Within the tablet, the correct cfiles |
| are determined, the cfiles’ decoders are opened, all of the data are copied to |
| a buffer, and the predicates are evaluated on this buffered data via |
| type-specific comparators. This path is extremely flexible, but because it was |
| designed to be encoding-independent, there is room for improvement.</p> |
| |
| <h2 id="trimming-the-fat">Trimming the Fat</h2> |
| |
| <p>The first step is to allow the decoders access to the predicate. In doing so, |
| each encoding type can specialize its evaluation. Additionally, this puts the |
| decoder in a position where it can determine whether a given row satisfies the |
| query, which in turn, allows the decoders to determine what data gets copied |
| instead of eagerly copying all of its data to get evaluated.</p> |
| |
| <p>Take the case of dictionary-encoded strings as an example. With the existing |
| scan path, not only are all of the strings in a column copied into a buffer, but |
| string comparisons are done on every row. By taking advantage of the fact that |
| the data can be represented as integers, the cost of determining the query |
| results can be greatly reduced. The string comparisons can be swapped out with |
| evaluation based on the codewords, in which case the room for improvement boils |
| down to how to most quickly determine whether or not a given codeword |
| corresponds to a string that satisfies the predicate. Dictionary columns will |
| now use a bitset to store the codewords that match the predicates. It will then |
| scan through the integer-valued data and checks the bitset to determine whether |
| it should copy the corresponding string over.</p> |
| |
| <p>This is great in the best case scenario where a cfile’s vocabulary is small, |
| but when the vocabulary gets too large and the dictionary blocks switch to plain |
| encoding mode, performance is hampered. In this mode, the blocks don’t utilize |
| any dictionary metadata and end up wasting the codeword bitset. That isn’t to |
| say all is lost: the decoders can still evaluate a predicate via string |
| comparison, and the fact that evaluation can still occur at the decoder-level |
| means the eager buffering can still be avoided.</p> |
| |
| <p>Dictionary encoding is a perfect storm in that the decoders can completely |
| evaluate the predicates. This is not the case for most other encoding types, |
| but having decoders support evaluation leaves the door open for other encoding |
| types to extend this idea.</p> |
| |
| <h2 id="performance">Performance</h2> |
| <p>Depending on the dataset and query, predicate pushdown can lead to significant |
| improvements. Tablet scans were timed with datasets consisting of repeated |
| string patterns of tunable length and tunable cardinality.</p> |
| |
| <p><img src="/img/predicate-pushdown/pushdown-10.png" alt="png" class="img-responsive" /> |
| <img src="/img/predicate-pushdown/pushdown-10M.png" alt="png" class="img-responsive" /></p> |
| |
| <p>The above plots show the time taken to completely scan a single tablet, recorded |
| using a dataset of ten million rows of strings with length ten. Predicates were |
| designed to select values out of bounds (Empty), select a single value (Equal, |
| i.e. for cardinality <em>k</em>, this would select 1/<em>k</em> of the dataset), select half |
| of the full range (Half), and select the full range of values (All).</p> |
| |
| <p>With the original evaluation implementation, the tablet must copy and scan |
| through the tablet to determine whether any values match. This means that even |
| when the result set is small, the full column is still copied. This is avoided |
| by pushing down predicates, which only copies as needed, and can be seen in the |
| above queries: those with near-empty result sets (Empty and Equal) have shorter |
| scan times than those with larger result sets (Half and All).</p> |
| |
| <p>Note that for dictionary encoding, given a low cardinality, Kudu can completely |
| rely on the dictionary codewords to evaluate, making the query significantly |
| faster. At higher cardinalities, the dictionaries completely fill up and the |
| blocks fall back on plain encoding. The slower, albeit still improved, |
| performance on the dataset containing 10M unique values reflects this.</p> |
| |
| <p><img src="/img/predicate-pushdown/pushdown-tpch.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Similar predicates were run with the TPC-H dataset, querying on the shipdate |
| column. The full path of a query includes not only the tablet scanning itself, |
| but also RPCs and batched data transfer to the caller as the scan progresses. |
| As such, the times plotted above refer to the average end-to-end time required |
| to scan and return a batch of rows. Regardless of this additional overhead, |
| significant improvements on the scan path still yield substantial improvements |
| to the query performance as a whole.</p> |
| |
| <h2 id="conclusion">Conclusion</h2> |
| |
| <p>Pushing down predicate evaluation in Kudu yielded substantial improvements to |
| the scan path. For dictionary encoding, pushdown can be particularly powerful, |
| and other encoding types are either unaffected or also improved. This change has |
| been pushed to the main branch of Kudu, and relevant commits can be found |
| <a href="https://github.com/cloudera/kudu/commit/c0f37278cb09a7781d9073279ea54b08db6e2010">here</a> |
| and |
| <a href="https://github.com/cloudera/kudu/commit/ec80fdb37be44d380046a823b5e6d8e2241ec3da">here</a>.</p> |
| |
| <p>This summer has been a phenomenal learning experience for me, in terms of the |
| tools, the workflow, the datasets, the thought-processes that go into building |
| something at Kudu’s scale. I am extremely thankful for all of the mentoring and |
| support I received, and that I got to be a part of Kudu’s journey from |
| incubating to a Top Level Apache project. I can’t express enough how grateful I |
| am for the amount of support I got from the Kudu team, from the intern |
| coordinators, and from the Cloudera community as a whole.</p></content><author><name>Andrew Wong</name></author><summary>I had the pleasure of interning with the Apache Kudu team at Cloudera this |
| summer. This project was my summer contribution to Kudu: a restructuring of the |
| scan path to speed up queries.</summary></entry><entry><title>An Introduction to the Flume Kudu Sink</title><link href="/2016/08/31/intro-flume-kudu-sink.html" rel="alternate" type="text/html" title="An Introduction to the Flume Kudu Sink" /><published>2016-08-31T00:00:00-07:00</published><updated>2016-08-31T00:00:00-07:00</updated><id>/2016/08/31/intro-flume-kudu-sink</id><content type="html" xml:base="/2016/08/31/intro-flume-kudu-sink.html"><p>This post discusses the Kudu Flume Sink. First, I’ll give some background on why we considered |
| using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.</p> |
| |
| <h2 id="why-kudu">Why Kudu</h2> |
| |
| <p>Traditionally in the Hadoop ecosystem we’ve dealt with various <em>batch processing</em> technologies such |
| as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig, |
| Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to |
| process the whole data set in batches, again and again, as soon as new data gets added. Things get |
| really complicated when a few such tasks need to get chained together, or when the same data set |
| needs to be processed in various ways by different jobs, while all compete for the shared cluster |
| resources.</p> |
| |
| <p>The opposite of this approach is <em>stream processing</em>: process the data as soon as it arrives, not |
| in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make |
| this possible. But writing streaming services is not trivial. The streaming systems are becoming |
| more and more capable and support more complex constructs, but they are not yet easy to use. All |
| queries and processes need to be carefully planned and implemented.</p> |
| |
| <p>To summarize, <em>batch processing</em> is:</p> |
| |
| <ul> |
| <li>file-based</li> |
| <li>a paradigm that processes large chunks of data as a group</li> |
| <li>high latency and high throughput, both for ingest and query</li> |
| <li>typically easy to program, but hard to orchestrate</li> |
| <li>well suited for writing ad-hoc queries, although they are typically high latency</li> |
| </ul> |
| |
| <p>While <em>stream processing</em> is:</p> |
| |
| <ul> |
| <li>a totally different paradigm, which involves single events and time windows instead of large groups of events</li> |
| <li>still file-based and not a long-term database</li> |
| <li>not batch-oriented, but incremental</li> |
| <li>ultra-fast ingest and ultra-fast query (query results basically pre-calculated)</li> |
| <li>not so easy to program, relatively easy to orchestrate</li> |
| <li>impossible to write ad-hoc queries</li> |
| </ul> |
| |
| <p>And a Kudu-based <em>near real-time</em> approach is:</p> |
| |
| <ul> |
| <li>flexible and expressive, thanks to SQL support via Apache Impala (incubating)</li> |
| <li>a table-oriented, mutable data store that feels like a traditional relational database</li> |
| <li>very easy to program, you can even pretend it’s good old MySQL</li> |
| <li>low-latency and relatively high throughput, both for ingest and query</li> |
| </ul> |
| |
| <p>At Argyle Data, we’re dealing with complex fraud detection scenarios. We need to ingest massive |
| amounts of data, run machine learning algorithms and generate reports. When we created our current |
| architecture two years ago we decided to opt for a database as the backbone of our system. That |
| database is Apache Accumulo. It’s a key-value based database which runs on top of Hadoop HDFS, |
| quite similar to HBase but with some important improvements such as cell level security and ease |
| of deployment and management. To enable querying of this data for quite complex reporting and |
| analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced |
| by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This |
| architecture has served us well, but there were a few problems:</p> |
| |
| <ul> |
| <li>we need to ingest even more massive volumes of data in real-time</li> |
| <li>we need to perform complex machine-learning calculations on even larger data-sets</li> |
| <li>we need to support ad-hoc queries, plus long-term data warehouse functionality</li> |
| </ul> |
| |
| <p>So, we’ve started gradually moving the core machine-learning pipeline to a streaming based |
| solution. This way we can ingest and process larger data-sets faster in the real-time. But then how |
| would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While |
| the machine learning pipeline ingests and processes real-time data, we store a copy of the same |
| ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our <em>data warehouse</em>. By |
| using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala’s |
| super-fast query engine.</p> |
| |
| <p>But how would we make sure data is reliably ingested into the streaming pipeline <em>and</em> the |
| Kudu-based data warehouse? This is where Apache Flume comes in.</p> |
| |
| <h2 id="why-flume">Why Flume</h2> |
| |
| <p>According to their <a href="http://flume.apache.org/">website</a> “Flume is a distributed, reliable, and |
| available service for efficiently collecting, aggregating, and moving large amounts of log data. |
| It has a simple and flexible architecture based on streaming data flows. It is robust and fault |
| tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” As you |
| can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop |
| clusters.</p> |
| |
| <p><img src="https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad" alt="png" /></p> |
| |
| <p>Flume has an extensible architecture. An instance of Flume, called an <em>agent</em>, can have multiple |
| <em>channels</em>, with each having multiple <em>sources</em> and <em>sinks</em> of various types. Sources queue data |
| in channels, which in turn write out data to sinks. Such <em>pipelines</em> can be chained together to |
| create even more complex ones. There may be more than one agent and agents can be configured to |
| support failover and recovery.</p> |
| |
| <p>Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the |
| default (an in-memory queue with no persistence to disk), but other options such as Kafka- and |
| File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory |
| source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing |
| data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.</p> |
| |
| <p>In the rest of this post I’ll go over the Kudu Flume sink and show you how to configure Flume to |
| write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8 |
| release and the source code can be found <a href="https://github.com/apache/kudu/tree/master/java/kudu-flume-sink">here</a>.</p> |
| |
| <h2 id="configuring-the-kudu-flume-sink">Configuring the Kudu Flume Sink</h2> |
| |
| <p>Here is a sample flume configuration file:</p> |
| |
| <pre><code>agent1.sources = source1 |
| agent1.channels = channel1 |
| agent1.sinks = sink1 |
| |
| agent1.sources.source1.type = exec |
| agent1.sources.source1.command = /usr/bin/vmstat 1 |
| agent1.sources.source1.channels = channel1 |
| |
| agent1.channels.channel1.type = memory |
| agent1.channels.channel1.capacity = 10000 |
| agent1.channels.channel1.transactionCapacity = 1000 |
| |
| agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink |
| agent1.sinks.sink1.masterAddresses = localhost |
| agent1.sinks.sink1.tableName = stats |
| agent1.sinks.sink1.channel = channel1 |
| agent1.sinks.sink1.batchSize = 50 |
| agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer |
| </code></pre> |
| |
| <p>We define a source called <code>source1</code> which simply executes a <code>vmstat</code> command to continuously generate |
| virtual memory statistics for the machine and queue events into an in-memory <code>channel1</code> channel, |
| which in turn is used for writing these events to a Kudu table called <code>stats</code>. We are using |
| <code>org.apache.kudu.flume.sink.SimpleKuduEventProducer</code> as the producer. <code>SimpleKuduEventProducer</code> is |
| the built-in and default producer, but it’s implemented as a showcase for how to write Flume |
| events into Kudu tables. For any serious functionality we’d have to write a custom producer. We |
| need to make this producer and the <code>KuduSink</code> class available to Flume. We can do that by simply |
| copying the <code>kudu-flume-sink-&lt;VERSION&gt;.jar</code> jar file from the Kudu distribution to the |
| <code>$FLUME_HOME/plugins.d/kudu-sink/lib</code> directory in the Flume installation. The jar file contains |
| <code>KuduSink</code> and all of its dependencies (including Kudu java client classes).</p> |
| |
| <p>At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are |
| (<code>agent1.sinks.sink1.masterAddresses = localhost</code>) and which Kudu table should be used for writing |
| Flume events to (<code>agent1.sinks.sink1.tableName = stats</code>). The Kudu Flume Sink doesn’t create this |
| table, it has to be created before the Kudu Flume Sink is started.</p> |
| |
| <p>You may also notice the <code>batchSize</code> parameter. Batch size is used for batching up to that many |
| Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge |
| impact on ingest performance of the Kudu cluster.</p> |
| |
| <p>Here is a complete list of KuduSink parameters:</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Parameter Name</th> |
| <th>Default</th> |
| <th>Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>masterAddresses</td> |
| <td>N/A</td> |
| <td>Comma-separated list of “host:port” pairs of the masters (port optional)</td> |
| </tr> |
| <tr> |
| <td>tableName</td> |
| <td>N/A</td> |
| <td>The name of the table in Kudu to write to</td> |
| </tr> |
| <tr> |
| <td>producer</td> |
| <td>org.apache.kudu.flume.sink.SimpleKuduEventProducer</td> |
| <td>The fully qualified class name of the Kudu event producer the sink should use</td> |
| </tr> |
| <tr> |
| <td>batchSize</td> |
| <td>100</td> |
| <td>Maximum number of events the sink should take from the channel per transaction, if available</td> |
| </tr> |
| <tr> |
| <td>timeoutMillis</td> |
| <td>30000</td> |
| <td>Timeout period for Kudu operations, in milliseconds</td> |
| </tr> |
| <tr> |
| <td>ignoreDuplicateRows</td> |
| <td>true</td> |
| <td>Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <p>Let’s take a look at the source code for the built-in producer class:</p> |
| |
| <pre><code class="language-java">public class SimpleKuduEventProducer implements KuduEventProducer { |
| private byte[] payload; |
| private KuduTable table; |
| private String payloadColumn; |
| |
| public SimpleKuduEventProducer(){ |
| } |
| |
| @Override |
| public void configure(Context context) { |
| payloadColumn = context.getString("payloadColumn","payload"); |
| } |
| |
| @Override |
| public void configure(ComponentConfiguration conf) { |
| } |
| |
| @Override |
| public void initialize(Event event, KuduTable table) { |
| this.payload = event.getBody(); |
| this.table = table; |
| } |
| |
| @Override |
| public List&lt;Operation&gt; getOperations() throws FlumeException { |
| try { |
| Insert insert = table.newInsert(); |
| PartialRow row = insert.getRow(); |
| row.addBinary(payloadColumn, payload); |
| |
| return Collections.singletonList((Operation) insert); |
| } catch (Exception e){ |
| throw new FlumeException("Failed to create Kudu Insert object!", e); |
| } |
| } |
| |
| @Override |
| public void close() { |
| } |
| } |
| </code></pre> |
| |
| <p><code>SimpleKuduEventProducer</code> implements the <code>org.apache.kudu.flume.sink.KuduEventProducer</code> interface, |
| which itself looks like this:</p> |
| |
| <pre><code class="language-java">public interface KuduEventProducer extends Configurable, ConfigurableComponent { |
| /** |
| * Initialize the event producer. |
| * @param event to be written to Kudu |
| * @param table the KuduTable object used for creating Kudu Operation objects |
| */ |
| void initialize(Event event, KuduTable table); |
| |
| /** |
| * Get the operations that should be written out to Kudu as a result of this |
| * event. This list is written to Kudu using the Kudu client API. |
| * @return List of {@link org.kududb.client.Operation} which |
| * are written as such to Kudu |
| */ |
| List&lt;Operation&gt; getOperations(); |
| |
| /* |
| * Clean up any state. This will be called when the sink is being stopped. |
| */ |
| void close(); |
| } |
| </code></pre> |
| |
| <p><code>public void configure(Context context)</code> is called when an instance of our producer is instantiated |
| by the KuduSink. SimpleKuduEventProducer’s implementation looks for a producer parameter named |
| <code>payloadColumn</code> and uses its value (“payload” if not overridden in Flume configuration file) as the |
| column which will hold the value of the Flume event payload. If you recall from above, we had |
| configured the KuduSink to listen for events generated from the <code>vmstat</code> command. Each output row |
| from that command will be stored as a new row containing a <code>payload</code> column in the <code>stats</code> table. |
| <code>SimpleKuduEventProducer</code> does not have any configuration parameters, but if it had any we would |
| define them by prefixing it with <code>producer.</code> (<code>agent1.sinks.sink1.producer.parameter1</code> for |
| example).</p> |
| |
| <p>The main producer logic resides in the <code>public List&lt;Operation&gt; getOperations()</code> method. In |
| SimpleKuduEventProducer’s implementation we simply insert the binary body of the Flume event into |
| the Kudu table. Here we call Kudu’s <code>newInsert()</code> to initiate an insert, but could have used |
| <code>Upsert</code> if updating an existing row was also an option, in fact there’s another producer |
| implementation available for doing just that: <code>SimpleKeyedKuduEventProducer</code>. Most probably you |
| will need to write your own custom producer in the real world, but you can base your implementation |
| on the built-in ones.</p> |
| |
| <p>In the future, we plan to add more flexible event producer implementations so that creation of a |
| custom event producer is not required to write data to Kudu. See |
| <a href="https://gerrit.cloudera.org/#/c/4034/">here</a> for a work-in-progress generic event producer for |
| Avro-encoded Events.</p> |
| |
| <h2 id="conclusion">Conclusion</h2> |
| |
| <p>Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume |
| helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store |
| the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of |
| disparate sources.</p> |
| |
| <p><em>Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using |
| sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that |
| is included in the Kudu distribution. You can follow him on Twitter at |
| <a href="https://twitter.com/ara_e">@ara_e</a>.</em></p></content><author><name>Ara Abrahamian</name></author><summary>This post discusses the Kudu Flume Sink. First, I’ll give some background on why we considered |
| using Kudu, what Flume does for us, and how Flume fits with Kudu in our project. |
| |
| Why Kudu |
| |
| Traditionally in the Hadoop ecosystem we’ve dealt with various batch processing technologies such |
| as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig, |
| Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to |
| process the whole data set in batches, again and again, as soon as new data gets added. Things get |
| really complicated when a few such tasks need to get chained together, or when the same data set |
| needs to be processed in various ways by different jobs, while all compete for the shared cluster |
| resources. |
| |
| The opposite of this approach is stream processing: process the data as soon as it arrives, not |
| in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make |
| this possible. But writing streaming services is not trivial. The streaming systems are becoming |
| more and more capable and support more complex constructs, but they are not yet easy to use. All |
| queries and processes need to be carefully planned and implemented. |
| |
| To summarize, batch processing is: |
| |
| |
| file-based |
| a paradigm that processes large chunks of data as a group |
| high latency and high throughput, both for ingest and query |
| typically easy to program, but hard to orchestrate |
| well suited for writing ad-hoc queries, although they are typically high latency |
| |
| |
| While stream processing is: |
| |
| |
| a totally different paradigm, which involves single events and time windows instead of large groups of events |
| still file-based and not a long-term database |
| not batch-oriented, but incremental |
| ultra-fast ingest and ultra-fast query (query results basically pre-calculated) |
| not so easy to program, relatively easy to orchestrate |
| impossible to write ad-hoc queries |
| |
| |
| And a Kudu-based near real-time approach is: |
| |
| |
| flexible and expressive, thanks to SQL support via Apache Impala (incubating) |
| a table-oriented, mutable data store that feels like a traditional relational database |
| very easy to program, you can even pretend it’s good old MySQL |
| low-latency and relatively high throughput, both for ingest and query |
| |
| |
| At Argyle Data, we’re dealing with complex fraud detection scenarios. We need to ingest massive |
| amounts of data, run machine learning algorithms and generate reports. When we created our current |
| architecture two years ago we decided to opt for a database as the backbone of our system. That |
| database is Apache Accumulo. It’s a key-value based database which runs on top of Hadoop HDFS, |
| quite similar to HBase but with some important improvements such as cell level security and ease |
| of deployment and management. To enable querying of this data for quite complex reporting and |
| analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced |
| by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This |
| architecture has served us well, but there were a few problems: |
| |
| |
| we need to ingest even more massive volumes of data in real-time |
| we need to perform complex machine-learning calculations on even larger data-sets |
| we need to support ad-hoc queries, plus long-term data warehouse functionality |
| |
| |
| So, we’ve started gradually moving the core machine-learning pipeline to a streaming based |
| solution. This way we can ingest and process larger data-sets faster in the real-time. But then how |
| would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While |
| the machine learning pipeline ingests and processes real-time data, we store a copy of the same |
| ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our data warehouse. By |
| using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala’s |
| super-fast query engine. |
| |
| But how would we make sure data is reliably ingested into the streaming pipeline and the |
| Kudu-based data warehouse? This is where Apache Flume comes in. |
| |
| Why Flume |
| |
| According to their website “Flume is a distributed, reliable, and |
| available service for efficiently collecting, aggregating, and moving large amounts of log data. |
| It has a simple and flexible architecture based on streaming data flows. It is robust and fault |
| tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” As you |
| can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop |
| clusters. |
| |
| |
| |
| Flume has an extensible architecture. An instance of Flume, called an agent, can have multiple |
| channels, with each having multiple sources and sinks of various types. Sources queue data |
| in channels, which in turn write out data to sinks. Such pipelines can be chained together to |
| create even more complex ones. There may be more than one agent and agents can be configured to |
| support failover and recovery. |
| |
| Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the |
| default (an in-memory queue with no persistence to disk), but other options such as Kafka- and |
| File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory |
| source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing |
| data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents. |
| |
| In the rest of this post I’ll go over the Kudu Flume sink and show you how to configure Flume to |
| write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8 |
| release and the source code can be found here. |
| |
| Configuring the Kudu Flume Sink |
| |
| Here is a sample flume configuration file: |
| |
| agent1.sources = source1 |
| agent1.channels = channel1 |
| agent1.sinks = sink1 |
| |
| agent1.sources.source1.type = exec |
| agent1.sources.source1.command = /usr/bin/vmstat 1 |
| agent1.sources.source1.channels = channel1 |
| |
| agent1.channels.channel1.type = memory |
| agent1.channels.channel1.capacity = 10000 |
| agent1.channels.channel1.transactionCapacity = 1000 |
| |
| agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink |
| agent1.sinks.sink1.masterAddresses = localhost |
| agent1.sinks.sink1.tableName = stats |
| agent1.sinks.sink1.channel = channel1 |
| agent1.sinks.sink1.batchSize = 50 |
| agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer |
| |
| |
| We define a source called source1 which simply executes a vmstat command to continuously generate |
| virtual memory statistics for the machine and queue events into an in-memory channel1 channel, |
| which in turn is used for writing these events to a Kudu table called stats. We are using |
| org.apache.kudu.flume.sink.SimpleKuduEventProducer as the producer. SimpleKuduEventProducer is |
| the built-in and default producer, but it’s implemented as a showcase for how to write Flume |
| events into Kudu tables. For any serious functionality we’d have to write a custom producer. We |
| need to make this producer and the KuduSink class available to Flume. We can do that by simply |
| copying the kudu-flume-sink-&lt;VERSION&gt;.jar jar file from the Kudu distribution to the |
| $FLUME_HOME/plugins.d/kudu-sink/lib directory in the Flume installation. The jar file contains |
| KuduSink and all of its dependencies (including Kudu java client classes). |
| |
| At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are |
| (agent1.sinks.sink1.masterAddresses = localhost) and which Kudu table should be used for writing |
| Flume events to (agent1.sinks.sink1.tableName = stats). The Kudu Flume Sink doesn’t create this |
| table, it has to be created before the Kudu Flume Sink is started. |
| |
| You may also notice the batchSize parameter. Batch size is used for batching up to that many |
| Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge |
| impact on ingest performance of the Kudu cluster. |
| |
| Here is a complete list of KuduSink parameters: |
| |
| |
| |
| |
| Parameter Name |
| Default |
| Description |
| |
| |
| |
| |
| masterAddresses |
| N/A |
| Comma-separated list of “host:port” pairs of the masters (port optional) |
| |
| |
| tableName |
| N/A |
| The name of the table in Kudu to write to |
| |
| |
| producer |
| org.apache.kudu.flume.sink.SimpleKuduEventProducer |
| The fully qualified class name of the Kudu event producer the sink should use |
| |
| |
| batchSize |
| 100 |
| Maximum number of events the sink should take from the channel per transaction, if available |
| |
| |
| timeoutMillis |
| 30000 |
| Timeout period for Kudu operations, in milliseconds |
| |
| |
| ignoreDuplicateRows |
| true |
| Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu |
| |
| |
| |
| |
| Let’s take a look at the source code for the built-in producer class: |
| |
| public class SimpleKuduEventProducer implements KuduEventProducer { |
| private byte[] payload; |
| private KuduTable table; |
| private String payloadColumn; |
| |
| public SimpleKuduEventProducer(){ |
| } |
| |
| @Override |
| public void configure(Context context) { |
| payloadColumn = context.getString("payloadColumn","payload"); |
| } |
| |
| @Override |
| public void configure(ComponentConfiguration conf) { |
| } |
| |
| @Override |
| public void initialize(Event event, KuduTable table) { |
| this.payload = event.getBody(); |
| this.table = table; |
| } |
| |
| @Override |
| public List&lt;Operation&gt; getOperations() throws FlumeException { |
| try { |
| Insert insert = table.newInsert(); |
| PartialRow row = insert.getRow(); |
| row.addBinary(payloadColumn, payload); |
| |
| return Collections.singletonList((Operation) insert); |
| } catch (Exception e){ |
| throw new FlumeException("Failed to create Kudu Insert object!", e); |
| } |
| } |
| |
| @Override |
| public void close() { |
| } |
| } |
| |
| |
| SimpleKuduEventProducer implements the org.apache.kudu.flume.sink.KuduEventProducer interface, |
| which itself looks like this: |
| |
| public interface KuduEventProducer extends Configurable, ConfigurableComponent { |
| /** |
| * Initialize the event producer. |
| * @param event to be written to Kudu |
| * @param table the KuduTable object used for creating Kudu Operation objects |
| */ |
| void initialize(Event event, KuduTable table); |
| |
| /** |
| * Get the operations that should be written out to Kudu as a result of this |
| * event. This list is written to Kudu using the Kudu client API. |
| * @return List of {@link org.kududb.client.Operation} which |
| * are written as such to Kudu |
| */ |
| List&lt;Operation&gt; getOperations(); |
| |
| /* |
| * Clean up any state. This will be called when the sink is being stopped. |
| */ |
| void close(); |
| } |
| |
| |
| public void configure(Context context) is called when an instance of our producer is instantiated |
| by the KuduSink. SimpleKuduEventProducer’s implementation looks for a producer parameter named |
| payloadColumn and uses its value (“payload” if not overridden in Flume configuration file) as the |
| column which will hold the value of the Flume event payload. If you recall from above, we had |
| configured the KuduSink to listen for events generated from the vmstat command. Each output row |
| from that command will be stored as a new row containing a payload column in the stats table. |
| SimpleKuduEventProducer does not have any configuration parameters, but if it had any we would |
| define them by prefixing it with producer. (agent1.sinks.sink1.producer.parameter1 for |
| example). |
| |
| The main producer logic resides in the public List&lt;Operation&gt; getOperations() method. In |
| SimpleKuduEventProducer’s implementation we simply insert the binary body of the Flume event into |
| the Kudu table. Here we call Kudu’s newInsert() to initiate an insert, but could have used |
| Upsert if updating an existing row was also an option, in fact there’s another producer |
| implementation available for doing just that: SimpleKeyedKuduEventProducer. Most probably you |
| will need to write your own custom producer in the real world, but you can base your implementation |
| on the built-in ones. |
| |
| In the future, we plan to add more flexible event producer implementations so that creation of a |
| custom event producer is not required to write data to Kudu. See |
| here for a work-in-progress generic event producer for |
| Avro-encoded Events. |
| |
| Conclusion |
| |
| Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume |
| helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store |
| the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of |
| disparate sources. |
| |
| Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using |
| sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that |
| is included in the Kudu distribution. You can follow him on Twitter at |
| @ara_e.</summary></entry><entry><title>New Range Partitioning Features in Kudu 0.10</title><link href="/2016/08/23/new-range-partitioning-features.html" rel="alternate" type="text/html" title="New Range Partitioning Features in Kudu 0.10" /><published>2016-08-23T00:00:00-07:00</published><updated>2016-08-23T00:00:00-07:00</updated><id>/2016/08/23/new-range-partitioning-features</id><content type="html" xml:base="/2016/08/23/new-range-partitioning-features.html"><p>Kudu 0.10 is shipping with a few important new features for range partitioning. |
| These features are designed to make Kudu easier to scale for certain workloads, |
| like time series. This post will introduce these features, and discuss how to use |
| them to effectively design tables for scalability and performance.</p> |
| |
| <!--more--> |
| |
| <p>Since Kudu’s initial release, tables have had the constraint that once created, |
| the set of partitions is static. This forces users to plan ahead and create |
| enough partitions for the expected size of the table, because once the table is |
| created no further partitions can be added. When using hash partitioning, |
| creating more partitions is as straightforward as specifying more buckets. For |
| range partitioning, however, knowing where to put the extra partitions ahead of |
| time can be difficult or impossible.</p> |
| |
| <p>The common solution to this problem in other distributed databases is to allow |
| range partitions to split into smaller child range partitions. Unfortunately, |
| range splitting typically has a large performance impact on running tables, |
| since child partitions need to eventually be recompacted and rebalanced to a |
| remote server. Range splitting is particularly thorny with Kudu, because rows |
| are stored in tablets in primary key sorted order, which does not necessarily |
| match the range partitioning order. If the range partition key is different than |
| the primary key, then splitting requires inspecting and shuffling each |
| individual row, instead of splitting the tablet in half.</p> |
| |
| <h2 id="adding-and-dropping-range-partitions">Adding and Dropping Range Partitions</h2> |
| |
| <p>As an alternative to range partition splitting, Kudu now allows range partitions |
| to be added and dropped on the fly, without locking the table or otherwise |
| affecting concurrent operations on other partitions. This solution is not |
| strictly as powerful as full range partition splitting, but it strikes a good |
| balance between flexibility, performance, and operational overhead. |
| Additionally, this feature does not preclude range splitting in the future if |
| there is a push to implement it. To support adding and dropping range |
| partitions, Kudu had to remove an even more fundamental restriction when using |
| range partitions.</p> |
| |
| <p>Previously, range partitions could only be created by specifying split points. |
| Split points divide an implicit partition covering the entire range into |
| contiguous and disjoint partitions. When using split points, the first and last |
| partitions are always unbounded below and above, respectively. A consequence of |
| the final partition being unbounded is that datasets which are range-partitioned |
| on a column that increases in value over time will eventually have far more rows |
| in the last partition than in any other. Unbalanced partitions are commonly |
| referred to as hotspots, and until Kudu 0.10 they have been difficult to avoid |
| when storing time series data in Kudu.</p> |
| |
| <p><img src="/img/2016-08-23-new-range-partitioning-features/range-partitioning-on-time.png" alt="png" class="img-responsive" /></p> |
| |
| <p>The figure above shows the tablets created by two different attempts to |
| partition a table by range on a timestamp column. The first, above in blue, uses |
| split points. The second, below in green, uses bounded range partitions |
| specified during table creation. With bounded range partitions, there is no |
| longer a guarantee that every possible row has a corresponding range partition. |
| As a result, Kudu will now reject writes which fall in a ‘non-covered’ range.</p> |
| |
| <p>Now that tables are no longer required to have range partitions covering all |
| possible rows, Kudu can support adding range partitions to cover the otherwise |
| unoccupied space. Dropping a range partition will result in unoccupied space |
| where the range partition was previously. In the example above, we may want to |
| add a range partition covering 2017 at the end of the year, so that we can |
| continue collecting data in the future. By lazily adding range partitions we |
| avoid hotspotting, avoid the need to specify range partitions up front for time |
| periods far in the future, and avoid the downsides of splitting. Additionally, |
| historical data which is no longer useful can be efficiently deleted by dropping |
| the entire range partition.</p> |
| |
| <h2 id="what-about-hash-partitioning">What About Hash Partitioning?</h2> |
| |
| <p>Since Kudu’s hash partitioning feature originally shipped in version 0.6, it has |
| been possible to create tables which combine hash partitioning with range |
| partitioning. The new range partitioning features continue to work seamlessly |
| when combined with hash partitioning. Just as before, the number of tablets |
| which comprise a table will be the product of the number of range partitions and |
| the number of hash partition buckets. Adding or dropping a range partition will |
| result in the creation or deletion of one tablet per hash bucket.</p> |
| |
| <p><img src="/img/2016-08-23-new-range-partitioning-features/range-and-hash-partitioning.png" alt="png" class="img-responsive" /></p> |
| |
| <p>The diagram above shows a time series table range-partitioned on the timestamp |
| and hash-partitioned with two buckets. The hash partitioning could be on the |
| timestamp column, or it could be on any other column or columns in the primary |
| key. In this example only two years of historical data is needed, so at the end |
| of 2016 a new range partition is added for 2017 and the historical 2014 range |
| partition is dropped. This causes two new tablets to be created for 2017, and |
| the two existing tablets for 2014 to be deleted.</p> |
| |
| <h2 id="getting-started">Getting Started</h2> |
| |
| <p>Beginning with the Kudu 0.10 release, users can add and drop range partitions |
| through the Java and C++ client APIs. Range partitions on existing tables can be |
| dropped and replacements added, but it requires the servers and all clients to |
| be updated to 0.10.</p></content><author><name>Dan Burkert</name></author><summary>Kudu 0.10 is shipping with a few important new features for range partitioning. |
| These features are designed to make Kudu easier to scale for certain workloads, |
| like time series. This post will introduce these features, and discuss how to use |
| them to effectively design tables for scalability and performance.</summary></entry><entry><title>Apache Kudu 0.10.0 released</title><link href="/2016/08/23/apache-kudu-0-10-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 0.10.0 released" /><published>2016-08-23T00:00:00-07:00</published><updated>2016-08-23T00:00:00-07:00</updated><id>/2016/08/23/apache-kudu-0-10-0-released</id><content type="html" xml:base="/2016/08/23/apache-kudu-0-10-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 0.10.0!</p> |
| |
| <p>This latest version adds several new features, including: |
| <!--more--></p> |
| |
| <ul> |
| <li> |
| <p>Users may now manually manage the partitioning of a range-partitioned table |
| by adding or dropping range partitions after a table has been created. This |
| can be particularly helpful for time-series workloads. Dan Burkert posted |
| an <a href="/2016/08/23/new-range-partitioning-features.html">in-depth blog</a> today |
| detailing the new feature.</p> |
| </li> |
| <li> |
| <p>Multi-master (HA) Kudu clusters are now significantly more stable.</p> |
| </li> |
| <li> |
| <p>Administrators may now reserve a certain amount of disk space on each of its |
| configured data directories.</p> |
| </li> |
| <li> |
| <p>Kudu’s integration with Spark has been substantially improved and is much |
| more flexible.</p> |
| </li> |
| </ul> |
| |
| <p>This release also includes many bug fixes and other improvements, detailed in |
| the release notes below.</p> |
| |
| <ul> |
| <li>Read the detailed <a href="http://kudu.apache.org/releases/0.10.0/docs/release_notes.html">Kudu 0.10.0 release notes</a></li> |
| <li>Download the <a href="http://kudu.apache.org/releases/0.9.0/">Kudu 0.10.0 source release</a></li> |
| </ul></content><author><name>Todd Lipcon</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 0.10.0! |
| |
| This latest version adds several new features, including:</summary></entry><entry><title>Apache Kudu Weekly Update August 16th, 2016</title><link href="/2016/08/16/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update August 16th, 2016" /><published>2016-08-16T00:00:00-07:00</published><updated>2016-08-16T00:00:00-07:00</updated><id>/2016/08/16/weekly-update</id><content type="html" xml:base="/2016/08/16/weekly-update.html"><p>Welcome to the twentieth edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</p> |
| |
| <!--more--> |
| |
| <h2 id="project-news">Project news</h2> |
| |
| <ul> |
| <li> |
| <p>The first release candidate for the 0.10.0 is <a href="http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201608.mbox/%3CCADY20s7U5jVpozFg3L%3DDz2%2B4AenGineJvH96A_HAM12biDjPJA%40mail.gmail.com%3E">now available</a></p> |
| |
| <p>Community developers and users are encouraged to download the source |
| tarball and vote on the release.</p> |
| |
| <p>For information on what’s new, check out the |
| <a href="https://github.com/apache/kudu/blob/master/docs/release_notes.adoc#rn_0.10.0">release notes</a>. |
| <em>Note:</em> some links from these in-progress release notes will not be live until the |
| release itself is published.</p> |
| </li> |
| </ul> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li> |
| <p>Will Berkeley spent some time working on the Spark integration this week |
| to add support for UPSERT as well as other operations. |
| Dan Burkert pitched in a bit with some <a href="http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201608.mbox/%3CCALo2W-XBoSz9cbhXi81ipubrAYgqyDiEeHz-ys8sPAshfcik6w%40mail.gmail.com%3E">suggestions</a> |
| which were then integrated in a <a href="https://gerrit.cloudera.org/#/c/3871/">patch</a> |
| provided by Will.</p> |
| |
| <p>After some reviews by Dan, Chris George, and Ram Mettu, the patch was committed |
| in time for the upcoming 0.10.0 release.</p> |
| </li> |
| <li> |
| <p>Dan Burkert also completed work for the new <a href="https://gerrit.cloudera.org/#/c/3854/">manual partitioning APIs</a> |
| in the Java client. After finishing up the basic implementation, Dan also made some |
| cleanups to the related APIs in both the <a href="https://gerrit.cloudera.org/#/c/3958/">Java</a> |
| and <a href="https://gerrit.cloudera.org/#/c/3882/">C++</a> clients.</p> |
| |
| <p>Dan and Misty Stanley-Jones also collaborated to finish the |
| <a href="https://gerrit.cloudera.org/#/c/3796/">documentation</a> |
| for this new feature.</p> |
| </li> |
| <li> |
| <p>Adar Dembo worked on some tooling to allow users to migrate their Kudu clusters |
| from a single-master configuration to a multi-master one. Along the way, he |
| started building some common infrastructure for command-line tooling.</p> |
| |
| <p>Since Kudu’s initial release, it has included separate binaries for different |
| administrative or operational tools (e.g. <code>kudu-ts-cli</code>, <code>kudu-ksck</code>, <code>kudu-fs_dump</code>, |
| <code>log-dump</code>, etc). Despite having similar usage, these tools don’t share much code, |
| and the separate statically linked binaries make the Kudu packages take more disk |
| space than strictly necessary.</p> |
| |
| <p>Adar’s work has introduced a new top-level <code>kudu</code> binary which exposes a set of subcommands, |
| much like the <code>git</code> and <code>docker</code> binaries with which readers may be familiar. |
| For example, a new tool he has built for dumping peer identifiers from a tablet’s |
| consensus metadata is triggered using <code>kudu tablet cmeta print_replica_uuids</code>.</p> |
| |
| <p>This new tool will be available in the upcoming 0.10.0 release; however, migration |
| of the existing tools to the new infrastructure has not yet been completed. We |
| expect that by Kudu 1.0, the old tools will be removed in favor of more subcommands |
| of the <code>kudu</code> tool.</p> |
| </li> |
| <li> |
| <p>Todd Lipcon picked up the work started by David Alves in July to provide |
| <a href="https://gerrit.cloudera.org/#/c/2642/">“exactly-once” semantics</a> for write operations. |
| Todd carried the patch series through review and also completed integration of the |
| feature into the Kudu server processes.</p> |
| |
| <p>After testing the feature for several days on a large cluster under load, |
| the team decided to enable this new feature by default in Kudu 0.10.0.</p> |
| </li> |
| <li> |
| <p>Mike Percy resumed working on garbage collection of <a href="https://gerrit.cloudera.org/#/c/2853/">past versions of |
| updated and deleted rows</a>. His <a href="https://gerrit.cloudera.org/#/c/3076/">main |
| patch for the feature</a> went through |
| several rounds of review and testing, but unfortunately missed the cut-off |
| for 0.10.0.</p> |
| </li> |
| <li> |
| <p>Alexey Serbin’s work to add doxygen-based documentation for the C++ Client API |
| was <a href="https://gerrit.cloudera.org/#/c/3840/">committed</a> this week. These |
| docs will be published as part of the 0.10.0 release.</p> |
| </li> |
| <li> |
| <p>Alexey also continued work on implementing the <code>AUTO_FLUSH_BACKGROUND</code> write |
| mode for the C++ client. This feature makes it easier to implement high-throughput |
| ingest using the C++ API by automatically handling the batching and flushing of writes |
| based on a configurable buffer size.</p> |
| |
| <p>Alexey’s <a href="https://gerrit.cloudera.org/#/c/3952/">patch</a> has received several |
| rounds of review and looks likely to be committed soon. Detailed performance testing |
| will follow.</p> |
| </li> |
| <li> |
| <p>Congratulations to Ram Mettu for committing his first patch to Kudu this week! |
| Ram fixed a <a href="https://issues.apache.org/jira/browse/KUDU-1522">bug in handling Alter Table with TIMESTAMP columns</a>.</p> |
| </li> |
| </ul> |
| |
| <h2 id="upcoming-talks">Upcoming talks</h2> |
| |
| <ul> |
| <li>Mike Percy will be speaking about Kudu this Wednesday at the |
| <a href="http://www.meetup.com/Denver-Cloudera-User-Group/events/232782782/">Denver Cloudera User Group</a> |
| and on Thursday at the |
| <a href="http://www.meetup.com/Boulder-Denver-Big-Data/events/232056701/">Boulder/Denver Big Data Meetup</a>. |
| If you’re based in the Boulder/Denver area, be sure not to miss these talks!</li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>Todd Lipcon</name></author><summary>Welcome to the twentieth edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>Apache Kudu Weekly Update August 8th, 2016</title><link href="/2016/08/08/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update August 8th, 2016" /><published>2016-08-08T00:00:00-07:00</published><updated>2016-08-08T00:00:00-07:00</updated><id>/2016/08/08/weekly-update</id><content type="html" xml:base="/2016/08/08/weekly-update.html"><p>Welcome to the nineteenth edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</p> |
| |
| <!--more--> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li> |
| <p>After a couple months of work, Dan Burkert finished |
| <a href="https://gerrit.cloudera.org/#/c/3648/">adding add/remove range partition support</a> |
| in the C++ client and in the master.</p> |
| |
| <p>Dan also posted a patch for review which <a href="https://gerrit.cloudera.org/#/c/3854/">adds support for this |
| feature</a> to the Java client. Dan is |
| expecting that this will be finished in time for the upcoming Kudu 0.10.0 |
| release.</p> |
| |
| <p>Misty Stanley-Jones started working on <a href="https://gerrit.cloudera.org/#/c/3796/">documentation for this |
| feature</a>. Readers of this |
| blog are encouraged to check out the docs and provide feedback!</p> |
| </li> |
| <li> |
| <p>Adar Dembo also completed fixing most of the issues related to high availability |
| using multiple Kudu master processes. The upcoming Kudu 0.10.0 release will support |
| running multiple masters and transparently handling a transient failure of any |
| master process.</p> |
| |
| <p>Although multi-master should now be stable, some work remains in this area. Namely, |
| Adar is working on a <a href="https://gerrit.cloudera.org/#/c/3393/">design for handling permanent failure of a machine hosting |
| a master</a>. In this case, the administrator |
| will need to use some new tools to create a new master replica by copying data from |
| an existing one.</p> |
| </li> |
| <li> |
| <p>Todd Lipcon started a |
| <a href="https://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201607.mbox/%3CCADY20s5WdR7KmB%3DEAHJwvzELhe9PXfnnGMLV%2B4t%3D%3Defw%3Dix8uw%40mail.gmail.com%3E">discussion</a> |
| on the dev mailing list about renaming the Kudu feature which creates new |
| replicas of tablets after they become under-replicated. Since its initial |
| introduction, this feature was called “remote bootstrap”, but Todd pointed out |
| that this naming caused some confusion with the other “bootstrap” term used to |
| describe the process by which a tablet loads itself at startup.</p> |
| |
| <p>The discussion concluded with an agreement to rename the process to “Tablet Copy”. |
| Todd provided patches to perform this rename, which were committed at the end of the |
| week last week.</p> |
| </li> |
| <li> |
| <p>Congratulations to Attila Bukor for his first commit to Kudu! Attila |
| <a href="https://gerrit.cloudera.org/#/c/3820/">fixed an error in the quick-start documentation</a>.</p> |
| </li> |
| </ul> |
| |
| <h2 id="news-and-articles-from-around-the-web">News and articles from around the web</h2> |
| |
| <ul> |
| <li>The New Stack published an <a href="http://thenewstack.io/apache-kudu-fast-columnar-data-store-hadoop/">introductory article about Kudu</a>. |
| The article was based on a recent interview with Todd Lipcon |
| and covers topics such as the origin of the name “Kudu”, where Kudu fits into the |
| Apache Hadoop ecosystem, and goals for the upcoming 1.0 release.</li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>Todd Lipcon</name></author><summary>Welcome to the nineteenth edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>Apache Kudu Weekly Update July 26, 2016</title><link href="/2016/07/26/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update July 26, 2016" /><published>2016-07-26T00:00:00-07:00</published><updated>2016-07-26T00:00:00-07:00</updated><id>/2016/07/26/weekly-update</id><content type="html" xml:base="/2016/07/26/weekly-update.html"><p>Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</p> |
| |
| <!--more--> |
| |
| <h2 id="project-news">Project news</h2> |
| |
| <ul> |
| <li>Kudu has graduated from the Apache Incubator and is now a Top-Level Project! All the details |
| are in this <a href="http://kudu.apache.org/2016/07/25/asf-graduation.html">blog post</a>. |
| Mike Percy and Todd Lipcon made a few updates to the website to reflect the project’s |
| new name and status.</li> |
| </ul> |
| |
| <h2 id="development-discussions-and-code-in-progress">Development discussions and code in progress</h2> |
| |
| <ul> |
| <li> |
| <p>Dan Burkert contributed a few patches that repackage the Java client under <code>org.apache.kudu</code> |
| in place of <code>org.kududb</code>. This was done in a <strong>backward-incompatible</strong> way, meaning that import |
| statements will have to be modified in existing Java code to compile against a newer Kudu JAR |
| version (from 0.10.0 onward). This stems from <a href="http://mail-archives.apache.org/mod_mbox/kudu-dev/201605.mbox/%3CCAGpTDNcJohQBgjzXafXJQdqmBB4sL495p5V_BJRXk_nAGWbzhA@mail.gmail.com%3E">a discussion</a> |
| initiated in May. It won’t have an impact on C++ or Python users, and it isn’t affecting wire |
| compatibility.</p> |
| </li> |
| <li> |
| <p>Still on the Java-side, J-D Cryans pushed <a href="https://gerrit.cloudera.org/#/c/3055/">a patch</a> |
| that completely changes how Exceptions are managed. Before this change, users had to introspect |
| generic Exception objects, making it a guessing game and discouraging good error handling. |
| Now, the synchronous client’s methods throw <code>KuduException</code> which packages a <code>Status</code> object |
| that can be interrogated. This is very similar to how the C++ API works.</p> |
| |
| <p>Existing code that uses the new Kudu JAR should still compile since this change replaces generic |
| <code>Exception</code> with a more specific <code>KuduException</code>. Error handling done by string-matching the |
| exception messages should now use the provided <code>Status</code> object.</p> |
| </li> |
| <li> |
| <p>Alexey Serbin’s <a href="https://gerrit.cloudera.org/#/c/3619/">patch</a> that adds Doxygen-based |
| documentation was pushed and the new API documentation for C++ developers will be available |
| with the next release.</p> |
| </li> |
| <li> |
| <p>Todd has made many improvements to the <code>ksck</code> tool over the last week. Building upon Will |
| Berkeley’s <a href="https://gerrit.cloudera.org/#/c/3632/">WIP patch for KUDU-1516</a>, <code>ksck</code> can |
| now detect more problematic situations like if a tablet doesn’t have a majority of replicas on |
| live tablet servers, or if those replicas aren’t in a good state. |
| <code>ksck</code> is also <a href="https://gerrit.cloudera.org/#/c/3705/">now faster</a> when run against a large |
| cluster with a lot of tablets, among other improvements.</p> |
| </li> |
| <li> |
| <p>As mentioned last week, Dan has been working on <a href="https://gerrit.cloudera.org/#/c/3648/">adding add/remove range partition support</a> |
| in the C++ client and in the master. The patch has been through many rounds of review and |
| testing and it’s getting close to completion. Meanwhile, J-D started looking at adding support |
| for this functionality in the <a href="https://gerrit.cloudera.org/#/c/3731/">Java client</a>.</p> |
| </li> |
| <li> |
| <p>Adar Dembo is also hard at work on the master. The <a href="https://gerrit.cloudera.org/#/c/3609/">series</a> |
| <a href="https://gerrit.cloudera.org/#/c/3610/">of</a> <a href="https://gerrit.cloudera.org/#/c/3611/">patches</a> to |
| have the tablet servers heartbeat to all the masters that he published earlier this month is |
| getting near the finish line.</p> |
| </li> |
| </ul> |
| |
| <p>Want to learn more about a specific topic from this blog post? Shoot an email to the |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#117;&#115;&#101;&#114;&#064;&#107;&#117;&#100;&#117;&#046;&#105;&#110;&#099;&#117;&#098;&#097;&#116;&#111;&#114;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">kudu-user mailing list</a> or |
| tweet at <a href="https://twitter.com/ApacheKudu">@ApacheKudu</a>. Similarly, if you’re |
| aware of some Kudu news we missed, let us know so we can cover it in |
| a future post.</p></content><author><name>Jean-Daniel Cryans</name></author><summary>Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post |
| covers ongoing development and news in the Apache Kudu project.</summary></entry></feed> |