blob: 0afd17c87312369fd4b259398d61004d9cabaa28 [file] [log] [blame]
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2016-09-19T20:48:47-07:00</updated><id>/</id><entry><title>Pushing Down Predicate Evaluation in Apache Kudu</title><link href="/2016/09/16/predicate-pushdown.html" rel="alternate" type="text/html" title="Pushing Down Predicate Evaluation in Apache Kudu" /><published>2016-09-16T00:00:00-07:00</published><updated>2016-09-16T00:00:00-07:00</updated><id>/2016/09/16/predicate-pushdown</id><content type="html" xml:base="/2016/09/16/predicate-pushdown.html">&lt;p&gt;I had the pleasure of interning with the Apache Kudu team at Cloudera this
summer. This project was my summer contribution to Kudu: a restructuring of the
scan path to speed up queries.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In Kudu, &lt;em&gt;predicate pushdown&lt;/em&gt; refers to the way in which predicates are
handled. When a scan is requested, its predicates are passed through the
different layers of Kudu’s storage hierarchy, allowing for pruning and other
optimizations to happen at each level before reaching the underlying data.&lt;/p&gt;
&lt;p&gt;While predicates are pushed down, predicate evaluation itself occurs at a fairly
high level, precluding the evaluation process from certain data-specific
optimizations. These optimizations can make tablet scans an order of magnitude
faster, if not more.&lt;/p&gt;
&lt;h2 id=&quot;a-day-in-the-life-of-a-query&quot;&gt;A Day in the Life of a Query&lt;/h2&gt;
&lt;p&gt;Because Kudu is a columnar storage engine, its scan path has a number of
optimizations to avoid extraneous reads, copies, and computation. When a query
is sent to a tablet server, the server prunes tablets based on the
primary key, directing the request to only the tablets that contain the key
range of interest. Once at a tablet, only the columns relevant to the query are
scanned. Further pruning is done over the primary key, and if the query is
predicated on non-key columns, the entire column is scanned. The columns in a
tablet are stored as &lt;em&gt;cfiles&lt;/em&gt;, which are split into encoded &lt;em&gt;blocks&lt;/em&gt;. Once the
relevant cfiles are determined, the data are materialized by the block
decoders, i.e. their underlying data are decoded and copied into a buffer,
which is passed back to the tablet layer. The tablet can then evaluate the
predicate on the batch of data and mark which rows should be returned to the
client.&lt;/p&gt;
&lt;p&gt;One of the encoding types I worked very closely with is &lt;em&gt;dictionary encoding&lt;/em&gt;,
an encoding type for strings that performs particularly well for cfiles that
have repeating values. Rather than storing every row’s string, each unique
string is assigned a numeric codeword, and the rows are stored numerically on
disk. When materializing a dictionary block, all of the numeric data are scanned
and all of the corresponding strings are copied and buffered for evaluation.
When the vocabulary of a dictionary-encoded cfile gets too large, the blocks
begin switching to &lt;em&gt;plain encoding mode&lt;/em&gt; to act like &lt;em&gt;plain-encoded&lt;/em&gt; blocks.&lt;/p&gt;
&lt;p&gt;In a plain-encoded block, strings are stored contiguously and the character
offsets to the start of each string are stored as a list of integers. When
materializing, all of the strings are copied to a buffer for evaluation.&lt;/p&gt;
&lt;p&gt;Therein lies room for improvement: this predicate evaluation path is the same
for all data types and encoding types. Within the tablet, the correct cfiles
are determined, the cfiles’ decoders are opened, all of the data are copied to
a buffer, and the predicates are evaluated on this buffered data via
type-specific comparators. This path is extremely flexible, but because it was
designed to be encoding-independent, there is room for improvement.&lt;/p&gt;
&lt;h2 id=&quot;trimming-the-fat&quot;&gt;Trimming the Fat&lt;/h2&gt;
&lt;p&gt;The first step is to allow the decoders access to the predicate. In doing so,
each encoding type can specialize its evaluation. Additionally, this puts the
decoder in a position where it can determine whether a given row satisfies the
query, which in turn, allows the decoders to determine what data gets copied
instead of eagerly copying all of its data to get evaluated.&lt;/p&gt;
&lt;p&gt;Take the case of dictionary-encoded strings as an example. With the existing
scan path, not only are all of the strings in a column copied into a buffer, but
string comparisons are done on every row. By taking advantage of the fact that
the data can be represented as integers, the cost of determining the query
results can be greatly reduced. The string comparisons can be swapped out with
evaluation based on the codewords, in which case the room for improvement boils
down to how to most quickly determine whether or not a given codeword
corresponds to a string that satisfies the predicate. Dictionary columns will
now use a bitset to store the codewords that match the predicates. It will then
scan through the integer-valued data and checks the bitset to determine whether
it should copy the corresponding string over.&lt;/p&gt;
&lt;p&gt;This is great in the best case scenario where a cfile’s vocabulary is small,
but when the vocabulary gets too large and the dictionary blocks switch to plain
encoding mode, performance is hampered. In this mode, the blocks don’t utilize
any dictionary metadata and end up wasting the codeword bitset. That isn’t to
say all is lost: the decoders can still evaluate a predicate via string
comparison, and the fact that evaluation can still occur at the decoder-level
means the eager buffering can still be avoided.&lt;/p&gt;
&lt;p&gt;Dictionary encoding is a perfect storm in that the decoders can completely
evaluate the predicates. This is not the case for most other encoding types,
but having decoders support evaluation leaves the door open for other encoding
types to extend this idea.&lt;/p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
&lt;p&gt;Depending on the dataset and query, predicate pushdown can lead to significant
improvements. Tablet scans were timed with datasets consisting of repeated
string patterns of tunable length and tunable cardinality.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/predicate-pushdown/pushdown-10.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;
&lt;img src=&quot;/img/predicate-pushdown/pushdown-10M.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The above plots show the time taken to completely scan a single tablet, recorded
using a dataset of ten million rows of strings with length ten. Predicates were
designed to select values out of bounds (Empty), select a single value (Equal,
i.e. for cardinality &lt;em&gt;k&lt;/em&gt;, this would select 1/&lt;em&gt;k&lt;/em&gt; of the dataset), select half
of the full range (Half), and select the full range of values (All).&lt;/p&gt;
&lt;p&gt;With the original evaluation implementation, the tablet must copy and scan
through the tablet to determine whether any values match. This means that even
when the result set is small, the full column is still copied. This is avoided
by pushing down predicates, which only copies as needed, and can be seen in the
above queries: those with near-empty result sets (Empty and Equal) have shorter
scan times than those with larger result sets (Half and All).&lt;/p&gt;
&lt;p&gt;Note that for dictionary encoding, given a low cardinality, Kudu can completely
rely on the dictionary codewords to evaluate, making the query significantly
faster. At higher cardinalities, the dictionaries completely fill up and the
blocks fall back on plain encoding. The slower, albeit still improved,
performance on the dataset containing 10M unique values reflects this.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/predicate-pushdown/pushdown-tpch.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Similar predicates were run with the TPC-H dataset, querying on the shipdate
column. The full path of a query includes not only the tablet scanning itself,
but also RPCs and batched data transfer to the caller as the scan progresses.
As such, the times plotted above refer to the average end-to-end time required
to scan and return a batch of rows. Regardless of this additional overhead,
significant improvements on the scan path still yield substantial improvements
to the query performance as a whole.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Pushing down predicate evaluation in Kudu yielded substantial improvements to
the scan path. For dictionary encoding, pushdown can be particularly powerful,
and other encoding types are either unaffected or also improved. This change has
been pushed to the main branch of Kudu, and relevant commits can be found
&lt;a href=&quot;https://github.com/cloudera/kudu/commit/c0f37278cb09a7781d9073279ea54b08db6e2010&quot;&gt;here&lt;/a&gt;
and
&lt;a href=&quot;https://github.com/cloudera/kudu/commit/ec80fdb37be44d380046a823b5e6d8e2241ec3da&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This summer has been a phenomenal learning experience for me, in terms of the
tools, the workflow, the datasets, the thought-processes that go into building
something at Kudu’s scale. I am extremely thankful for all of the mentoring and
support I received, and that I got to be a part of Kudu’s journey from
incubating to a Top Level Apache project. I can’t express enough how grateful I
am for the amount of support I got from the Kudu team, from the intern
coordinators, and from the Cloudera community as a whole.&lt;/p&gt;</content><author><name>Andrew Wong</name></author><summary>I had the pleasure of interning with the Apache Kudu team at Cloudera this
summer. This project was my summer contribution to Kudu: a restructuring of the
scan path to speed up queries.</summary></entry><entry><title>An Introduction to the Flume Kudu Sink</title><link href="/2016/08/31/intro-flume-kudu-sink.html" rel="alternate" type="text/html" title="An Introduction to the Flume Kudu Sink" /><published>2016-08-31T00:00:00-07:00</published><updated>2016-08-31T00:00:00-07:00</updated><id>/2016/08/31/intro-flume-kudu-sink</id><content type="html" xml:base="/2016/08/31/intro-flume-kudu-sink.html">&lt;p&gt;This post discusses the Kudu Flume Sink. First, I’ll give some background on why we considered
using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.&lt;/p&gt;
&lt;h2 id=&quot;why-kudu&quot;&gt;Why Kudu&lt;/h2&gt;
&lt;p&gt;Traditionally in the Hadoop ecosystem we’ve dealt with various &lt;em&gt;batch processing&lt;/em&gt; technologies such
as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig,
Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to
process the whole data set in batches, again and again, as soon as new data gets added. Things get
really complicated when a few such tasks need to get chained together, or when the same data set
needs to be processed in various ways by different jobs, while all compete for the shared cluster
resources.&lt;/p&gt;
&lt;p&gt;The opposite of this approach is &lt;em&gt;stream processing&lt;/em&gt;: process the data as soon as it arrives, not
in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make
this possible. But writing streaming services is not trivial. The streaming systems are becoming
more and more capable and support more complex constructs, but they are not yet easy to use. All
queries and processes need to be carefully planned and implemented.&lt;/p&gt;
&lt;p&gt;To summarize, &lt;em&gt;batch processing&lt;/em&gt; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;file-based&lt;/li&gt;
&lt;li&gt;a paradigm that processes large chunks of data as a group&lt;/li&gt;
&lt;li&gt;high latency and high throughput, both for ingest and query&lt;/li&gt;
&lt;li&gt;typically easy to program, but hard to orchestrate&lt;/li&gt;
&lt;li&gt;well suited for writing ad-hoc queries, although they are typically high latency&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While &lt;em&gt;stream processing&lt;/em&gt; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a totally different paradigm, which involves single events and time windows instead of large groups of events&lt;/li&gt;
&lt;li&gt;still file-based and not a long-term database&lt;/li&gt;
&lt;li&gt;not batch-oriented, but incremental&lt;/li&gt;
&lt;li&gt;ultra-fast ingest and ultra-fast query (query results basically pre-calculated)&lt;/li&gt;
&lt;li&gt;not so easy to program, relatively easy to orchestrate&lt;/li&gt;
&lt;li&gt;impossible to write ad-hoc queries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And a Kudu-based &lt;em&gt;near real-time&lt;/em&gt; approach is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;flexible and expressive, thanks to SQL support via Apache Impala (incubating)&lt;/li&gt;
&lt;li&gt;a table-oriented, mutable data store that feels like a traditional relational database&lt;/li&gt;
&lt;li&gt;very easy to program, you can even pretend it’s good old MySQL&lt;/li&gt;
&lt;li&gt;low-latency and relatively high throughput, both for ingest and query&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At Argyle Data, we’re dealing with complex fraud detection scenarios. We need to ingest massive
amounts of data, run machine learning algorithms and generate reports. When we created our current
architecture two years ago we decided to opt for a database as the backbone of our system. That
database is Apache Accumulo. It’s a key-value based database which runs on top of Hadoop HDFS,
quite similar to HBase but with some important improvements such as cell level security and ease
of deployment and management. To enable querying of this data for quite complex reporting and
analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced
by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This
architecture has served us well, but there were a few problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we need to ingest even more massive volumes of data in real-time&lt;/li&gt;
&lt;li&gt;we need to perform complex machine-learning calculations on even larger data-sets&lt;/li&gt;
&lt;li&gt;we need to support ad-hoc queries, plus long-term data warehouse functionality&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So, we’ve started gradually moving the core machine-learning pipeline to a streaming based
solution. This way we can ingest and process larger data-sets faster in the real-time. But then how
would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While
the machine learning pipeline ingests and processes real-time data, we store a copy of the same
ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our &lt;em&gt;data warehouse&lt;/em&gt;. By
using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala’s
super-fast query engine.&lt;/p&gt;
&lt;p&gt;But how would we make sure data is reliably ingested into the streaming pipeline &lt;em&gt;and&lt;/em&gt; the
Kudu-based data warehouse? This is where Apache Flume comes in.&lt;/p&gt;
&lt;h2 id=&quot;why-flume&quot;&gt;Why Flume&lt;/h2&gt;
&lt;p&gt;According to their &lt;a href=&quot;http://flume.apache.org/&quot;&gt;website&lt;/a&gt; “Flume is a distributed, reliable, and
available service for efficiently collecting, aggregating, and moving large amounts of log data.
It has a simple and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” As you
can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop
clusters.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad&quot; alt=&quot;png&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Flume has an extensible architecture. An instance of Flume, called an &lt;em&gt;agent&lt;/em&gt;, can have multiple
&lt;em&gt;channels&lt;/em&gt;, with each having multiple &lt;em&gt;sources&lt;/em&gt; and &lt;em&gt;sinks&lt;/em&gt; of various types. Sources queue data
in channels, which in turn write out data to sinks. Such &lt;em&gt;pipelines&lt;/em&gt; can be chained together to
create even more complex ones. There may be more than one agent and agents can be configured to
support failover and recovery.&lt;/p&gt;
&lt;p&gt;Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the
default (an in-memory queue with no persistence to disk), but other options such as Kafka- and
File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory
source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing
data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.&lt;/p&gt;
&lt;p&gt;In the rest of this post I’ll go over the Kudu Flume sink and show you how to configure Flume to
write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8
release and the source code can be found &lt;a href=&quot;https://github.com/apache/kudu/tree/master/java/kudu-flume-sink&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;configuring-the-kudu-flume-sink&quot;&gt;Configuring the Kudu Flume Sink&lt;/h2&gt;
&lt;p&gt;Here is a sample flume configuration file:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1
agent1.sources.source1.type = exec
agent1.sources.source1.command = /usr/bin/vmstat 1
agent1.sources.source1.channels = channel1
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 1000
agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink
agent1.sinks.sink1.masterAddresses = localhost
agent1.sinks.sink1.tableName = stats
agent1.sinks.sink1.channel = channel1
agent1.sinks.sink1.batchSize = 50
agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We define a source called &lt;code&gt;source1&lt;/code&gt; which simply executes a &lt;code&gt;vmstat&lt;/code&gt; command to continuously generate
virtual memory statistics for the machine and queue events into an in-memory &lt;code&gt;channel1&lt;/code&gt; channel,
which in turn is used for writing these events to a Kudu table called &lt;code&gt;stats&lt;/code&gt;. We are using
&lt;code&gt;org.apache.kudu.flume.sink.SimpleKuduEventProducer&lt;/code&gt; as the producer. &lt;code&gt;SimpleKuduEventProducer&lt;/code&gt; is
the built-in and default producer, but it’s implemented as a showcase for how to write Flume
events into Kudu tables. For any serious functionality we’d have to write a custom producer. We
need to make this producer and the &lt;code&gt;KuduSink&lt;/code&gt; class available to Flume. We can do that by simply
copying the &lt;code&gt;kudu-flume-sink-&amp;lt;VERSION&amp;gt;.jar&lt;/code&gt; jar file from the Kudu distribution to the
&lt;code&gt;$FLUME_HOME/plugins.d/kudu-sink/lib&lt;/code&gt; directory in the Flume installation. The jar file contains
&lt;code&gt;KuduSink&lt;/code&gt; and all of its dependencies (including Kudu java client classes).&lt;/p&gt;
&lt;p&gt;At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are
(&lt;code&gt;agent1.sinks.sink1.masterAddresses = localhost&lt;/code&gt;) and which Kudu table should be used for writing
Flume events to (&lt;code&gt;agent1.sinks.sink1.tableName = stats&lt;/code&gt;). The Kudu Flume Sink doesn’t create this
table, it has to be created before the Kudu Flume Sink is started.&lt;/p&gt;
&lt;p&gt;You may also notice the &lt;code&gt;batchSize&lt;/code&gt; parameter. Batch size is used for batching up to that many
Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge
impact on ingest performance of the Kudu cluster.&lt;/p&gt;
&lt;p&gt;Here is a complete list of KuduSink parameters:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter Name&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;masterAddresses&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Comma-separated list of “host:port” pairs of the masters (port optional)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tableName&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;The name of the table in Kudu to write to&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;producer&lt;/td&gt;
&lt;td&gt;org.apache.kudu.flume.sink.SimpleKuduEventProducer&lt;/td&gt;
&lt;td&gt;The fully qualified class name of the Kudu event producer the sink should use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;batchSize&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;Maximum number of events the sink should take from the channel per transaction, if available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;timeoutMillis&lt;/td&gt;
&lt;td&gt;30000&lt;/td&gt;
&lt;td&gt;Timeout period for Kudu operations, in milliseconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ignoreDuplicateRows&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Let’s take a look at the source code for the built-in producer class:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;public class SimpleKuduEventProducer implements KuduEventProducer {
private byte[] payload;
private KuduTable table;
private String payloadColumn;
public SimpleKuduEventProducer(){
}
@Override
public void configure(Context context) {
payloadColumn = context.getString(&quot;payloadColumn&quot;,&quot;payload&quot;);
}
@Override
public void configure(ComponentConfiguration conf) {
}
@Override
public void initialize(Event event, KuduTable table) {
this.payload = event.getBody();
this.table = table;
}
@Override
public List&amp;lt;Operation&amp;gt; getOperations() throws FlumeException {
try {
Insert insert = table.newInsert();
PartialRow row = insert.getRow();
row.addBinary(payloadColumn, payload);
return Collections.singletonList((Operation) insert);
} catch (Exception e){
throw new FlumeException(&quot;Failed to create Kudu Insert object!&quot;, e);
}
}
@Override
public void close() {
}
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;SimpleKuduEventProducer&lt;/code&gt; implements the &lt;code&gt;org.apache.kudu.flume.sink.KuduEventProducer&lt;/code&gt; interface,
which itself looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;public interface KuduEventProducer extends Configurable, ConfigurableComponent {
/**
* Initialize the event producer.
* @param event to be written to Kudu
* @param table the KuduTable object used for creating Kudu Operation objects
*/
void initialize(Event event, KuduTable table);
/**
* Get the operations that should be written out to Kudu as a result of this
* event. This list is written to Kudu using the Kudu client API.
* @return List of {@link org.kududb.client.Operation} which
* are written as such to Kudu
*/
List&amp;lt;Operation&amp;gt; getOperations();
/*
* Clean up any state. This will be called when the sink is being stopped.
*/
void close();
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;public void configure(Context context)&lt;/code&gt; is called when an instance of our producer is instantiated
by the KuduSink. SimpleKuduEventProducer’s implementation looks for a producer parameter named
&lt;code&gt;payloadColumn&lt;/code&gt; and uses its value (“payload” if not overridden in Flume configuration file) as the
column which will hold the value of the Flume event payload. If you recall from above, we had
configured the KuduSink to listen for events generated from the &lt;code&gt;vmstat&lt;/code&gt; command. Each output row
from that command will be stored as a new row containing a &lt;code&gt;payload&lt;/code&gt; column in the &lt;code&gt;stats&lt;/code&gt; table.
&lt;code&gt;SimpleKuduEventProducer&lt;/code&gt; does not have any configuration parameters, but if it had any we would
define them by prefixing it with &lt;code&gt;producer.&lt;/code&gt; (&lt;code&gt;agent1.sinks.sink1.producer.parameter1&lt;/code&gt; for
example).&lt;/p&gt;
&lt;p&gt;The main producer logic resides in the &lt;code&gt;public List&amp;lt;Operation&amp;gt; getOperations()&lt;/code&gt; method. In
SimpleKuduEventProducer’s implementation we simply insert the binary body of the Flume event into
the Kudu table. Here we call Kudu’s &lt;code&gt;newInsert()&lt;/code&gt; to initiate an insert, but could have used
&lt;code&gt;Upsert&lt;/code&gt; if updating an existing row was also an option, in fact there’s another producer
implementation available for doing just that: &lt;code&gt;SimpleKeyedKuduEventProducer&lt;/code&gt;. Most probably you
will need to write your own custom producer in the real world, but you can base your implementation
on the built-in ones.&lt;/p&gt;
&lt;p&gt;In the future, we plan to add more flexible event producer implementations so that creation of a
custom event producer is not required to write data to Kudu. See
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/4034/&quot;&gt;here&lt;/a&gt; for a work-in-progress generic event producer for
Avro-encoded Events.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume
helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store
the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of
disparate sources.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using
sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that
is included in the Kudu distribution. You can follow him on Twitter at
&lt;a href=&quot;https://twitter.com/ara_e&quot;&gt;@ara_e&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content><author><name>Ara Abrahamian</name></author><summary>This post discusses the Kudu Flume Sink. First, I&amp;#8217;ll give some background on why we considered
using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.
Why Kudu
Traditionally in the Hadoop ecosystem we&amp;#8217;ve dealt with various batch processing technologies such
as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig,
Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to
process the whole data set in batches, again and again, as soon as new data gets added. Things get
really complicated when a few such tasks need to get chained together, or when the same data set
needs to be processed in various ways by different jobs, while all compete for the shared cluster
resources.
The opposite of this approach is stream processing: process the data as soon as it arrives, not
in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make
this possible. But writing streaming services is not trivial. The streaming systems are becoming
more and more capable and support more complex constructs, but they are not yet easy to use. All
queries and processes need to be carefully planned and implemented.
To summarize, batch processing is:
file-based
a paradigm that processes large chunks of data as a group
high latency and high throughput, both for ingest and query
typically easy to program, but hard to orchestrate
well suited for writing ad-hoc queries, although they are typically high latency
While stream processing is:
a totally different paradigm, which involves single events and time windows instead of large groups of events
still file-based and not a long-term database
not batch-oriented, but incremental
ultra-fast ingest and ultra-fast query (query results basically pre-calculated)
not so easy to program, relatively easy to orchestrate
impossible to write ad-hoc queries
And a Kudu-based near real-time approach is:
flexible and expressive, thanks to SQL support via Apache Impala (incubating)
a table-oriented, mutable data store that feels like a traditional relational database
very easy to program, you can even pretend it&amp;#8217;s good old MySQL
low-latency and relatively high throughput, both for ingest and query
At Argyle Data, we&amp;#8217;re dealing with complex fraud detection scenarios. We need to ingest massive
amounts of data, run machine learning algorithms and generate reports. When we created our current
architecture two years ago we decided to opt for a database as the backbone of our system. That
database is Apache Accumulo. It&amp;#8217;s a key-value based database which runs on top of Hadoop HDFS,
quite similar to HBase but with some important improvements such as cell level security and ease
of deployment and management. To enable querying of this data for quite complex reporting and
analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced
by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This
architecture has served us well, but there were a few problems:
we need to ingest even more massive volumes of data in real-time
we need to perform complex machine-learning calculations on even larger data-sets
we need to support ad-hoc queries, plus long-term data warehouse functionality
So, we&amp;#8217;ve started gradually moving the core machine-learning pipeline to a streaming based
solution. This way we can ingest and process larger data-sets faster in the real-time. But then how
would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While
the machine learning pipeline ingests and processes real-time data, we store a copy of the same
ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our data warehouse. By
using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala&amp;#8217;s
super-fast query engine.
But how would we make sure data is reliably ingested into the streaming pipeline and the
Kudu-based data warehouse? This is where Apache Flume comes in.
Why Flume
According to their website &amp;#8220;Flume is a distributed, reliable, and
available service for efficiently collecting, aggregating, and moving large amounts of log data.
It has a simple and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.&amp;#8221; As you
can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop
clusters.
Flume has an extensible architecture. An instance of Flume, called an agent, can have multiple
channels, with each having multiple sources and sinks of various types. Sources queue data
in channels, which in turn write out data to sinks. Such pipelines can be chained together to
create even more complex ones. There may be more than one agent and agents can be configured to
support failover and recovery.
Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the
default (an in-memory queue with no persistence to disk), but other options such as Kafka- and
File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory
source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing
data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.
In the rest of this post I&amp;#8217;ll go over the Kudu Flume sink and show you how to configure Flume to
write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8
release and the source code can be found here.
Configuring the Kudu Flume Sink
Here is a sample flume configuration file:
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1
agent1.sources.source1.type = exec
agent1.sources.source1.command = /usr/bin/vmstat 1
agent1.sources.source1.channels = channel1
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 1000
agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink
agent1.sinks.sink1.masterAddresses = localhost
agent1.sinks.sink1.tableName = stats
agent1.sinks.sink1.channel = channel1
agent1.sinks.sink1.batchSize = 50
agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer
We define a source called source1 which simply executes a vmstat command to continuously generate
virtual memory statistics for the machine and queue events into an in-memory channel1 channel,
which in turn is used for writing these events to a Kudu table called stats. We are using
org.apache.kudu.flume.sink.SimpleKuduEventProducer as the producer. SimpleKuduEventProducer is
the built-in and default producer, but it&amp;#8217;s implemented as a showcase for how to write Flume
events into Kudu tables. For any serious functionality we&amp;#8217;d have to write a custom producer. We
need to make this producer and the KuduSink class available to Flume. We can do that by simply
copying the kudu-flume-sink-&amp;lt;VERSION&amp;gt;.jar jar file from the Kudu distribution to the
$FLUME_HOME/plugins.d/kudu-sink/lib directory in the Flume installation. The jar file contains
KuduSink and all of its dependencies (including Kudu java client classes).
At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are
(agent1.sinks.sink1.masterAddresses = localhost) and which Kudu table should be used for writing
Flume events to (agent1.sinks.sink1.tableName = stats). The Kudu Flume Sink doesn&amp;#8217;t create this
table, it has to be created before the Kudu Flume Sink is started.
You may also notice the batchSize parameter. Batch size is used for batching up to that many
Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge
impact on ingest performance of the Kudu cluster.
Here is a complete list of KuduSink parameters:
Parameter Name
Default
Description
masterAddresses
N/A
Comma-separated list of &amp;#8220;host:port&amp;#8221; pairs of the masters (port optional)
tableName
N/A
The name of the table in Kudu to write to
producer
org.apache.kudu.flume.sink.SimpleKuduEventProducer
The fully qualified class name of the Kudu event producer the sink should use
batchSize
100
Maximum number of events the sink should take from the channel per transaction, if available
timeoutMillis
30000
Timeout period for Kudu operations, in milliseconds
ignoreDuplicateRows
true
Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu
Let&amp;#8217;s take a look at the source code for the built-in producer class:
public class SimpleKuduEventProducer implements KuduEventProducer {
private byte[] payload;
private KuduTable table;
private String payloadColumn;
public SimpleKuduEventProducer(){
}
@Override
public void configure(Context context) {
payloadColumn = context.getString(&quot;payloadColumn&quot;,&quot;payload&quot;);
}
@Override
public void configure(ComponentConfiguration conf) {
}
@Override
public void initialize(Event event, KuduTable table) {
this.payload = event.getBody();
this.table = table;
}
@Override
public List&amp;lt;Operation&amp;gt; getOperations() throws FlumeException {
try {
Insert insert = table.newInsert();
PartialRow row = insert.getRow();
row.addBinary(payloadColumn, payload);
return Collections.singletonList((Operation) insert);
} catch (Exception e){
throw new FlumeException(&quot;Failed to create Kudu Insert object!&quot;, e);
}
}
@Override
public void close() {
}
}
SimpleKuduEventProducer implements the org.apache.kudu.flume.sink.KuduEventProducer interface,
which itself looks like this:
public interface KuduEventProducer extends Configurable, ConfigurableComponent {
/**
* Initialize the event producer.
* @param event to be written to Kudu
* @param table the KuduTable object used for creating Kudu Operation objects
*/
void initialize(Event event, KuduTable table);
/**
* Get the operations that should be written out to Kudu as a result of this
* event. This list is written to Kudu using the Kudu client API.
* @return List of {@link org.kududb.client.Operation} which
* are written as such to Kudu
*/
List&amp;lt;Operation&amp;gt; getOperations();
/*
* Clean up any state. This will be called when the sink is being stopped.
*/
void close();
}
public void configure(Context context) is called when an instance of our producer is instantiated
by the KuduSink. SimpleKuduEventProducer&amp;#8217;s implementation looks for a producer parameter named
payloadColumn and uses its value (&amp;#8220;payload&amp;#8221; if not overridden in Flume configuration file) as the
column which will hold the value of the Flume event payload. If you recall from above, we had
configured the KuduSink to listen for events generated from the vmstat command. Each output row
from that command will be stored as a new row containing a payload column in the stats table.
SimpleKuduEventProducer does not have any configuration parameters, but if it had any we would
define them by prefixing it with producer. (agent1.sinks.sink1.producer.parameter1 for
example).
The main producer logic resides in the public List&amp;lt;Operation&amp;gt; getOperations() method. In
SimpleKuduEventProducer&amp;#8217;s implementation we simply insert the binary body of the Flume event into
the Kudu table. Here we call Kudu&amp;#8217;s newInsert() to initiate an insert, but could have used
Upsert if updating an existing row was also an option, in fact there&amp;#8217;s another producer
implementation available for doing just that: SimpleKeyedKuduEventProducer. Most probably you
will need to write your own custom producer in the real world, but you can base your implementation
on the built-in ones.
In the future, we plan to add more flexible event producer implementations so that creation of a
custom event producer is not required to write data to Kudu. See
here for a work-in-progress generic event producer for
Avro-encoded Events.
Conclusion
Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume
helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store
the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of
disparate sources.
Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using
sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that
is included in the Kudu distribution. You can follow him on Twitter at
@ara_e.</summary></entry><entry><title>New Range Partitioning Features in Kudu 0.10</title><link href="/2016/08/23/new-range-partitioning-features.html" rel="alternate" type="text/html" title="New Range Partitioning Features in Kudu 0.10" /><published>2016-08-23T00:00:00-07:00</published><updated>2016-08-23T00:00:00-07:00</updated><id>/2016/08/23/new-range-partitioning-features</id><content type="html" xml:base="/2016/08/23/new-range-partitioning-features.html">&lt;p&gt;Kudu 0.10 is shipping with a few important new features for range partitioning.
These features are designed to make Kudu easier to scale for certain workloads,
like time series. This post will introduce these features, and discuss how to use
them to effectively design tables for scalability and performance.&lt;/p&gt;
&lt;!--more--&gt;
&lt;p&gt;Since Kudu’s initial release, tables have had the constraint that once created,
the set of partitions is static. This forces users to plan ahead and create
enough partitions for the expected size of the table, because once the table is
created no further partitions can be added. When using hash partitioning,
creating more partitions is as straightforward as specifying more buckets. For
range partitioning, however, knowing where to put the extra partitions ahead of
time can be difficult or impossible.&lt;/p&gt;
&lt;p&gt;The common solution to this problem in other distributed databases is to allow
range partitions to split into smaller child range partitions. Unfortunately,
range splitting typically has a large performance impact on running tables,
since child partitions need to eventually be recompacted and rebalanced to a
remote server. Range splitting is particularly thorny with Kudu, because rows
are stored in tablets in primary key sorted order, which does not necessarily
match the range partitioning order. If the range partition key is different than
the primary key, then splitting requires inspecting and shuffling each
individual row, instead of splitting the tablet in half.&lt;/p&gt;
&lt;h2 id=&quot;adding-and-dropping-range-partitions&quot;&gt;Adding and Dropping Range Partitions&lt;/h2&gt;
&lt;p&gt;As an alternative to range partition splitting, Kudu now allows range partitions
to be added and dropped on the fly, without locking the table or otherwise
affecting concurrent operations on other partitions. This solution is not
strictly as powerful as full range partition splitting, but it strikes a good
balance between flexibility, performance, and operational overhead.
Additionally, this feature does not preclude range splitting in the future if
there is a push to implement it. To support adding and dropping range
partitions, Kudu had to remove an even more fundamental restriction when using
range partitions.&lt;/p&gt;
&lt;p&gt;Previously, range partitions could only be created by specifying split points.
Split points divide an implicit partition covering the entire range into
contiguous and disjoint partitions. When using split points, the first and last
partitions are always unbounded below and above, respectively. A consequence of
the final partition being unbounded is that datasets which are range-partitioned
on a column that increases in value over time will eventually have far more rows
in the last partition than in any other. Unbalanced partitions are commonly
referred to as hotspots, and until Kudu 0.10 they have been difficult to avoid
when storing time series data in Kudu.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/2016-08-23-new-range-partitioning-features/range-partitioning-on-time.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The figure above shows the tablets created by two different attempts to
partition a table by range on a timestamp column. The first, above in blue, uses
split points. The second, below in green, uses bounded range partitions
specified during table creation. With bounded range partitions, there is no
longer a guarantee that every possible row has a corresponding range partition.
As a result, Kudu will now reject writes which fall in a ‘non-covered’ range.&lt;/p&gt;
&lt;p&gt;Now that tables are no longer required to have range partitions covering all
possible rows, Kudu can support adding range partitions to cover the otherwise
unoccupied space. Dropping a range partition will result in unoccupied space
where the range partition was previously. In the example above, we may want to
add a range partition covering 2017 at the end of the year, so that we can
continue collecting data in the future. By lazily adding range partitions we
avoid hotspotting, avoid the need to specify range partitions up front for time
periods far in the future, and avoid the downsides of splitting. Additionally,
historical data which is no longer useful can be efficiently deleted by dropping
the entire range partition.&lt;/p&gt;
&lt;h2 id=&quot;what-about-hash-partitioning&quot;&gt;What About Hash Partitioning?&lt;/h2&gt;
&lt;p&gt;Since Kudu’s hash partitioning feature originally shipped in version 0.6, it has
been possible to create tables which combine hash partitioning with range
partitioning. The new range partitioning features continue to work seamlessly
when combined with hash partitioning. Just as before, the number of tablets
which comprise a table will be the product of the number of range partitions and
the number of hash partition buckets. Adding or dropping a range partition will
result in the creation or deletion of one tablet per hash bucket.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/2016-08-23-new-range-partitioning-features/range-and-hash-partitioning.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The diagram above shows a time series table range-partitioned on the timestamp
and hash-partitioned with two buckets. The hash partitioning could be on the
timestamp column, or it could be on any other column or columns in the primary
key. In this example only two years of historical data is needed, so at the end
of 2016 a new range partition is added for 2017 and the historical 2014 range
partition is dropped. This causes two new tablets to be created for 2017, and
the two existing tablets for 2014 to be deleted.&lt;/p&gt;
&lt;h2 id=&quot;getting-started&quot;&gt;Getting Started&lt;/h2&gt;
&lt;p&gt;Beginning with the Kudu 0.10 release, users can add and drop range partitions
through the Java and C++ client APIs. Range partitions on existing tables can be
dropped and replacements added, but it requires the servers and all clients to
be updated to 0.10.&lt;/p&gt;</content><author><name>Dan Burkert</name></author><summary>Kudu 0.10 is shipping with a few important new features for range partitioning.
These features are designed to make Kudu easier to scale for certain workloads,
like time series. This post will introduce these features, and discuss how to use
them to effectively design tables for scalability and performance.</summary></entry><entry><title>Apache Kudu 0.10.0 released</title><link href="/2016/08/23/apache-kudu-0-10-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 0.10.0 released" /><published>2016-08-23T00:00:00-07:00</published><updated>2016-08-23T00:00:00-07:00</updated><id>/2016/08/23/apache-kudu-0-10-0-released</id><content type="html" xml:base="/2016/08/23/apache-kudu-0-10-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 0.10.0!&lt;/p&gt;
&lt;p&gt;This latest version adds several new features, including:
&lt;!--more--&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Users may now manually manage the partitioning of a range-partitioned table
by adding or dropping range partitions after a table has been created. This
can be particularly helpful for time-series workloads. Dan Burkert posted
an &lt;a href=&quot;/2016/08/23/new-range-partitioning-features.html&quot;&gt;in-depth blog&lt;/a&gt; today
detailing the new feature.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Multi-master (HA) Kudu clusters are now significantly more stable.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Administrators may now reserve a certain amount of disk space on each of its
configured data directories.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Kudu’s integration with Spark has been substantially improved and is much
more flexible.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This release also includes many bug fixes and other improvements, detailed in
the release notes below.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read the detailed &lt;a href=&quot;http://kudu.apache.org/releases/0.10.0/docs/release_notes.html&quot;&gt;Kudu 0.10.0 release notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Download the &lt;a href=&quot;http://kudu.apache.org/releases/0.9.0/&quot;&gt;Kudu 0.10.0 source release&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>Todd Lipcon</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 0.10.0!
This latest version adds several new features, including:</summary></entry><entry><title>Apache Kudu Weekly Update August 16th, 2016</title><link href="/2016/08/16/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update August 16th, 2016" /><published>2016-08-16T00:00:00-07:00</published><updated>2016-08-16T00:00:00-07:00</updated><id>/2016/08/16/weekly-update</id><content type="html" xml:base="/2016/08/16/weekly-update.html">&lt;p&gt;Welcome to the twentieth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu project.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;project-news&quot;&gt;Project news&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The first release candidate for the 0.10.0 is &lt;a href=&quot;http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201608.mbox/%3CCADY20s7U5jVpozFg3L%3DDz2%2B4AenGineJvH96A_HAM12biDjPJA%40mail.gmail.com%3E&quot;&gt;now available&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Community developers and users are encouraged to download the source
tarball and vote on the release.&lt;/p&gt;
&lt;p&gt;For information on what’s new, check out the
&lt;a href=&quot;https://github.com/apache/kudu/blob/master/docs/release_notes.adoc#rn_0.10.0&quot;&gt;release notes&lt;/a&gt;.
&lt;em&gt;Note:&lt;/em&gt; some links from these in-progress release notes will not be live until the
release itself is published.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;development-discussions-and-code-in-progress&quot;&gt;Development discussions and code in progress&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Will Berkeley spent some time working on the Spark integration this week
to add support for UPSERT as well as other operations.
Dan Burkert pitched in a bit with some &lt;a href=&quot;http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201608.mbox/%3CCALo2W-XBoSz9cbhXi81ipubrAYgqyDiEeHz-ys8sPAshfcik6w%40mail.gmail.com%3E&quot;&gt;suggestions&lt;/a&gt;
which were then integrated in a &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3871/&quot;&gt;patch&lt;/a&gt;
provided by Will.&lt;/p&gt;
&lt;p&gt;After some reviews by Dan, Chris George, and Ram Mettu, the patch was committed
in time for the upcoming 0.10.0 release.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dan Burkert also completed work for the new &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3854/&quot;&gt;manual partitioning APIs&lt;/a&gt;
in the Java client. After finishing up the basic implementation, Dan also made some
cleanups to the related APIs in both the &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3958/&quot;&gt;Java&lt;/a&gt;
and &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3882/&quot;&gt;C++&lt;/a&gt; clients.&lt;/p&gt;
&lt;p&gt;Dan and Misty Stanley-Jones also collaborated to finish the
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3796/&quot;&gt;documentation&lt;/a&gt;
for this new feature.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Adar Dembo worked on some tooling to allow users to migrate their Kudu clusters
from a single-master configuration to a multi-master one. Along the way, he
started building some common infrastructure for command-line tooling.&lt;/p&gt;
&lt;p&gt;Since Kudu’s initial release, it has included separate binaries for different
administrative or operational tools (e.g. &lt;code&gt;kudu-ts-cli&lt;/code&gt;, &lt;code&gt;kudu-ksck&lt;/code&gt;, &lt;code&gt;kudu-fs_dump&lt;/code&gt;,
&lt;code&gt;log-dump&lt;/code&gt;, etc). Despite having similar usage, these tools don’t share much code,
and the separate statically linked binaries make the Kudu packages take more disk
space than strictly necessary.&lt;/p&gt;
&lt;p&gt;Adar’s work has introduced a new top-level &lt;code&gt;kudu&lt;/code&gt; binary which exposes a set of subcommands,
much like the &lt;code&gt;git&lt;/code&gt; and &lt;code&gt;docker&lt;/code&gt; binaries with which readers may be familiar.
For example, a new tool he has built for dumping peer identifiers from a tablet’s
consensus metadata is triggered using &lt;code&gt;kudu tablet cmeta print_replica_uuids&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This new tool will be available in the upcoming 0.10.0 release; however, migration
of the existing tools to the new infrastructure has not yet been completed. We
expect that by Kudu 1.0, the old tools will be removed in favor of more subcommands
of the &lt;code&gt;kudu&lt;/code&gt; tool.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Todd Lipcon picked up the work started by David Alves in July to provide
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/2642/&quot;&gt;“exactly-once” semantics&lt;/a&gt; for write operations.
Todd carried the patch series through review and also completed integration of the
feature into the Kudu server processes.&lt;/p&gt;
&lt;p&gt;After testing the feature for several days on a large cluster under load,
the team decided to enable this new feature by default in Kudu 0.10.0.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Mike Percy resumed working on garbage collection of &lt;a href=&quot;https://gerrit.cloudera.org/#/c/2853/&quot;&gt;past versions of
updated and deleted rows&lt;/a&gt;. His &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3076/&quot;&gt;main
patch for the feature&lt;/a&gt; went through
several rounds of review and testing, but unfortunately missed the cut-off
for 0.10.0.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Alexey Serbin’s work to add doxygen-based documentation for the C++ Client API
was &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3840/&quot;&gt;committed&lt;/a&gt; this week. These
docs will be published as part of the 0.10.0 release.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Alexey also continued work on implementing the &lt;code&gt;AUTO_FLUSH_BACKGROUND&lt;/code&gt; write
mode for the C++ client. This feature makes it easier to implement high-throughput
ingest using the C++ API by automatically handling the batching and flushing of writes
based on a configurable buffer size.&lt;/p&gt;
&lt;p&gt;Alexey’s &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3952/&quot;&gt;patch&lt;/a&gt; has received several
rounds of review and looks likely to be committed soon. Detailed performance testing
will follow.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Congratulations to Ram Mettu for committing his first patch to Kudu this week!
Ram fixed a &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-1522&quot;&gt;bug in handling Alter Table with TIMESTAMP columns&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;upcoming-talks&quot;&gt;Upcoming talks&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Mike Percy will be speaking about Kudu this Wednesday at the
&lt;a href=&quot;http://www.meetup.com/Denver-Cloudera-User-Group/events/232782782/&quot;&gt;Denver Cloudera User Group&lt;/a&gt;
and on Thursday at the
&lt;a href=&quot;http://www.meetup.com/Boulder-Denver-Big-Data/events/232056701/&quot;&gt;Boulder/Denver Big Data Meetup&lt;/a&gt;.
If you’re based in the Boulder/Denver area, be sure not to miss these talks!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Want to learn more about a specific topic from this blog post? Shoot an email to the
&lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#117;&amp;#115;&amp;#101;&amp;#114;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;kudu-user mailing list&lt;/a&gt; or
tweet at &lt;a href=&quot;https://twitter.com/ApacheKudu&quot;&gt;@ApacheKudu&lt;/a&gt;. Similarly, if you’re
aware of some Kudu news we missed, let us know so we can cover it in
a future post.&lt;/p&gt;</content><author><name>Todd Lipcon</name></author><summary>Welcome to the twentieth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>Apache Kudu Weekly Update August 8th, 2016</title><link href="/2016/08/08/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update August 8th, 2016" /><published>2016-08-08T00:00:00-07:00</published><updated>2016-08-08T00:00:00-07:00</updated><id>/2016/08/08/weekly-update</id><content type="html" xml:base="/2016/08/08/weekly-update.html">&lt;p&gt;Welcome to the nineteenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu project.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;development-discussions-and-code-in-progress&quot;&gt;Development discussions and code in progress&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;After a couple months of work, Dan Burkert finished
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3648/&quot;&gt;adding add/remove range partition support&lt;/a&gt;
in the C++ client and in the master.&lt;/p&gt;
&lt;p&gt;Dan also posted a patch for review which &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3854/&quot;&gt;adds support for this
feature&lt;/a&gt; to the Java client. Dan is
expecting that this will be finished in time for the upcoming Kudu 0.10.0
release.&lt;/p&gt;
&lt;p&gt;Misty Stanley-Jones started working on &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3796/&quot;&gt;documentation for this
feature&lt;/a&gt;. Readers of this
blog are encouraged to check out the docs and provide feedback!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Adar Dembo also completed fixing most of the issues related to high availability
using multiple Kudu master processes. The upcoming Kudu 0.10.0 release will support
running multiple masters and transparently handling a transient failure of any
master process.&lt;/p&gt;
&lt;p&gt;Although multi-master should now be stable, some work remains in this area. Namely,
Adar is working on a &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3393/&quot;&gt;design for handling permanent failure of a machine hosting
a master&lt;/a&gt;. In this case, the administrator
will need to use some new tools to create a new master replica by copying data from
an existing one.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Todd Lipcon started a
&lt;a href=&quot;https://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201607.mbox/%3CCADY20s5WdR7KmB%3DEAHJwvzELhe9PXfnnGMLV%2B4t%3D%3Defw%3Dix8uw%40mail.gmail.com%3E&quot;&gt;discussion&lt;/a&gt;
on the dev mailing list about renaming the Kudu feature which creates new
replicas of tablets after they become under-replicated. Since its initial
introduction, this feature was called “remote bootstrap”, but Todd pointed out
that this naming caused some confusion with the other “bootstrap” term used to
describe the process by which a tablet loads itself at startup.&lt;/p&gt;
&lt;p&gt;The discussion concluded with an agreement to rename the process to “Tablet Copy”.
Todd provided patches to perform this rename, which were committed at the end of the
week last week.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Congratulations to Attila Bukor for his first commit to Kudu! Attila
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3820/&quot;&gt;fixed an error in the quick-start documentation&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;news-and-articles-from-around-the-web&quot;&gt;News and articles from around the web&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The New Stack published an &lt;a href=&quot;http://thenewstack.io/apache-kudu-fast-columnar-data-store-hadoop/&quot;&gt;introductory article about Kudu&lt;/a&gt;.
The article was based on a recent interview with Todd Lipcon
and covers topics such as the origin of the name “Kudu”, where Kudu fits into the
Apache Hadoop ecosystem, and goals for the upcoming 1.0 release.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Want to learn more about a specific topic from this blog post? Shoot an email to the
&lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#117;&amp;#115;&amp;#101;&amp;#114;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;kudu-user mailing list&lt;/a&gt; or
tweet at &lt;a href=&quot;https://twitter.com/ApacheKudu&quot;&gt;@ApacheKudu&lt;/a&gt;. Similarly, if you’re
aware of some Kudu news we missed, let us know so we can cover it in
a future post.&lt;/p&gt;</content><author><name>Todd Lipcon</name></author><summary>Welcome to the nineteenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>Apache Kudu Weekly Update July 26, 2016</title><link href="/2016/07/26/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu Weekly Update July 26, 2016" /><published>2016-07-26T00:00:00-07:00</published><updated>2016-07-26T00:00:00-07:00</updated><id>/2016/07/26/weekly-update</id><content type="html" xml:base="/2016/07/26/weekly-update.html">&lt;p&gt;Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu project.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;project-news&quot;&gt;Project news&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Kudu has graduated from the Apache Incubator and is now a Top-Level Project! All the details
are in this &lt;a href=&quot;http://kudu.apache.org/2016/07/25/asf-graduation.html&quot;&gt;blog post&lt;/a&gt;.
Mike Percy and Todd Lipcon made a few updates to the website to reflect the project’s
new name and status.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;development-discussions-and-code-in-progress&quot;&gt;Development discussions and code in progress&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Dan Burkert contributed a few patches that repackage the Java client under &lt;code&gt;org.apache.kudu&lt;/code&gt;
in place of &lt;code&gt;org.kududb&lt;/code&gt;. This was done in a &lt;strong&gt;backward-incompatible&lt;/strong&gt; way, meaning that import
statements will have to be modified in existing Java code to compile against a newer Kudu JAR
version (from 0.10.0 onward). This stems from &lt;a href=&quot;http://mail-archives.apache.org/mod_mbox/kudu-dev/201605.mbox/%3CCAGpTDNcJohQBgjzXafXJQdqmBB4sL495p5V_BJRXk_nAGWbzhA@mail.gmail.com%3E&quot;&gt;a discussion&lt;/a&gt;
initiated in May. It won’t have an impact on C++ or Python users, and it isn’t affecting wire
compatibility.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Still on the Java-side, J-D Cryans pushed &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3055/&quot;&gt;a patch&lt;/a&gt;
that completely changes how Exceptions are managed. Before this change, users had to introspect
generic Exception objects, making it a guessing game and discouraging good error handling.
Now, the synchronous client’s methods throw &lt;code&gt;KuduException&lt;/code&gt; which packages a &lt;code&gt;Status&lt;/code&gt; object
that can be interrogated. This is very similar to how the C++ API works.&lt;/p&gt;
&lt;p&gt;Existing code that uses the new Kudu JAR should still compile since this change replaces generic
&lt;code&gt;Exception&lt;/code&gt; with a more specific &lt;code&gt;KuduException&lt;/code&gt;. Error handling done by string-matching the
exception messages should now use the provided &lt;code&gt;Status&lt;/code&gt; object.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Alexey Serbin’s &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3619/&quot;&gt;patch&lt;/a&gt; that adds Doxygen-based
documentation was pushed and the new API documentation for C++ developers will be available
with the next release.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Todd has made many improvements to the &lt;code&gt;ksck&lt;/code&gt; tool over the last week. Building upon Will
Berkeley’s &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3632/&quot;&gt;WIP patch for KUDU-1516&lt;/a&gt;, &lt;code&gt;ksck&lt;/code&gt; can
now detect more problematic situations like if a tablet doesn’t have a majority of replicas on
live tablet servers, or if those replicas aren’t in a good state.
&lt;code&gt;ksck&lt;/code&gt; is also &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3705/&quot;&gt;now faster&lt;/a&gt; when run against a large
cluster with a lot of tablets, among other improvements.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;As mentioned last week, Dan has been working on &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3648/&quot;&gt;adding add/remove range partition support&lt;/a&gt;
in the C++ client and in the master. The patch has been through many rounds of review and
testing and it’s getting close to completion. Meanwhile, J-D started looking at adding support
for this functionality in the &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3731/&quot;&gt;Java client&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Adar Dembo is also hard at work on the master. The &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3609/&quot;&gt;series&lt;/a&gt;
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3610/&quot;&gt;of&lt;/a&gt; &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3611/&quot;&gt;patches&lt;/a&gt; to
have the tablet servers heartbeat to all the masters that he published earlier this month is
getting near the finish line.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Want to learn more about a specific topic from this blog post? Shoot an email to the
&lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#117;&amp;#115;&amp;#101;&amp;#114;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#105;&amp;#110;&amp;#099;&amp;#117;&amp;#098;&amp;#097;&amp;#116;&amp;#111;&amp;#114;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;kudu-user mailing list&lt;/a&gt; or
tweet at &lt;a href=&quot;https://twitter.com/ApacheKudu&quot;&gt;@ApacheKudu&lt;/a&gt;. Similarly, if you’re
aware of some Kudu news we missed, let us know so we can cover it in
a future post.&lt;/p&gt;</content><author><name>Jean-Daniel Cryans</name></author><summary>Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu project.</summary></entry><entry><title>The Apache Software Foundation Announces Apache® Kudu™ as a Top-Level Project</title><link href="/2016/07/25/asf-graduation.html" rel="alternate" type="text/html" title="The Apache Software Foundation Announces Apache&amp;reg; Kudu&amp;trade; as a Top-Level Project" /><published>2016-07-25T00:00:00-07:00</published><updated>2016-07-25T00:00:00-07:00</updated><id>/2016/07/25/asf-graduation</id><content type="html" xml:base="/2016/07/25/asf-graduation.html">&lt;p&gt;The following post was originally published on &lt;a href=&quot;https://blogs.apache.org/foundation/entry/apache_software_foundation_announces_apache&quot;&gt;The Apache Software Foundation Blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Open Source columnar storage engine enables fast analytics across the Internet of Things, time series, cybersecurity, and other Big Data applications in the Apache Hadoop ecosystem&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Forest Hill, MD –25 July 2016–&lt;/strong&gt; The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Kudu™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.&lt;/p&gt;
&lt;!--more--&gt;
&lt;p&gt;Apache Kudu is an Open Source columnar storage engine built for the Apache Hadoop ecosystem designed to enable flexible, high-performance analytic pipelines.&lt;/p&gt;
&lt;p&gt;“Under the Apache Incubator, the Kudu community has grown to more than 45 developers and hundreds of users,” said Todd Lipcon, Vice President of Apache Kudu and Software Engineer at Cloudera. “We are excited to be recognized for our strong Open Source community and are looking forward to our upcoming 1.0 release.”&lt;/p&gt;
&lt;p&gt;Optimized for lightning-fast scans, Kudu is particularly well suited to hosting time-series data and various types of operational data. In addition to its impressive scan speed, Kudu supports many operations available in traditional databases, including real-time insert, update, and delete operations. Kudu enables a “bring your own SQL” philosophy, and supports being accessed by multiple different query engines including such other Apache projects as Drill, Spark, and Impala (incubating).&lt;/p&gt;
&lt;p&gt;Apache Kudu is in use at diverse companies and organizations across many industries, including retail, online service delivery, risk management, and digital advertising.&lt;/p&gt;
&lt;p&gt;“Using Apache Kudu alongside interactive SQL tools like Apache Impala (incubating) has allowed us to deploy a next-generation platform for real-time analytics and online reporting,” said Baoqiu Cui, Chief Architect at Xiaomi. “Apache Kudu has been deployed in production at Xiaomi for more than six months and has enabled us to improve key reliability and performance metrics for our customers. We are excited to see Kudu graduate to a top-level project and look forward to continuing to contribute to its success.”&lt;/p&gt;
&lt;p&gt;“We are already seeing the many benefits of Apache Kudu. In fact we’re using its combination of fast scans and fast updates for upcoming releases of our risk solutions,” said Cory Isaacson, CTO at Risk Management Solutions, Inc. “Kudu is performing well, and RMS is proud to have contributed to the project’s integration with Apache Spark.”&lt;/p&gt;
&lt;p&gt;“The Internet of Things, cybersecurity and other fast data drivers highlight the demands that real-time analytics place on Big Data platforms,” said Arvind Prabhakar, Apache Software Foundation member and CTO of StreamSets. “Apache Kudu fills a key architectural gap by providing an elegant solution spanning both traditional analytics and fast data access. StreamSets provides native support for Apache Kudu to help build real-time ingestion and analytics for our users.”&lt;/p&gt;
&lt;p&gt;“Graduation to a Top-Level Project marks an important milestone in the Apache Kudu community, but we are really just beginning to achieve our vision of a hybrid storage engine for analytics and real-time processing,” added Lipcon. “As our community continues to grow, we welcome feedback, use cases, bug reports, patch submissions, documentation, new integrations, and all other contributions.”&lt;/p&gt;
&lt;p&gt;The Apache Kudu project welcomes contributions and community participation through mailing lists, a Slack channel, face-to-face MeetUps, and other events. Catch Apache Kudu in action at Strata + Hadoop World, 26-29 September 2016 in New York.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Availability and Oversight&lt;/strong&gt;&lt;br /&gt;
Apache Kudu software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. For project updates, downloads, documentation, and ways to become involved with Apache Kudu, visit &lt;a href=&quot;http://kudu.apache.org/&quot;&gt;http://kudu.apache.org/&lt;/a&gt; , &lt;a href=&quot;https://twitter.com/ApacheKudu&quot;&gt;@ApacheKudu&lt;/a&gt;, and &lt;a href=&quot;http://kudu.apache.org/blog/&quot;&gt;http://kudu.apache.org/blog/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;About the Apache Incubator&lt;/strong&gt;&lt;br /&gt;
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit &lt;a href=&quot;http://incubator.apache.org/&quot;&gt;http://incubator.apache.org/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;About The Apache Software Foundation (ASF)&lt;/strong&gt;&lt;br /&gt;
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server –the world’s most popular Web server software. Through the ASF’s meritocratic process known as “The Apache Way,” more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation’s official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, OPDi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit &lt;a href=&quot;http://www.apache.org/&quot;&gt;http://www.apache.org/&lt;/a&gt; and &lt;a href=&quot;https://twitter.com/TheASF&quot;&gt;https://twitter.com/TheASF&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;© The Apache Software Foundation. “Apache”, “Kudu”, “Apache Kudu”, “Drill”, “Apache Drill”, “Hadoop”, “Apache Hadoop”, “Apache Impala (incubating)”, “Spark”, “Apache Spark”, and “ApacheCon” are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.&lt;/p&gt;</content><author><name>Jean-Daniel Cryans</name></author><summary>The following post was originally published on The Apache Software Foundation Blog.
Open Source columnar storage engine enables fast analytics across the Internet of Things, time series, cybersecurity, and other Big Data applications in the Apache Hadoop ecosystem
Forest Hill, MD –25 July 2016– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Kudu™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.</summary></entry><entry><title>Apache Kudu (incubating) Weekly Update July 18, 2016</title><link href="/2016/07/18/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu (incubating) Weekly Update July 18, 2016" /><published>2016-07-18T00:00:00-07:00</published><updated>2016-07-18T00:00:00-07:00</updated><id>/2016/07/18/weekly-update</id><content type="html" xml:base="/2016/07/18/weekly-update.html">&lt;p&gt;Welcome to the seventeenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu (incubating) project.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;development-discussions-and-code-in-progress&quot;&gt;Development discussions and code in progress&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Dan Burkert has continued making progress on support for non-covering range partitioned
tables. This past week, he posted a code review for
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3648/&quot;&gt;adding and dropping range partitions to the master&lt;/a&gt;
and another for &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3581/&quot;&gt;handling non-covering ranges in the C++ client&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Adar Dembo continued working on addressing multi-master issues, as he explained in this
&lt;a href=&quot;http://kudu.apache.org/2016/06/24/multi-master-1-0-0.html&quot;&gt;blog post&lt;/a&gt;. This past week
he worked on &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3550/&quot;&gt;tackling various race conditions&lt;/a&gt;
that were possible when master operations were submitted concurrent with a master leader election.&lt;/p&gt;
&lt;p&gt;Adar also posted patches for most of the remaining known server-side issues, including
posting a comprehensive &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3611/&quot;&gt;stress test&lt;/a&gt; which issues
client DDL operations concurrent with triggering master crashes and associated failovers.&lt;/p&gt;
&lt;p&gt;As always, Adar’s commit messages are instructive and fun reads for those interested in
following along.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;As mentioned last week, David Alves has been making a lot of progress on the implementation
of the replay cache. Many patches landed in master this week, including:
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3192/&quot;&gt;RPC system integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3449/&quot;&gt;Integration with replicated writes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3519/&quot;&gt;Correctness/stress tests&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Currently, this new feature is disabled by default, as the support for evicting elements
from the cache is not yet complete. This last missing feature is now
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3628/&quot;&gt;up for review&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;Alexey Serbin has been working on adding Doxygen-based documentation for the public
C++ API. This was originally &lt;a href=&quot;https://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201606.mbox/%3CCANbMB4wtMz=JKwgKMNPvkjWX3t9NxCeGt04NmL=SyESyzUMWJg@mail.gmail.com%3E&quot;&gt;proposed on the mailing list&lt;/a&gt;
a couple of weeks ago, and last week, Alexey posted the
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3619/&quot;&gt;initial draft of the implementation&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;project-news&quot;&gt;Project news&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&quot;http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201607.mbox/%3CCAGpTDNesxH43C-Yt5fNwpEpAxfb2P62Xpdi8AqT8jfvjeqnu0w%40mail.gmail.com%3E&quot;&gt;discussion&lt;/a&gt;
on the dev mailing list about having an intermediate release, called 0.10.0, before 1.0.0,
has wound down. The consensus seems to be that the development team is in favor of this
release. Accordingly, the version number in the master branch has been changed back to
0.10.0-SNAPSHOT.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Want to learn more about a specific topic from this blog post? Shoot an email to the
&lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#117;&amp;#115;&amp;#101;&amp;#114;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#105;&amp;#110;&amp;#099;&amp;#117;&amp;#098;&amp;#097;&amp;#116;&amp;#111;&amp;#114;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;kudu-user mailing list&lt;/a&gt; or
tweet at &lt;a href=&quot;https://twitter.com/ApacheKudu&quot;&gt;@ApacheKudu&lt;/a&gt;. Similarly, if you’re
aware of some Kudu news we missed, let us know so we can cover it in
a future post.&lt;/p&gt;</content><author><name>Todd Lipcon</name></author><summary>Welcome to the seventeenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu (incubating) project.</summary></entry><entry><title>Apache Kudu (incubating) Weekly Update July 11, 2016</title><link href="/2016/07/11/weekly-update.html" rel="alternate" type="text/html" title="Apache Kudu (incubating) Weekly Update July 11, 2016" /><published>2016-07-11T00:00:00-07:00</published><updated>2016-07-11T00:00:00-07:00</updated><id>/2016/07/11/weekly-update</id><content type="html" xml:base="/2016/07/11/weekly-update.html">&lt;p&gt;Welcome to the sixteenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu (incubating) project.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;development-discussions-and-code-in-progress&quot;&gt;Development discussions and code in progress&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Todd Lipcon &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3517/&quot;&gt;changed the default&lt;/a&gt;
bloom filter false positive (FP) ratio from 1% to 0.01%. The original value
wasn’t scientifically chosen, but testing with billions of rows on a 5
node cluster showed a 2x insert throughput improvement at the cost of
some more disk space.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;J-D Cryans has been fixing some recently introduced bugs in the Java client.
For example, see &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3541/&quot;&gt;this patch&lt;/a&gt; and
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3586/&quot;&gt;that one&lt;/a&gt;. Testability is a major
concern right now in the Java client since triggering those issues requires
a lot of time and data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dan Burkert has been making progress on the non-convering range partitioned
tables front. The Java client &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3388/&quot;&gt;now supports&lt;/a&gt;
such tables and Dan is currently implementing new functionality to add and remove
tablets via simple APIs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;David Alves is also making a lot of progress on the replay cache, this new component
on the server-side which makes it possible for tablets to identify client write
operations and enable exactly-once semantics. The main patch is up for review
&lt;a href=&quot;https://gerrit.cloudera.org/#/c/3449/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Adar Dembo is working on addressing multi-master issues, as he explained in this
&lt;a href=&quot;http://kudu.apache.org/2016/06/24/multi-master-1-0-0.html&quot;&gt;blog post&lt;/a&gt;. He just put
up for review patches that enable tablet servers to heartbeat to all masters. Part
one is &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3609/&quot;&gt;available here&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Misty prepared a document with J-D that contains instructions on how to release a
new Kudu version. It is &lt;a href=&quot;https://gerrit.cloudera.org/#/c/3614/&quot;&gt;up for review here&lt;/a&gt;
if you are curious or want to learn more about this process.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;project-news&quot;&gt;Project news&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The &lt;a href=&quot;https://s.apache.org/l6Tw&quot;&gt;vote&lt;/a&gt; to graduate Kudu from the ASF’s Incubator passed!
The next step is for the ASF Board to vote on the resolution at their next meeting.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;There’s &lt;a href=&quot;http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201607.mbox/%3CCAGpTDNesxH43C-Yt5fNwpEpAxfb2P62Xpdi8AqT8jfvjeqnu0w%40mail.gmail.com%3E&quot;&gt;a discussion&lt;/a&gt;
on the dev mailing list about having an intermediate release, called 0.10.0, before 1.0.0.
The current issue is that the version in the code is current “1.0.0-SNAPSHOT” which doesn’t
leave room for another other release, but the bigger issue is that code is still churning a
lot which doesn’t make for a stable 1.0 release.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Want to learn more about a specific topic from this blog post? Shoot an email to the
&lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#117;&amp;#115;&amp;#101;&amp;#114;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#105;&amp;#110;&amp;#099;&amp;#117;&amp;#098;&amp;#097;&amp;#116;&amp;#111;&amp;#114;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;kudu-user mailing list&lt;/a&gt; or
tweet at &lt;a href=&quot;https://twitter.com/ApacheKudu&quot;&gt;@ApacheKudu&lt;/a&gt;. Similarly, if you’re
aware of some Kudu news we missed, let us know so we can cover it in
a future post.&lt;/p&gt;</content><author><name>Jean-Daniel Cryans</name></author><summary>Welcome to the sixteenth edition of the Kudu Weekly Update. This weekly blog post
covers ongoing development and news in the Apache Kudu (incubating) project.</summary></entry></feed>