blob: c014d56cb3c56407b8dd8b858de2e09afa938913 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Consistency in Apache Kudu, Part 1</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href=""
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="" />
<link rel="alternate" type="application/atom+xml"
title="RSS Feed for Apache Kudu blog"
href="/feed.xml" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src=""></script>
<script src=""></script>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<a class="logo" href="/"><img
srcset="// 1x, // 2x"
alt="Apache Kudu"/></a>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
<li >
<a href="/overview.html">Overview</a>
<li >
<a href="/docs/">Documentation</a>
<li >
<a href="/releases/">Releases</a>
<li class="active">
<a href="/blog/">Blog</a>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="">GitHub</a></li>
<li><a class="icon gerrit" href="">Gerrit Code Review</a></li>
<li><a class="icon jira" href="">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="">Twitter</a></li>
<li><a href="">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="" target="_blank">Security</a></li>
<li><a href="" target="_blank">Sponsorship</a></li>
<li><a href="" target="_blank">Thanks</a></li>
<li><a href="" target="_blank">License</a></li>
<li >
<a href="/faq.html">FAQ</a>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
<div class="row header">
<div class="col-lg-12">
<h2><a href="/blog">Apache Kudu Blog</a></h2>
<div class="row-fluid">
<div class="col-lg-9">
<h1 class="entry-title">Consistency in Apache Kudu, Part 1</h1>
<p class="meta">Posted 18 Sep 2017 by David Alves</p>
<div class="entry-content">
<p>In this series of short blog posts we will introduce Kudu’s consistency model,
its design and ultimate goals, current features, and next steps.
On the way, we’ll shed some light on the more relevant components and how they
fit together.</p>
<p>In Part 1 of the series (this one), we’ll cover motivation and design trade-offs, the end goals and
the current status.</p>
<h2 id="what-is-consistency-and-why-is-it-relevant">What is “consistency” and why is it relevant?</h2>
<p>In order to cope with ever increasing data volumes, modern storage systems like Kudu have to support
many concurrent users while coordinating requests across many machines, each with many threads executing
work at the same time. However, application developers shouldn’t have to understand the internal
details of how these systems implement this parallel, distributed, execution in order to write
correct applications. <em>Consistency in the context of parallel, distributed systems roughly
refers to how the system behaves in comparison to a single-machine, single-thread system</em>. In a
single-threaded, single-machine storage system operations happen one-at-a-time, in a clearly
defined order, making correct applications easy to code and reason about. A developer writing an
application against such a system doesn’t have to care about how simultaneous operations interact
or about ordering anomalies, so the code is simpler, but more importantly, cognitive load is greatly
reduced, freeing focus for the application logic itself.</p>
<p>While such a simple system is definitely possible to build, it wouldn’t be able to cope with very
large amounts of data. In order to deal with big data volumes and write throughputs modern storage
systems like Kudu are designed to be distributed, storing and processing data across many machines
and cores. This means that many things happen simultaneously in the same and different machines,
that there are more moving parts and thus more oportunity for mis-orderings and for components
to fail. How far systems like Kudu go (or don’t go) in emulating the simple single-threaded, single-machine
system a distributed, parallel setting where failures are common is roughly what is referred to
as how <em>“consistent”</em> the system is.</p>
<p><em>Consistency</em> as a term is somewhat overloaded in the distributed systems and database communities,
there are many different models, properties, different names for the same concept, and often
different concepts under the same name. This post is not meant to introduce these concepts
as there are excellent references already available elsewhere (we recommend Kyle Kinsbury’s excellent
series of blog posts on the matter, like <a href="">this one</a>).
Throughout this and follow-up posts we’ll refer to consistency loosely as the <strong>C</strong> in <strong>CAP</strong>[1]
in some cases and as the <strong>I</strong> in <strong>ACID</strong>[2] in others; we’ll try to be specific when relevant.</p>
<h2 id="design-decisions-trade-offs-and-motivation">Design decisions, trade-offs and motivation</h2>
<p>Consistency is essentially about ordering and ordering usually has a cost. Distributed storage
system design must choose to prioritize some properties over others according to the target use
cases. That is, trade-offs must be made or, borrowing a term from economics, there is
“no free lunch”. Different systems choose different trade-off points; for instance, systems inspired by <em>Dynamo</em>[3], usually favor availability in the consistency/availability
trade-off: by allowing a write to a data item to succeed even when a majority (or even all) of the
replicas serving that data item are unreachable, Dynamo’s design is minimizing insertion errors and
insert latency (related to availability) at the cost having to perform extra work for value
reconciliation on reads and possibly returning stale or disordered values (related to consistency).
On the other end of the spectrum, traditional DBMS design is often driven by the need to support
transactions of arbitrary complexity while providing the users stronger, predictable, semantics,
favoring consistency at the cost of scalability and availability.</p>
<p>Kudu’s overarching goal is to enable <em>fast analytic workloads over large amounts of mutable</em> data,
meaning it was designed to perform fast scans over large volumes of data stored in many servers.
In practical terms this means that, when given a choice, more often than not, we opted for the
design that would enable Kudu to have faster scan performance (i.e. favoring reads even if it meant pushing
a bit more work to the path that mutates data, i.e. writes). This does not mean that the write path
was not a concern altogether. In fact, modern storage systems like <em>Google’s Spanner</em>[4]
global-scale database demonstrate that, with the right set of trade-offs, it is possible to have strong
consistency semantics with write latencies and overall availability that are adequate for most use
cases (e.g. Spanner achieves 5 9’s of availability). For the write path, we often made similar choices in Kudu.</p>
<p>Another important aspect that directed our design decisions is the type of <em>write workload</em> we targeted.
Traditionally, analytical storage systems target periodic bulk write workloads and a continuous
stream of analytical scans. This design is often problematic in that it forces users to have to
build complex pipelines where data is accumulated in one place for later loading into the storage
system. Moreover, beyond the architectural complexity, this kind of design usually also
means that the data that is available for analytics is not the most recent. In Kudu we aimed for
enabling continuous ingest, i.e. having a continuous stream of small writes, obviating the need to
assemble a pipeline for data accumulation/loading and allowing analytical scans to have access to
the most recent data. Another important aspect of the write workloads that we targeted in Kudu is
that they are append-mostly, i.e. most insert new values into the table, with a smaller percentage
updating currently existing values. Both the average write size and the data distribution influence
the design of the write path, as we’ll see in the following sections.</p>
<p>One last concern we had in mind is that different users have different needs when it comes to
consistency semantics, particularly as it applies to an analytical storage system like Kudu. For
some users consistency isn’t a primary concern, they just want fast scans, and the ability to
update/insert/delete values without needing to build a complex pipeline. For example, many machine
learning models are mostly insensitive to data recency or ordering so, when using Kudu to store data that
will be used to train such a model, consistency is often not as primary a concern as read/write performance is.
In other cases consistency is a much higher priority. For example, when using Kudu to
store transaction data for fraud analysis it might be important to capture if events are causally
related. Fraudulent transactions might be characterized by a specific sequence of events and when
retrieving that data it might be important for the scan result to reflect that sequence. Kudu’s
design allows users to make a trade-off between consistency and performance at scan time. That is,
users can choose to have stronger consistency semantics for scans at the penalty of latency and
throughput or they can choose to weaken the consistency semantics for an extra performance boost.</p>
<h3 id="note">Note</h3>
<p>Kudu currently lacks support for atomic multi-row mutation operations (i.e. mutation
operations to more than one row in the same or different tablets, planned as a future feature).
So, when discussing writes, we’ll be talking about the consistency semantics of single row mutations.
In this context we’ll discuss Kudu’s properties more from a key/value store standpoint. On the
other hand Kudu is an analytical storage engine so, for the read path, we’ll also discuss the
semantics of large (multi-row) scans. This moves the discussion more into the field of traditional
DBMSs. These ingredients make for a non-traditional discussion that is not exactly apples-to-apples
with what the reader might be familiar with, but our hope is that it still provides valuable, or
at least interesting, insight.</p>
<h2 id="consistency-options-in-kudu">Consistency options in Kudu</h2>
<p>Consistency, as well as other properties, are underpinned in Kudu by the concept of a <em>timestamp</em>.
In follow-up posts we’ll look into detail how these are assigned and how they are assembled. For now
it’s sufficient to know that a timestamp is a single, usually large, number that has some mapping
to wall time. Each mutation of a Kudu row is tagged with one such timestamp. Globally, these timestamps
form a partial order over all the rows with the particularity that causally related mutations (e.g.
a write mutation that is the result of the value obtained from a previous read) may be required to
have increasing timestamps, depending on the user’s choices.</p>
<p>Row mutations performed by a single client <em>instance</em> are guaranteed to have increasing timestamps
thus reflecting their potential causal relationship. This property is always enforced. However
there are two major <em>“knobs”</em> that are available to the user to make performance trade-offs, the
<code class="language-plaintext highlighter-rouge">Read</code> mode, and the <code class="language-plaintext highlighter-rouge">External Consistency</code> mode (see <a href="">here</a>
for more information on how to use the relevant APIs).</p>
<p>The first and most important knob, the <code class="language-plaintext highlighter-rouge">Read</code> mode, pertains to what is the guaranteed recency of
data resulting from scans. Since Kudu uses replication for availability and fault-tolerance, there
are always multiple replicas of any data item.
Not all replicas must be up-to-date so if the user cares about recency, e.g. if the user requires
that any data read includes all previously written data <em>from a single client instance</em> then it must
choose the <code class="language-plaintext highlighter-rouge">READ_AT_SNAPSHOT</code> read mode. With this mode enabled the client is guaranteed to observe
<strong>“READ YOUR OWN WRITES”</strong> semantics, i.e. scans from a client will always include all previous mutations
performed by that client. Note that this property is local to a single client instance, not a global
<p>The second “knob”, the <code class="language-plaintext highlighter-rouge">External Consistency</code> mode, defines the semantics of how reads and writes
are performed across multiple client instances. By default, <code class="language-plaintext highlighter-rouge">External Consistency</code> is set to
<code class="language-plaintext highlighter-rouge">CLIENT_PROPAGATED</code>, meaning it’s up to the user to coordinate a set of <em>timestamp tokens</em> with clients (even
across different machines) if they are performing writes/reads that are somehow causally linked.
If done correctly this enables <strong>STRICT SERIALIZABILITY</strong>[5], i.e. <strong>LINEARIZABILITY</strong>[6] and
<strong>SERIALIZABILITY</strong>[7] at the same time, at the cost of having the user coordinate the timestamp
tokens across clients (a survey of the meaning of these, and other definitions can be found
<a href="">here</a>).
The alternative setting for <code class="language-plaintext highlighter-rouge">External Consistency</code> is to have it set to
<code class="language-plaintext highlighter-rouge">COMMIT_WAIT</code> (experimental), which guarantees the same properties through a different means, by
implementing Google Spanner’s <em>TrueTime</em>. This comes at the cost of higher latency (depending on how
tightly synchronized the system clocks of the various tablet servers are), but doesn’t require users
to propagate timestamps programmatically.</p>
<h2 id="next-up">Next up</h2>
<p>In following posts we’ll look into the several components of Kudu’s architecture that come together
to enable the consistency semantics introduced in the previous section, including:</p>
<li>Transactions and the Transaction Driver</li>
<li>Concurrent execution with Multi-Version Concurrency Control</li>
<li>Exactly-Once semantics with Replay Cache</li>
<li>Replication, Crash Recovery with Consensus and the Write-Ahead-Log</li>
<li>Time keeping and timestamp assignment</li>
<h2 id="references">References</h2>
<p><a href=";rep=rep1&amp;type=pdf">[1]</a>: Armando Fox and Eric A. Brewer. 1999. Harvest, Yield, and Scalable Tolerant Systems. In Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems (HOTOS ‘99). IEEE Computer Society, Washington, DC, USA.</p>
<p><a href="">[2]</a>: ACID - Wikipedia entry</p>
<p><a href="">[3]</a>: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: amazon’s highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (SOSP ‘07). ACM, New York, NY, USA.</p>
<p><a href="">[4]</a>: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, USA.</p>
<p><a href="">[5]</a>: Gifford, David K. Information storage in a decentralized computer system. Diss. Stanford University, 1981.</p>
<p><a href="">[6]</a>: Herlihy, Maurice P., and Jeannette M. Wing. “Linearizability: A correctness condition for concurrent objects.” ACM Transactions on Programming Languages and Systems (TOPLAS) 12.3 (1990): 463-492.</p>
<p><a href="">[7]</a>: Papadimitriou, Christos H. “The serializability of concurrent database updates.” Journal of the ACM (JACM) 26.4 (1979): 631-653.</p>
<div class="col-lg-3 recent-posts">
<h3>Recent posts</h3>
<li> <a href="/2020/07/30/building-near-real-time-big-data-lake.html">Building Near Real-time Big Data Lake</a> </li>
<li> <a href="/2020/05/18/apache-kudu-1-12-0-release.html">Apache Kudu 1.12.0 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-11-1-release.html">Apache Kudu 1.11.1 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-10-1-release.html">Apache Kudu 1.10.1 released</a> </li>
<li> <a href="/2019/07/09/apache-kudu-1-10-0-release.html">Apache Kudu 1.10.0 Released</a> </li>
<li> <a href="/2019/04/30/location-awareness.html">Location Awareness in Kudu</a> </li>
<li> <a href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html">Fine-Grained Authorization with Apache Kudu and Impala</a> </li>
<li> <a href="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html">Testing Apache Kudu Applications on the JVM</a> </li>
<li> <a href="/2019/03/15/apache-kudu-1-9-0-release.html">Apache Kudu 1.9.0 Released</a> </li>
<li> <a href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a> </li>
<li> <a href="/2018/12/11/call-for-posts.html">Call for Posts</a> </li>
<li> <a href="/2018/10/26/apache-kudu-1-8-0-released.html">Apache Kudu 1.8.0 Released</a> </li>
<li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
<li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
<li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2019 The Apache Software Foundation.
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
<div class="col-md-3">
<a class="pull-right" href="">
<img src=""/>
<script src=""></script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
} else {
<script src=""
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
<script src=""></script>
anchors.options = {
placement: 'right',
visible: 'touch',