blob: 788c6b85996972f4b5a531570bd29fb192cda77c [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Transaction Semantics in Apache Kudu</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7"
crossorigin="anonymous">
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="logo" href="/"><img
src="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png"
srcset="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png 1x, //d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_160px.png 2x"
alt="Apache Kudu"/></a>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
</li>
<li >
<a href="/overview.html">Overview</a>
</li>
<li class="active">
<a href="/docs/">Documentation</a>
</li>
<li >
<a href="/releases/">Releases</a>
</li>
<li >
<a href="/blog/">Blog</a>
</li>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="https://getkudu-slack.herokuapp.com/">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<li><a href="/ecosystem.html">Ecosystem</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="https://github.com/apache/incubator-kudu">GitHub</a></li>
<li><a class="icon gerrit" href="http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu">Gerrit Code Review</a></li>
<li><a class="icon jira" href="https://issues.apache.org/jira/browse/KUDU">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="https://twitter.com/ApacheKudu">Twitter</a></li>
<li><a href="https://www.reddit.com/r/kudu/">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
<li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li>
</ul>
</li>
<li >
<a href="/faq.html">FAQ</a>
</li>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
</nav>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<div class="container">
<div class="row">
<div class="col-md-9">
<h1>Transaction Semantics in Apache Kudu</h1>
<div id="preamble">
<div class="sectionbody">
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
This document applies to Apache Kudu version 1.16.0. Please consult the
<a href="http://kudu.apache.org/releases/">documentation of the appropriate release</a> that&#8217;s applicable
to the version of the Kudu cluster.
</td>
</tr>
</table>
</div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph">
<p>This is a brief introduction to Kudu&#8217;s transaction and consistency semantics. For an
in-depth technical exposition of most of what is mentioned here, and why it is correct,
see the technical report <a href="#ht">[1]</a>.</p>
</div>
</div>
</div>
<div class="paragraph">
<p>Kudu&#8217;s transactional semantics and architecture are inspired by state-of-the-art
systems such as Spanner <a href="#spanner">[2]</a> and Calvin <a href="#fdt">[3]</a>. Kudu builds upon decades of database
research. The core philosophy is to make the lives of developers easier by providing transactions
with simple, strong semantics, without sacrificing performance or the ability to tune to different
requirements.</p>
</div>
<div class="paragraph">
<p>Kudu currently allows the following operations:</p>
</div>
<div class="ulist">
<ul>
<li>
<p><strong>Write operations</strong> are sets of rows to be inserted, updated, or deleted in the storage
engine, in a single tablet with multiple replicas. Write operations do not have separate
"read sets" i.e. they do not scan existing data before performing the write. Each write
is only concerned with previous state of the rows that are about to change.
Writes are not "committed" explicitly by the user. Instead, they are committed automatically
by the system, after completion.</p>
</li>
<li>
<p><strong>Write transactions</strong> are groups of write operations across potentially multiple tablets
that are committed atomically upon the user&#8217;s request. Once each write operation within a
transaction is complete, the user sends an explicit "commit" request to make the contents of the
transaction visible to readers at a single timestamp.</p>
</li>
<li>
<p><strong>Scans</strong> are read operations that can traverse multiple tablets and read information
with different levels of consistency or correctness guarantees. Scans can perform
time-travel reads, i.e. the user is able to set a scan timestamp in the past and
get back results that reflect the state of the storage engine at that point in time.</p>
</li>
</ul>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
<div class="title">Before We Begin</div>
<div class="ulist">
<ul>
<li>
<p>The term <em>timestamp</em> is mentioned several times to illustrate the
functionality, but <em>timestamp</em> is an internal concept mostly invisible to users,
except when setting timestamp on a <code>KuduScanner</code>.</p>
</li>
<li>
<p>We generally refer to methods and classes of the C&#43;&#43; client. While the Java
client mostly has analogous methods and classes, the exact names of the APIs
may differ.</p>
</li>
</ul>
</div>
</td>
</tr>
</table>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_single_tablet_write_operations"><a class="link" href="#_single_tablet_write_operations">Single tablet write operations</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Kudu employs <em>Multiversion Concurrency Control (MVCC)</em> and the <em>Raft consensus</em>
algorithm <a href="#consensus">[4]</a>. Each write operation in Kudu must go through the
tablet&#8217;s leader.</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p>The leader acquires all locks for the rows that it will change.</p>
</li>
<li>
<p>The leader assigns the write a timestamp before the write is submitted for
replication. This timestamp will be the write&#8217;s "tag" in MVCC.</p>
</li>
<li>
<p>After a majority of replicas acknowledges the change, the actual rows are changed.</p>
</li>
<li>
<p>After the changes are complete, they are made visible to concurrent writes
and reads, atomically.</p>
</li>
</ol>
</div>
<div class="paragraph">
<p>All replicas of a tablet observe the same order of operations, and if a write
operation is assigned timestamp <em>n</em> and changes row <em>x</em>, a second write operation
at timestamp <em>m &gt; n</em> is guaranteed to see the new value of <em>x</em>.</p>
</div>
<div class="paragraph">
<p>This strict ordering of lock acquisition and timestamp assignment is enforced to be
consistent across all replicas of a tablet through consensus. Therefore, write operations
are totally ordered with regard to clock-assigned timestamps, relative to other writes
in the same tablet. In other words, writes have strict-serializable semantics,
though in an admittedly limited context. See this
<a href="http://www.bailis.org/blog/linearizability-versus-serializability">blog post</a>
for a little more context regarding what these semantics mean.</p>
</div>
<div class="paragraph">
<p>While Isolated and Durable in an ACID sense, multi-row write operations, even within a single
tablet, are not fully Atomic unless they are a part of a multi-tablet write transaction. The failure
of a single write in a batch operation does not roll back the operation, but produces per-row
errors.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_multi_tablet_write_operations"><a class="link" href="#_multi_tablet_write_operations">Multi-tablet write operations</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Regardless of whether they are a part of a transaction, writes from a Kudu client are optionally
buffered in memory until they are flushed and sent the server. When a client&#8217;s session flushes, the
rows for each tablet are batched together, and sent to the tablet server that hosts the leader
replica of the tablet. Outside of a transaction, each of these batches represents a single,
independent write operation with its own timestamp. However, the client API provides the option to
impose some constraints on the assigned timestamps and on how writes to different tablets are
observed by clients.</p>
</div>
<div class="paragraph">
<p>Kudu, like Spanner, was designed to be externally consistent <a href="#consistency">[5]</a>, preserving consistency
when operations span multiple tablets and even multiple data centers. In practice this
means that, if a write operation changes item <em>x</em> at tablet <em>A</em>, and a following write
operation changes item <em>y</em> at tablet <em>B</em>, you might want to enforce that if
the change to <em>y</em> is observed, the change to <em>x</em> must also be observed. There
are many examples where this can be important. For example, if Kudu is
storing clickstreams for further analysis, and two clicks follow each other but
are stored in different tablets, subsequent clicks should be assigned subsequent
timestamps so that the causal relationship between them is captured.</p>
</div>
<div class="paragraph">
<div class="title"><code>CLIENT_PROPAGATED</code> Consistency</div>
<p>Kudu&#8217;s default external consistency mode is called <code>CLIENT_PROPAGATED</code>.
See <a href="#ht">[1]</a> for an extensive explanation on how it works. In brief, this mode causes writes
from <em>a single client</em> to be automatically externally consistent. In the clickstream scenario
above, if the two clicks are submitted by different client instances, the application must
manually propagate timestamps from one client to the other for the causal relationship
to be captured.</p>
</div>
<div class="paragraph">
<p>Timestamps between clients <em>a</em> and <em>b</em> can be propagated as follows:</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1">Java Client</dt>
<dd>
<p>Call <code>AsyncKuduClient#getLastPropagatedTimestamp()</code> on client <em>a</em>,
propagate the timestamp to client <em>b</em>, and call
<code>AsyncKuduClient#setLastPropagatedTimestamp()</code> on client <em>b</em>.</p>
</dd>
<dt class="hdlist1">C&#43;&#43; Client</dt>
<dd>
<p>Call <code>KuduClient::GetLatestObservedTimestamp()</code> on client <em>a</em>,
propagate the timestamp to client <em>b</em>, and call
<code>KuduClient::SetLatestObservedTimestamp()</code> on client <em>b</em>.</p>
</dd>
</dl>
</div>
<div class="paragraph">
<div class="title"><code>COMMIT_WAIT</code> Consistency</div>
<p>Kudu also has an experimental implementation of an external consistency
model used in Google&#8217;s Spanner , called <code>COMMIT_WAIT</code>. <code>COMMIT_WAIT</code> works
by tightly synchronizing the clocks on all machines in the cluster. Then, when a
write occurs, timestamps are assigned and the results of the write are not made
visible until enough time has passed so that no other machine in the cluster could
possibly assign a lower timestamp to a following write.</p>
</div>
<div class="paragraph">
<p>When using this mode, the latency of writes is tightly tied to the accuracy of clocks on
all the cluster hosts, and using this mode with loose clock synchronization causes writes
to take a long time to complete or even time out. See <a href="#known_issues">Known Issues and Limitations</a>.</p>
</div>
<div class="paragraph">
<p>The <code>COMMIT_WAIT</code> consistency mode may be selected as follows:</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1">Java Client</dt>
<dd>
<p>Call <code>KuduSession#setExternalConsistencyMode(ExternalConsistencyMode.COMMIT_WAIT)</code></p>
</dd>
<dt class="hdlist1">C&#43;&#43; Client</dt>
<dd>
<p>Call <code>KuduSession::SetExternalConsistencyMode(COMMIT_WAIT)</code></p>
</dd>
</dl>
</div>
<div class="admonitionblock caution">
<table>
<tr>
<td class="icon">
<i class="fa icon-caution" title="Caution"></i>
</td>
<td class="content">
<code>COMMIT_WAIT</code> consistency is considered an experimental feature. It may return
incorrect results, exhibit performance issues, or negatively impact cluster stability.
Use in production environments is discouraged.
</td>
</tr>
</table>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_multi_tablet_write_transactions"><a class="link" href="#_multi_tablet_write_transactions">Multi-tablet write transactions</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Kudu provides transactionality on top of the write operations, meaning all operations that occur
within a transaction abide by the same consistency behavior described above.</p>
</div>
<div class="paragraph">
<p>When a client begins a transaction, Kudu automatically assigns the transaction a unique identifier
(called a "transaction ID"). The identifier can be used to create sessions to which write operations
are applied, potentially across multiple clients per transaction. Write operations applied in the
context of a transaction are not visible until a client commits the transaction.</p>
</div>
<div class="paragraph">
<p>Kudu exposes the following APIs to pass a transaction identifier between potentially multiple
processes:</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1">Java Client</dt>
<dd>
<p>Call <code>KuduTransaction#serialize(&#8230;&#8203;)</code> to get a bytes representation of the transaction
ID, and call <code>KuduTransaction#deserialize(&#8230;&#8203;)</code> to get a <code>KuduTransaction</code> object.</p>
</dd>
<dt class="hdlist1">C&#43;&#43; Client</dt>
<dd>
<p>Call <code>KuduTransaction::Serialize(&#8230;&#8203;)</code> to get a bytes representation of the
transaction ID, and call <code>KuduTransaction::Deserialize(&#8230;&#8203;)</code> to get a <code>KuduTransaction</code> object.</p>
</dd>
</dl>
</div>
<div class="paragraph">
<p>As writes are applied in the context of the transaction, each tablet that participates in the
transaction automatically registers itself as a participant, and is locked for further transactions
until the transaction is complete. Per-row locks are taken as per the normal flow of a write
operation, but per row locks are released upon replicating the write operation, in favor of relying
on the tablet-wide lock.</p>
</div>
<div class="paragraph">
<p>If multiple transactions lock the same tablet, Kudu uses the wait-die scheme to avoid deadlocks when
locking the participant: if a transaction <em>b</em> attempts to lock a tablet that is already locked by
transaction <em>a</em>, if <em>a</em> &gt; <em>b</em> (<em>a</em> is newer than <em>b</em>), transaction <em>b</em> continues trying to lock
until it is successful (it "waits"). Otherwise, transaction <em>b</em> is automatically aborted, and it is
up to the application to retry the transaction.</p>
</div>
<div class="paragraph">
<p>When the client commits a transaction, Kudu orchestrates a two-phase commit that assigns a "commit
timestamp" to all write operations that is higher than each of their individually assigned
timestamps. The mutations of the transaction are all visible to clients as of this commit timestamp.
Additionally, subsequent write operations on all participants are guaranteed to be assigned
timestamps higher than this timestamp. It is up to applications to ensure that all desired write
operations have succeeded (i.e. did not return row errors) before committing.</p>
</div>
<div class="paragraph">
<p>As long as a transaction is expected to remain active, applications are expected to maintain at
least one reference to the given transaction&#8217;s handle, each of which can be configured to
automatically heartbeat to the Kudu cluster, indicating liveness of the transacting application. By
default, only the first created transaction handle for a transaction will heartbeat, with the
expectation that it is kept alive for the entire duration of the transaction. If only a single
transaction handle is expected to be kept alive at once across multiple clients, the heartbeating
can be enabled with the following calls when serializing the handle for use in other processes.</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1">Java Client</dt>
<dd>
<p>Call <code>KuduTransaction.SerializationOptions#setEnableKeepalive(true)</code></p>
</dd>
<dt class="hdlist1">C&#43;&#43; Client</dt>
<dd>
<p>Call <code>KuduTransaction::SerializationOptions::enable_keepalive(true)</code></p>
</dd>
</dl>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_read_operations_scans"><a class="link" href="#_read_operations_scans">Read Operations (Scans)</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Scans are read operations performed by clients that may span one or more rows across
one or more tablets. When a server receives a scan request, it takes a snapshot of the MVCC
state and then proceeds in one of two ways depending on the read mode selected by
the user. The mode may be selected as follows:</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1">Java Client</dt>
<dd>
<p>Call <code>KuduScannerBuilder#readMode(&#8230;&#8203;)</code></p>
</dd>
<dt class="hdlist1">C&#43;&#43; Client</dt>
<dd>
<p>Call <code>KuduScanner::SetReadMode()</code></p>
</dd>
</dl>
</div>
<div class="paragraph">
<p>The following modes are available in both clients:</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1"><code>READ_LATEST</code></dt>
<dd>
<p>This is the default read mode. The server takes a snapshot of
the MVCC state and proceeds with the read immediately. Reads in this mode only yield
'Read Committed' isolation.</p>
</dd>
<dt class="hdlist1"><code>READ_AT_SNAPSHOT</code></dt>
<dd>
<p>In this read mode, scans are consistent and repeatable. A
timestamp for the snapshot is selected either by the server, or set
explicitly by the user through <code>KuduScanner::SetSnapshotMicros()</code>. Explicitly setting
the timestamp is recommended; see <a href="#recommendations">Recommendations</a>. The server waits until this
timestamp is 'safe' (until all write operations that have a lower timestamp have
completed and are visible). This delay, coupled with an external consistency method,
will eventually allow Kudu to have full <code>strict-serializable</code> semantics for reads
and writes. This is still a work in progress and some anomalies are still possible
(see <a href="#known_issues">Known Issues and Limitations</a>). Only scans in this mode can be fault-tolerant.</p>
</dd>
<dt class="hdlist1"><code>READ_YOUR_WRITES</code></dt>
<dd>
<p>This read mode relies on the state of a Kudu client to
issue subsequent scan requests. When issuing a scan request in this read mode,
a Kudu client provides the latest timestamp it observed so far. The server
selects a timestamp higher than the timestamp provided by the client, that is
also guaranteed to have all prior write operations committed and applied to
the data. That translates into read-your-writes and read-your-reads behavior
which is useful in scenarios where subsequent scan requests should contain the
data the client has seen so far while reading and writing during its current
session. <a href="https://issues.apache.org/jira/browse/KUDU-1704">KUDU-1704</a> could
provide more details and references, if necessary. To summarize, this read mode</p>
<div class="ulist">
<ul>
<li>
<p>ensures read-your-writes and read-your-reads session guarantees</p>
</li>
<li>
<p>minimizes the latency caused by waiting for outstanding write operations
at the server side to complete</p>
</li>
<li>
<p>doesn&#8217;t guarantee linearizability</p>
</li>
</ul>
</div>
</dd>
</dl>
</div>
<div class="paragraph">
<p>Selecting between read modes requires balancing the trade-offs and making a choice
that fits your workload. For instance, a reporting application that needs to
scan the entire database might need to perform careful accounting operations, so that
scan may need to be fault-tolerant, but probably doesn&#8217;t require a to-the-microsecond
up-to-date view of the database. In that case, you might choose <code>READ_AT_SNAPSHOT</code>
and select a timestamp that is a few seconds in the past when the scan starts. On
the other hand, a machine learning workload that is not ingesting the whole data
set and is already statistical in nature might not require the scan to be repeatable,
so you might choose <code>READ_LATEST</code> instead for better scan performance.</p>
</div>
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
<div class="paragraph">
<p>Kudu also provides replica selection API for users to choose at which replica the
scan should be performed:</p>
</div>
<div class="dlist">
<dl>
<dt class="hdlist1">Java Client</dt>
<dd>
<p>Call <code>KuduScannerBuilder#replicaSelection(&#8230;&#8203;)</code></p>
</dd>
<dt class="hdlist1">C&#43;&#43; Client</dt>
<dd>
<p>Call <code>KuduScanner::SetSelection(&#8230;&#8203;)</code></p>
</dd>
</dl>
</div>
<div class="paragraph">
<p>This API is a means to control locality and, in some cases, latency. The replica
selection API has no effect on the consistency guarantees, which will hold no matter
which replica is selected.</p>
</div>
</td>
</tr>
</table>
</div>
</div>
</div>
<div class="sect1">
<h2 id="known_issues"><a class="link" href="#known_issues">Known Issues and Limitations</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>There are several gaps and corner cases that prevent Kudu from being fully strictly-serializable
in some situations, at the moment. Below are the details and next, some recommendations.</p>
</div>
<div class="sect2">
<h3 id="known_issues_scans"><a class="link" href="#known_issues_scans">Writes</a></h3>
<div class="ulist">
<ul>
<li>
<p>Support for <code>COMMIT_WAIT</code> is experimental and requires careful tuning of the
time-synchronization protocol, such as NTP (Network Time Protocol). Its use
is discouraged in production environments.</p>
</li>
<li>
<p>Multi-tablet transaction support currently only allows a tablet to participate in a single
transaction at a time.</p>
</li>
<li>
<p>Multi-tablet transaction support currently only guarantees
<a href="https://jepsen.io/consistency/models/read-committed">"read committed"</a> semantics.</p>
</li>
</ul>
</div>
</div>
<div class="sect2">
<h3 id="_reads_scans"><a class="link" href="#_reads_scans">Reads (Scans)</a></h3>
<div class="ulist">
<ul>
<li>
<p>On a leader change, <code>READ_AT_SNAPSHOT</code> scans at a snapshot whose timestamp is beyond the last
write may also yield non-repeatable reads (see
<a href="https://issues.apache.org/jira/browse/KUDU-1188">KUDU-1188</a>).
See <a href="#recommendations">Recommendations</a> for a workaround.</p>
</li>
<li>
<p>Impala scans are currently performed as <code>READ_LATEST</code> and have no consistency
guarantees.</p>
</li>
<li>
<p>In <code>AUTO_BACKGROUND_FLUSH</code> mode, or when using "async" flushing mechanisms, writes applied to a
single client session may become reordered due to the concurrency of flushing the data to the
server. This may be particularly noticeable if a single row is quickly updated with different
values in succession. This phenomenon affects all client API implementations, including
transactional APIs. Workarounds are described in the API documentation for the respective
implementations in the docs for <code>FlushMode</code> or <code>AsyncKuduSession</code>. See
<a href="https://issues.apache.org/jira/browse/KUDU-1767">KUDU-1767</a>.</p>
</li>
<li>
<p>Dirty reads (i.e. reads within an uncommitted transaction) are not currently supported.</p>
</li>
</ul>
</div>
</div>
<div class="sect2">
<h3 id="recommendations"><a class="link" href="#recommendations">Recommendations</a></h3>
<div class="ulist">
<ul>
<li>
<p>If repeatable snapshot reads are a requirement, use <code>READ_AT_SNAPSHOT</code>
with a timestamp that is slightly in the past (between 2-5 seconds, ideally).
This will circumvent the anomaly described in <a href="#known_issues_scans">Writes</a>. Even when the
anomaly has been addressed, back-dating the timestamp will always make scans
faster, since they are unlikely to block.</p>
</li>
<li>
<p>If external consistency is a requirement and you decide to use <code>COMMIT_WAIT</code>, the
time-synchronization protocol needs to be tuned carefully. Each operation will wait 2x the maximum
clock error at the time of execution, which is usually in the 100 msec. to 1 sec. range with the
default settings, maybe more. Thus, write operations would take at least 200 msec. to 2 sec. to
complete when using the default settings and may even time out.</p>
<div class="ulist">
<ul>
<li>
<p>A local server should be used as a time server. We&#8217;ve performed experiments using the default
NTP time source available in a Google Compute Engine data center and were able to obtain
a reasonable tight max error bound, usually varying between 12-17 milliseconds.</p>
</li>
<li>
<p>The following parameters should be adjusted in <code>/etc/ntp.conf</code> to tighten the maximum error:</p>
<div class="ulist">
<ul>
<li>
<p><code>server my_server.org iburst minpoll 1 maxpoll 8</code></p>
</li>
<li>
<p><code>tinker dispersion 500</code></p>
</li>
<li>
<p><code>tinker allan 0</code></p>
</li>
</ul>
</div>
</li>
</ul>
</div>
</li>
</ul>
</div>
<div class="admonitionblock important">
<table>
<tr>
<td class="icon">
<i class="fa icon-important" title="Important"></i>
</td>
<td class="content">
The above parameters minimize <code>maximum error</code> at the expense of <code>estimated error</code>,
the latter might be orders of magnitude above it&#8217;s "normal" value. These parameters also
may place a greater load on the time server, since they make the servers poll much more
frequently.
</td>
</tr>
</table>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_references"><a class="link" href="#_references">References</a></h2>
<div class="sectionbody">
<div class="ulist bibliography">
<ul class="bibliography">
<li>
<p><a id="ht"></a>[1] David Alves, Todd Lipcon and Vijay Garg. Technical Report: HybridTime - Accessible Global Consistency with High Clock Uncertainty. April, 2014. <a href="http://users.ece.utexas.edu/~garg/pdslab/david/hybrid-time-tech-report-01.pdf" class="bare">http://users.ece.utexas.edu/~garg/pdslab/david/hybrid-time-tech-report-01.pdf</a></p>
</li>
<li>
<p><a id="spanner"></a>[2] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google&#8217;s globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 251-264.</p>
</li>
<li>
<p><a id="fdt"></a>[3] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 1-12. DOI=10.1145/2213836.2213838 <a href="http://doi.acm.org/10.1145/2213836.2213838" class="bare">http://doi.acm.org/10.1145/2213836.2213838</a></p>
</li>
<li>
<p><a id="consensus"></a>[4] Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC'14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.</p>
</li>
<li>
<p><a id="consistency"></a>[5] Kwei-Jay Lin, "Consistency issues in real-time database systems," in System Sciences, 1989. Vol.II: Software Track, Proceedings of the Twenty-Second Annual Hawaii International Conference on , vol.2, no., pp.654-661 vol.2, 3-6 Jan 1989 doi: 10.1109/HICSS.1989.48069</p>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="col-md-3">
<div id="toc" data-spy="affix" data-offset-top="70">
<ul>
<li>
<a href="index.html">Introducing Kudu</a>
</li>
<li>
<a href="release_notes.html">Kudu Release Notes</a>
</li>
<li>
<a href="quickstart.html">Quickstart Guide</a>
</li>
<li>
<a href="installation.html">Installation Guide</a>
</li>
<li>
<a href="configuration.html">Configuring Kudu</a>
</li>
<li>
<a href="hive_metastore.html">Using the Hive Metastore with Kudu</a>
</li>
<li>
<a href="kudu_impala_integration.html">Using Impala with Kudu</a>
</li>
<li>
<a href="administration.html">Administering Kudu</a>
</li>
<li>
<a href="troubleshooting.html">Troubleshooting Kudu</a>
</li>
<li>
<a href="developing.html">Developing Applications with Kudu</a>
</li>
<li>
<a href="schema_design.html">Kudu Schema Design</a>
</li>
<li>
<a href="scaling_guide.html">Kudu Scaling Guide</a>
</li>
<li>
<a href="security.html">Kudu Security</a>
</li>
<li>
<span class="active-toc">Kudu Transaction Semantics</span>
<ul class="sectlevel1">
<li><a href="#_single_tablet_write_operations">Single tablet write operations</a></li>
<li><a href="#_multi_tablet_write_operations">Multi-tablet write operations</a></li>
<li><a href="#_multi_tablet_write_transactions">Multi-tablet write transactions</a></li>
<li><a href="#_read_operations_scans">Read Operations (Scans)</a></li>
<li><a href="#known_issues">Known Issues and Limitations</a>
<ul class="sectlevel2">
<li><a href="#known_issues_scans">Writes</a></li>
<li><a href="#_reads_scans">Reads (Scans)</a></li>
<li><a href="#recommendations">Recommendations</a></li>
</ul>
</li>
<li><a href="#_references">References</a></li>
</ul>
</li>
<li>
<a href="background_tasks.html">Background Maintenance Tasks</a>
</li>
<li>
<a href="configuration_reference.html">Kudu Configuration Reference</a>
</li>
<li>
<a href="command_line_tools_reference.html">Kudu Command Line Tools Reference</a>
</li>
<li>
<a href="metrics_reference.html">Kudu Metrics Reference</a>
</li>
<li>
<a href="known_issues.html">Known Issues and Limitations</a>
</li>
<li>
<a href="contributing.html">Contributing to Kudu</a>
</li>
<li>
<a href="export_control.html">Export Control Notice</a>
</li>
</ul>
</div>
</div>
</div>
</div>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2020 The Apache Software Foundation. Last updated 2022-02-10 19:45:07 +0100
</p>
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
</p>
</div>
<div class="col-md-3">
<a class="pull-right" href="https://www.apache.org/events/current-event.html">
<img src="https://www.apache.org/events/current-event-234x60.png"/>
</a>
</div>
</div>
</footer>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
$(document.documentElement).addClass("touch");
} else {
$(document.documentElement).addClass("no-touch");
}
});
</script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"
integrity="sha384-0mSbJDEHialfmuBBQP6A4Qrprq5OVfW37PRR3j5ELqxss1yVqOtnepnHVP9aJ7xS"
crossorigin="anonymous"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script>
<script>
anchors.options = {
placement: 'right',
visible: 'touch',
};
anchors.add();
</script>
</body>
</html>