docs/transaction_semantics.adoc - kudu - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 [[installation]]
 = Transaction Semantics in Apache Kudu

 :author: Kudu Team
 :imagesdir: ./images
 :icons: font
 :toc: left
 :toclevels: 3
 :doctype: book
 :backend: html5
 :sectlinks:
 :experimental:

 ****
 This is a brief introduction to Kudu's transaction and consistency semantics. For an
 in-depth technical exposition of most of what is mentioned here, and why it is correct,
 see the technical report <<ht>>.
 ****

 Kudu's transactional semantics and architecture are inspired by state-of-the-art
 systems such as Spanner <<spanner>> and Calvin <<fdt>>. Kudu builds upon decades of database
 research. The core philosophy is to make the lives of developers easier by providing transactions with
 simple, strong semantics, without sacrificing performance or the ability to tune to different
 requirements.

 Kudu is designed to eventually be fully ACID, however, multi-tablet transactions are not
 yet implemented. As such, this discussion focuses on single-tablet write operations, and only
 briefly touches multi-tablet reads. Eventually Kudu will support fully strict-serializable
 semantics. In fact it already does in a limited context, but not all corner cases are covered
 as this is still a work in progress.

 Kudu currently allows the following operations:

 * *Write operations* are sets of rows to be inserted, updated, or deleted in the storage
 engine, in a single tablet with multiple replicas. Write operations do not have separate
 "read sets" i.e. they do not scan existing data before performing the write. Each write
 is only concerned with previous state of the rows that are about to change.
 Writes are not  "committed" explicitly by the user. Instead, they are committed automatically
 by the system, after completion.

 * *Scans* are read operations that can traverse multiple tablets and read information
 with different levels of consistency or correctness guarantees. Scans can perform
 time-travel reads, i.e. the user is able to set a scan timestamp in the past and
 get back results that reflect the state of the storage engine at that point in time.

 [NOTE]
 .Before We Begin
 ====
 * The term _timestamp_ is mentioned several times to illustrate the
 functionality, but _timestamp_ is an internal concept mostly invisible to users,
 except when setting timestamp on a `KuduScanner`.

 * We generally refer to methods and classes of the {cpp} client. While the Java
 client mostly has analogous methods and classes, the exact names of the APIs
 may differ.
 ====

 == Single tablet write operations

 Kudu employs _Multiversion Concurrency Control (MVCC)_ and the _Raft consensus_
 algorithm <<consensus>>. Each write operation in Kudu must go through the
 tablet's leader.

 . The leader acquires all locks for the rows that it will change.
 . The leader assigns the write a timestamp before the write is submitted for
 replication. This timestamp will be the write's "tag" in MVCC.
 . After a majority of replicas acknowledges the change, the actual rows are changed.
 . After the changes are complete, they are made visible to concurrent writes
 and reads, atomically.

 All replicas of a tablet observe the same order of operations, and if a write
 operation is assigned timestamp _n_ and changes row _x_, a second write operation
 at timestamp _m > n_ is guaranteed to see the new value of _x_.

 This strict ordering of lock acquisition and timestamp assignment is enforced to be
 consistent across all replicas of a tablet through consensus. Therefore, write operations
 are totally ordered with regard to clock-assigned timestamps, relative to other writes
 in the same tablet. In other words, writes have strict-serializable semantics,
 though in an admittedly limited context. See this
 link:http://www.bailis.org/blog/linearizability-versus-serializability[blog post]
 for a little more context regarding what these semantics mean.

 While Isolated and Durable in an ACID sense, multi-row write operations are not yet fully Atomic.
 The failure of a single write in a batch operation does not roll back the operation,
 but produces per-row errors.

 == Writing to multiple tablets

 Kudu does not yet support transactions that span multiple tablets. However,
 consistent snapshot reads are possible (with caveats in the current implementation)
 as explained below.

 Writes from a Kudu client are optionally buffered in memory until they are flushed and sent
 to the server. When client's session flushes, the rows for each tablet are batched
 together, and sent to the tablet server which hosts the leader replica of the tablet.
 Since there are no inter-tablet transactions, each of these batches represents a single,
 independent write operation with its own timestamp. However, the client API provides
 the option to impose some constraints on the assigned timestamps and on how writes to
 different tablets can be observed by clients.

 Kudu, like Spanner, was designed to be externally consistent <<consistency>>, preserving consistency
 when operations span multiple tablets and even multiple data centers. In practice this
 means that, if a write operation changes item _x_ at tablet _A_, and a following write
 operation changes item _y_ at tablet _B_, you might want to enforce that if
 the change to _y_ is observed, the change to _x_ must also be observed. There
 are many examples where this can be important. For example,  if Kudu is
 storing clickstreams for further analysis, and two clicks follow each other but
 are stored in different tablets, subsequent clicks should be assigned subsequent
 timestamps so that the causal relationship between them is captured.

 .`CLIENT_PROPAGATED` Consistency
 Kudu's default external consistency mode is called `CLIENT_PROPAGATED`.
 See <<ht>> for an extensive explanation on how it works. In brief, this mode causes writes
 from _a single client_ to be automatically externally consistent. In the clickstream scenario
 above, if the two clicks are submitted by different client instances, the application must
 manually propagate timestamps from one client to the other for the causal relationship
 to be captured.

 Timestamps between clients _a_ and _b_ can be propagated as follows:

 Java Client:: Call `AsyncKuduClient#getLastPropagatedTimestamp()` on client _a_,
 propagate the timestamp to client _b_, and call
 `AsyncKuduClient#setLastPropagatedTimestamp()` on client _b_.

 {cpp} Client:: Call `KuduClient::GetLatestObservedTimestamp()` on client _a_,
 propagate the timestamp to client _b_, and call
 `KuduClient::SetLatestObservedTimestamp()` on client _b_.


 .`COMMIT_WAIT` Consistency
 Kudu also has an experimental implementation of an external consistency
 model used in Google's Spanner , called `COMMIT_WAIT`. `COMMIT_WAIT` works
 by tightly synchronizing the clocks on all machines in the cluster. Then, when a
 write occurs, timestamps are assigned and the results of the write are not made
 visible until enough time has passed so that no other machine in the cluster could
 possibly assign a lower timestamp to a following write.

 When using this mode, the latency of writes is tightly tied to the accuracy of clocks on
 all the cluster hosts, and using this mode with loose clock synchronization causes writes
 to take a long time to complete or even time out. See <<known_issues>>.

 The `COMMIT_WAIT` consistency mode may be selected as follows:

 Java Client:: Call `KuduSession#setExternalConsistencyMode(ExternalConsistencyMode.COMMIT_WAIT)`

 {cpp} Client:: Call `KuduSession::SetExternalConsistencyMode(COMMIT_WAIT)`

 CAUTION: `COMMIT_WAIT` consistency is considered an experimental feature. It may return
 incorrect results, exhibit performance issues, or negatively impact cluster stability.
 Use in production environments is discouraged.

 == Read Operations (Scans)

 Scans are read operations performed by clients that may span one or more rows across
 one or more tablets. When a server receives a scan request, it takes a snapshot of the MVCC
 state and then proceeds in one of two ways depending on the read mode selected by
 the user. The mode may be selected as follows:

 Java Client:: Call `KuduScannerBuilder#readMode(...)`
 {cpp} Client:: Call `KuduScanner::SetReadMode()`

 The following modes are available in both clients:

 `READ_LATEST`:: This is the default read mode. The server takes a snapshot of
 the MVCC state and proceeds with the read immediately. Reads in this mode only yield
 'Read Committed' isolation.

 `READ_AT_SNAPSHOT`:: In this read mode, scans are consistent and repeatable. A
 timestamp for the snapshot is selected either by the server, or set
 explicitly by the user through `KuduScanner::SetSnapshotMicros()`. Explicitly setting
 the timestamp is recommended; see <<recommendations>>. The server waits until this
 timestamp is 'safe' (until all write operations that have a lower timestamp have
 completed and are visible). This delay, coupled with an external consistency method,
 will eventually allow Kudu to have full `strict-serializable` semantics for reads
 and writes. This is still a work in progress and some anomalies are still possible
 (see <<known_issues>>). Only scans in this mode can be fault-tolerant.

 Selecting between read modes requires balancing the trade-offs and making a choice
 that fits your workload. For instance, a reporting application that needs to
 scan the entire database might need to perform careful accounting operations, so that
 scan may need to be fault-tolerant, but probably doesn't require a to-the-microsecond
 up-to-date view of the database. In that case, you might choose `READ_AT_SNAPSHOT`
 and select a timestamp that is a few seconds in the past when the scan starts. On
 the other hand, a machine learning workload that is not ingesting the whole data
 set and is already statistical in nature might not require the scan to be repeatable,
 so you might choose `READ_LATEST` instead for better scan performance.

 [NOTE]
 ====
 Kudu also provides replica selection API for users to choose at which replica the
 scan should be performed:

 Java Client:: Call `KuduScannerBuilder#replicaSelection(...)`
 {cpp} Client:: Call `KuduScanner::SetSelection(...)`

 This API is a means to control locality and, in some cases, latency. The replica
 selection API has no effect on the consistency guarantees, which will hold no matter
 which replica is selected.
 ====

 [[known_issues]]
 == Known Issues and Limitations

 There are several gaps and corner cases that prevent Kudu from being fully strictly-serializable
 in some situations, at the moment. Below are the details and next, some recommendations.

 [[known_issues_scans]]
 === Writes

 * Support for `COMMIT_WAIT` is experimental and requires careful tuning of the
   time-synchronization protocol, such as NTP (Network Time Protocol). Its use
   is discouraged in production environments.

 === Reads (Scans)

 * On a leader change, `READ_AT_SNAPSHOT` scans at a snapshot whose timestamp is beyond the last
   write may also yield non-repeatable reads (see
   link:https://issues.apache.org/jira/browse/KUDU-1188[KUDU-1188]).
   See <<recommendations>> for a workaround.
 * Impala scans are currently performed as `READ_LATEST` and have no consistency
   guarantees.
 * In `AUTO_BACKGROUND_FLUSH` mode, or when using "async" flushing mechanisms,
   writes applied to a single client session may become reordered due to the
   concurrency of flushing the data to the server. This may be particularly
   noticeable if a single row is quickly updated with different values in
   succession. This phenomenon affects all client API implementations.
   Workarounds are described in the API documentation for the respective
   implementations in the docs for `FlushMode` or `AsyncKuduSession`.
   See link:https://issues.apache.org/jira/browse/KUDU-1767[KUDU-1767].

 [[recommendations]]
 === Recommendations

 * If repeatable snapshot reads are a requirement, use `READ_AT_SNAPSHOT`
   with a timestamp that is slightly in the past (between 2-5 seconds, ideally).
   This will circumvent the anomaly described in <<known_issues_scans>>. Even when the
   anomaly has been addressed, back-dating the timestamp will always make scans
   faster, since they are unlikely to block.

 * If external consistency is a requirement and you decide to use `COMMIT_WAIT`, the
   time-synchronization protocol needs to be tuned carefully. Each transaction will wait
   2x the maximum clock error at the time of execution, which is usually in the 100 msec.
   to 1 sec. range with the default settings, maybe more. Thus, transactions would take at least
   200 msec. to 2 sec. to complete when using the default settings and may even time out.

   ** A local server should be used as a time server. We've performed experiments using the default
   NTP time source available in a Google Compute Engine data center and were able to obtain
   a reasonable tight max error bound, usually varying between 12-17 milliseconds.

   ** The following parameters should be adjusted in `/etc/ntp.conf` to tighten the maximum error:
   - `server my_server.org iburst minpoll 1 maxpoll 8`
   - `tinker dispersion 500`
   - `tinker allan 0`

 IMPORTANT: The above parameters minimize `maximum error` at the expense of `estimated error`,
 the latter might be orders of magnitude above it's "normal" value. These parameters also
 may place a greater load on the time server, since they make the servers poll much more
 frequently.

 [bibliography]
 == References
 - [[[ht,1]]] David Alves, Todd Lipcon and Vijay Garg. Technical Report: HybridTime - Accessible Global Consistency with High Clock Uncertainty. April, 2014. http://users.ece.utexas.edu/~garg/pdslab/david/hybrid-time-tech-report-01.pdf
 - [[[spanner,2]]] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 251-264.
 - [[[fdt,3]]] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 1-12. DOI=10.1145/2213836.2213838 http://doi.acm.org/10.1145/2213836.2213838
 - [[[consensus,4]]] Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC'14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.
 - [[[consistency,5]]] Kwei-Jay Lin, "Consistency issues in real-time database systems," in System Sciences, 1989. Vol.II: Software Track, Proceedings of the Twenty-Second Annual Hawaii International Conference on , vol.2, no., pp.654-661 vol.2, 3-6 Jan 1989 doi: 10.1109/HICSS.1989.48069
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	[[installation]]
	= Transaction Semantics in Apache Kudu

	:author: Kudu Team
	:imagesdir: ./images
	:icons: font
	:toc: left
	:toclevels: 3
	:doctype: book
	:backend: html5
	:sectlinks:
	:experimental:

	****
	This is a brief introduction to Kudu's transaction and consistency semantics. For an
	in-depth technical exposition of most of what is mentioned here, and why it is correct,
	see the technical report <<ht>>.
	****

	Kudu's transactional semantics and architecture are inspired by state-of-the-art
	systems such as Spanner <<spanner>> and Calvin <<fdt>>. Kudu builds upon decades of database
	research. The core philosophy is to make the lives of developers easier by providing transactions with
	simple, strong semantics, without sacrificing performance or the ability to tune to different
	requirements.

	Kudu is designed to eventually be fully ACID, however, multi-tablet transactions are not
	yet implemented. As such, this discussion focuses on single-tablet write operations, and only
	briefly touches multi-tablet reads. Eventually Kudu will support fully strict-serializable
	semantics. In fact it already does in a limited context, but not all corner cases are covered
	as this is still a work in progress.

	Kudu currently allows the following operations:

	* Write operations are sets of rows to be inserted, updated, or deleted in the storage
	engine, in a single tablet with multiple replicas. Write operations do not have separate
	"read sets" i.e. they do not scan existing data before performing the write. Each write
	is only concerned with previous state of the rows that are about to change.
	Writes are not "committed" explicitly by the user. Instead, they are committed automatically
	by the system, after completion.

	* Scans are read operations that can traverse multiple tablets and read information
	with different levels of consistency or correctness guarantees. Scans can perform
	time-travel reads, i.e. the user is able to set a scan timestamp in the past and
	get back results that reflect the state of the storage engine at that point in time.

	[NOTE]
	.Before We Begin
	====
	* The term _timestamp_ is mentioned several times to illustrate the
	functionality, but _timestamp_ is an internal concept mostly invisible to users,
	except when setting timestamp on a `KuduScanner`.

	* We generally refer to methods and classes of the {cpp} client. While the Java
	client mostly has analogous methods and classes, the exact names of the APIs
	may differ.
	====

	== Single tablet write operations

	Kudu employs _Multiversion Concurrency Control (MVCC)_ and the _Raft consensus_
	algorithm <<consensus>>. Each write operation in Kudu must go through the
	tablet's leader.

	. The leader acquires all locks for the rows that it will change.
	. The leader assigns the write a timestamp before the write is submitted for
	replication. This timestamp will be the write's "tag" in MVCC.
	. After a majority of replicas acknowledges the change, the actual rows are changed.
	. After the changes are complete, they are made visible to concurrent writes
	and reads, atomically.

	All replicas of a tablet observe the same order of operations, and if a write
	operation is assigned timestamp _n_ and changes row _x_, a second write operation
	at timestamp _m > n_ is guaranteed to see the new value of _x_.

	This strict ordering of lock acquisition and timestamp assignment is enforced to be
	consistent across all replicas of a tablet through consensus. Therefore, write operations
	are totally ordered with regard to clock-assigned timestamps, relative to other writes
	in the same tablet. In other words, writes have strict-serializable semantics,
	though in an admittedly limited context. See this
	link:http://www.bailis.org/blog/linearizability-versus-serializability[blog post]
	for a little more context regarding what these semantics mean.

	While Isolated and Durable in an ACID sense, multi-row write operations are not yet fully Atomic.
	The failure of a single write in a batch operation does not roll back the operation,
	but produces per-row errors.

	== Writing to multiple tablets

	Kudu does not yet support transactions that span multiple tablets. However,
	consistent snapshot reads are possible (with caveats in the current implementation)
	as explained below.

	Writes from a Kudu client are optionally buffered in memory until they are flushed and sent
	to the server. When client's session flushes, the rows for each tablet are batched
	together, and sent to the tablet server which hosts the leader replica of the tablet.
	Since there are no inter-tablet transactions, each of these batches represents a single,
	independent write operation with its own timestamp. However, the client API provides
	the option to impose some constraints on the assigned timestamps and on how writes to
	different tablets can be observed by clients.

	Kudu, like Spanner, was designed to be externally consistent <<consistency>>, preserving consistency
	when operations span multiple tablets and even multiple data centers. In practice this
	means that, if a write operation changes item _x_ at tablet _A_, and a following write
	operation changes item _y_ at tablet _B_, you might want to enforce that if
	the change to _y_ is observed, the change to _x_ must also be observed. There
	are many examples where this can be important. For example, if Kudu is
	storing clickstreams for further analysis, and two clicks follow each other but
	are stored in different tablets, subsequent clicks should be assigned subsequent
	timestamps so that the causal relationship between them is captured.

	.`CLIENT_PROPAGATED` Consistency
	Kudu's default external consistency mode is called `CLIENT_PROPAGATED`.
	See <<ht>> for an extensive explanation on how it works. In brief, this mode causes writes
	from _a single client_ to be automatically externally consistent. In the clickstream scenario
	above, if the two clicks are submitted by different client instances, the application must
	manually propagate timestamps from one client to the other for the causal relationship
	to be captured.

	Timestamps between clients _a_ and _b_ can be propagated as follows:

	Java Client:: Call `AsyncKuduClient#getLastPropagatedTimestamp()` on client _a_,
	propagate the timestamp to client _b_, and call
	`AsyncKuduClient#setLastPropagatedTimestamp()` on client _b_.

	{cpp} Client:: Call `KuduClient::GetLatestObservedTimestamp()` on client _a_,
	propagate the timestamp to client _b_, and call
	`KuduClient::SetLatestObservedTimestamp()` on client _b_.



	.`COMMIT_WAIT` Consistency
	Kudu also has an experimental implementation of an external consistency
	model used in Google's Spanner , called `COMMIT_WAIT`. `COMMIT_WAIT` works
	by tightly synchronizing the clocks on all machines in the cluster. Then, when a
	write occurs, timestamps are assigned and the results of the write are not made
	visible until enough time has passed so that no other machine in the cluster could
	possibly assign a lower timestamp to a following write.

	When using this mode, the latency of writes is tightly tied to the accuracy of clocks on
	all the cluster hosts, and using this mode with loose clock synchronization causes writes
	to take a long time to complete or even time out. See <<known_issues>>.

	The `COMMIT_WAIT` consistency mode may be selected as follows:

	Java Client:: Call `KuduSession#setExternalConsistencyMode(ExternalConsistencyMode.COMMIT_WAIT)`

	{cpp} Client:: Call `KuduSession::SetExternalConsistencyMode(COMMIT_WAIT)`

	CAUTION: `COMMIT_WAIT` consistency is considered an experimental feature. It may return
	incorrect results, exhibit performance issues, or negatively impact cluster stability.
	Use in production environments is discouraged.

	== Read Operations (Scans)

	Scans are read operations performed by clients that may span one or more rows across
	one or more tablets. When a server receives a scan request, it takes a snapshot of the MVCC
	state and then proceeds in one of two ways depending on the read mode selected by
	the user. The mode may be selected as follows:

	Java Client:: Call `KuduScannerBuilder#readMode(...)`
	{cpp} Client:: Call `KuduScanner::SetReadMode()`

	The following modes are available in both clients:

	`READ_LATEST`:: This is the default read mode. The server takes a snapshot of
	the MVCC state and proceeds with the read immediately. Reads in this mode only yield
	'Read Committed' isolation.

	`READ_AT_SNAPSHOT`:: In this read mode, scans are consistent and repeatable. A
	timestamp for the snapshot is selected either by the server, or set
	explicitly by the user through `KuduScanner::SetSnapshotMicros()`. Explicitly setting
	the timestamp is recommended; see <<recommendations>>. The server waits until this
	timestamp is 'safe' (until all write operations that have a lower timestamp have
	completed and are visible). This delay, coupled with an external consistency method,
	will eventually allow Kudu to have full `strict-serializable` semantics for reads
	and writes. This is still a work in progress and some anomalies are still possible
	(see <<known_issues>>). Only scans in this mode can be fault-tolerant.

	Selecting between read modes requires balancing the trade-offs and making a choice
	that fits your workload. For instance, a reporting application that needs to
	scan the entire database might need to perform careful accounting operations, so that
	scan may need to be fault-tolerant, but probably doesn't require a to-the-microsecond
	up-to-date view of the database. In that case, you might choose `READ_AT_SNAPSHOT`
	and select a timestamp that is a few seconds in the past when the scan starts. On
	the other hand, a machine learning workload that is not ingesting the whole data
	set and is already statistical in nature might not require the scan to be repeatable,
	so you might choose `READ_LATEST` instead for better scan performance.

	[NOTE]
	====
	Kudu also provides replica selection API for users to choose at which replica the
	scan should be performed:

	Java Client:: Call `KuduScannerBuilder#replicaSelection(...)`
	{cpp} Client:: Call `KuduScanner::SetSelection(...)`

	This API is a means to control locality and, in some cases, latency. The replica
	selection API has no effect on the consistency guarantees, which will hold no matter
	which replica is selected.
	====

	[[known_issues]]
	== Known Issues and Limitations

	There are several gaps and corner cases that prevent Kudu from being fully strictly-serializable
	in some situations, at the moment. Below are the details and next, some recommendations.

	[[known_issues_scans]]
	=== Writes

	* Support for `COMMIT_WAIT` is experimental and requires careful tuning of the
	time-synchronization protocol, such as NTP (Network Time Protocol). Its use
	is discouraged in production environments.

	=== Reads (Scans)

	* On a leader change, `READ_AT_SNAPSHOT` scans at a snapshot whose timestamp is beyond the last
	write may also yield non-repeatable reads (see
	link:https://issues.apache.org/jira/browse/KUDU-1188[KUDU-1188]).
	See <<recommendations>> for a workaround.
	* Impala scans are currently performed as `READ_LATEST` and have no consistency
	guarantees.
	* In `AUTO_BACKGROUND_FLUSH` mode, or when using "async" flushing mechanisms,
	writes applied to a single client session may become reordered due to the
	concurrency of flushing the data to the server. This may be particularly
	noticeable if a single row is quickly updated with different values in
	succession. This phenomenon affects all client API implementations.
	Workarounds are described in the API documentation for the respective
	implementations in the docs for `FlushMode` or `AsyncKuduSession`.
	See link:https://issues.apache.org/jira/browse/KUDU-1767[KUDU-1767].

	[[recommendations]]
	=== Recommendations

	* If repeatable snapshot reads are a requirement, use `READ_AT_SNAPSHOT`
	with a timestamp that is slightly in the past (between 2-5 seconds, ideally).
	This will circumvent the anomaly described in <<known_issues_scans>>. Even when the
	anomaly has been addressed, back-dating the timestamp will always make scans
	faster, since they are unlikely to block.

	* If external consistency is a requirement and you decide to use `COMMIT_WAIT`, the
	time-synchronization protocol needs to be tuned carefully. Each transaction will wait
	2x the maximum clock error at the time of execution, which is usually in the 100 msec.
	to 1 sec. range with the default settings, maybe more. Thus, transactions would take at least
	200 msec. to 2 sec. to complete when using the default settings and may even time out.

	** A local server should be used as a time server. We've performed experiments using the default
	NTP time source available in a Google Compute Engine data center and were able to obtain
	a reasonable tight max error bound, usually varying between 12-17 milliseconds.

	** The following parameters should be adjusted in `/etc/ntp.conf` to tighten the maximum error:
	- `server my_server.org iburst minpoll 1 maxpoll 8`
	- `tinker dispersion 500`
	- `tinker allan 0`

	IMPORTANT: The above parameters minimize `maximum error` at the expense of `estimated error`,
	the latter might be orders of magnitude above it's "normal" value. These parameters also
	may place a greater load on the time server, since they make the servers poll much more
	frequently.

	[bibliography]
	== References
	- [[[ht,1]]] David Alves, Todd Lipcon and Vijay Garg. Technical Report: HybridTime - Accessible Global Consistency with High Clock Uncertainty. April, 2014. http://users.ece.utexas.edu/~garg/pdslab/david/hybrid-time-tech-report-01.pdf
	- [[[spanner,2]]] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 251-264.
	- [[[fdt,3]]] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 1-12. DOI=10.1145/2213836.2213838 http://doi.acm.org/10.1145/2213836.2213838
	- [[[consensus,4]]] Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC'14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.
	- [[[consistency,5]]] Kwei-Jay Lin, "Consistency issues in real-time database systems," in System Sciences, 1989. Vol.II: Software Track, Proceedings of the Twenty-Second Annual Hawaii International Conference on , vol.2, no., pp.654-661 vol.2, 3-6 Jan 1989 doi: 10.1109/HICSS.1989.48069