java/kudu-jepsen/README.adoc - kudu - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 = jepsen.kudu

 :author: Kudu Team

 A link:http://clojure.org[Clojure] library designed to run
 link:http://kudu.apache.org[Apache Kudu] consistency tests using
 the link:https://aphyr.com/tags/Jepsen[Jepsen] framework. Currently, a simple
 linearizability test for read/write register is implemented and run
 for several fault injection scenarios.

 == Prerequisites and Requirements
 === Operating System Requirements
 Only Debian/Ubuntu Linux is supported as a platform for the master and tablet
 server nodes. Tested to work on Debian 8 Jessie.

 == Overview
 The Clojure code is integrated into the project using the
 link:https://github.com/nebula-plugins/nebula-clojure-plugin[nebula-clojure-plugin].
 The kudu-jepsen tests are invoked by executing the `runJepsen` task.
 The parameters are passed via the standard `-D<property>=<value>` notation.
 There is a dedicated Clojure wrapper script
 `kudu_test_runner.clj` in `$KUDU_HOME/java/kudu-jepsen/src/utils` which
 populates the test environment with appropriate properties and iteratively
 runs all the registered tests with different nemeses scenarios.

 == Usage
 === Building
 To build the library the following components are required:

 * JDK 8

 To build the project, run in the parent directory (i.e. `$KUDU_HOME/java`)
 [listing]
 ----
 $ ./gradlew clean assemble
 ----

 === Running
 The machines for Kudu master and tserver nodes should be created prior
 to running the test: the tests does not create those itself. The machines should
 be up and running when starting the test.

 To run the test, the following components are required at the control node:

 * JDK 8
 * SSH client (and optionally, SSH authentication agent)
 * gnuplot (to visualize test results)

 Jepsen uses SSH to perform operations at DB nodes. The kudu-jepsen assumes
 that SSH keys are installed accordingly:

 * The public part of the SSH key should be added into the `authorized_keys` file
   at all DB nodes for the `root` user
 * For the SSH private key the options are:
 ** Add the key to the SSH authentication agent running at the control node
 ** Specify the path to the file with the key in plain (non-encrypted) format
    via the `sshKeyPath` property.

 If using SSH authentication agent to hold the SSH key for DB nodes access,
 run in the parent directory:
 [listing]
 ----
 $ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0"
 ----

 If not using SSH authentication agent, specify the location of the file with
 SSH private key via the `sshKeyPath` property:
 [listing]
 ----
 $ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0" \
   -DsshKeyPath="/home/user/.ssh/vm_root_id_rsa"
 ----

 Note that commas (not spaces) are used to separate the names of the nodes. The
 DNS resolver should be properly configured to resolve the specified hostnames
 into IP addresses.

 The `tserverNodes` property is used to specify the set of nodes where to run
 Kudu tablet servers. The `masterNodes` property is used to specify the set of
 nodes to run Kudu master servers.

 In the Jepsen terminology, Kudu master and tserver nodes are playing
 *Jepsen DB node* roles. The machine where the above mentioned Gradle command
 is run plays *Jepsen control node* role.

 === A reference script to build Kudu and run Jepsen tests
 The following link:../../src/kudu/scripts/jepsen.sh[Bourne-again shell script]
 can be used as a reference to build Kudu from source and run Jepsen tests.

 === Troubleshooting
 When Jepsen's analysis doesn't find inconsistencies in the history of operations
 it outputs the following in the end of a test:
 [listing]
 ----
 Everything looks good! ヽ(‘ー`)ノ
 ----

 However, it might not be the case. If so, it's crucial to understand why the
 test failed.

 The majority of the kudu-jepsen test failures can be put into two classification
 buckets:

 * An error happened while setting up the testing environment, contacting
   machines at the Kudu cluster, starting up Kudu server-side components, or in
   any of the other third-party components the Jepsen uses (like clj-ssh), etc.
 * The Jepsen's analysis detected inconsistent history of operations.

 The former class of failures might be a manifestation of wrong configuration,
 a problem with the test environment, a bug in the test code itself or some
 other intermittent failure. Usually, encountering issues like that means the
 consistency analysis (which is the last step of a test scenario) cannot run.
 Such issues are reported as _errors_ in the summary message. E.g., the example
 summary message below reports on 10 errors in 10 tests ran:
 [listing]
 ----
 21:41:42 Ran 10  tests containing 10 assertions.
 21:41:42 0 failures, 10 errors.
 ----
 To get more details, take a closer look at the output of `./gradlew runJepsen`
 or at particular `jepsen.log` files in
 `$KUDU_HOME/java/kudu-jepsen/store/rw-register/<test_timestamp>` directory. A
 quick way to locate the corresponding section in the error log is to search for
 `^ERROR in \(` regex pattern. An example of error message from Jepsen's output:
 [listing]
 ----
 ERROR in (register-test-tserver-random-halves) (KuduException.java:110)
 expected: (:valid? (:results (jepsen/run! (tcasefun opts))))
   actual: org.apache.kudu.client.NonRecoverableException: can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=28, DeadlineTracker(timeout=30000, elapsed=28571), ...
 ----

 The latter class represents more serious issue: a manifestation of
 non-linearizable history of operations. This is reported as _failure_ in the
 summary message. E.g., the summary message below reports finding 2 instances
 of non-linearizable history among 10 tests ran:
 [listing]
 ----
 22:21:52 Ran 10  tests containing 10 assertions.
 22:21:52 2 failures, 0 errors.
 ----

 If Jepsen's analysis finds non-linearizable history of operations, it outputs
 the following in the end of a test:
 [listing]
 ----
 Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻
 ----
 To troubleshoot, first it's necessary to find where the failed test stores
 the results: it should be one of the timestamp-named sub-directories
 (e.g. `20170109T071938.000-0800`) under
 `$KUDU_HOME/java/kudu-jepsen/store/rw-register` in case of a linearizability
 failure in one of the `rw-register` test scenarios. One of the possible ways
 to find the directory:
 [listing]
 ----
 $ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
 $ find . -name jepsen.log | xargs grep 'Analysis invalid'
 ./20170109T071938.000-0800/jepsen.log:Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻
 $
 ----
 Another way is to find sub-directories where the `linear.svg` file is present:
 [listing]
 ----
 $ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
 $ find . -name linear.svg
 ./20170109T071938.000-0800/linear.svg
 $
 ----
 Along with `jepsen.log` and `history.txt` files the failed test generates
 `linear.svg` file (gnuplot is required for that). The diagram in `linear.svg`
 illustrates the part of the history which Jepsen found inconsistent:
 the diagram shows the time/client operation status/system state relationship
 and the sequences of legal/illegal operations paths. From this point, the next
 step is to locate the corresponding part of the history in the `history.txt`
 file. Usually the problem appears around an activation interval of the test
 nemesis scenario. Once found, it's possible to tie the vicinity of the
 inconsistent operation sequence with the timestamps in the `jepsen.log` file.
 Having the timestamps of the operations and their sequence, it's possible to
 find relative messages in `kudu-tserver.log` and `kudu-master.log` log files
 in sub-directories named as Kudu cluster nodes. Hopefully, that information
 is enough to create a reproducible scenario for further troubleshooting
 and debugging.
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	= jepsen.kudu

	:author: Kudu Team

	A link:http://clojure.org[Clojure] library designed to run
	link:http://kudu.apache.org[Apache Kudu] consistency tests using
	the link:https://aphyr.com/tags/Jepsen[Jepsen] framework. Currently, a simple
	linearizability test for read/write register is implemented and run
	for several fault injection scenarios.

	== Prerequisites and Requirements
	=== Operating System Requirements
	Only Debian/Ubuntu Linux is supported as a platform for the master and tablet
	server nodes. Tested to work on Debian 8 Jessie.

	== Overview
	The Clojure code is integrated into the project using the
	link:https://github.com/nebula-plugins/nebula-clojure-plugin[nebula-clojure-plugin].
	The kudu-jepsen tests are invoked by executing the `runJepsen` task.
	The parameters are passed via the standard `-D<property>=<value>` notation.
	There is a dedicated Clojure wrapper script
	`kudu_test_runner.clj` in `$KUDU_HOME/java/kudu-jepsen/src/utils` which
	populates the test environment with appropriate properties and iteratively
	runs all the registered tests with different nemeses scenarios.

	== Usage
	=== Building
	To build the library the following components are required:

	* JDK 8

	To build the project, run in the parent directory (i.e. `$KUDU_HOME/java`)
	[listing]
	----
	$ ./gradlew clean assemble
	----

	=== Running
	The machines for Kudu master and tserver nodes should be created prior
	to running the test: the tests does not create those itself. The machines should
	be up and running when starting the test.

	To run the test, the following components are required at the control node:

	* JDK 8
	* SSH client (and optionally, SSH authentication agent)
	* gnuplot (to visualize test results)

	Jepsen uses SSH to perform operations at DB nodes. The kudu-jepsen assumes
	that SSH keys are installed accordingly:

	* The public part of the SSH key should be added into the `authorized_keys` file
	at all DB nodes for the `root` user
	* For the SSH private key the options are:
	** Add the key to the SSH authentication agent running at the control node
	** Specify the path to the file with the key in plain (non-encrypted) format
	via the `sshKeyPath` property.

	If using SSH authentication agent to hold the SSH key for DB nodes access,
	run in the parent directory:
	[listing]
	----
	$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0"
	----

	If not using SSH authentication agent, specify the location of the file with
	SSH private key via the `sshKeyPath` property:
	[listing]
	----
	$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0" \
	-DsshKeyPath="/home/user/.ssh/vm_root_id_rsa"
	----

	Note that commas (not spaces) are used to separate the names of the nodes. The
	DNS resolver should be properly configured to resolve the specified hostnames
	into IP addresses.

	The `tserverNodes` property is used to specify the set of nodes where to run
	Kudu tablet servers. The `masterNodes` property is used to specify the set of
	nodes to run Kudu master servers.

	In the Jepsen terminology, Kudu master and tserver nodes are playing
	Jepsen DB node roles. The machine where the above mentioned Gradle command
	is run plays Jepsen control node role.

	=== A reference script to build Kudu and run Jepsen tests
	The following link:../../src/kudu/scripts/jepsen.sh[Bourne-again shell script]
	can be used as a reference to build Kudu from source and run Jepsen tests.

	=== Troubleshooting
	When Jepsen's analysis doesn't find inconsistencies in the history of operations
	it outputs the following in the end of a test:
	[listing]
	----
	Everything looks good! ヽ(‘ー`)ノ
	----

	However, it might not be the case. If so, it's crucial to understand why the
	test failed.

	The majority of the kudu-jepsen test failures can be put into two classification
	buckets:

	* An error happened while setting up the testing environment, contacting
	machines at the Kudu cluster, starting up Kudu server-side components, or in
	any of the other third-party components the Jepsen uses (like clj-ssh), etc.
	* The Jepsen's analysis detected inconsistent history of operations.

	The former class of failures might be a manifestation of wrong configuration,
	a problem with the test environment, a bug in the test code itself or some
	other intermittent failure. Usually, encountering issues like that means the
	consistency analysis (which is the last step of a test scenario) cannot run.
	Such issues are reported as _errors_ in the summary message. E.g., the example
	summary message below reports on 10 errors in 10 tests ran:
	[listing]
	----
	21:41:42 Ran 10 tests containing 10 assertions.
	21:41:42 0 failures, 10 errors.
	----
	To get more details, take a closer look at the output of `./gradlew runJepsen`
	or at particular `jepsen.log` files in
	`$KUDU_HOME/java/kudu-jepsen/store/rw-register/<test_timestamp>` directory. A
	quick way to locate the corresponding section in the error log is to search for
	`^ERROR in \(` regex pattern. An example of error message from Jepsen's output:
	[listing]
	----
	ERROR in (register-test-tserver-random-halves) (KuduException.java:110)
	expected: (:valid? (:results (jepsen/run! (tcasefun opts))))
	actual: org.apache.kudu.client.NonRecoverableException: can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=28, DeadlineTracker(timeout=30000, elapsed=28571), ...
	----

	The latter class represents more serious issue: a manifestation of
	non-linearizable history of operations. This is reported as _failure_ in the
	summary message. E.g., the summary message below reports finding 2 instances
	of non-linearizable history among 10 tests ran:
	[listing]
	----
	22:21:52 Ran 10 tests containing 10 assertions.
	22:21:52 2 failures, 0 errors.
	----

	If Jepsen's analysis finds non-linearizable history of operations, it outputs
	the following in the end of a test:
	[listing]
	----
	Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻
	----
	To troubleshoot, first it's necessary to find where the failed test stores
	the results: it should be one of the timestamp-named sub-directories
	(e.g. `20170109T071938.000-0800`) under
	`$KUDU_HOME/java/kudu-jepsen/store/rw-register` in case of a linearizability
	failure in one of the `rw-register` test scenarios. One of the possible ways
	to find the directory:
	[listing]
	----
	$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
	$ find . -name jepsen.log \| xargs grep 'Analysis invalid'
	./20170109T071938.000-0800/jepsen.log:Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻
	$
	----
	Another way is to find sub-directories where the `linear.svg` file is present:
	[listing]
	----
	$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
	$ find . -name linear.svg
	./20170109T071938.000-0800/linear.svg
	$
	----
	Along with `jepsen.log` and `history.txt` files the failed test generates
	`linear.svg` file (gnuplot is required for that). The diagram in `linear.svg`
	illustrates the part of the history which Jepsen found inconsistent:
	the diagram shows the time/client operation status/system state relationship
	and the sequences of legal/illegal operations paths. From this point, the next
	step is to locate the corresponding part of the history in the `history.txt`
	file. Usually the problem appears around an activation interval of the test
	nemesis scenario. Once found, it's possible to tie the vicinity of the
	inconsistent operation sequence with the timestamps in the `jepsen.log` file.
	Having the timestamps of the operations and their sequence, it's possible to
	find relative messages in `kudu-tserver.log` and `kudu-master.log` log files
	in sub-directories named as Kudu cluster nodes. Hopefully, that information
	is enough to create a reproducible scenario for further troubleshooting
	and debugging.