blob: 827bd5e5a8b49f382912f6fde4c7644aefcfd6da [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
= jepsen.kudu
:author: Kudu Team
A link:http://clojure.org[Clojure] library designed to run
link:http://kudu.apache.org[Apache Kudu] consistency tests using
the link:https://aphyr.com/tags/Jepsen[Jepsen] framework. Currently, a simple
linearizability test for read/write register is implemented and run
for several fault injection scenarios.
== Prerequisites and Requirements
=== Operating System Requirements
Only Debian/Ubuntu Linux is supported as a platform for the master and tablet
server nodes. Tested to work on Debian 8 Jessie.
== Overview
The Clojure code is integrated into the project using the
link:https://github.com/nebula-plugins/nebula-clojure-plugin[nebula-clojure-plugin].
The kudu-jepsen tests are invoked by executing the `runJepsen` task.
The parameters are passed via the standard `-D<property>=<value>` notation.
There is a dedicated Clojure wrapper script
`kudu_test_runner.clj` in `$KUDU_HOME/java/kudu-jepsen/src/utils` which
populates the test environment with appropriate properties and iteratively
runs all the registered tests with different nemeses scenarios.
== Usage
=== Building
To build the library the following components are required:
* JDK 8
To build the project, run in the parent directory (i.e. `$KUDU_HOME/java`)
[listing]
----
$ ./gradlew clean assemble
----
=== Running
The machines for Kudu master and tserver nodes should be created prior
to running the test: the tests does not create those itself. The machines should
be up and running when starting the test.
To run the test, the following components are required at the control node:
* JDK 8
* SSH client (and optionally, SSH authentication agent)
* gnuplot (to visualize test results)
Jepsen uses SSH to perform operations at DB nodes. The kudu-jepsen assumes
that SSH keys are installed accordingly:
* The public part of the SSH key should be added into the `authorized_keys` file
at all DB nodes for the `root` user
* For the SSH private key the options are:
** Add the key to the SSH authentication agent running at the control node
** Specify the path to the file with the key in plain (non-encrypted) format
via the `sshKeyPath` property.
If using SSH authentication agent to hold the SSH key for DB nodes access,
run in the parent directory:
[listing]
----
$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0"
----
If not using SSH authentication agent, specify the location of the file with
SSH private key via the `sshKeyPath` property:
[listing]
----
$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0" \
-DsshKeyPath="/home/user/.ssh/vm_root_id_rsa"
----
Note that commas (not spaces) are used to separate the names of the nodes. The
DNS resolver should be properly configured to resolve the specified hostnames
into IP addresses.
The `tserverNodes` property is used to specify the set of nodes where to run
Kudu tablet servers. The `masterNodes` property is used to specify the set of
nodes to run Kudu master servers.
In the Jepsen terminology, Kudu master and tserver nodes are playing
*Jepsen DB node* roles. The machine where the above mentioned Gradle command
is run plays *Jepsen control node* role.
=== A reference script to build Kudu and run Jepsen tests
The following link:../../src/kudu/scripts/jepsen.sh[Bourne-again shell script]
can be used as a reference to build Kudu from source and run Jepsen tests.
=== Troubleshooting
When Jepsen's analysis doesn't find inconsistencies in the history of operations
it outputs the following in the end of a test:
[listing]
----
Everything looks good! ヽ(‘ー`)ノ
----
However, it might not be the case. If so, it's crucial to understand why the
test failed.
The majority of the kudu-jepsen test failures can be put into two classification
buckets:
* An error happened while setting up the testing environment, contacting
machines at the Kudu cluster, starting up Kudu server-side components, or in
any of the other third-party components the Jepsen uses (like clj-ssh), etc.
* The Jepsen's analysis detected inconsistent history of operations.
The former class of failures might be a manifestation of wrong configuration,
a problem with the test environment, a bug in the test code itself or some
other intermittent failure. Usually, encountering issues like that means the
consistency analysis (which is the last step of a test scenario) cannot run.
Such issues are reported as _errors_ in the summary message. E.g., the example
summary message below reports on 10 errors in 10 tests ran:
[listing]
----
21:41:42 Ran 10 tests containing 10 assertions.
21:41:42 0 failures, 10 errors.
----
To get more details, take a closer look at the output of `./gradlew runJepsen`
or at particular `jepsen.log` files in
`$KUDU_HOME/java/kudu-jepsen/store/rw-register/<test_timestamp>` directory. A
quick way to locate the corresponding section in the error log is to search for
`^ERROR in \(` regex pattern. An example of error message from Jepsen's output:
[listing]
----
ERROR in (register-test-tserver-random-halves) (KuduException.java:110)
expected: (:valid? (:results (jepsen/run! (tcasefun opts))))
actual: org.apache.kudu.client.NonRecoverableException: can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=28, DeadlineTracker(timeout=30000, elapsed=28571), ...
----
The latter class represents more serious issue: a manifestation of
non-linearizable history of operations. This is reported as _failure_ in the
summary message. E.g., the summary message below reports finding 2 instances
of non-linearizable history among 10 tests ran:
[listing]
----
22:21:52 Ran 10 tests containing 10 assertions.
22:21:52 2 failures, 0 errors.
----
If Jepsen's analysis finds non-linearizable history of operations, it outputs
the following in the end of a test:
[listing]
----
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
----
To troubleshoot, first it's necessary to find where the failed test stores
the results: it should be one of the timestamp-named sub-directories
(e.g. `20170109T071938.000-0800`) under
`$KUDU_HOME/java/kudu-jepsen/store/rw-register` in case of a linearizability
failure in one of the `rw-register` test scenarios. One of the possible ways
to find the directory:
[listing]
----
$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
$ find . -name jepsen.log | xargs grep 'Analysis invalid'
./20170109T071938.000-0800/jepsen.log:Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
$
----
Another way is to find sub-directories where the `linear.svg` file is present:
[listing]
----
$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
$ find . -name linear.svg
./20170109T071938.000-0800/linear.svg
$
----
Along with `jepsen.log` and `history.txt` files the failed test generates
`linear.svg` file (gnuplot is required for that). The diagram in `linear.svg`
illustrates the part of the history which Jepsen found inconsistent:
the diagram shows the time/client operation status/system state relationship
and the sequences of legal/illegal operations paths. From this point, the next
step is to locate the corresponding part of the history in the `history.txt`
file. Usually the problem appears around an activation interval of the test
nemesis scenario. Once found, it's possible to tie the vicinity of the
inconsistent operation sequence with the timestamps in the `jepsen.log` file.
Having the timestamps of the operations and their sequence, it's possible to
find relative messages in `kudu-tserver.log` and `kudu-master.log` log files
in sub-directories named as Kudu cluster nodes. Hopefully, that information
is enough to create a reproducible scenario for further troubleshooting
and debugging.