| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| = jepsen.kudu |
| |
| :author: Kudu Team |
| |
| A link:http://clojure.org[Clojure] library designed to run |
| link:http://kudu.apache.org[Apache Kudu] consistency tests using |
| the link:https://aphyr.com/tags/Jepsen[Jepsen] framework. Currently, a simple |
| linearizability test for read/write register is implemented and run |
| for several fault injection scenarios. |
| |
| == Prerequisites and Requirements |
| === Operating System Requirements |
| Only Debian/Ubuntu Linux is supported as a platform for the master and tablet |
| server nodes. Tested to work on Debian 8 Jessie. |
| |
| == Overview |
| The Clojure code is integrated into the project using the |
| link:https://github.com/nebula-plugins/nebula-clojure-plugin[nebula-clojure-plugin]. |
| The kudu-jepsen tests are invoked by executing the `runJepsen` task. |
| The parameters are passed via the standard `-D<property>=<value>` notation. |
| There is a dedicated Clojure wrapper script |
| `kudu_test_runner.clj` in `$KUDU_HOME/java/kudu-jepsen/src/utils` which |
| populates the test environment with appropriate properties and iteratively |
| runs all the registered tests with different nemeses scenarios. |
| |
| == Usage |
| === Building |
| To build the library the following components are required: |
| |
| * JDK 8 |
| |
| To build the project, run in the parent directory (i.e. `$KUDU_HOME/java`) |
| [listing] |
| ---- |
| $ ./gradlew clean assemble |
| ---- |
| |
| === Running |
| The machines for Kudu master and tserver nodes should be created prior |
| to running the test: the tests does not create those itself. The machines should |
| be up and running when starting the test. |
| |
| To run the test, the following components are required at the control node: |
| |
| * JDK 8 |
| * SSH client (and optionally, SSH authentication agent) |
| * gnuplot (to visualize test results) |
| |
| Jepsen uses SSH to perform operations at DB nodes. The kudu-jepsen assumes |
| that SSH keys are installed accordingly: |
| |
| * The public part of the SSH key should be added into the `authorized_keys` file |
| at all DB nodes for the `root` user |
| * For the SSH private key the options are: |
| ** Add the key to the SSH authentication agent running at the control node |
| ** Specify the path to the file with the key in plain (non-encrypted) format |
| via the `sshKeyPath` property. |
| |
| If using SSH authentication agent to hold the SSH key for DB nodes access, |
| run in the parent directory: |
| [listing] |
| ---- |
| $ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0" |
| ---- |
| |
| If not using SSH authentication agent, specify the location of the file with |
| SSH private key via the `sshKeyPath` property: |
| [listing] |
| ---- |
| $ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0" \ |
| -DsshKeyPath="/home/user/.ssh/vm_root_id_rsa" |
| ---- |
| |
| Note that commas (not spaces) are used to separate the names of the nodes. The |
| DNS resolver should be properly configured to resolve the specified hostnames |
| into IP addresses. |
| |
| The `tserverNodes` property is used to specify the set of nodes where to run |
| Kudu tablet servers. The `masterNodes` property is used to specify the set of |
| nodes to run Kudu master servers. |
| |
| In the Jepsen terminology, Kudu master and tserver nodes are playing |
| *Jepsen DB node* roles. The machine where the above mentioned Gradle command |
| is run plays *Jepsen control node* role. |
| |
| === A reference script to build Kudu and run Jepsen tests |
| The following link:../../src/kudu/scripts/jepsen.sh[Bourne-again shell script] |
| can be used as a reference to build Kudu from source and run Jepsen tests. |
| |
| === Troubleshooting |
| When Jepsen's analysis doesn't find inconsistencies in the history of operations |
| it outputs the following in the end of a test: |
| [listing] |
| ---- |
| Everything looks good! ヽ(‘ー`)ノ |
| ---- |
| |
| However, it might not be the case. If so, it's crucial to understand why the |
| test failed. |
| |
| The majority of the kudu-jepsen test failures can be put into two classification |
| buckets: |
| |
| * An error happened while setting up the testing environment, contacting |
| machines at the Kudu cluster, starting up Kudu server-side components, or in |
| any of the other third-party components the Jepsen uses (like clj-ssh), etc. |
| * The Jepsen's analysis detected inconsistent history of operations. |
| |
| The former class of failures might be a manifestation of wrong configuration, |
| a problem with the test environment, a bug in the test code itself or some |
| other intermittent failure. Usually, encountering issues like that means the |
| consistency analysis (which is the last step of a test scenario) cannot run. |
| Such issues are reported as _errors_ in the summary message. E.g., the example |
| summary message below reports on 10 errors in 10 tests ran: |
| [listing] |
| ---- |
| 21:41:42 Ran 10 tests containing 10 assertions. |
| 21:41:42 0 failures, 10 errors. |
| ---- |
| To get more details, take a closer look at the output of `./gradlew runJepsen` |
| or at particular `jepsen.log` files in |
| `$KUDU_HOME/java/kudu-jepsen/store/rw-register/<test_timestamp>` directory. A |
| quick way to locate the corresponding section in the error log is to search for |
| `^ERROR in \(` regex pattern. An example of error message from Jepsen's output: |
| [listing] |
| ---- |
| ERROR in (register-test-tserver-random-halves) (KuduException.java:110) |
| expected: (:valid? (:results (jepsen/run! (tcasefun opts)))) |
| actual: org.apache.kudu.client.NonRecoverableException: can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=28, DeadlineTracker(timeout=30000, elapsed=28571), ... |
| ---- |
| |
| The latter class represents more serious issue: a manifestation of |
| non-linearizable history of operations. This is reported as _failure_ in the |
| summary message. E.g., the summary message below reports finding 2 instances |
| of non-linearizable history among 10 tests ran: |
| [listing] |
| ---- |
| 22:21:52 Ran 10 tests containing 10 assertions. |
| 22:21:52 2 failures, 0 errors. |
| ---- |
| |
| If Jepsen's analysis finds non-linearizable history of operations, it outputs |
| the following in the end of a test: |
| [listing] |
| ---- |
| Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ |
| ---- |
| To troubleshoot, first it's necessary to find where the failed test stores |
| the results: it should be one of the timestamp-named sub-directories |
| (e.g. `20170109T071938.000-0800`) under |
| `$KUDU_HOME/java/kudu-jepsen/store/rw-register` in case of a linearizability |
| failure in one of the `rw-register` test scenarios. One of the possible ways |
| to find the directory: |
| [listing] |
| ---- |
| $ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register |
| $ find . -name jepsen.log | xargs grep 'Analysis invalid' |
| ./20170109T071938.000-0800/jepsen.log:Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ |
| $ |
| ---- |
| Another way is to find sub-directories where the `linear.svg` file is present: |
| [listing] |
| ---- |
| $ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register |
| $ find . -name linear.svg |
| ./20170109T071938.000-0800/linear.svg |
| $ |
| ---- |
| Along with `jepsen.log` and `history.txt` files the failed test generates |
| `linear.svg` file (gnuplot is required for that). The diagram in `linear.svg` |
| illustrates the part of the history which Jepsen found inconsistent: |
| the diagram shows the time/client operation status/system state relationship |
| and the sequences of legal/illegal operations paths. From this point, the next |
| step is to locate the corresponding part of the history in the `history.txt` |
| file. Usually the problem appears around an activation interval of the test |
| nemesis scenario. Once found, it's possible to tie the vicinity of the |
| inconsistent operation sequence with the timestamps in the `jepsen.log` file. |
| Having the timestamps of the operations and their sequence, it's possible to |
| find relative messages in `kudu-tserver.log` and `kudu-master.log` log files |
| in sub-directories named as Kudu cluster nodes. Hopefully, that information |
| is enough to create a reproducible scenario for further troubleshooting |
| and debugging. |