= Harry, an Open Source Fuzz Testing and Verification Tool for Apache Cassandra | |
:page-layout: single-post | |
:page-role: blog-post | |
:page-post-date: November 18, 2021 | |
:page-post-author: Alex Petrov | |
:description: The Apache Cassandra Community | |
:keywords: | |
Over the years working on Apache Cassandra while writing tests or | |
trying to reproduce the issues, I’ve always found myself repeating the | |
same procedure over and over again: creating schema, writing loops | |
generating data, then either manually reconciling it to check the | |
results, or comparing the result set against some predetermined | |
expected result. Not only is this approach tedious and time-consuming, | |
but it also does not scale: if some set of operations work for one | |
schema, there’s no way to know if it will also work for any arbitrary | |
schema, whether it will work if operations are executed in a different | |
order, or if operations themselves are slightly different. | |
While preparing Apache Cassandra for 4.0 release, we’ve made extensive | |
progress in how we test. The new in-tree in-JVM distributed test | |
framework enables us to easily write tests that exercise coordinated | |
query execution code paths while giving us flexibility and control | |
that was previously offered only by CQLTester, a tool for exercising | |
node-local query paths. Many subsystems were audited and covered with | |
tests. Cassandra users tried the new version out in their clusters and | |
reported their findings. All of these things are useful and important, | |
but we still needed a tool that would give us the same or higher | |
degree of confidence for every commit so that we could know that the | |
database is working as expected, not only for an exact set of | |
operations that exercised by unit and integration tests, but | |
potentially for any use-case and combination of operations under | |
circumstances comparable to production. | |
This all led us to develop Harry, a tool that can combine properties | |
of stress- and integration-testing tools. Harry is a tool that can | |
generate data for an arbitrary schema, execute data modification | |
queries against the cluster, track the progress of operation | |
execution, and make sure that responses to read queries are correct. | |
After reading this post, you will understand: | |
* how Harry generates the data | |
* how Harry performs verification | |
* which properties values generated by Harry make verification not only possible but also efficient. | |
The primary audience for this post is Cassandra contributors, so you | |
will need to be familiar with Apache Cassandra and its tooling. | |
== Fuzz testing | |
Since most of the bugs are reproduced by taking a sequence of actions | |
following some pattern, we need to specify what actions can be used to | |
lead to a given state. However, there’s a lot of flexibility regarding | |
which values exactly are written in specific columns. | |
For example, if we look at | |
https://issues.apache.org/jira/browse/CASSANDRA-16453[CASSANDRA-16453,window=_blank], | |
which was reproduced using Harry. Code to reproduce the issue with | |
in-JVM DTests looks something like this: | |
.Repro.java | |
[source,java] | |
---- | |
try (Cluster cluster = init(builder().withNodes(2).start())) | |
{ | |
cluster.schemaChange(withKeyspace("CREATE TABLE distributed_test_keyspace.table_0 (pk0 bigint,ck0 bigint,regular0 bigint,regular1 bigint,regular2 bigint, PRIMARY KEY (pk0, ck0)) WITH CLUSTERING ORDER BY (ck0 ASC);")); | |
cluster.coordinator(1).execute("DELETE FROM distributed_test_keyspace.table_0 USING TIMESTAMP 1 WHERE pk0=1 AND ck0>2;", ConsistencyLevel.ALL); | |
cluster.get(2).executeInternal("DELETE FROM distributed_test_keyspace.table_0 USING TIMESTAMP 1 WHERE pk0=1;"); | |
cluster.coordinator(1).execute("SELECT * FROM distributed_test_keyspace.table_0 WHERE pk0=1 AND ck0>=1 AND ck0<3;", | |
ConsistencyLevel.ALL, 1L, 1L, 3L); | |
} | |
---- | |
You can see that, at first glance, there are only three things that | |
that are important to reproduce the issue: | |
1. The table has to have at least one clustering column | |
2. Two actions are executed against the cluster: a range deletion, and a partition deletion | |
3. Both operations have the same timestamp | |
The rest of the details do not matter: size of the cluster, number of | |
replicas, clustering order, consistency level with which operations | |
are executed, types of clustering keys and values written, and so on. | |
The simplest way to cover a case like this with a test is to hardcode | |
the schema and then execute a partition deletion and a range deletion | |
hardcoding the values, much as we did above. This might work, but | |
there’s still a chance that the proposed fix may not work for some | |
other schema or some combination of values. | |
To improve the situation, we can express the test in more abstract | |
terms and, instead of writing a repro using specific statements, we | |
can only use the constraints we’ve specified above: | |
.HarryDsl.java | |
[source,java] | |
---- | |
test(new SchemaGenerators.Builder("harry") | |
.partitionKeySpec(1, 5) | |
.clusteringKeySpec(1, 5) | |
.regularColumnSpec(1, 10) | |
.generator(), | |
historyBuilder -> { | |
historyBuilder.nextPartition() | |
.simultaneously() | |
.randomOrder() | |
.partitionDeletion() | |
.rangeDeletion() | |
.finish(); | |
}); | |
---- | |
This spec can be used to generate clusters of different sizes, | |
configured with different schemas, executing the given sequence of | |
actions both in isolation and combined with other randomly generated | |
ones, with failure-injection. Best of all, this test will _not only_ | |
ensure that such a sequence of actions does not produce an exception | |
but also ensures that a cluster will respond with correct results to | |
_any_ allowed read query. | |
== Generating data | |
Generating random values and sequences of actions and reconciling them | |
during verification is in itself not a difficult task. Making this | |
process time- and memory-efficient is what makes it more interesting. | |
For space efficiency, the log of actions generated using Harry is not | |
kept in memory or saved anywhere on disk since any generated operation | |
can be reproduced from its sequence number. In Harry, a sequence | |
number consists of two parts: the logical timestamp (LTS, which has | |
1-1 mapping to real-time timestamp), and the modification ID, which | |
allows having multiple uniquely identifiable operations for each | |
logical timestamp. For the sake of simplicity, we’ll just say that | |
each operation is represented by its sequence number / LTS. | |
In the example above, the operation order is determined by the seed | |
for the given run. Let’s say that partition deletion is executed | |
first. To produce a `DELETE` statement from it, we now need to | |
generate a partition key and get a timestamp. Similarly, to generate a | |
range deletion, we will need a partition key, two clustering keys that | |
will serve as lower and higher bounds for the range tombstone, and a | |
timestamp. | |
Using the sequence number and knowing the operation type, we can now | |
produce _descriptors_ that are used as the compact internal | |
representation of data in Harry. No matter how many parts it consists | |
of, any partition key is represented by a single `long`. The same is | |
true for the clustering keys: any clustering key, single-part or | |
composite, is represented using a single `long` descriptor. If we were | |
to generate an `INSERT` or `UPDATE` operation, each value for a | |
regular or a static column would have its own descriptor since we | |
would want to distinguish between two writes made by two different | |
operations. | |
To summarise, every operation has a sequence number, which determines | |
everything that is required to fully reproduce this operation, | |
including descriptors that we will later use to generate values | |
themselves: | |
* partition deletion only has a partition descriptor | |
* range deletion has a partition descriptor and two clustering descriptors, specifying tombstone bounds | |
* insert or update operation has a partition descriptor, a clustering descriptor, and a set of value descriptors, one for each regular and static column. | |
Using descriptors rather than specific values for verification can be | |
extremely useful for efficiency. Instead of comparing potentially | |
large values, we could just compare two longs that uniquely identify | |
them. This means that we have to have a way to not _only_ generate a | |
value from the descriptor, but _also_ to compute a descriptor the | |
value was generated from. | |
In Harry, we call such a generator `Bijection<T>`, and every bijection | |
can _inflate_ a descriptor into the value of type `T`. Then _deflate_ | |
the value of type `T` back into the descriptor where it was originally | |
generated. | |
== Validating results | |
Applying a predetermined sequence of operations against a single | |
partition produces some partition state. Knowing the status of | |
execution of each operation, we can deterministically determine the | |
state of each node in the cluster and validate the results of | |
execution of any `SELECT` query. | |
Since we can represent any operation as a sequence of descriptors, we | |
know the order of operations (since the timestamp determines it). We | |
can assume we know the status of each operation (whether or not it has | |
been executed against some node), and we can deterministically produce | |
partition state for any given point in time. Partition state is | |
nothing but a sorted map, where the key is a clustering descriptor, | |
and value is a row state. Row state, in this case, holds value | |
descriptors for each column, and timestamps where operations were | |
executed: | |
.PartitionState.java | |
[source,java] | |
---- | |
public class PartitionState implements Iterable<RowState> { | |
long partitionDescriptor; | |
NavigableMap<Long, RowState> rowStates; | |
} | |
public static class RowState { | |
long[] valueDescriptors; | |
long[] logicalTimestamps; | |
} | |
---- | |
Similarly, since any value written to the database is generated using | |
a bijection, we can produce the partition state from the result set by | |
deflating every value returned by the database into the descriptor | |
that it was generated from. | |
== Generating Descriptors | |
Reproducible operation sequences can be generated from a set of rules | |
that determines what the sequence is going to look like. For example, | |
we can specify probability distributions for each operation type or | |
give operations relative weights, which can be turned into the | |
distribution internally later. Configuration for an insert / update / | |
delete workload with a probability of an insert operation (100/251) | |
being twice as high as a probability of a row deletion (50/251), and | |
ten times more probable than a partition deletion (1/251), would look | |
like: | |
---- | |
INSERT: 100 | |
UPDATE: 100 | |
DELETE_ROW: 50 | |
DELETE_PARTITION: 1 | |
---- | |
Since each operation is uniquely determined by its sequence number, we | |
can deterministically compute its operation type by taking these | |
probability distributions. One way to do this is by using PCG random | |
number generator, which has some useful properties we’re going to use | |
for generating our pseudorandom values. | |
If you’d like to learn more about the mathematical underpinnings of | |
PCG, you should read this paper | |
(https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf[https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf,window=_blank]). However, | |
to be able to use PCG, it is not necessary to know any of the | |
internals. We need a random number generator that will have the | |
following properties: | |
* Long period: sequence of numbers it produces does not repeat | |
frequently; ideally the period should be 2^64 when generating a | |
random number from 64 bits of entropy | |
* Stream selection: the ability to produce different random | |
sequences from the same seed, identified by some stream id. | |
* Addressability: any number produced by the generator can be | |
reproduced from the seed and its sequence number. Ideally, we’d | |
like to have methods such as `long randomNumber(long | |
sequenceNumber, long stream)` and `long sequenceNumber(long | |
randomNumber, long stream)`. In other words, we should be able to | |
determine the sequence number of the random number in the given | |
stream. Using this method, we can also determine `distance(long x, | |
long y)` : how many random numbers we should skip to get `y` after | |
seeing `x`. | |
* Walkability: the ability to produce a number immediately following | |
`long next(long randomNumber, long stream)` or preceding `long | |
prev(long randomNumber, long stream)` the given random number in | |
the random sequence. | |
You might have noticed that there are two ways to achieve the same | |
thing. We can get a pseudorandom number from some number known by the | |
system by using `randomNumber(i, stream)` and by using `prev(i, | |
stream)`. Both variants are valid, and both operations can be | |
inverted. We have a slight preference toward using `prev`, since its | |
inverse can be computed in constant time. | |
These properties allow us to reproduce partition state from just | |
configuration (i.e., known distributions, schema, size of the | |
partition, etc) and a seed: | |
* Partition descriptor for `N` th operation can be picked as `M` th | |
random number in the stream of partition descriptors, and the | |
relation between `N` and `M` is determined by the chosen pattern | |
for visiting partitions. | |
* Clustering descriptor for `N` th operation can be picked as `M` th | |
random number in the stream of clustering descriptors **for the | |
given partition**, where maximum `M` is determined by the maximum | |
partition size, so there can be no more than `max(M)` rows in any | |
generated partition. | |
One of the simplest useful ways to represent a pattern for picking a | |
descriptor from the sequence is to use a sliding window. The sliding | |
window begins with a preset number of items in it and allows to visit | |
each item in the current window one or several times in a round-robin | |
fashion. After this, it cycles one of the items out and adds a new one | |
in its place. | |
Once operation type, partition descriptor, and clustering descriptors | |
are determined, all we have left to cover is how to generate value | |
descriptors for `INSERT` and `UPDATE` operations. Value descriptor for | |
a column is uniquely identified by its sequence number and is bound by | |
partition descriptor, clustering descriptor, and column. | |
To summarise, all operations in Harry are deterministic and are | |
represented using their descriptors. Descriptors can be computed | |
hierarchically using the following rules: | |
* Partition descriptor is picked from the _stream_ of partition descriptors. Its position in that stream is determined by some rule (for example, a sliding window): | |
[source,java] | |
---- | |
long pd = rng.randomNumber(positionFor(sequenceNumber), PARTITION_DESCRIPTOR_STREAM_ID) | |
---- | |
* Clustering descriptor is picked from the _stream_ of clustering descriptors **for the given partition**. | |
[source,java] | |
---- | |
long cd = rng.prev(positionInPartition, pd); | |
---- | |
* Value descriptor is picked from the _stream_ of descriptors identified by which partition, clustering, and column the value belongs to: | |
[source,java] | |
---- | |
long vd = rng.randomNumber(sequenceNumber, pd ^ cd ^ col); | |
---- | |
== Inflation and Deflation | |
We’ve mentioned before that one reason Harry state is so compact and | |
can be validated so efficiently is because every value read from the | |
database can be traced back to the descriptor it was generated | |
from. To achieve this, we generate all values using order-preserving | |
bijections. In other words, for any value generated from a descriptor, | |
it should be possible to quickly find a descriptor this value was | |
generated from, and two values generated from two distinct descriptors | |
should sort the same as descriptors themselves. | |
Implementing an order-preserving bijection for 64-bit longs is trivial | |
and can be achieved by using an identity function. Essentially, any | |
long descriptor _is_ the value it represents: | |
[source,java] | |
---- | |
long inflate(long descriptor) { | |
return descriptor; | |
} | |
long deflate(long value) { | |
return value; | |
} | |
---- | |
There are many ways to make a bijection for strings. One of the ways | |
to do it is to have a set of 256 short strings of the same length in a | |
sorted array. When inflating a 64-bit long descriptor into the string, | |
we’ll be iterating over these 64 bits, taking 8 bits (one byte) at a | |
time, using the value of this byte as an index in an array of 256 | |
strings. | |
[source,java] | |
---- | |
String inflate(long descriptor) { | |
StringBuilder builder = new StringBuilder(); | |
for (int i = 0; i < Long.BYTES; i++) { | |
int idx = getByte(descriptor, i); | |
builder.append(nibbles[idx]); | |
} | |
return builder.toString(); | |
} | |
---- | |
One thing we should take into account here is that strings are | |
compared byte-wise, while longs use signed comparison. To make sure | |
generated strings have the same order as descriptors, we need to XOR | |
the sign bit. | |
Since any two strings produced by this generator will be unique, and | |
we can produce at most 2^64 values using this generator, to generate | |
longer strings we do not even need larger nibbles. We can append | |
random data of arbitrary length to the end of the string. This does | |
not change the order since it is determined by the prefix generated | |
from nibbles that is unique to each value. | |
Such simple bijections can represent data types used for regular and | |
static columns. We’ve previously mentioned that partition and | |
clustering keys are also represented using 64-bit | |
descriptors. Partition and clustering keys are composite: they consist | |
of multiple distinct parts. One way to implement bijection for a | |
composite type is to “slice” 64 bits of entropy into smaller chunks, | |
each chunk giving some entropy to generate a different part of the | |
key. Each slice is then inflated using a bijection that corresponds to | |
the part of the key it represents. To convert the value back to the | |
descriptor, we must deflate each part of the key and then “stitch” the | |
values back together into a 64-bit descriptor. | |
To summarise, key generators are just bijections that can generate | |
multiple values for a single 64-bit descriptor instead of one. A | |
simplified and generalized version of such bijection may look | |
something like this: | |
[source,java] | |
---- | |
Object[] inflate(long descriptor) { | |
long[] slices = slice(descriptor); | |
Object[] key = new Object[slices.length]; | |
for (int i = 0; i < slices.length; i++) { | |
key[i] = children[i].inflate(slices[i]); | |
} | |
return key; | |
} | |
long deflate(Object[] value) { | |
long[] slices = new long[value.length]; | |
for (int i = 0; i < value.length; i++) { | |
slices[i] = children[i].deflate(value[i]); | |
} | |
return stitch(slices); | |
} | |
---- | |
Values generated by key generators preserve the order of descriptors | |
they were generated from, which allows efficiently checking the order | |
of results, comparing clustering descriptors, and validating range | |
deletions. | |
== Putting it all together | |
In this post, we’ve learned how the various parts of Harry work, | |
starting with how to reproduce a sequence of operations up to how the | |
values are generated. Using this information, we can create a | |
quiescent model checker that can validate the state of the database in | |
the absence of in-flight operations, assuming we know the state of all | |
operations before this moment. | |
As we’ve discussed, Harry is working with reproducible histories of | |
operations, where the following information identifies each operation: | |
[source,java] | |
---- | |
class Operation { | |
long lts; // logical timestamp of the operation | |
long pd; // partition descriptor, derived from LTS | |
long cd; // clustering descriptor, derived from LTS and PD | |
long OperationKind; // operation type, derived from LTS and PD | |
} | |
---- | |
Now, all we need to do is to produce a sequence of operations. For example, each operation with `INSERT` kind is going to be represented by: | |
[source,java] | |
---- | |
class Insert { | |
long lts; // logical timestamp | |
long pd; // partition descriptor | |
long cd; // clustering descriptor | |
long[] vds; // value descriptors for each column | |
} | |
---- | |
We can compile this information into an INSERT statement by inflating | |
partition, clustering, and value descriptors, and deriving the | |
real-time timestamp from the logical timestamp. This statement can | |
then be executed against the system under test (our Cassandra | |
cluster). | |
We know from the history of operations which partitions were visited, | |
and which operations were executed against them. To validate the state | |
of any given partition, all we need to do is to query the database to | |
retrieve partition state from the database, deflate every row returned | |
in the results: deflate all clustering keys into clustering | |
descriptors, and values into corresponding value descriptors, | |
producing internal PartitionState. | |
To verify that this partition state is also correct, we replay the | |
sequence of operations again. But instead of going all the way to the | |
generated INSERT statement, we operate with only descriptors and apply | |
operations sequentially to the PartitionState following usual | |
Cassandra reconciliation rules (i.e. last-write-wins, partition | |
tombstone > tombstone > write), compute logical timestamps and value | |
descriptors for each row. | |
Now, having two partition states: one “deflated” from the result set | |
returned by the system under test and one “inflated” from the logical | |
timestamp, we can directly compare them. If there are any | |
inconsistencies between the two sets, we can conclude there has been | |
an error. | |
Validating query results by reproducing the sequence of events is more | |
efficient than working with full operation logs holding all values for | |
every operation executed against the database state. This process can | |
be made even more efficient by validating rows individually while | |
having enough state in memory to validate a single row. | |
Since implementing other Cassandra features, such as partition | |
deletions, writes without primary key liveness, static columns, column | |
deletions, etc., do not require any additional information, and | |
follows the same rules, for the sake of brevity, it is not covered in | |
this post. | |
Quiescent checker is a very useful tool and can validate data sets | |
generated by sequential or concurrent runners, as long as the state of | |
each operation is known at the moment of validation. Since we simply | |
replay the sequence of events to reproduce the state, we can not have | |
any events whose state is unknown to the system. | |
== Closing words | |
Harry is a tool that allows the testing of databases in ways we | |
weren’t able to before. Instead of creating test cases expressed in | |
specific schemas and statements, it enables us to describe sequences | |
of events we’d like tested and generate schemas and possible | |
interleavings of statements corresponding to given | |
specifications. Creating exhaustive test suites can take a lot of | |
time, and we can have contributor creativity poured into patches and | |
features, not into test examples. This approach also allows testing | |
different features in combination and checking old behaviors in the | |
presence of new ones without explicitly creating new tests. | |
That said, integration testing is not always enough, and often we do | |
not know which areas of the codebase can be problematic. Harry can be | |
helpful here, too, since it will generate data, execute the workload | |
against the system and validate the results. Contributors can focus on | |
implementing test scenarios such as behavior in presence of unexpected | |
failures, cluster expansions, or node replacements. | |
Harry is a productivity tool that encapsulates the properties of a | |
stress test and a model checker, allowing us to find issues and | |
improve the quality of Cassandra in ways that were not possible | |
before. |