blob: f8921029bd78653f78cdd1b258bae32548dc5ac0 [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="As of September 1st, the Apache Cassandra community has shifted the focus of Cassandra 4.0 development from new feature work to testing, validation, and hard...">
<meta name="keywords" content="cassandra, apache, apache cassandra, distributed storage, key value store, scalability, bigtable, dynamo" />
<meta name="robots" content="index,follow" />
<meta name="language" content="en" />
<title>Finding Bugs in Cassandra&#39;s Internals with Property-based Testing</title>
<link rel="canonical" href="http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css" integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7" crossorigin="anonymous">
<link rel="stylesheet" href="./../../../../css/style.css">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.2.0/css/all.css" integrity="sha384-hWVjflwFxL6sNzntih27bfxkr27PmbbK/iSvJ+a4+0owXq79v+lsFkW54bOGbiDQ" crossorigin="anonymous">
<link type="application/atom+xml" rel="alternate" href="http://cassandra.apache.org/feed.xml" title="Apache Cassandra Website" />
</head>
<body>
<!-- breadcrumbs -->
<div class="topnav">
<div class="container breadcrumb-container">
<ul class="breadcrumb">
<li>
<div class="dropdown">
<img class="asf-logo" src="./../../../../img/asf_feather.png" />
<a data-toggle="dropdown" href="#">Apache Software Foundation <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu" aria-labelledby="dLabel">
<li><a href="http://www.apache.org">Apache Homepage</a></li>
<li><a href="http://www.apache.org/licenses/">License</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
<li><a href="http://www.apache.org/security/">Security</a></li>
</ul>
</div>
</li>
<li><a href="./../../../../">Apache Cassandra</a></li>
<li>Finding Bugs in Cassandra's Internals with Property-based Testing</li>
</ul>
</div>
<!-- navbar -->
<nav class="navbar navbar-default navbar-static-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#cassandra-menu" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="./../../../../"><img src="./../../../../img/cassandra_logo.png" alt="Apache Cassandra logo" /></a>
</div><!-- /.navbar-header -->
<div id="cassandra-menu" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li><a href="./../../../../">Home</a></li>
<li><a href="./../../../../download/">Download</a></li>
<li><a href="./../../../../doc/">Documentation</a></li>
<li><a href="./../../../../community/">Community</a></li>
<li>
<a href="./../../../../blog">Blog</a>
</li>
</ul>
</div><!-- /#cassandra-menu -->
</div>
</nav><!-- /.navbar -->
</div><!-- /.topnav -->
<div class="content">
<div class="container">
<h2>Finding Bugs in Cassandra's Internals with Property-based Testing</h2>
<p>Posted on October 17, 2018 by the Apache Cassandra Community</p>
<h5><a href="/blog">&laquo; Back to the Apache Cassandra Blog</a></h5>
<hr />
<p>As of September 1st, the Apache Cassandra community has shifted the focus of Cassandra 4.0 development from new feature work to testing, validation, and hardening, with the goal of releasing a stable 4.0 that every Cassandra user, from small deployments to large corporations, can deploy with confidence. There are several projects and methodologies that the community is undertaking to this end. One of these is the adoption of property-based testing, which was <a href="http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html">previously introduced here</a>. This post will take a look at a specific use of this approach and how it found a bug in a new feature meant to ensure data integrity between the client and Cassandra.</p>
<h4 id="detecting-corruption-is-a-property">Detecting Corruption is a Property</h4>
<p>In this post, we demonstrate property-based testing in Cassandra through the integration of the <a href="https://github.com/ncredinburgh/QuickTheories">QuickTheories</a> library introduced as part of the work done for <a href="https://issues.apache.org/jira/browse/CASSANDRA-13304">CASSANDRA-13304</a>.</p>
<p>This ticket modifies the framing of Cassandra’s native client protocol to include checksums in addition to the existing, optional compression. Clients can opt-in to this new feature to retain data integrity across the many hops between themselves and Cassandra. This is meant to address cases where hardware and protocol level checksums fail (due to underlying hardware issues) — a case that has been seen in production. A description of the protocol changes can be found in the ticket but for the purposes of this discussion the salient part is that two checksums are added: one that covers the length(s) of the data (if compressed there are two lengths), and one for the data itself. Before merging this feature, property-based testing using QuickTheories was used to uncover a bug in the calculation of the checksum over the lengths. This bug could have led to silent corruption at worst or unexpected errors during deserialization at best.</p>
<p>The test used to find this bug is shown below. This example tests the property that when a frame is corrupted, that corruption should be caught by checksum comparison. The test is wrapped inside of a standard JUnit test case but, once called by JUnit, execution is handed over to QuickTheories to generate and execute hundreds of examples. These examples are dictated by the types of input that should be generated (the arguments to <code class="highlighter-rouge">forAll</code>). The execution of each individual example is done by <code class="highlighter-rouge">checkAssert</code> and its argument, the <code class="highlighter-rouge">roundTripWithCorruption</code> function.</p>
<div><div><pre>
@Test
public void corruptionCausesFailure()
{
qt().withExamples(500)
.forAll(inputWithCorruptablePosition(),
integers().between(0, Byte.MAX_VALUE).map(Integer::byteValue),
compressors(),
checksumTypes())
.checkAssert(this::roundTripWithCorruption);
}
</pre></div></div>
<p>The <code class="highlighter-rouge">roundTripWithCorruption</code> function is a generalization of a unit test that worked similarly but for a single case. It is given an input to transform and a position in the transformed output to insert corruption, as well as what byte to write to the corrupted position. The additional arguments (the compressor and checksum type) are used to ensure coverage of Cassandra’s various compression and checksumming implementations.</p>
<div><div><pre>
private void roundTripWithCorruption(Pair&lt;String, Integer&gt; inputAndCorruptablePosition,
byte corruptionValue,
Compressor compressor,
ChecksumType checksum) {
String input = inputAndCorruptablePosition.left;
ByteBuf expectedBuf = Unpooled.wrappedBuffer(input.getBytes());
int byteToCorrupt = inputAndCorruptablePosition.right;
ChecksummingTransformer transformer = new ChecksummingTransformer(checksum, DEFAULT_BLOCK_SIZE, compressor);
ByteBuf outbound = transformer.transformOutbound(expectedBuf);
// make sure we're actually expecting to produce some corruption
if (outbound.getByte(byteToCorrupt) == corruptionValue)
return;
if (byteToCorrupt &gt;= outbound.writerIndex())
return;
try {
int oldIndex = outbound.writerIndex();
outbound.writerIndex(byteToCorrupt);
outbound.writeByte(corruptionValue);
outbound.writerIndex(oldIndex);
ByteBuf inbound = transformer.transformInbound(outbound, FLAGS);
// verify that the content was actually corrupted
expectedBuf.readerIndex(0);
Assert.assertEquals(expectedBuf, inbound);
} catch(ProtocolException e) {
return;
}
}
</pre></div></div>
<p>The remaining piece is how those arguments are generated — the arguments to <code class="highlighter-rouge">forAll</code> mentioned above. Each argument is a function that returns an input generator. For each example, an input is pulled from each generator and passed to <code class="highlighter-rouge">roundTripWithCorruption</code>. The <code class="highlighter-rouge">compressors()</code> and <code class="highlighter-rouge">checksums()</code> generators aren’t copied here. They can be found in the <a href="https://github.com/apache/cassandra/blob/65fb17a88bd096b1e952ccca31ad709759644a1b/test/unit/org/apache/cassandra/transport/frame/checksum/ChecksummingTransformerTest.java#L209-L217">source</a> and are based on built-in generator methods, provided by QuickTheories, that select a value from a list of values. The second argument, <code class="highlighter-rouge">integers().between(0, Byte.MAX_VALUE).map(Integer::byteValue)</code>, generates non-negative numbers that fit into a single byte. These numbers will be passed as the <code class="highlighter-rouge">corruptionValue</code> argument.</p>
<p>The <code class="highlighter-rouge">inputWithCorruptiblePosition</code> generator, copied below, generates strings to use as input to the transformation function and a position within the output byte stream to corrupt. Because compression prevents knowledge of the output size of the frame, the generator tries to choose a somewhat reasonable position to corrupt by limiting the choice to the size of the generated string (it’s uncommon for compression to generate a larger string and the implementation discards the compressed value if it does). It also avoids corrupting the first two bytes of the stream which are not covered by a checksum and therefore can be corrupted without being caught. The function above ensures that corruption is actually introduced and that corrupting a position larger than the size of the output does not occur.</p>
<div><div><pre>
private Gen&lt;Pair&lt;String, Integer&gt;&gt; inputWithCorruptablePosition()
{
return inputs().flatMap(s -&gt; integers().between(2, s.length() + 2)
.map(i -&gt; Pair.create(s, i)));
}
</pre></div></div>
<p>With all those pieces in place, if the test were run before the bug were fixed, it would fail with the following output.</p>
<div><div><pre>
java.lang.AssertionError: Property falsified after 2 example(s)
Smallest found falsifying value(s) :-
{(c,3), 0, null, Adler32}
Cause was :-
java.lang.IndexOutOfBoundsException: readerIndex(10) + length(16711681) exceeds writerIndex(15): UnpooledHeapByteBuf(ridx: 10, widx: 15, cap: 54/54)
at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401)
at io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1388)
at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:870)
at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformer.transformInbound(ChecksummingTransformer.java:289)
at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.roundTripWithCorruption(ChecksummingTransformerTest.java:106)
...
Other found falsifying value(s) :-
{(c,3), 0, null, CRC32}
{(c,3), 1, null, CRC32}
{(c,3), 9, null, CRC32}
{(c,3), 11, null, CRC32}
{(c,3), 36, null, CRC32}
{(c,3), 50, null, CRC32}
{(c,3), 74, null, CRC32}
{(c,3), 99, null, CRC32}
Seed was 179207634899674
</pre></div></div>
<p>The output shows more than a single failing example. This is because QuickTheories, like most property-based testing libraries, comes with a shrinker, which performs the task of taking a failure and minimizing its inputs. This aids in debugging because there are multiple failing examples to look at often removing noise in the process. Additionally, a seed value is provided so the same series of tests and failures can be generated again — another useful feature when debugging. In this case, the library generated an example that contains a single byte of input, which will corrupt the fourth byte in the output stream by setting it to zero, using no compression, and using Adler32 for checksumming. It can be seen from the other failing examples that using CRC32 also fails. This is due to improper calculation of the checksum, regardless of the algorithm. In particular, the checksum was only calculated over the least significant byte of each length rather than all eight bytes. By corrupting the fourth byte of the output stream (the first length’s second-most significant byte not covered by the calculation), an invalid length is read and later used.</p>
<h4 id="where-to-find-more">Where to Find More</h4>
<p>Property-based testing is a broad topic, much of which is not covered by this post. In addition to Cassandra, it has been used successfully in several places including <a href="https://ieeexplore.ieee.org/document/7107466/">car</a> <a href="https://arxiv.org/pdf/1703.06574.pdf">operating
systems</a> and <a href="https://youtu.be/hXnS_Xjwk2Y?t=1023">suppliers’ products</a>, <a href="https://dl.acm.org/citation.cfm?id=2034662">GNOME Glib</a>, <a href="https://github.com/WesleyAC/raft/tree/master/src">distributed consensus</a>, and other <a href="https://www.youtube.com/watch?v=x9mW54GJpG0">distributed</a> <a href="https://youtu.be/hXnS_Xjwk2Y?t=1382">databases</a>. It can also be combined with other approaches such as fault-injection and memory leak detection. Stateful models can also be built to generate a series of commands instead of running each example on one generated set of inputs. Our goal is to evangelize this approach within the Cassandra developer community and encourage more testing of this kind as part of our work to deliver the most stable major release of Cassandra yet.</p>
</div>
</div>
<hr />
<footer>
<div class="container">
<div class="col-md-4 social-blk">
<span class="social">
<a href="https://twitter.com/cassandra"
class="twitter-follow-button"
data-show-count="false" data-size="large">Follow @cassandra</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a href="https://twitter.com/intent/tweet?button_hashtag=cassandra"
class="twitter-hashtag-button"
data-size="large"
data-related="ApacheCassandra">Tweet #cassandra</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</span>
<a class="subscribe-rss icon-link" href="/feed.xml" title="Subscribe to Blog via RSS">
<span><i class="fa fa-rss"></i></span>
</a>
</div>
<div class="col-md-8 trademark">
<p>&copy; 2016 <a href="http://apache.org">The Apache Software Foundation</a>.
Apache, the Apache feather logo, and Apache Cassandra are trademarks of The Apache Software Foundation.
<p>
</div>
</div><!-- /.container -->
</footer>
<!-- Javascript. Placed here so pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script src="./../../../../js/underscore-min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js" integrity="sha384-0mSbJDEHialfmuBBQP6A4Qrprq5OVfW37PRR3j5ELqxss1yVqOtnepnHVP9aJ7xS" crossorigin="anonymous"></script>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
try {
var pageTracker = _gat._getTracker("UA-11583863-1");
pageTracker._trackPageview();
} catch(err) {}
</script>
</body>
</html>