blob: 62dd5308b58222a335d87ab6934cd2d6bae6a003 [file] [log] [blame]
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link href="http://cassandra.apache.org/feed.xml" rel="self" type="application/atom+xml" /><link href="http://cassandra.apache.org/" rel="alternate" type="text/html" /><updated>2019-12-04T19:20:21+00:00</updated><id>http://cassandra.apache.org/</id><title type="html">Apache Cassandra Website</title><subtitle>The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
</subtitle><entry><title type="html">Even Higher Availability with 5x Faster Streaming in Cassandra 4.0</title><link href="http://cassandra.apache.org/blog/2019/04/09/benchmarking_streaming.html" rel="alternate" type="text/html" title="Even Higher Availability with 5x Faster Streaming in Cassandra 4.0" /><published>2019-04-09T08:00:00+00:00</published><updated>2019-04-09T08:00:00+00:00</updated><id>http://cassandra.apache.org/blog/2019/04/09/benchmarking_streaming</id><content type="html" xml:base="http://cassandra.apache.org/blog/2019/04/09/benchmarking_streaming.html">&lt;p&gt;Streaming is a process where nodes of a cluster exchange data in the form of SSTables. Streaming can kick in during many situations such as bootstrap, repair, rebuild, range movement, cluster expansion, etc. In this post, we discuss the massive performance improvements made to the streaming process in Apache Cassandra 4.0.&lt;/p&gt;
&lt;h2 id=&quot;high-availability&quot;&gt;High Availability&lt;/h2&gt;
&lt;p&gt;As we know Cassandra is a Highly Available, Eventually Consistent database. The way it maintains its legendary availability is by storing redundant copies of data in nodes known as replicas, usually running on commodity hardware. During normal operations, these replicas may end up having hardware issues causing them to fail. As a result, we need to replace them with new nodes on fresh hardware.&lt;/p&gt;
&lt;p&gt;As part of this replacement operation, the new Cassandra node streams data from the neighboring nodes that hold copies of the data belonging to this new node’s token range. Depending on the amount of data stored, this process can require substantial network bandwidth, taking some time to complete. The longer these types of operations take, the more we are exposing ourselves to loss of availability. Depending on your replication factor and consistency requirements, if another node fails during this replacement operation, ability will be impacted.&lt;/p&gt;
&lt;h2 id=&quot;increasing-availability&quot;&gt;Increasing Availability&lt;/h2&gt;
&lt;p&gt;To minimize the failure window, we want to make these operations as fast as possible. The faster the new node completes streaming its data, the faster it can serve traffic, increasing the availability of the cluster. Towards this goal, Cassandra 4.0 saw the addition of &lt;a href=&quot;https://en.wikipedia.org/wiki/Zero-copy&quot;&gt;Zero Copy&lt;/a&gt; streaming. For more details on Cassandra’s zero copy implementation, see this &lt;a href=&quot;../../../2018/08/07/faster_streaming_in_cassandra.html&quot;&gt;blog post&lt;/a&gt; and &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-14556&quot;&gt;CASSANDRA-14556&lt;/a&gt; for more information.&lt;/p&gt;
&lt;h2 id=&quot;talking-numbers&quot;&gt;Talking Numbers&lt;/h2&gt;
&lt;p&gt;To quantify the results of these improvements, we, at Netflix, measured the performance impact of streaming in 4.0 vs 3.0, using our open source &lt;a href=&quot;https://github.com/Netflix/ndbench&quot;&gt;NDBench&lt;/a&gt; benchmarking tool with the CassJavaDriverGeneric plugin. Though we knew there would be improvements, we were still amazed with the overall results of a &lt;strong&gt;five fold increase&lt;/strong&gt; in streaming performance. The test setup and operations are all detailed below.&lt;/p&gt;
&lt;h3 id=&quot;test-setup&quot;&gt;Test Setup&lt;/h3&gt;
&lt;p&gt;In our test setup, we used the following configurations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;6-node clusters on i3.xl, i3.2xl, i3.4xl and i3.8xl EC2 instances, each on 3.0 and trunk (sha dd7ec5a2d6736b26d3c5f137388f2d0028df7a03).&lt;/li&gt;
&lt;li&gt;Table schema&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;&lt;pre&gt;
CREATE TABLE testing.test (
key text,
column1 int,
value text,
PRIMARY KEY (key, column1)
) WITH CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'enabled': 'false'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Data size per node: 500GB&lt;/li&gt;
&lt;li&gt;No. of tokens per node: 1 (no vnodes)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To trigger the streaming process we used the following steps in each of the clusters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;terminated a node&lt;/li&gt;
&lt;li&gt;add a new node as a replacement&lt;/li&gt;
&lt;li&gt;measure the time taken to complete streaming data by the new node replacing the terminated node&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For each cluster and version, we repeated this exercise multiple times to collect several samples.&lt;/p&gt;
&lt;p&gt;Below is the distribution of streaming times we found across the clusters
&lt;img src=&quot;/img/blog-post-benchmarking-streaming/cassandra_streaming.png&quot; alt=&quot;Benchmark results&quot; title=&quot;Benchmark results&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;interpreting-the-results&quot;&gt;Interpreting the Results&lt;/h3&gt;
&lt;p&gt;Based on the graph above, there are many conclusions one can draw from it. Some of them are&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3.0 streaming times are inconsistent and show high degree of variability (fat distributions across multiple samples)&lt;/li&gt;
&lt;li&gt;3.0 streaming is highly affected by the instance type and generally looks generally CPU bound&lt;/li&gt;
&lt;li&gt;Zero Copy streaming is approximately 5x faster&lt;/li&gt;
&lt;li&gt;Zero Copy streaming time shows little variability in its performance (thin distributions across multiple samples)&lt;/li&gt;
&lt;li&gt;Zero Copy streaming performance is not CPU bound and remains consistent across instance types&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is clear from the performance test results that Zero Copy Streaming has a huge performance benefit over the current streaming infrastructure in Cassandra. But what does it mean in the real world? The following key points are the main take aways.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MTTR (Mean Time to Recovery):&lt;/strong&gt; MTTR is a KPI (Key Performance Indicator) that is used to measure how quickly a system recovers from a failure. Zero Copy Streaming has a very direct impact here with a &lt;strong&gt;five fold improvement&lt;/strong&gt; on performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Costs:&lt;/strong&gt; Zero Copy Streaming is ~5x faster. This translates directly into cost for some organizations primarily as a result of reducing the need to maintain spare server or cloud capacity. In other situations where you’re migrating data to larger instance types or moving AZs or DCs, this means that instances that are sending data can be turned off sooner saving costs. An added cost benefit is that now you don’t have to over provision the instance. You get a similar streaming performance whether you use a i3.xl or an i3.8xl provided the bandwidth is available to the instance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Risk Reduction:&lt;/strong&gt; There is a great reduction in the risk due to Zero Copy Streaming as well. Since a Cluster’s recovery mainly depends on the streaming speed, Cassandra clusters with failed nodes will be able to recover much more quickly (5x faster). This means the window of vulnerability is reduced significantly, in some situations down to few minutes.&lt;/p&gt;
&lt;p&gt;Finally, a benefit that we generally don’t talk about is the environmental benefit of this change. Zero Copy Streaming enables us to move data very quickly through the cluster. It objectively reduces the number and sizes of instances that are used to build Cassandra cluster. As a result not only does it reduce Cassandra’s TCO (Total Cost of Ownership), it also helps the environment by consuming fewer resources!&lt;/p&gt;</content><author><name>The Apache Cassandra Community</name></author><summary type="html">Streaming is a process where nodes of a cluster exchange data in the form of SSTables. Streaming can kick in during many situations such as bootstrap, repair, rebuild, range movement, cluster expansion, etc. In this post, we discuss the massive performance improvements made to the streaming process in Apache Cassandra 4.0.</summary></entry><entry><title type="html">Introducing Transient Replication</title><link href="http://cassandra.apache.org/blog/2018/12/03/introducing-transient-replication.html" rel="alternate" type="text/html" title="Introducing Transient Replication" /><published>2018-12-03T08:00:00+00:00</published><updated>2018-12-03T08:00:00+00:00</updated><id>http://cassandra.apache.org/blog/2018/12/03/introducing-transient-replication</id><content type="html" xml:base="http://cassandra.apache.org/blog/2018/12/03/introducing-transient-replication.html">&lt;p&gt;Transient Replication is a new experimental feature soon to be available in 4.0. When enabled, it allows for the creation of keyspaces where replication factor can be specified as a number of copies (full replicas) and temporary copies (transient replicas). Transient replicas retain the data they replicate only long enough for it to be propagated to full replicas, via incremental repair, at which point the data is deleted. Writing to transient replicas can be avoided almost entirely if monotonic reads are not required because it is possible to achieve a quorum of acknowledged writes without them.&lt;/p&gt;
&lt;p&gt;This results in a savings in disk space, CPU, and IO. By deleting data as soon as it is no longer needed, transient replicas require only a fraction of the disk space of a full replica. By not having to store the data indefinitely, the CPU and IO required for compaction is reduced, and read queries are faster as they have less data to process.&lt;/p&gt;
&lt;p&gt;So what are the benefits of not actually keeping a full copy of the data? Well, for some installations and use cases, transient replicas can be almost free if &lt;a href=&quot;https://en.wikipedia.org/wiki/Consistency_model#Monotonic_Read_Consistency&quot;&gt;monotonic reads&lt;/a&gt; are disabled. In future releases where monotonic reads are supported with Transient Replication, enabling monotonic reads would reduce the savings in CPU and IO, but even then they should still be significant.&lt;/p&gt;
&lt;p&gt;Transient Replication is designed to be transparent to applications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Consistency levels continue to produce the same results for queries.&lt;/li&gt;
&lt;li&gt;The number of replicas that can be lost before data loss occurs is unchanged.&lt;/li&gt;
&lt;li&gt;The number of replicas that can be unavailable before some queries start to timeout or return unavailable is unchanged (with the exception of ONE).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With Transient Replication, you can go from 3 replicas to 5 replicas, two of which are transient, without adding any hardware.&lt;/p&gt;
&lt;p&gt;If you are running an active-passive 2 DC setup with 3 replicas in each DC, you can make one replica in each DC transient and still have four full copies of the data in total.&lt;/p&gt;
&lt;h2 id=&quot;feature-support&quot;&gt;Feature support&lt;/h2&gt;
&lt;p&gt;Transient Replication is not intended to fully replace Cassandra’s existing approach to replication. There are features that currently don’t work with transiently replicated keyspaces and features that are unlikely ever to work with them.&lt;/p&gt;
&lt;p&gt;You can have keyspaces with and without Transient Replication enabled in the same cluster, so it is possible to use Transient Replication for just the use cases that are a good fit for the currently available functionality.&lt;/p&gt;
&lt;h3 id=&quot;currently-unsupported-but-coming&quot;&gt;Currently unsupported but coming:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Monotonic reads&lt;/li&gt;
&lt;li&gt;Batch log&lt;/li&gt;
&lt;li&gt;LWT&lt;/li&gt;
&lt;li&gt;Counters&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;will-never-be-supported&quot;&gt;Will never be supported:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Secondary indexes&lt;/li&gt;
&lt;li&gt;Materialized views&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;how-transient-replication-works&quot;&gt;How Transient Replication works&lt;/h2&gt;
&lt;h3 id=&quot;overview&quot;&gt;Overview&lt;/h3&gt;
&lt;p&gt;Transient replication extends Cassandra’s existing consistent hashing algorithm to designate some replicas of a point or range on the consistent hash ring as transient and some as full. The following image depicts a consistent hash ring with three replicas &lt;strong&gt;A&lt;/strong&gt;, &lt;strong&gt;B&lt;/strong&gt;, and &lt;strong&gt;C&lt;/strong&gt;. The replicas are located at tokens 5, 10, 15 respectively. A key &lt;strong&gt;&lt;em&gt;k&lt;/em&gt;&lt;/strong&gt; hashes to token 3 on the ring.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-post-introducing-transient-replication/diagram-hash-ring.gif&quot; alt=&quot;A consistent hash ring without Transient Replication&quot; title=&quot;A consistent hash ring without Rransient Replication&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Replicas are selected by walking the ring clockwise starting at the point on the ring the key hashes to. At RF=3, the replicas of key &lt;strong&gt;&lt;em&gt;k **&lt;/em&gt;are **A&lt;/strong&gt;, &lt;strong&gt;B&lt;/strong&gt;, &lt;strong&gt;C&lt;/strong&gt;.
With Transient Replication, the last N replicas (where N is the configured number of transient replicas) found while walking the ring are designated as transient.&lt;/p&gt;
&lt;p&gt;There are no nodes designated as transient replicas or full replicas. All nodes will fully replicate some ranges on the ring and transiently replicate others.&lt;/p&gt;
&lt;p&gt;The following image depicts a consistent hash ring at RF=3/1 (three replicas, one of which is transient). The replicas of &lt;strong&gt;&lt;em&gt;k&lt;/em&gt;&lt;/strong&gt; are still &lt;strong&gt;A&lt;/strong&gt;, &lt;strong&gt;B&lt;/strong&gt;, and &lt;strong&gt;C&lt;/strong&gt;, but &lt;strong&gt;C&lt;/strong&gt; is now transiently replicating &lt;strong&gt;&lt;em&gt;k&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-post-introducing-transient-replication/diagram-hash-ring-with-transient-replica.gif&quot; alt=&quot;A consistent hash ring with Transient Replication&quot; title=&quot;A consistent hash ring with Transient Replication&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Normally all replicas of a range receive all writes for that range, as depicted in the following image.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-post-introducing-transient-replication/diagram-regular-write.gif&quot; alt=&quot;Normal write behavior&quot; title=&quot;Normal write behavior&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Transient replicas do not receive writes in the normal write path.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-post-introducing-transient-replication/diagram-transient-write.gif&quot; alt=&quot;Transient write behavior&quot; title=&quot;Transient write behavior&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If sufficient full replicas are unavailable, transient replicas will receive writes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-post-introducing-transient-replication/diagram-transient-write-down-node.gif&quot; alt=&quot;Transient write with unavailable node&quot; title=&quot;Transient write with unavailable node&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This optimization, which is possible with Transient Replication, is called Cheap Quorums. This minimizes the amount of work that transient replicas have to do at write time, and reduces the amount of background compaction they will have to do.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cheap Quorums and monotonic reads:&lt;/strong&gt; Cheap Quorums may end up being incompatible with an initial implementation of monotonic reads, and operators will be able to make a conscious trade off between performance and monotonic reads.&lt;/p&gt;
&lt;h3 id=&quot;rapid-write-protection&quot;&gt;Rapid write protection&lt;/h3&gt;
&lt;p&gt;In keyspaces utilizing Transient Replication, writes are sent to every full replica and enough transient replicas to meet the requested consistency level (to make up for unavailable full replicas). In addition, enough transient replicas are selected to reach a quorum in every datacenter, though unless the consistency level requires it, the write will be acknowledged without ensuring all have been delivered.&lt;/p&gt;
&lt;p&gt;Because not all replicas are sent the write, it’s possible that insufficient replicas will respond, causing timeouts. To prevent this, we implement rapid write protection, similar to rapid read protection, that sends writes to additional replicas if sufficient acknowledgements to meet the consistency level are not received promptly.&lt;/p&gt;
&lt;p&gt;The following animation shows rapid write protection in action.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-post-introducing-transient-replication/diagram-rapid-write-protection.gif&quot; alt=&quot;Animation of rapid write protection preventing a write timeout&quot; title=&quot;Rapid write protection preventing a write timeout&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Rapid write protection is configured similarly to rapid read protection using the table option &lt;code class=&quot;highlighter-rouge&quot;&gt;additional_write_policy&lt;/code&gt;. The policy determines how long to wait for acknowledgements before sending additional mutations. The default is to wait for P99 of the observed latency.&lt;/p&gt;
&lt;h3 id=&quot;incremental-repair&quot;&gt;Incremental repair&lt;/h3&gt;
&lt;p&gt;Incremental repair is used to clean up transient data at transient replicas and propagate it to full replicas.&lt;/p&gt;
&lt;p&gt;When incremental repair occurs transient replicas stream out transient data, but don’t receive any. Anti-compaction is used to separate transient and fully replicated data so that only fully replicated data is retained once incremental repair completes.&lt;/p&gt;
&lt;p&gt;The result of running an incremental repair is that all full replicas for a range are synchronized and can be used interchangeably to retrieve the repaired data set for a query.&lt;/p&gt;
&lt;h3 id=&quot;read-path&quot;&gt;Read path&lt;/h3&gt;
&lt;p&gt;Reads must always include at least one full replica and can include as many replicas (transient or full) as necessary to achieve the desired consistency level. At least one full replica is required in order to provide the data not available at transient replicas, but it doesn’t matter which full replica is picked because incremental repair synchronizes the repaired data set across full replicas.&lt;/p&gt;
&lt;p&gt;Reads at transient replicas are faster than reads at full replicas because reads at transient replicas are unlikely to return any results if monotonic reads are disabled, and they haven’t been receiving writes.&lt;/p&gt;
&lt;h2 id=&quot;creating-keyspaces-with-transient-replication&quot;&gt;Creating keyspaces with Transient Replication&lt;/h2&gt;
&lt;p&gt;Transient Replication is supported by SimpleStrategy and NetworkTopologyStrategy. When specifying the replication factor, you can specify the number of transient replicas in addition to the total number of replicas (including transient replicas). The syntax for a replication factor of 3 replicas total with one of them being transient would be “3/1”.&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ALTER KEYSPACE foo WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : '3/1'};
ALTER KEYSPACE foo WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : '3/1'};
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Monotonic reads are not supported with Transient Replication in 4.0, so any existing tables in the keyspace must have monotonic reads disabled by setting &lt;code class=&quot;highlighter-rouge&quot;&gt;read_repair = 'NONE'&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Once the keyspace has been altered, you will need to run incremental repair and then nodetool cleanup to ensure transient data is cleaned up.&lt;/p&gt;
&lt;h2 id=&quot;operational-matters&quot;&gt;Operational matters&lt;/h2&gt;
&lt;p&gt;Transient replication requires rolling incremental repair to be run regularly in order to move data from transient replicas to full replicas. By default transient replicas will receive 1% of writes for transiently replicated ranges due to rapid write protection. If a node is down for an extended period of time, its transient replicas will receive additional write load and that data should be cleaned up using incremental repair. Running incremental repair regularly will ensure that the size of each repair is small.&lt;/p&gt;
&lt;p&gt;It’s also a good idea to run a small number of vnodes with transient replication so that when a node goes down the load is spread out over several other nodes that transiently replicate that range. Larges numbers of vnodes are known to be problematic, so it’s best to start with a cluster that is already close to or at its maximum size so that a small number of vnodes will be sufficient. If you intend to grow the cluster in the future, you will need to be cognizant of how this will interact with the number of vnodes you select.&lt;/p&gt;
&lt;p&gt;While the odds of any data loss should multiple nodes be permanently lost remain the same with transient replication, the magnitude of potential data loss does not. With 3/1 transient replication the permanent loss of two nodes could result in the loss of the entirety of the repaired data set. If you are running a multi-DC setup with a high level of replication such as 2 DCs, with 3/1 replicas in each, then you will have 4 full copies total and the added risk of transient replication is minimal.&lt;/p&gt;
&lt;h2 id=&quot;experimental-features&quot;&gt;Experimental features&lt;/h2&gt;
&lt;p&gt;Experimental features are a relatively new idea for Apache Cassandra. Although we recently voted to make materialized views an experimental feature retroactively, Transient Replication is the first experimental feature to be introduced as such.&lt;/p&gt;
&lt;p&gt;The goal of introducing experimental features is to allow for incremental development across multiple releases. In the case of Transient Replication, we can avoid a giant code drop that heavily modifies the code base, and the associated risks with incorporating a new feature that way.&lt;/p&gt;
&lt;p&gt;What it means for a feature to be experimental doesn’t have a set definition, but for Transient Replication it’s intended to set expectations. As of 4.0, Transient Replication’s intended audience is expert operators of Cassandra with the ability to write the book on how to safely deploy Transient Replication, debug any issues that result, and if necessary contribute code back to address problems as they are discovered.&lt;/p&gt;
&lt;p&gt;It’s expected that the feature set for Transient Replication will not change in minor updates to 4.0, but eventually it should be ready for use by a wider audience.&lt;/p&gt;
&lt;h2 id=&quot;next-steps-for-transient-replication&quot;&gt;Next steps for Transient Replication&lt;/h2&gt;
&lt;p&gt;If increasing availability or saving on capacity sounds good to you, then you can help make transient replication production-ready by testing it out or even deploying it. Experience and feedback from the community is one the of the things that will drive transient replication bug fixing and development.&lt;/p&gt;</content><author><name>The Apache Cassandra Community</name></author><summary type="html">Transient Replication is a new experimental feature soon to be available in 4.0. When enabled, it allows for the creation of keyspaces where replication factor can be specified as a number of copies (full replicas) and temporary copies (transient replicas). Transient replicas retain the data they replicate only long enough for it to be propagated to full replicas, via incremental repair, at which point the data is deleted. Writing to transient replicas can be avoided almost entirely if monotonic reads are not required because it is possible to achieve a quorum of acknowledged writes without them.</summary></entry><entry><title type="html">Audit Logging in Apache Cassandra 4.0</title><link href="http://cassandra.apache.org/blog/2018/10/29/audit_logging_cassandra.html" rel="alternate" type="text/html" title="Audit Logging in Apache Cassandra 4.0" /><published>2018-10-29T07:00:00+00:00</published><updated>2018-10-29T07:00:00+00:00</updated><id>http://cassandra.apache.org/blog/2018/10/29/audit_logging_cassandra</id><content type="html" xml:base="http://cassandra.apache.org/blog/2018/10/29/audit_logging_cassandra.html">&lt;p&gt;Database audit logging is an industry standard tool for enterprises to
capture critical data change events including what data changed and who
triggered the event. These captured records can then be reviewed later
to ensure compliance with regulatory, security and operational policies.&lt;/p&gt;
&lt;p&gt;Prior to Apache Cassandra 4.0, the open source community did not have a
good way of tracking such critical database activity. With this goal in
mind, Netflix implemented
&lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-12151&quot;&gt;CASSANDRA-12151&lt;/a&gt;
so that users of Cassandra would have a simple yet powerful audit
logging tool built into their database out of the box.&lt;/p&gt;
&lt;h2 id=&quot;why-are-audit-logs-important&quot;&gt;Why are Audit Logs Important?&lt;/h2&gt;
&lt;p&gt;Audit logging database activity is one of the key components for making
a database truly ready for the enterprise. Audit logging is generally
useful but enterprises frequently use it for:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Regulatory compliance with laws such as &lt;a href=&quot;https://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act&quot;&gt;SOX&lt;/a&gt;, &lt;a href=&quot;https://en.wikipedia.org/wiki/Payment_Card_Industry_Data_Security_Standard&quot;&gt;PCI&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/General_Data_Protection_Regulation&quot;&gt;GDPR&lt;/a&gt; et al. These types of compliance are crucial for companies that are traded on public stock exchanges, hold payment information such as credit cards, or retain private user information.&lt;/li&gt;
&lt;li&gt;Security compliance. Companies often have strict rules for what data can be accessed by which employees, both to protect the privacy of users but also to limit the probability of a data breach.&lt;/li&gt;
&lt;li&gt;Debugging complex data corruption bugs such as those found in massively distributed microservice architectures like Netflix’s.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;why-is-audit-logging-difficult&quot;&gt;Why is Audit Logging Difficult?&lt;/h2&gt;
&lt;p&gt;Implementing a simple logger in the request (inbound/outbound) path
sounds easy, but the devil is in the details. In particular, the “fast
path” of a database, where audit logging must operate, strives to do as
little as humanly possible so that users get the fastest and most
scalable database system possible. While implementing Cassandra audit
logging, we had to ensure that the audit log infrastructure does not
take up excessive CPU or IO resources from the actual database execution
itself. However, one cannot simply optimize only for performance because
that may compromise the guarantees of the audit logging.&lt;/p&gt;
&lt;p&gt;For example, if producing an audit record would block a thread, it
should be dropped to maintain maximum performance. However, most
compliance requirements prohibit dropping records. Therefore, the key to
implementing audit logging correctly lies in allowing users to achieve
both performance &lt;em&gt;and&lt;/em&gt; reliability, or absent being able to achieve both
allow users to make an explicit trade-off through configuration.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&quot;audit-logging-design-goals&quot;&gt;Audit Logging Design Goals&lt;/h2&gt;
&lt;p&gt;The design goal of the Audit log are broadly categorized into 3
different areas:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt;: Considering the Audit Log injection points are
live in the request path, performance is an important goal in every
design decision.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Accuracy&lt;/strong&gt; : Accuracy is required by compliance and is thus a
critical goal. Audit Logging must be able to answer crucial auditor
questions like “Is every write request to the database being audited?”.
As such, accuracy cannot be compromised.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Usability &amp;amp; Extensibility&lt;/strong&gt;: The diverse Cassandra ecosystem
demands that any frequently used feature must be easily usable and
pluggable (e.g., Compaction, Compression, SeedProvider etc...), so the
Audit Log interface was designed with this context in mind from the
start.&lt;/p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;/h2&gt;
&lt;p&gt;With these three design goals in mind, the
&lt;a href=&quot;https://github.com/OpenHFT&quot;&gt;OpenHFT&lt;/a&gt; libraries were an
obvious choice due to their reliability and high performance. Earlier in
&lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-13983&quot;&gt;CASSANDRA-13983&lt;/a&gt;
the &lt;a href=&quot;https://github.com/OpenHFT/Chronicle-Queue&quot;&gt;chronical queue
library&lt;/a&gt; of
OpenHFT was introduced as a BinLog utility to the Apache Cassandra code
base. The performance of Full Query Logging (FQL) was excellent, but it only instrumented mutation and read query paths. It was missing a lot of critical data such as when queries failed, where they came from, and which user issued the query. The FQL was also single purpose: preferring to drop messages rather than delay the process (which makes sense for FQL but not for Audit Logging). Lastly, the FQL didn’t allow for pluggability, which would make it harder to adopt in the codebase for this feature.&lt;/p&gt;
&lt;p&gt;As shown in the architecture figure below, we were able to unify the FQL feature with the AuditLog functionality through the AuditLogManager and IAuditLogger abstractions. Using this architecture, we can support any output format: logs, files, databases, etc. By default, the BinAuditLogger implementation comes out of the box to maintain performance. Users can choose the custom audit logger implementation by dropping the jar file on Cassandra classpath and customizing with configuration options in
&lt;a href=&quot;https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1216-L1234&quot;&gt;cassandra.yaml&lt;/a&gt;
file.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&quot;architecture&quot;&gt;Architecture&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;&quot; alt=&quot;Fig 1. AuditLog Architecture Figure.&quot; /&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&quot;what-does-it-log&quot;&gt;What does it log&lt;/h2&gt;
&lt;p&gt;Each audit log implementation has access to the following attributes. For the default text-based logger, these fields are concatenated with &lt;code class=&quot;highlighter-rouge&quot;&gt;|&lt;/code&gt; to yield the final message.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;user&lt;/code&gt;: User name(if available)&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;host&lt;/code&gt;: Host IP, where the command is being executed&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;source ip address&lt;/code&gt;: Source IP address from where the request initiated&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;source port&lt;/code&gt;: Source port number from where the request initiated&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;timestamp&lt;/code&gt;: unix time stamp&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;type&lt;/code&gt;: Type of the request (SELECT, INSERT, etc.,)&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;category&lt;/code&gt; - Category of the request (DDL, DML, etc.,)&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;keyspace&lt;/code&gt; - Keyspace(If applicable) on which request is targeted to be executed&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;scope&lt;/code&gt; - Table/Aggregate name/ function name/ trigger name etc., as applicable&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;operation&lt;/code&gt; - CQL command being executed&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;example-of-audit-log-messages&quot;&gt;Example of Audit log messages&lt;/h3&gt;
&lt;div&gt;&lt;pre&gt;
Type: AuditLog
LogMessage: user:anonymous|host:127.0.0.1:7000|source:/127.0.0.1|port:53418|timestamp:1539978679457|type:SELECT|category:QUERY|ks:k1|scope:t1|operation:SELECT * from k1.t1 ;
Type: AuditLog
LogMessage: user:anonymous|host:127.0.0.1:7000|source:/127.0.0.1|port:53418|timestamp:1539978692456|type:SELECT|category:QUERY|ks:system|scope:peers|operation:SELECT * from system.peers limit 1;
Type: AuditLog
LogMessage: user:anonymous|host:127.0.0.1:7000|source:/127.0.0.1|port:53418|timestamp:1539980764310|type:SELECT|category:QUERY|ks:system_virtual_schema|scope:columns|operation:SELECT * from system_virtual_schema.columns ;
&lt;/pre&gt;&lt;/div&gt;
&lt;hr /&gt;
&lt;h2 id=&quot;how-to-configure&quot;&gt;How to configure&lt;/h2&gt;
&lt;p&gt;Auditlog can be configured using &lt;a href=&quot;https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1216-L1234&quot;&gt;cassandra.yaml&lt;/a&gt;. If you want to try Auditlog on one node, it can also be enabled and configured using &lt;code class=&quot;highlighter-rouge&quot;&gt;nodetool&lt;/code&gt;.&lt;/p&gt;
&lt;h4 id=&quot;cassandrayaml-configurations-for-auditlog&quot;&gt;cassandra.yaml configurations for AuditLog&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;enabled&lt;/code&gt;: This option enables/ disables audit log&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;logger&lt;/code&gt;: Class name of the logger/ custom logger.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;audit_logs_dir&lt;/code&gt;: Auditlogs directory location, if not set, default to &lt;code class=&quot;highlighter-rouge&quot;&gt;cassandra.logdir.audit&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;cassandra.logdir&lt;/code&gt; + /audit/&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;included_keyspaces&lt;/code&gt;: Comma separated list of keyspaces to be included in audit log, default - includes all keyspaces&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;excluded_keyspaces&lt;/code&gt;: Comma separated list of keyspaces to be excluded from audit log, default - excludes no keyspace&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;included_categories&lt;/code&gt;: Comma separated list of Audit Log Categories to be included in audit log, default - includes all categories&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;excluded_categories&lt;/code&gt;: Comma separated list of Audit Log Categories to be excluded from audit log, default - excludes no category&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;included_users&lt;/code&gt;: Comma separated list of users to be included in audit log, default - includes all users&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;excluded_users&lt;/code&gt;: Comma separated list of users to be excluded from audit log, default - excludes no user&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note: BinAuditLogger configurations can be tuned using cassandra.yaml properties as well.&lt;/p&gt;
&lt;p&gt;List of available categories are: QUERY, DML, DDL, DCL, OTHER, AUTH, ERROR, PREPARE&lt;/p&gt;
&lt;h4 id=&quot;nodetool-command-to-enable-auditlog&quot;&gt;NodeTool command to enable AuditLog&lt;/h4&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;enableauditlog&lt;/code&gt;: Enables AuditLog with yaml defaults. yaml configurations can be overridden using options via nodetool command.&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;nodetool enableauditlog
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Options:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;--excluded-categories&lt;/code&gt;
Comma separated list of Audit Log Categories to be excluded for
audit log. If not set the value from cassandra.yaml will be used&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;--excluded-keyspaces&lt;/code&gt;
Comma separated list of keyspaces to be excluded for audit log. If
not set the value from cassandra.yaml will be used&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;--excluded-users&lt;/code&gt;
Comma separated list of users to be excluded for audit log. If not
set the value from cassandra.yaml will be used&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;--included-categories&lt;/code&gt;
Comma separated list of Audit Log Categories to be included for
audit log. If not set the value from cassandra.yaml will be used&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;--included-keyspaces&lt;/code&gt;
Comma separated list of keyspaces to be included for audit log. If
not set the value from cassandra.yaml will be used&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;--included-users&lt;/code&gt;
Comma separated list of users to be included for audit log. If not
set the value from cassandra.yaml will be used&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;--logger&lt;/code&gt;
Logger name to be used for AuditLogging. Default BinAuditLogger. If
not set the value from cassandra.yaml will be used&lt;/p&gt;
&lt;h4 id=&quot;nodetool-command-to-disable-auditlog&quot;&gt;NodeTool command to disable AuditLog&lt;/h4&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;disableauditlog&lt;/code&gt;: Disables AuditLog.&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;nodetool disableuditlog
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;h4 id=&quot;nodetool-command-to-reload-auditlog-filters&quot;&gt;NodeTool command to reload AuditLog filters&lt;/h4&gt;
&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;enableauditlog&lt;/code&gt;: NodeTool enableauditlog command can be used to reload auditlog filters when called with default or previous &lt;code class=&quot;highlighter-rouge&quot;&gt;loggername&lt;/code&gt; and updated filters&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;nodetool enableauditlog --loggername &amp;lt;Default/ existing loggerName&amp;gt; --included-keyspaces &amp;lt;New Filter values&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;hr /&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Now that Apache Cassandra ships with audit logging out of the box, users
can easily capture data change events to a persistent record indicating
what happened, when it happened, and where the event originated. This
type of information remains critical to modern enterprises operating in
a diverse regulatory environment. While audit logging represents one of
many steps forward in the 4.0 release, we believe that it will uniquely
enable enterprises to use the database in ways they could not
previously.&lt;/p&gt;</content><author><name>the Apache Cassandra Community</name></author><summary type="html">Database audit logging is an industry standard tool for enterprises to capture critical data change events including what data changed and who triggered the event. These captured records can then be reviewed later to ensure compliance with regulatory, security and operational policies.</summary></entry><entry><title type="html">Finding Bugs in Cassandra’s Internals with Property-based Testing</title><link href="http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html" rel="alternate" type="text/html" title="Finding Bugs in Cassandra's Internals with Property-based Testing" /><published>2018-10-17T07:00:00+00:00</published><updated>2018-10-17T07:00:00+00:00</updated><id>http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing</id><content type="html" xml:base="http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html">&lt;p&gt;As of September 1st, the Apache Cassandra community has shifted the focus of Cassandra 4.0 development from new feature work to testing, validation, and hardening, with the goal of releasing a stable 4.0 that every Cassandra user, from small deployments to large corporations, can deploy with confidence. There are several projects and methodologies that the community is undertaking to this end. One of these is the adoption of property-based testing, which was &lt;a href=&quot;http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html&quot;&gt;previously introduced here&lt;/a&gt;. This post will take a look at a specific use of this approach and how it found a bug in a new feature meant to ensure data integrity between the client and Cassandra.&lt;/p&gt;
&lt;h4 id=&quot;detecting-corruption-is-a-property&quot;&gt;Detecting Corruption is a Property&lt;/h4&gt;
&lt;p&gt;In this post, we demonstrate property-based testing in Cassandra through the integration of the &lt;a href=&quot;https://github.com/ncredinburgh/QuickTheories&quot;&gt;QuickTheories&lt;/a&gt; library introduced as part of the work done for &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-13304&quot;&gt;CASSANDRA-13304&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This ticket modifies the framing of Cassandra’s native client protocol to include checksums in addition to the existing, optional compression. Clients can opt-in to this new feature to retain data integrity across the many hops between themselves and Cassandra. This is meant to address cases where hardware and protocol level checksums fail (due to underlying hardware issues) — a case that has been seen in production. A description of the protocol changes can be found in the ticket but for the purposes of this discussion the salient part is that two checksums are added: one that covers the length(s) of the data (if compressed there are two lengths), and one for the data itself. Before merging this feature, property-based testing using QuickTheories was used to uncover a bug in the calculation of the checksum over the lengths. This bug could have led to silent corruption at worst or unexpected errors during deserialization at best.&lt;/p&gt;
&lt;p&gt;The test used to find this bug is shown below. This example tests the property that when a frame is corrupted, that corruption should be caught by checksum comparison. The test is wrapped inside of a standard JUnit test case but, once called by JUnit, execution is handed over to QuickTheories to generate and execute hundreds of examples. These examples are dictated by the types of input that should be generated (the arguments to &lt;code class=&quot;highlighter-rouge&quot;&gt;forAll&lt;/code&gt;). The execution of each individual example is done by &lt;code class=&quot;highlighter-rouge&quot;&gt;checkAssert&lt;/code&gt; and its argument, the &lt;code class=&quot;highlighter-rouge&quot;&gt;roundTripWithCorruption&lt;/code&gt; function.&lt;/p&gt;
&lt;div&gt;&lt;div&gt;&lt;pre&gt;
@Test
public void corruptionCausesFailure()
{
qt().withExamples(500)
.forAll(inputWithCorruptablePosition(),
integers().between(0, Byte.MAX_VALUE).map(Integer::byteValue),
compressors(),
checksumTypes())
.checkAssert(this::roundTripWithCorruption);
}
&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;roundTripWithCorruption&lt;/code&gt; function is a generalization of a unit test that worked similarly but for a single case. It is given an input to transform and a position in the transformed output to insert corruption, as well as what byte to write to the corrupted position. The additional arguments (the compressor and checksum type) are used to ensure coverage of Cassandra’s various compression and checksumming implementations.&lt;/p&gt;
&lt;div&gt;&lt;div&gt;&lt;pre&gt;
private void roundTripWithCorruption(Pair&amp;lt;String, Integer&amp;gt; inputAndCorruptablePosition,
byte corruptionValue,
Compressor compressor,
ChecksumType checksum) {
String input = inputAndCorruptablePosition.left;
ByteBuf expectedBuf = Unpooled.wrappedBuffer(input.getBytes());
int byteToCorrupt = inputAndCorruptablePosition.right;
ChecksummingTransformer transformer = new ChecksummingTransformer(checksum, DEFAULT_BLOCK_SIZE, compressor);
ByteBuf outbound = transformer.transformOutbound(expectedBuf);
// make sure we're actually expecting to produce some corruption
if (outbound.getByte(byteToCorrupt) == corruptionValue)
return;
if (byteToCorrupt &amp;gt;= outbound.writerIndex())
return;
try {
int oldIndex = outbound.writerIndex();
outbound.writerIndex(byteToCorrupt);
outbound.writeByte(corruptionValue);
outbound.writerIndex(oldIndex);
ByteBuf inbound = transformer.transformInbound(outbound, FLAGS);
// verify that the content was actually corrupted
expectedBuf.readerIndex(0);
Assert.assertEquals(expectedBuf, inbound);
} catch(ProtocolException e) {
return;
}
}
&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The remaining piece is how those arguments are generated — the arguments to &lt;code class=&quot;highlighter-rouge&quot;&gt;forAll&lt;/code&gt; mentioned above. Each argument is a function that returns an input generator. For each example, an input is pulled from each generator and passed to &lt;code class=&quot;highlighter-rouge&quot;&gt;roundTripWithCorruption&lt;/code&gt;. The &lt;code class=&quot;highlighter-rouge&quot;&gt;compressors()&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;checksums()&lt;/code&gt; generators aren’t copied here. They can be found in the &lt;a href=&quot;https://github.com/apache/cassandra/blob/65fb17a88bd096b1e952ccca31ad709759644a1b/test/unit/org/apache/cassandra/transport/frame/checksum/ChecksummingTransformerTest.java#L209-L217&quot;&gt;source&lt;/a&gt; and are based on built-in generator methods, provided by QuickTheories, that select a value from a list of values. The second argument, &lt;code class=&quot;highlighter-rouge&quot;&gt;integers().between(0, Byte.MAX_VALUE).map(Integer::byteValue)&lt;/code&gt;, generates non-negative numbers that fit into a single byte. These numbers will be passed as the &lt;code class=&quot;highlighter-rouge&quot;&gt;corruptionValue&lt;/code&gt; argument.&lt;/p&gt;
&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;inputWithCorruptiblePosition&lt;/code&gt; generator, copied below, generates strings to use as input to the transformation function and a position within the output byte stream to corrupt. Because compression prevents knowledge of the output size of the frame, the generator tries to choose a somewhat reasonable position to corrupt by limiting the choice to the size of the generated string (it’s uncommon for compression to generate a larger string and the implementation discards the compressed value if it does). It also avoids corrupting the first two bytes of the stream which are not covered by a checksum and therefore can be corrupted without being caught. The function above ensures that corruption is actually introduced and that corrupting a position larger than the size of the output does not occur.&lt;/p&gt;
&lt;div&gt;&lt;div&gt;&lt;pre&gt;
private Gen&amp;lt;Pair&amp;lt;String, Integer&amp;gt;&amp;gt; inputWithCorruptablePosition()
{
return inputs().flatMap(s -&amp;gt; integers().between(2, s.length() + 2)
.map(i -&amp;gt; Pair.create(s, i)));
}
&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;With all those pieces in place, if the test were run before the bug were fixed, it would fail with the following output.&lt;/p&gt;
&lt;div&gt;&lt;div&gt;&lt;pre&gt;
java.lang.AssertionError: Property falsified after 2 example(s)
Smallest found falsifying value(s) :-
{(c,3), 0, null, Adler32}
Cause was :-
java.lang.IndexOutOfBoundsException: readerIndex(10) + length(16711681) exceeds writerIndex(15): UnpooledHeapByteBuf(ridx: 10, widx: 15, cap: 54/54)
at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401)
at io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1388)
at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:870)
at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformer.transformInbound(ChecksummingTransformer.java:289)
at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.roundTripWithCorruption(ChecksummingTransformerTest.java:106)
...
Other found falsifying value(s) :-
{(c,3), 0, null, CRC32}
{(c,3), 1, null, CRC32}
{(c,3), 9, null, CRC32}
{(c,3), 11, null, CRC32}
{(c,3), 36, null, CRC32}
{(c,3), 50, null, CRC32}
{(c,3), 74, null, CRC32}
{(c,3), 99, null, CRC32}
Seed was 179207634899674
&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The output shows more than a single failing example. This is because QuickTheories, like most property-based testing libraries, comes with a shrinker, which performs the task of taking a failure and minimizing its inputs. This aids in debugging because there are multiple failing examples to look at often removing noise in the process. Additionally, a seed value is provided so the same series of tests and failures can be generated again — another useful feature when debugging. In this case, the library generated an example that contains a single byte of input, which will corrupt the fourth byte in the output stream by setting it to zero, using no compression, and using Adler32 for checksumming. It can be seen from the other failing examples that using CRC32 also fails. This is due to improper calculation of the checksum, regardless of the algorithm. In particular, the checksum was only calculated over the least significant byte of each length rather than all eight bytes. By corrupting the fourth byte of the output stream (the first length’s second-most significant byte not covered by the calculation), an invalid length is read and later used.&lt;/p&gt;
&lt;h4 id=&quot;where-to-find-more&quot;&gt;Where to Find More&lt;/h4&gt;
&lt;p&gt;Property-based testing is a broad topic, much of which is not covered by this post. In addition to Cassandra, it has been used successfully in several places including &lt;a href=&quot;https://ieeexplore.ieee.org/document/7107466/&quot;&gt;car&lt;/a&gt; &lt;a href=&quot;https://arxiv.org/pdf/1703.06574.pdf&quot;&gt;operating
systems&lt;/a&gt; and &lt;a href=&quot;https://youtu.be/hXnS_Xjwk2Y?t=1023&quot;&gt;suppliers’ products&lt;/a&gt;, &lt;a href=&quot;https://dl.acm.org/citation.cfm?id=2034662&quot;&gt;GNOME Glib&lt;/a&gt;, &lt;a href=&quot;https://github.com/WesleyAC/raft/tree/master/src&quot;&gt;distributed consensus&lt;/a&gt;, and other &lt;a href=&quot;https://www.youtube.com/watch?v=x9mW54GJpG0&quot;&gt;distributed&lt;/a&gt; &lt;a href=&quot;https://youtu.be/hXnS_Xjwk2Y?t=1382&quot;&gt;databases&lt;/a&gt;. It can also be combined with other approaches such as fault-injection and memory leak detection. Stateful models can also be built to generate a series of commands instead of running each example on one generated set of inputs. Our goal is to evangelize this approach within the Cassandra developer community and encourage more testing of this kind as part of our work to deliver the most stable major release of Cassandra yet.&lt;/p&gt;</content><author><name>the Apache Cassandra Community</name></author><summary type="html">As of September 1st, the Apache Cassandra community has shifted the focus of Cassandra 4.0 development from new feature work to testing, validation, and hardening, with the goal of releasing a stable 4.0 that every Cassandra user, from small deployments to large corporations, can deploy with confidence. There are several projects and methodologies that the community is undertaking to this end. One of these is the adoption of property-based testing, which was previously introduced here. This post will take a look at a specific use of this approach and how it found a bug in a new feature meant to ensure data integrity between the client and Cassandra.</summary></entry><entry><title type="html">Testing Apache Cassandra 4.0</title><link href="http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html" rel="alternate" type="text/html" title="Testing Apache Cassandra 4.0" /><published>2018-08-21T03:00:00+00:00</published><updated>2018-08-21T03:00:00+00:00</updated><id>http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra</id><content type="html" xml:base="http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html">&lt;p&gt;With the goal of ensuring reliability and stability in Apache Cassandra 4.0, the project’s committers have voted to freeze new features on September 1 to concentrate on testing and validation before cutting a stable beta. Towards that goal, the community is investing in methodologies that can be performed at scale to exercise edge cases in the largest Cassandra clusters. The result, we hope, is to make Apache Cassandra 4.0 the best-tested and most reliable major release right out of the gate.&lt;/p&gt;
&lt;p&gt;In the interests of communication (and hopefully more participation), here’s a look at some of the approaches being used to test Apache Cassandra 4.0:&lt;/p&gt;
&lt;hr /&gt;
&lt;h4 id=&quot;replay-testing&quot;&gt;Replay Testing&lt;/h4&gt;
&lt;h5 id=&quot;workload-recording-log-replay-and-comparison&quot;&gt;Workload Recording, Log Replay, and Comparison&lt;/h5&gt;
&lt;p&gt;Replay testing allows for side-by-side comparison of a workload using two versions of the same database. It is a black-box technique that answers the question, “did anything change that we didn’t expect?”&lt;/p&gt;
&lt;p&gt;Replay testing is simple in concept: record a workload, then re-issue it against two clusters – one running a stable release and the second running a candidate build. Replay testing a stateful distributed system is more challenging. For a subset of workloads, we can achieve determinism in testing by grouping writes by CQL partition and ordering them via client-supplied timestamps. This also allows us to achieve parallelism, as recorded workloads can be distributed by partition across an arbitrarily-large fleet of writers. Though linearizing updates within a partition and comparing differences does not allow for validation of all possible workloads (e.g., CAS queries), this subset is very useful.&lt;/p&gt;
&lt;p&gt;The suite of Full Query Logging (“FQL”) tools in Apache Cassandra enable workload recording. &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-14618&quot;&gt;CASSANDRA-14618&lt;/a&gt; and &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-14619&quot;&gt;CASSANDRA-14619&lt;/a&gt; will add fqltool replay and fqltool compare, enabling log replay and comparison. Standard tools in the Apache ecosystem such as &lt;a href=&quot;https://spark.apache.org&quot;&gt;Apache Spark&lt;/a&gt; and &lt;a href=&quot;https://mesos.apache.org&quot;&gt;Apache Mesos&lt;/a&gt; can also make parallelizing replay and comparison across large clusters of machines straightforward.&lt;/p&gt;
&lt;hr /&gt;
&lt;h4 id=&quot;fuzz-testing-and-property-based-testing&quot;&gt;Fuzz Testing and Property-Based Testing&lt;/h4&gt;
&lt;h5 id=&quot;dynamic-test-generation-and-fuzzing&quot;&gt;Dynamic Test Generation and Fuzzing&lt;/h5&gt;
&lt;p&gt;Fuzz testing dynamically generates input to be passed through a function for validation. We can make fuzz testing smarter in stateful systems like Apache Cassandra to assert that persisted data conforms to the database’s contracts: acknowledged writes are not lost, deleted data is not resurrected, and consistency levels are respected. Fuzz testing of storage systems to validate these properties requires maintaining a record of responses received from the system; the development of a model representing valid legal states of data within the database; and a validation pass to assert that responses reflect valid states according to that model.&lt;/p&gt;
&lt;p&gt;Property-based testing combines fuzz testing and assertions to explore a state space using randomly-generated input. These tests provide dynamic input to the system and assert that its fundamental properties are not violated. These properties can range from generic (e.g., “I can write data and read it back”) to specific (“range tombstone bounds synthesized during short-read-protection reads are properly closed”); and from local to distributed (e.g., “replacing every single node in a cluster results in an identical database”). To simplify debugging, property-based testing libraries like &lt;a href=&quot;https://github.com/ncredinburgh/QuickTheories&quot;&gt;QuickTheories&lt;/a&gt; also provide a “shrinker,” which attempts to generate the simplest possible failing case after detecting input or a sequence of actions that triggers a failure.&lt;/p&gt;
&lt;p&gt;Unlike model checkers, property-based tests don’t exhaust the state space – but explore it until a threshold of examples is reached. This allows for the computation to be distributed across many machines to gain confidence in code and infrastructure that scales with the amount of computation applied to test it.&lt;/p&gt;
&lt;hr /&gt;
&lt;h4 id=&quot;distributed-tests-and-fault-injection-testing&quot;&gt;Distributed Tests and Fault-Injection Testing&lt;/h4&gt;
&lt;h5 id=&quot;validating-behavior-under-fault-scenarios&quot;&gt;Validating Behavior Under Fault Scenarios&lt;/h5&gt;
&lt;p&gt;All of the above techniques can be combined with fault injection testing to validate that the system maintains availability where expected in fault scenarios, that fundamental properties hold, and that reads and writes conform to the system’s contracts. By asserting series of invariants under fault scenarios using different techniques, we gain the ability to exercise edge cases in the system that may reveal unexpected failures in extreme scenarios. Injected faults can take many forms – network partitions, process pauses, disk failures, and more.&lt;/p&gt;
&lt;hr /&gt;
&lt;h4 id=&quot;upgrade-testing&quot;&gt;Upgrade Testing&lt;/h4&gt;
&lt;h5 id=&quot;ensuring-a-safe-upgrade-path&quot;&gt;Ensuring a Safe Upgrade Path&lt;/h5&gt;
&lt;p&gt;Finally, it’s not enough to test one version of the database. Upgrade testing allows us to validate the upgrade path between major versions, ensuring that a rolling upgrade can be completed successfully, and that contents of the resulting upgraded database is identical to the original. To perform upgrade tests, we begin by snapshotting a cluster and cloning it twice, resulting in two identical clusters. One of the clusters is then upgraded. Finally, we perform a row-by-row scan and comparison of all data in each partition to assert that all rows read are identical, logging any deltas for investigation. Like fault injection tests, upgrade tests can also be thought of as an operational scenario all other types of tests can be parameterized against.&lt;/p&gt;
&lt;hr /&gt;
&lt;h4 id=&quot;wrapping-up&quot;&gt;Wrapping Up&lt;/h4&gt;
&lt;p&gt;The Apache Cassandra developer community is working hard to deliver Cassandra 4.0 as the most stable major release to date, bringing a variety of methodologies to bear on the problem. We invite you to join us in the effort, deploying these techniques within your infrastructure and testing the release on your workloads. Learn more about how to get involved &lt;a href=&quot;http://cassandra.apache.org/community/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The more that join, the better the release we’ll ship together.&lt;/p&gt;</content><author><name>the Apache Cassandra Community</name></author><summary type="html">With the goal of ensuring reliability and stability in Apache Cassandra 4.0, the project’s committers have voted to freeze new features on September 1 to concentrate on testing and validation before cutting a stable beta. Towards that goal, the community is investing in methodologies that can be performed at scale to exercise edge cases in the largest Cassandra clusters. The result, we hope, is to make Apache Cassandra 4.0 the best-tested and most reliable major release right out of the gate.</summary></entry><entry><title type="html">Hardware-bound Zero Copy Streaming in Apache Cassandra 4.0</title><link href="http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html" rel="alternate" type="text/html" title="Hardware-bound Zero Copy Streaming in Apache Cassandra 4.0" /><published>2018-08-07T19:00:00+00:00</published><updated>2018-08-07T19:00:00+00:00</updated><id>http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra</id><content type="html" xml:base="http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html">&lt;p&gt;Streaming in Apache Cassandra powers host replacement, range movements, and cluster expansions. Streaming plays a crucial role in the cluster and as such its performance is key to not only the speed of the operations its used in but the cluster’s health generally. In Apache Cassandra 4.0, we have introduced an improved streaming implementation that reduces GC pressure and increases throughput several folds and are now limited, in some cases, only by the disk / network IO (See: &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-14556&quot;&gt;CASSANDRA-14556&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;&quot; alt=&quot;Fig 1. Cassandra Streaming&quot; style=&quot;float: right;margin-right: 7px;margin-top: 7px;&quot; /&gt; To get an understanding of the impact of these changes, let’s first have a look at the current streaming code path. The diagram below illustrates the stream session setup when a node attempts to stream data from a peer. Let’s say, we have a 3 node cluster (Nodes A, B, C). Node C is being rebuilt and has to stream all data that it is responsible for from A &amp;amp; B. C setups a streaming session with each of it’s peers (See: &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-4650&quot;&gt;CASSANDRA-4560&lt;/a&gt; how Cassandra applies &lt;a href=&quot;https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm&quot;&gt;Ford Fulkerson&lt;/a&gt; to optimize streaming peers). It exchanges messages to request ranges and begins streaming data from the selected nodes.&lt;/p&gt;
&lt;p&gt;During the streaming phase, A collects all SSTables that have partitions in the requested ranges. It streams each SSTable by serializing individual partitions. Upon receiving the partition, node C reifies the data in memory and then writes it to disk. This is necessary to accurately transfer partitions from all possible SSTables for the requested ranges. This streaming path generates garbage and could be avoided in scenarios where all partitions within the SSTable need to be transmitted. This is common when you’re using LeveledCompactionStrategy or have enabled partitioning SSTables by token range (See: &lt;a href=&quot;http://issues.apache.org/jira/browse/CASSANDRA-6696&quot;&gt;CASSANDRA-6696&lt;/a&gt;), etc.&lt;/p&gt;
&lt;p&gt;To solve this problem &lt;a href=&quot;http://issues.apache.org/jira/browse/CASSANDRA-14556&quot;&gt;CASSANDRA-14556&lt;/a&gt; adds a Zero Copy streaming path. This significantly speeds up the transfer of SSTables and reduces garbage and unnecessary object creation. It modifies the streaming path to add additional information into the streaming header and uses ZeroCopy APIs to transfer bytes to and from the network and disk. So now, an SSTable may be transferred using this strategy when Cassandra detects that a complete SSTable needs to be transferred.&lt;/p&gt;
&lt;h2 id=&quot;how-do-i-use-this-feature&quot;&gt;How do I use this feature?&lt;/h2&gt;
&lt;p&gt;It just works. This feature is controlled using &lt;code class=&quot;highlighter-rouge&quot;&gt;stream_entire_sstables&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;cassandra.yaml&lt;/code&gt; and is enabled by default. Even though this feature is enabled, it will respect the throttling limits as defined by &lt;code class=&quot;highlighter-rouge&quot;&gt;stream_throughput_outbound_megabits_per_sec&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;impact&quot;&gt;Impact&lt;/h2&gt;
&lt;p&gt;Cassandra can stream SSTables only bounded by the hardware limitations (Network and Disk IO). With this optimization, we hope to make Cassandra more performant and reliable.&lt;/p&gt;
&lt;p&gt;Microbenchmarking this feature shows a marked improvement (higher is better). Block Stream Writers are the ZeroCopy writers and Partial Stream Writers are the existing writers.&lt;/p&gt;
&lt;table class=&quot;table-condensed table-bordered table-hover&quot;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Cnt&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Error&lt;/th&gt;
&lt;th&gt;Units&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ZeroCopyStreamingBenchmark.blockStreamReader&lt;/td&gt;
&lt;td&gt;thrpt&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;20.119&lt;/td&gt;
&lt;td&gt;± 1.300&lt;/td&gt;
&lt;td&gt;ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZeroCopyStreamingBenchmark.blockStreamWriter&lt;/td&gt;
&lt;td&gt;thrpt&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1339.672&lt;/td&gt;
&lt;td&gt;± 352.242&lt;/td&gt;
&lt;td&gt;ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZeroCopyStreamingBenchmark.partialStreamReader&lt;/td&gt;
&lt;td&gt;thrpt&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0.590&lt;/td&gt;
&lt;td&gt;± 0.135&lt;/td&gt;
&lt;td&gt;ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZeroCopyStreamingBenchmark.partialStreamWriter&lt;/td&gt;
&lt;td&gt;thrpt&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;17.556&lt;/td&gt;
&lt;td&gt;± 0.323&lt;/td&gt;
&lt;td&gt;ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;If you’re a Cassandra user, we would love to hear back from you. Please send us feedback via user &lt;a href=&quot;http://cassandra.apache.org/community/&quot;&gt;Mailing List&lt;/a&gt;, &lt;a href=&quot;https://issues.apache.org/jira/projects/CASSANDRA/summary&quot;&gt;Jira&lt;/a&gt;, or &lt;a href=&quot;http://cassandra.apache.org/community/&quot;&gt;IRC&lt;/a&gt; (or any combination of the three).&lt;/p&gt;</content><author><name>The Apache Cassandra Community</name></author><summary type="html">Streaming in Apache Cassandra powers host replacement, range movements, and cluster expansions. Streaming plays a crucial role in the cluster and as such its performance is key to not only the speed of the operations its used in but the cluster’s health generally. In Apache Cassandra 4.0, we have introduced an improved streaming implementation that reduces GC pressure and increases throughput several folds and are now limited, in some cases, only by the disk / network IO (See: CASSANDRA-14556).</summary></entry></feed>