doc/modules/cassandra/pages/new/streaming.adoc - cassandra - Git at Google

 = Improved Streaming

 Apache Cassandra 4.0 has made several improvements to streaming.
 Streaming is the process used by nodes of a cluster to exchange data in
 the form of SSTables. Streaming of SSTables is performed for several
 operations, such as:

 * SSTable Repair
 * Host Replacement
 * Range movements
 * Bootstrapping
 * Rebuild
 * Cluster expansion

 == Streaming based on Netty

 Streaming in Cassandra 4.0 is based on Non-blocking Input/Output (NIO)
 with Netty
 (https://issues.apache.org/jira/browse/CASSANDRA-12229[CASSANDRA-12229]).
 It replaces the single-threaded (or sequential), synchronous, blocking
 model of streaming messages and transfer of files. Netty supports
 non-blocking, asynchronous, multi-threaded streaming with which multiple
 connections are opened simultaneously. Non-blocking implies that threads
 are not blocked as they don’t wait for a response for a sent request. A
 response could be returned in a different thread. With asynchronous,
 connections and threads are decoupled and do not have a 1:1 relation.
 Several more connections than threads may be opened.

 == Zero Copy Streaming

 Pre-4.0, during streaming Cassandra reifies the SSTables into objects.
 This creates unnecessary garbage and slows down the whole streaming
 process as some SSTables can be transferred as a whole file rather than
 individual partitions. Cassandra 4.0 has added support for streaming
 entire SSTables when possible
 (https://issues.apache.org/jira/browse/CASSANDRA-14556[CASSANDRA-14556])
 for faster Streaming using ZeroCopy APIs. If enabled, Cassandra will use
 ZeroCopy for eligible SSTables significantly speeding up transfers and
 increasing throughput. A zero-copy path avoids bringing data into
 user-space on both sending and receiving side. Any streaming related
 operations will notice corresponding improvement. Zero copy streaming is
 hardware bound; only limited by the hardware limitations (Network and
 Disk IO ).

 === High Availability

 In benchmark tests Zero Copy Streaming is 5x faster than partitions
 based streaming. Faster streaming provides the benefit of improved
 availability. A cluster’s recovery mainly depends on the streaming
 speed, Cassandra clusters with failed nodes will be able to recover much
 more quickly (5x faster). If a node fails, SSTables need to be streamed
 to a replacement node. During the replacement operation, the new
 Cassandra node streams SSTables from the neighboring nodes that hold
 copies of the data belonging to this new node’s token range. Depending
 on the amount of data stored, this process can require substantial
 network bandwidth, taking some time to complete. The longer these range
 movement operations take, the more the cluster availability is lost.
 Failure of multiple nodes would reduce high availability greatly. The
 faster the new node completes streaming its data, the faster it can
 serve traffic, increasing the availability of the cluster.

 === Enabling Zero Copy Streaming

 Zero copy streaming is enabled by setting the following setting in
 `cassandra.yaml`.

 ....
 stream_entire_sstables: true
 ....

 By default zero copy streaming is enabled.

 === SSTables Eligible for Zero Copy Streaming

 Zero copy streaming is used if all partitions within the SSTable need to
 be transmitted. This is common when using `LeveledCompactionStrategy` or
 when partitioning SSTables by token range has been enabled. All
 partition keys in the SSTables are iterated over to determine the
 eligibility for Zero Copy streaming.

 === Benefits of Zero Copy Streaming

 When enabled, it permits Cassandra to zero-copy stream entire eligible
 SSTables between nodes, including every component. This speeds up the
 network transfer significantly subject to throttling specified by
 `stream_throughput_outbound_megabits_per_sec`.

 Enabling this will reduce the GC pressure on sending and receiving node.
 While this feature tries to keep the disks balanced, it cannot guarantee
 it. This feature will be automatically disabled if internode encryption
 is enabled. Currently this can be used with Leveled Compaction.

 === Configuring for Zero Copy Streaming

 Throttling would reduce the streaming speed. The
 `stream_throughput_outbound_megabits_per_sec` throttles all outbound
 streaming file transfers on a node to the given total throughput in
 Mbps. When unset, the default is 200 Mbps or 25 MB/s.

 ....
 stream_throughput_outbound_megabits_per_sec: 200
 ....

 To run any Zero Copy streaming benchmark the
 `stream_throughput_outbound_megabits_per_sec` must be set to a really
 high value otherwise, throttling will be significant and the benchmark
 results will not be meaningful.

 The `inter_dc_stream_throughput_outbound_megabits_per_sec` throttles all
 streaming file transfer between the datacenters, this setting allows
 users to throttle inter dc stream throughput in addition to throttling
 all network stream traffic as configured with
 `stream_throughput_outbound_megabits_per_sec`. When unset, the default
 is 200 Mbps or 25 MB/s.

 ....
 inter_dc_stream_throughput_outbound_megabits_per_sec: 200
 ....

 === SSTable Components Streamed with Zero Copy Streaming

 Zero Copy Streaming streams entire SSTables. SSTables are made up of
 multiple components in separate files. SSTable components streamed are
 listed in Table 1.

 Table 1. SSTable Components

 [width="98%",cols="27%,73%",]
 |===
 |SSTable Component |Description

 |Data.db |The base data for an SSTable: the remaining components can be
 regenerated based on the data component.

 |Index.db |Index of the row keys with pointers to their positions in the
 data file.

 |Filter.db |Serialized bloom filter for the row keys in the SSTable.

 |CompressionInfo.db |File to hold information about uncompressed data
 length, chunk offsets etc.

 |Statistics.db |Statistical metadata about the content of the SSTable.

 |Digest.crc32 |Holds CRC32 checksum of the data file size_bytes.

 |CRC.db |Holds the CRC32 for chunks in an uncompressed file.

 |Summary.db |Holds SSTable Index Summary (sampling of Index component)

 |TOC.txt |Table of contents, stores the list of all components for the
 SSTable.
 |===

 Custom component, used by e.g. custom compaction strategy may also be
 included.

 == Repair Streaming Preview

 Repair with `nodetool repair` involves streaming of repaired SSTables
 and a repair preview has been added to provide an estimate of the amount
 of repair streaming that would need to be performed. Repair preview
 (https://issues.apache.org/jira/browse/CASSANDRA-13257[CASSANDRA-13257])
 is invoke with `nodetool repair --preview` using option:

 ....
 -prv, --preview
 ....

 It determines ranges and amount of data to be streamed, but doesn't
 actually perform repair.

 == Parallelizing of Streaming of Keyspaces

 The streaming of the different keyspaces for bootstrap and rebuild has
 been parallelized in Cassandra 4.0
 (https://issues.apache.org/jira/browse/CASSANDRA-4663[CASSANDRA-4663]).

 == Unique nodes for Streaming in Multi-DC deployment

 Range Streamer picks unique nodes to stream data from when number of
 replicas in each DC is three or more
 (https://issues.apache.org/jira/browse/CASSANDRA-4650[CASSANDRA-4650]).
 What the optimization does is to even out the streaming load across the
 cluster. Without the optimization, some node can be picked up to stream
 more data than others. This patch allows to select dedicated node to
 stream only one range.

 This will increase the performance of bootstrapping a node and will also
 put less pressure on nodes serving the data. This does not affect if N <
 3 in each DC as then it streams data from only 2 nodes.

 Stream Operation Types ^^^^^^^^^^^^^

 It is important to know the type or purpose of a certain stream. Version
 4.0
 (https://issues.apache.org/jira/browse/CASSANDRA-13064[CASSANDRA-13064])
 adds an `enum` to distinguish between the different types of streams.
 Stream types are available both in a stream request and a stream task.
 The different stream types are:

 * Restore replica count
 * Unbootstrap
 * Relocation
 * Bootstrap
 * Rebuild
 * Bulk Load
 * Repair

 == Disallow Decommission when number of Replicas will drop below configured RF

 https://issues.apache.org/jira/browse/CASSANDRA-12510[CASSANDRA-12510]
 guards against decommission that will drop # of replicas below
 configured replication factor (RF), and adds the `--force` option that
 allows decommission to continue if intentional; force decommission of
 this node even when it reduces the number of replicas to below
 configured RF.
	= Improved Streaming

	Apache Cassandra 4.0 has made several improvements to streaming.
	Streaming is the process used by nodes of a cluster to exchange data in
	the form of SSTables. Streaming of SSTables is performed for several
	operations, such as:

	* SSTable Repair
	* Host Replacement
	* Range movements
	* Bootstrapping
	* Rebuild
	* Cluster expansion

	== Streaming based on Netty

	Streaming in Cassandra 4.0 is based on Non-blocking Input/Output (NIO)
	with Netty
	(https://issues.apache.org/jira/browse/CASSANDRA-12229[CASSANDRA-12229]).
	It replaces the single-threaded (or sequential), synchronous, blocking
	model of streaming messages and transfer of files. Netty supports
	non-blocking, asynchronous, multi-threaded streaming with which multiple
	connections are opened simultaneously. Non-blocking implies that threads
	are not blocked as they don’t wait for a response for a sent request. A
	response could be returned in a different thread. With asynchronous,
	connections and threads are decoupled and do not have a 1:1 relation.
	Several more connections than threads may be opened.

	== Zero Copy Streaming

	Pre-4.0, during streaming Cassandra reifies the SSTables into objects.
	This creates unnecessary garbage and slows down the whole streaming
	process as some SSTables can be transferred as a whole file rather than
	individual partitions. Cassandra 4.0 has added support for streaming
	entire SSTables when possible
	(https://issues.apache.org/jira/browse/CASSANDRA-14556[CASSANDRA-14556])
	for faster Streaming using ZeroCopy APIs. If enabled, Cassandra will use
	ZeroCopy for eligible SSTables significantly speeding up transfers and
	increasing throughput. A zero-copy path avoids bringing data into
	user-space on both sending and receiving side. Any streaming related
	operations will notice corresponding improvement. Zero copy streaming is
	hardware bound; only limited by the hardware limitations (Network and
	Disk IO ).

	=== High Availability

	In benchmark tests Zero Copy Streaming is 5x faster than partitions
	based streaming. Faster streaming provides the benefit of improved
	availability. A cluster’s recovery mainly depends on the streaming
	speed, Cassandra clusters with failed nodes will be able to recover much
	more quickly (5x faster). If a node fails, SSTables need to be streamed
	to a replacement node. During the replacement operation, the new
	Cassandra node streams SSTables from the neighboring nodes that hold
	copies of the data belonging to this new node’s token range. Depending
	on the amount of data stored, this process can require substantial
	network bandwidth, taking some time to complete. The longer these range
	movement operations take, the more the cluster availability is lost.
	Failure of multiple nodes would reduce high availability greatly. The
	faster the new node completes streaming its data, the faster it can
	serve traffic, increasing the availability of the cluster.

	=== Enabling Zero Copy Streaming

	Zero copy streaming is enabled by setting the following setting in
	`cassandra.yaml`.

	....
	stream_entire_sstables: true
	....

	By default zero copy streaming is enabled.

	=== SSTables Eligible for Zero Copy Streaming

	Zero copy streaming is used if all partitions within the SSTable need to
	be transmitted. This is common when using `LeveledCompactionStrategy` or
	when partitioning SSTables by token range has been enabled. All
	partition keys in the SSTables are iterated over to determine the
	eligibility for Zero Copy streaming.

	=== Benefits of Zero Copy Streaming

	When enabled, it permits Cassandra to zero-copy stream entire eligible
	SSTables between nodes, including every component. This speeds up the
	network transfer significantly subject to throttling specified by
	`stream_throughput_outbound_megabits_per_sec`.

	Enabling this will reduce the GC pressure on sending and receiving node.
	While this feature tries to keep the disks balanced, it cannot guarantee
	it. This feature will be automatically disabled if internode encryption
	is enabled. Currently this can be used with Leveled Compaction.

	=== Configuring for Zero Copy Streaming

	Throttling would reduce the streaming speed. The
	`stream_throughput_outbound_megabits_per_sec` throttles all outbound
	streaming file transfers on a node to the given total throughput in
	Mbps. When unset, the default is 200 Mbps or 25 MB/s.

	....
	stream_throughput_outbound_megabits_per_sec: 200
	....

	To run any Zero Copy streaming benchmark the
	`stream_throughput_outbound_megabits_per_sec` must be set to a really
	high value otherwise, throttling will be significant and the benchmark
	results will not be meaningful.

	The `inter_dc_stream_throughput_outbound_megabits_per_sec` throttles all
	streaming file transfer between the datacenters, this setting allows
	users to throttle inter dc stream throughput in addition to throttling
	all network stream traffic as configured with
	`stream_throughput_outbound_megabits_per_sec`. When unset, the default
	is 200 Mbps or 25 MB/s.

	....
	inter_dc_stream_throughput_outbound_megabits_per_sec: 200
	....

	=== SSTable Components Streamed with Zero Copy Streaming

	Zero Copy Streaming streams entire SSTables. SSTables are made up of
	multiple components in separate files. SSTable components streamed are
	listed in Table 1.

	Table 1. SSTable Components

	[width="98%",cols="27%,73%",]
	\|===
	\|SSTable Component \|Description

	\|Data.db \|The base data for an SSTable: the remaining components can be
	regenerated based on the data component.

	\|Index.db \|Index of the row keys with pointers to their positions in the
	data file.

	\|Filter.db \|Serialized bloom filter for the row keys in the SSTable.

	\|CompressionInfo.db \|File to hold information about uncompressed data
	length, chunk offsets etc.

	\|Statistics.db \|Statistical metadata about the content of the SSTable.

	\|Digest.crc32 \|Holds CRC32 checksum of the data file size_bytes.

	\|CRC.db \|Holds the CRC32 for chunks in an uncompressed file.

	\|Summary.db \|Holds SSTable Index Summary (sampling of Index component)

	\|TOC.txt \|Table of contents, stores the list of all components for the
	SSTable.
	\|===

	Custom component, used by e.g. custom compaction strategy may also be
	included.

	== Repair Streaming Preview

	Repair with `nodetool repair` involves streaming of repaired SSTables
	and a repair preview has been added to provide an estimate of the amount
	of repair streaming that would need to be performed. Repair preview
	(https://issues.apache.org/jira/browse/CASSANDRA-13257[CASSANDRA-13257])
	is invoke with `nodetool repair --preview` using option:

	....
	-prv, --preview
	....

	It determines ranges and amount of data to be streamed, but doesn't
	actually perform repair.

	== Parallelizing of Streaming of Keyspaces

	The streaming of the different keyspaces for bootstrap and rebuild has
	been parallelized in Cassandra 4.0
	(https://issues.apache.org/jira/browse/CASSANDRA-4663[CASSANDRA-4663]).

	== Unique nodes for Streaming in Multi-DC deployment

	Range Streamer picks unique nodes to stream data from when number of
	replicas in each DC is three or more
	(https://issues.apache.org/jira/browse/CASSANDRA-4650[CASSANDRA-4650]).
	What the optimization does is to even out the streaming load across the
	cluster. Without the optimization, some node can be picked up to stream
	more data than others. This patch allows to select dedicated node to
	stream only one range.

	This will increase the performance of bootstrapping a node and will also
	put less pressure on nodes serving the data. This does not affect if N <
	3 in each DC as then it streams data from only 2 nodes.

	Stream Operation Types ^^^^^^^^^^^^^

	It is important to know the type or purpose of a certain stream. Version
	4.0
	(https://issues.apache.org/jira/browse/CASSANDRA-13064[CASSANDRA-13064])
	adds an `enum` to distinguish between the different types of streams.
	Stream types are available both in a stream request and a stream task.
	The different stream types are:

	* Restore replica count
	* Unbootstrap
	* Relocation
	* Bootstrap
	* Rebuild
	* Bulk Load
	* Repair

	== Disallow Decommission when number of Replicas will drop below configured RF

	https://issues.apache.org/jira/browse/CASSANDRA-12510[CASSANDRA-12510]
	guards against decommission that will drop # of replicas below
	configured replication factor (RF), and adds the `--force` option that
	allows decommission to continue if intentional; force decommission of
	this node even when it reduces the number of replicas to below
	configured RF.