commit	dc0e79b9c483562ec0920d69e886715eb329c426	[log] [tgz]
author	Francisco Guerrero <frankgh@apache.org>	Wed Jan 31 13:44:23 2024 -0800
committer	Francisco Guerrero <frankgh@apache.org>	Tue Feb 13 17:40:17 2024 -0800
tree	5f5fd1aa24a39c9e2e782a68804edbe942ea1136
parent	c3e8803b3331bc7ef81797ac52a8417524f67edc [diff]

commit

dc0e79b9c483562ec0920d69e886715eb329c426

[log] [tgz]

author

Francisco Guerrero <frankgh@apache.org>

Wed Jan 31 13:44:23 2024 -0800

committer

Francisco Guerrero <frankgh@apache.org>

Tue Feb 13 17:40:17 2024 -0800

tree

5f5fd1aa24a39c9e2e782a68804edbe942ea1136

parent

c3e8803b3331bc7ef81797ac52a8417524f67edc [diff]

CASSANDRA-19369 Use XXHash32 for digest calculation of SSTables This commit adds the ability to use the newly supported in Cassandra Sidecar XXhash32 digest algorithm. The commit allows for backwards compatibility to perform MD5 checksumming, but it now defaults to XXHash32. A new Writer option is added: ``` .option(WriterOptions.DIGEST.name(), "XXHASH32") // or .option(WriterOptions.DIGEST.name(), "MD5") ``` This option defaults to XXHash32, when not provided, but it can be configured to use the legacy MD5 algorithm. Path by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19369

tree: 5f5fd1aa24a39c9e2e782a68804edbe942ea1136

README.md

Cassandra Analytics

Cassandra Spark Bulk Reader

The open-source repository for the Cassandra Spark Bulk Reader. This library allows integration between Cassandra and Spark job, allowing users to run arbitrary Spark jobs against a Cassandra cluster securely and consistently.

This project contains the necessary open-source implementations to connect to a Cassandra cluster and read the data into Spark.

For example usage, see the example repository; sample steps:

import org.apache.cassandra.spark.sparksql.CassandraDataSource
import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.getOrCreate()
val df = sparkSession.read.format("org.apache.cassandra.spark.sparksql.CassandraDataSource")
                          .option("sidecar_instances", "localhost,localhost2,localhost3")
                          .option("keyspace", "sbr_tests")
                          .option("table", "basic_test")
                          .option("DC", "datacenter1")
                          .option("createSnapshot", true)
                          .option("numCores", 4)
                          .load()

Cassandra Spark Bulk Writer

The Cassandra Spark Bulk Writer allows for high-speed data ingest to Cassandra clusters running Cassandra 3.0 and 4.0.

Developers interested in contributing to the Analytics library, please see the DEV-README.

Getting Started

For example usage, see the example repository. This example covers both setting up Cassandra 4.0, Apache Sidecar, and running a Spark Bulk Reader and Spark Bulk Writer job.