commit	550bdfa1c6082537e2cfb93449128a61dbe3a1fb	[log] [tgz]
author	Francisco Guerrero <frankgh@apache.org>	Tue Dec 19 12:50:43 2023 -0800
committer	Francisco Guerrero <frankgh@apache.org>	Wed Jan 10 05:44:54 2024 -0800
tree	1c1445fde788dd4bc61225c1d0e2d03770a189d2
parent	0aaf5659028dd874c8d666c636f11eae63c429e6 [diff]

CASSANDRA-19251 Speed up integration tests

This commit introduces an opinionated way to run integration tests where a test class
reuses the same in-jvm dtest cluster, and it offers certain ordering that help running
tests faster.

The test setup does the following:
- Find the Cassandra version to run
- Provision a cluster for the test
- Initialize schemas required for tests
- Start the Sidecar service

The above approach guarantess that Sidecar is ready once the setup method completes,
which means we no longer need to spend time waiting for schema propagation. This
optimization also helps in reducing test time.

The drawback of this approach is that if we need the cluster to be in some state for
testing, for example a node needs to be in joining state while executing the bulk test
then, that cluster can only be used for tests in that state. Which means that testing
different states of the cluster requires a new test class.

Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19251

52 files changed

tree: 1c1445fde788dd4bc61225c1d0e2d03770a189d2

README.md

Cassandra Analytics

Cassandra Spark Bulk Reader

The open-source repository for the Cassandra Spark Bulk Reader. This library allows integration between Cassandra and Spark job, allowing users to run arbitrary Spark jobs against a Cassandra cluster securely and consistently.

This project contains the necessary open-source implementations to connect to a Cassandra cluster and read the data into Spark.

For example usage, see the example repository; sample steps:

import org.apache.cassandra.spark.sparksql.CassandraDataSource
import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.getOrCreate()
val df = sparkSession.read.format("org.apache.cassandra.spark.sparksql.CassandraDataSource")
                          .option("sidecar_instances", "localhost,localhost2,localhost3")
                          .option("keyspace", "sbr_tests")
                          .option("table", "basic_test")
                          .option("DC", "datacenter1")
                          .option("createSnapshot", true)
                          .option("numCores", 4)
                          .load()

Cassandra Spark Bulk Writer

The Cassandra Spark Bulk Writer allows for high-speed data ingest to Cassandra clusters running Cassandra 3.0 and 4.0.

Developers interested in contributing to the Analytics library, please see the DEV-README.

Getting Started

For example usage, see the example repository. This example covers both setting up Cassandra 4.0, Apache Sidecar, and running a Spark Bulk Reader and Spark Bulk Writer job.