CASSANDRA-18759: Use in-jvm dtest framework from Sidecar for testing

This commit introduces the use of the in-jvm dtest framework for testing
Analytics workloads. It can spin up a Cassandra cluster, including the
necessary Sidecar process, to test writing to and reading from Cassandra
using the analytics library.

Additional changes made in this commit include

* Use concurrent collections in MockBulkWriterContext (Fixes flaky test StreamSessionConsistencyTest)

    The StreamSessionConsistency test uses the MockBulkWriter context, but it wasn't originally used
    (before this test was added) in a multi-threaded environment. Because of this, it would occasionally
    throw ConcurrentModificationExceptions, which would cause the stream test to fail in a
    non-deterministic way. This commit adds the use of concurrent/synchronous collections to the
    MockBulkWriterContext to make sure it doesn't throw these spurious errors.

* Make the StartupValidation system thread-safe by using TreadLocals
  instead of static collections, and clearing them once validation is
  complete.

Patch by Doug Rohrer; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18759
48 files changed
tree: 92604c6a119b7ee8e2a7a6abe3f10ac1f13f492b
  1. .circleci/
  2. cassandra-analytics-core/
  3. cassandra-analytics-core-example/
  4. cassandra-analytics-integration-framework/
  5. cassandra-analytics-integration-tests/
  6. cassandra-bridge/
  7. cassandra-four-zero/
  8. cassandra-three-zero/
  9. config/
  10. githooks/
  11. gradle/
  12. ide/
  13. profiles/
  14. scripts/
  15. .asf.yaml
  16. .gitignore
  17. build.gradle
  18. CHANGES.txt
  19. code_version.sh
  20. DEV-README.md
  21. gradle.properties
  22. gradlew
  23. LICENSE.txt
  24. NOTICE.txt
  25. README.md
  26. settings.gradle
README.md

Cassandra Analytics

Cassandra Spark Bulk Reader

The open-source repository for the Cassandra Spark Bulk Reader. This library allows integration between Cassandra and Spark job, allowing users to run arbitrary Spark jobs against a Cassandra cluster securely and consistently.

This project contains the necessary open-source implementations to connect to a Cassandra cluster and read the data into Spark.

For example usage, see the example repository; sample steps:

import org.apache.cassandra.spark.sparksql.CassandraDataSource
import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.getOrCreate()
val df = sparkSession.read.format("org.apache.cassandra.spark.sparksql.CassandraDataSource")
                          .option("sidecar_instances", "localhost,localhost2,localhost3")
                          .option("keyspace", "sbr_tests")
                          .option("table", "basic_test")
                          .option("DC", "datacenter1")
                          .option("createSnapshot", true)
                          .option("numCores", 4)
                          .load()

Cassandra Spark Bulk Writer

The Cassandra Spark Bulk Writer allows for high-speed data ingest to Cassandra clusters running Cassandra 3.0 and 4.0.

Developers interested in contributing to the Analytics library, please see the DEV-README.

Getting Started

For example usage, see the example repository. This example covers both setting up Cassandra 4.0, Apache Sidecar, and running a Spark Bulk Reader and Spark Bulk Writer job.