Sample Cassandra Data with Spark Bulk Analytics Job

This sub-project showcases the Cassandra Spark Bulk Analytics read and write functionality. The job writes data to a Cassandra table using the bulk writer functionality, and then it reads the data using the bulk reader functionality.


  • Java 11
  • A running Cassandra 4.0 cluster
  • A running Cassandra Sidecar service


This example requires a Cassandra 4.0 cluster running, we recommend a 3-node cluster. Additionally, it requires the latest version of Cassandra Sidecar that we will run from source.

Step 1: Cassandra using CCM

The easiest way to provision a Cassandra cluster is by using Cassandra Cluster Manager or CCM. Follow the configuration instructions for CCM and provision a new test cluster.

For example,

ccm create test --version=4.1.1 --nodes=3

Note If you are using macOS, explicit configurations of interface aliases are needed as follows:

sudo ifconfig lo0 alias
sudo ifconfig lo0 alias
sudo bash -c 'echo "  localhost2" >> /etc/hosts'
sudo bash -c 'echo "  localhost3" >> /etc/hosts'

After configuring the network interfaces run:

ccm start

Verify that your Cassandra cluster is configured and running successfully.

Step 2: Configure and Run Cassandra Sidecar

In this step, we will clone and configure the Cassandra Sidecar project. Finally, we will run Sidecar which will be connecting to our local Cassandra 3-node cluster.

git clone
cd cassandra-sidecar

Configure the main/dist/sidecar.yaml file for your local environment. You will most likely only need to configure the cassandra_instances section in your file pointing to your local Cassandra data directories. Here is what my configuration looks like for this tutorial:

  - id: 1
    host: localhost
    port: 9042
    data_dirs: <your_ccm_parent_path>/.ccm/test/node1/data0
    uploads_staging_dir: <your_ccm_parent_path>/.ccm/test/node1/sstable-staging
    jmx_port: 7100
    jmx_ssl_enabled: false
  - id: 2
    host: localhost2
    port: 9042
    data_dirs: <your_ccm_parent_path>/.ccm/test/node2/data0
    uploads_staging_dir: <your_ccm_parent_path>/.ccm/test/node2/sstable-staging
    jmx_port: 7200
    jmx_ssl_enabled: false
  - id: 3
    host: localhost3
    port: 9042
    data_dirs: <your_ccm_parent_path>/.ccm/test/node3/data0
    uploads_staging_dir: <your_ccm_parent_path>/.ccm/test/node3/sstable-staging
    jmx_port: 7300
    jmx_ssl_enabled: false

I have a 3 node setup, so I configure Sidecar for those 3 nodes. CCM creates the Cassandra cluster under ${HOME}/.ccm/test, so I update my data_dirs and uploads_staging_dir configuration to use my local path.

Next, create the uploads_staging_dir where Sidecar will stage SSTables coming from Cassandra Spark bulk writer. In my case, I have decided to keep the sstable-staging directory inside each of the node's directories.

mkdir -p ${HOME}/.ccm/test/node1/sstable-staging
mkdir -p ${HOME}/.ccm/test/node2/sstable-staging
mkdir -p ${HOME}/.ccm/test/node3/sstable-staging

Finally, run Cassandra Sidecar, we skip running integration tests because we need docker for integration tests. You can opt to run integration tests if you have docker running in your local environment.

./gradlew run -x integrationTest
> Task :run
There we have it, Cassandra Sidecar is now running and connected to all 3 Cassandra nodes on my local machine.

Step 3: Run the Sample Job

To be able to run the Sample Job, you need to create the keyspace and table used for the test.

Connect to your local Cassandra cluster using CCM:

ccm node1 cqlsh

Then run the following command to create the keyspace:

CREATE KEYSPACE spark_test WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '3'}
                            AND durable_writes = true;

Then run the following command to create the table:

CREATE TABLE spark_test.test
    course BLOB,
    marks  BIGINT

The Cassandra Spark Bulk Analytics project depends on the Sidecar client. Since the Sidecar dependencies for the Analytics have not yet been published to maven central, we need to build them locally:

./gradlew -Pversion=1.0.0-local :common:publishToMavenLocal :client:publishToMavenLocal :vertx-client:publishToMavenLocal :vertx-client-shaded:publishToMavenLocal

We use the suffix -local to denote that these artifacts have been built locally.

Finally, we are ready to run the example spark job:

./gradlew :cassandra-analytics-core-example:run

Tear down

Stop the Cassandra Sidecar project and tear down the ccm cluster

ccm remove test