This sub-project showcases the Cassandra Spark Bulk Analytics read and write functionality. The job writes data to a Cassandra table using the bulk writer functionality, and then it reads the data using the bulk reader functionality.
This example requires a Cassandra 4.0 cluster running, we recommend a 3-node cluster. Additionally, it requires the latest version of Cassandra Sidecar that we will run from source.
The easiest way to provision a Cassandra cluster is by using Cassandra Cluster Manager or CCM. Follow the configuration instructions for CCM and provision a new test cluster.
For example,
ccm create test --version=4.1.1 --nodes=3
Note If you are using macOS, explicit configurations of interface aliases are needed as follows:
sudo ifconfig lo0 alias 127.0.0.2 sudo ifconfig lo0 alias 127.0.0.3 sudo bash -c 'echo "127.0.0.2 localhost2" >> /etc/hosts' sudo bash -c 'echo "127.0.0.3 localhost3" >> /etc/hosts'
After configuring the network interfaces run:
ccm start
Verify that your Cassandra cluster is configured and running successfully.
In this step, we will clone and configure the Cassandra Sidecar project. Finally, we will run Sidecar which will be connecting to our local Cassandra 3-node cluster.
git clone https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis cd cassandra-sidecar
Configure the main/dist/sidecar.yaml
file for your local environment. You will most likely only need to configure the cassandra_instances
section in your file pointing to your local Cassandra data directories. Here is what my configuration looks like for this tutorial:
cassandra_instances: - id: 1 host: localhost port: 9042 data_dirs: <your_ccm_parent_path>/.ccm/test/node1/data0 uploads_staging_dir: <your_ccm_parent_path>/.ccm/test/node1/sstable-staging jmx_host: 127.0.0.1 jmx_port: 7100 jmx_ssl_enabled: false - id: 2 host: localhost2 port: 9042 data_dirs: <your_ccm_parent_path>/.ccm/test/node2/data0 uploads_staging_dir: <your_ccm_parent_path>/.ccm/test/node2/sstable-staging jmx_host: 127.0.0.1 jmx_port: 7200 jmx_ssl_enabled: false - id: 3 host: localhost3 port: 9042 data_dirs: <your_ccm_parent_path>/.ccm/test/node3/data0 uploads_staging_dir: <your_ccm_parent_path>/.ccm/test/node3/sstable-staging jmx_host: 127.0.0.1 jmx_port: 7300 jmx_ssl_enabled: false
I have a 3 node setup, so I configure Sidecar for those 3 nodes. CCM creates the Cassandra cluster under ${HOME}/.ccm/test
, so I update my data_dirs
and uploads_staging_dir
configuration to use my local path.
Next, create the uploads_staging_dir
where Sidecar will stage SSTables coming from Cassandra Spark bulk writer. In my case, I have decided to keep the sstable-staging
directory inside each of the node's directories.
mkdir -p ${HOME}/.ccm/test/node1/sstable-staging mkdir -p ${HOME}/.ccm/test/node2/sstable-staging mkdir -p ${HOME}/.ccm/test/node3/sstable-staging
Finally, run Cassandra Sidecar, we skip running integration tests because we need docker for integration tests. You can opt to run integration tests if you have docker running in your local environment.
user:~$ ./gradlew run -x integrationTest > Task :common:compileJava UP-TO-DATE > Task :cassandra40:compileJava UP-TO-DATE > Task :compileJava UP-TO-DATE > Task :processResources UP-TO-DATE > Task :classes UP-TO-DATE > Task :jar UP-TO-DATE > Task :startScripts UP-TO-DATE > Task :cassandra40:processResources NO-SOURCE > Task :cassandra40:classes UP-TO-DATE > Task :cassandra40:jar UP-TO-DATE > Task :common:processResources NO-SOURCE > Task :common:classes UP-TO-DATE > Task :common:jar UP-TO-DATE > Task :distTar > Task :distZip > Task :assemble > Task :common:compileTestFixturesJava UP-TO-DATE > Task :compileTestFixturesJava UP-TO-DATE > Task :compileTestJava UP-TO-DATE > Task :processTestResources UP-TO-DATE > Task :testClasses UP-TO-DATE > Task :compileIntegrationTestJava UP-TO-DATE > Task :processIntegrationTestResources NO-SOURCE > Task :integrationTestClasses UP-TO-DATE > Task :checkstyleIntegrationTest UP-TO-DATE > Task :checkstyleMain UP-TO-DATE > Task :checkstyleTest UP-TO-DATE > Task :processTestFixturesResources NO-SOURCE > Task :testFixturesClasses UP-TO-DATE > Task :checkstyleTestFixtures UP-TO-DATE > Task :testFixturesJar UP-TO-DATE > Task :common:processTestFixturesResources NO-SOURCE > Task :common:testFixturesClasses UP-TO-DATE > Task :common:testFixturesJar UP-TO-DATE > Task :test UP-TO-DATE > Task :jacocoTestReport UP-TO-DATE > Task :spotbugsIntegrationTest UP-TO-DATE > Task :spotbugsMain UP-TO-DATE > Task :spotbugsTest UP-TO-DATE > Task :spotbugsTestFixtures UP-TO-DATE > Task :check > Task :copyJolokia UP-TO-DATE > Task :installDist > Task :copyDist > Task :docs:asciidoctor UP-TO-DATE > Task :copyDocs UP-TO-DATE > Task :generateReDoc NO-SOURCE > Task :generateSwaggerUI NO-SOURCE > Task :build > Task :run Could not start Jolokia agent: java.net.BindException: Address already in use WARNING: An illegal reflective access operation has occurred WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1 WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release INFO [main] 2023-04-13 13:13:35,939 YAMLSidecarConfiguration.java:211 - Reading configuration from file://${HOME}/workspace/cassandra-sidecar-server/conf/sidecar.yaml INFO [main] 2023-04-13 13:13:35,997 CQLSessionProvider.java:63 - Connecting to localhost on port 9042 INFO [main] 2023-04-13 13:13:36,001 CQLSessionProvider.java:63 - Connecting to localhost2 on port 9042 INFO [main] 2023-04-13 13:13:36,014 CQLSessionProvider.java:63 - Connecting to localhost3 on port 9042 INFO [main] 2023-04-13 13:13:36,030 CacheFactory.java:83 - Building SSTable Import Cache with expireAfterAccess=PT2H, maxSize=10000 _____ _ _____ _ _ / __ \ | | / ___(_) | | | / \/ __ _ ___ ___ __ _ _ __ __| |_ __ __ _ \ `--. _ __| | ___ ___ __ _ _ __ | | / _` / __/ __|/ _` | '_ \ / _` | '__/ _` | `--. \ |/ _` |/ _ \/ __/ _` | '__| | \__/\ (_| \__ \__ \ (_| | | | | (_| | | | (_| | /\__/ / | (_| | __/ (_| (_| | | \____/\__,_|___/___/\__,_|_| |_|\__,_|_| \__,_| \____/|_|\__,_|\___|\___\__,_|_| INFO [main] 2023-04-13 13:13:36,229 CassandraSidecarDaemon.java:75 - Starting Cassandra Sidecar on 0.0.0.0:9043
There we have it, Cassandra Sidecar is now running and connected to all 3 Cassandra nodes on my local machine.
To be able to run the Sample Job, you need to create the keyspace and table used for the test.
Connect to your local Cassandra cluster using CCM:
ccm node1 cqlsh
Then run the following command to create the keyspace:
CREATE KEYSPACE spark_test WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true;
Then run the following command to create the table:
CREATE TABLE spark_test.test ( id BIGINT PRIMARY KEY, course BLOB, marks BIGINT );
The Cassandra Spark Bulk Analytics project depends on the Sidecar client. Since the Sidecar dependencies for the Analytics have not yet been published to maven central, we need to build them locally:
cd ${SIDECAR_REPOSITORY_HOME} ./gradlew -Pversion=1.0.0-local :common:publishToMavenLocal :client:publishToMavenLocal :vertx-client:publishToMavenLocal :vertx-client-shaded:publishToMavenLocal
We use the suffix -local
to denote that these artifacts have been built locally.
Finally, we are ready to run the example spark job:
cd ${ANALYTICS_REPOSITORY_HOME} ./gradlew :cassandra-analytics-core-example:run
Stop the Cassandra Sidecar project and tear down the ccm cluster
ccm remove test