In this tutorial, you'll learn how to setup a very simple Spark application for reading and writing data from/to Cassandra. Before you start, you need to have basic knowledge of Apache Cassandra and Apache Spark. Refer to Datastax and Cassandra documentation and Spark documentation.
Install and launch a Cassandra cluster and a Spark cluster.
Configure a new Scala project with the Apache Spark and dependency.
The dependencies are easily retrieved via the spark-packages.org website. For example, if you're using sbt, your build.sbt should include something like this:
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven" libraryDependencies += "datastax" % "spark-cassandra-connector" % "2.3.0-s_2.11"
The spark-packages libraries can also be used with spark-submit and spark shell, these commands will place the connector and all of its dependencies on the path of the Spark Driver and all Spark Executors.
$SPARK_HOME/bin/spark-shell --packages datastax:spark-cassandra-connector:2.3.0-s_2.11 $SPARK_HOME/bin/spark-submit --packages datastax:spark-cassandra-connector:2.3.0-s_2.11
For the list of available versions, see:
This driver does not depend on the Cassandra server code.
Create a simple keyspace and table in Cassandra. Run the following statements in cqlsh:
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; CREATE TABLE test.kv(key text PRIMARY KEY, value int);
Then insert some example data:
INSERT INTO test.kv(key, value) VALUES ('key1', 1); INSERT INTO test.kv(key, value) VALUES ('key2', 2);
Now you're ready to write your first Spark program using Cassandra.
Run the spark-shell with the packages line for your version. To configure the default Spark Configuration pass key value pairs with --conf
$SPARK_HOME/bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 \
--packages datastax:spark-cassandra-connector:2.3.0-s_2.11
This command would set the Spark Cassandra Connector parameter spark.cassandra.connection.host to 127.0.0.1. Change this to the address of one of the nodes in your Cassandra cluster.
Enable Cassandra-specific functions on the SparkContext, SparkSession, RDD, and DataFrame:
import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra._
Use the sc.cassandraTable method to view this table as a Spark RDD:
val rdd = sc.cassandraTable("test", "kv") println(rdd.count) println(rdd.first) println(rdd.map(_.getInt("value")).sum)
Add two more rows to the table:
val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4))) collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
Next - Connecting to Cassandra Jump to - Accessing data with DataFrames