commit | 457b36bcb3c8a865cca83ca6c402246798113ab4 | [log] [tgz] |
---|---|---|
author | Francisco Guerrero <frankgh@apache.org> | Mon Nov 13 16:16:36 2023 -0800 |
committer | Francisco Guerrero <frankgh@apache.org> | Thu Dec 07 09:22:21 2023 -0800 |
tree | 8032b8869938142a6213702109f5bd35c7dffe4a | |
parent | 680cc9395c55a88217f2de975f62ad588e8c95d5 [diff] |
CASSANDRA-19024 Fix bulk reading when using identifiers that need quotes Cassandra treats all identifiers as lower case unless explicitly quoted by the users, (i.e. keyspace names, table names, column names, etc). We can define a case-sensitive identifier or we can use a reserved word as an identifier by quoting it during DDL creation. In the analytics library, bulk reads fail when we encounter these identifiers. In this, commit, we fix the issue by properly propagating information about whether identifiers need to be quoted by exposing a new data frame option (`quote_identifiers`). When set to `true`, it will maybe quote the keyspace/table and it will properly be able to read data when these situations are encountered. Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19024
The open-source repository for the Cassandra Spark Bulk Reader. This library allows integration between Cassandra and Spark job, allowing users to run arbitrary Spark jobs against a Cassandra cluster securely and consistently.
This project contains the necessary open-source implementations to connect to a Cassandra cluster and read the data into Spark.
For example usage, see the example repository; sample steps:
import org.apache.cassandra.spark.sparksql.CassandraDataSource import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder.getOrCreate() val df = sparkSession.read.format("org.apache.cassandra.spark.sparksql.CassandraDataSource") .option("sidecar_instances", "localhost,localhost2,localhost3") .option("keyspace", "sbr_tests") .option("table", "basic_test") .option("DC", "datacenter1") .option("createSnapshot", true) .option("numCores", 4) .load()
The Cassandra Spark Bulk Writer allows for high-speed data ingest to Cassandra clusters running Cassandra 3.0 and 4.0.
Developers interested in contributing to the Analytics library, please see the DEV-README.
For example usage, see the example repository. This example covers both setting up Cassandra 4.0, Apache Sidecar, and running a Spark Bulk Reader and Spark Bulk Writer job.