commit | c7c3bbca2c7cb415b39689e924fa2357c239f043 | [log] [tgz] |
---|---|---|
author | Francisco Guerrero <frankgh@apache.org> | Tue Nov 14 16:28:14 2023 -0800 |
committer | Francisco Guerrero <frankgh@apache.org> | Fri Dec 08 10:57:26 2023 -0800 |
tree | 43d79702ea4efcf4378553ba479d36ed438d73c1 | |
parent | 457b36bcb3c8a865cca83ca6c402246798113ab4 [diff] |
CASSANDRA-19031: Fix bulk writing when using identifiers that need quotes Cassandra treats all identifiers as lower case unless explicitly quoted by the users, (i.e. keyspace names, table names, column names, etc). We can define a case-sensitive identifier or we can use a reserved word as an identifier by quoting it during DDL creation. In the analytics library, bulk writing fails when we encounter these identifiers. In this commit, we fix the issue by property propagating the information about whether identifiers need to be quoted by exposing a new dataframe option (`quote_identifiers`). When set to `true`, it will _maybe_ quote the keyspace/table/column names and it will properly be able to write data when using mixed-case or reserved words in the identifiers. Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19031
The open-source repository for the Cassandra Spark Bulk Reader. This library allows integration between Cassandra and Spark job, allowing users to run arbitrary Spark jobs against a Cassandra cluster securely and consistently.
This project contains the necessary open-source implementations to connect to a Cassandra cluster and read the data into Spark.
For example usage, see the example repository; sample steps:
import org.apache.cassandra.spark.sparksql.CassandraDataSource import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder.getOrCreate() val df = sparkSession.read.format("org.apache.cassandra.spark.sparksql.CassandraDataSource") .option("sidecar_instances", "localhost,localhost2,localhost3") .option("keyspace", "sbr_tests") .option("table", "basic_test") .option("DC", "datacenter1") .option("createSnapshot", true) .option("numCores", 4) .load()
The Cassandra Spark Bulk Writer allows for high-speed data ingest to Cassandra clusters running Cassandra 3.0 and 4.0.
Developers interested in contributing to the Analytics library, please see the DEV-README.
For example usage, see the example repository. This example covers both setting up Cassandra 4.0, Apache Sidecar, and running a Spark Bulk Reader and Spark Bulk Writer job.