Records that fail to get loaded into the CarbonData due to data type incompatibility or are empty or have incompatible format are classified as Bad Records.
The bad records are stored at the location set in carbon.badRecords.location in carbon.properties file. By default carbon.badRecords.location specifies the following location /opt/Carbon/Spark/badrecords
.
While loading data we can specify the approach to handle Bad Records. In order to analyse the cause of the Bad Records the parameter BAD_RECORDS_LOGGER_ENABLE
must be set to value TRUE
. There are multiple approaches to handle Bad Records which can be specified by the parameter BAD_RECORDS_ACTION
.
'BAD_RECORDS_ACTION'='FORCE'
'BAD_RECORDS_ACTION'='REDIRECT'
To ignore the Bad Records from getting stored in the raw csv, we need to set the following in the query :
'BAD_RECORDS_ACTION'='IGNORE'
The store location specified while creating carbon session is used by the CarbonData to store the meta data like the schema, dictionary files, dictionary meta data and sort indexes.
Try creating carbonsession
with storepath
specified in the following manner :
val carbon = SparkSession.builder().config(sc.getConf) .getOrCreateCarbonSession(<store_path>)
Example:
val carbon = SparkSession.builder().config(sc.getConf) .getOrCreateCarbonSession("hdfs://localhost:9000/carbon/store")
The Apache CarbonData acquires lock on the files to prevent concurrent operation from modifying the same files. The lock can be of the following types depending on the storage location, for HDFS we specify it to be of type HDFSLOCK. By default it is set to type LOCALLOCK. The property carbon.lock.type configuration specifies the type of lock to be acquired during concurrent operations on table. This property can be set with the following values :
In order to build CarbonData project it is necessary to specify the spark profile. The spark profile sets the Spark Version. You need to specify the spark version
while using Maven to build project.
Carbon support insert operation, you can refer to the syntax mentioned in DML Operations on CarbonData. First, create a soucre table in spark-sql and load data into this created table.
CREATE TABLE source_table( id String, name String, city String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
SELECT * FROM source_table; id name city 1 jack beijing 2 erlu hangzhou 3 davi shenzhen
Scenario 1 :
Suppose, the column order in carbon table is different from source table, use script “SELECT * FROM carbon table” to query, will get the column order similar as source table, rather than in carbon table's column order as expected.
CREATE TABLE IF NOT EXISTS carbon_table( id String, city String, name String) STORED BY 'carbondata';
INSERT INTO TABLE carbon_table SELECT * FROM source_table;
SELECT * FROM carbon_table; id city name 1 jack beijing 2 erlu hangzhou 3 davi shenzhen
As result shows, the second column is city in carbon table, but what inside is name, such as jack. This phenomenon is same with insert data into hive table.
If you want to insert data into corresponding column in carbon table, you have to specify the column order same in insert statment.
INSERT INTO TABLE carbon_table SELECT id, city, name FROM source_table;
Scenario 2 :
Insert operation will be failed when the number of column in carbon table is different from the column specified in select statement. The following insert operation will be failed.
INSERT INTO TABLE carbon_table SELECT id, city FROM source_table;
Scenario 3 :
When the column type in carbon table is different from the column specified in select statement. The insert operation will still success, but you may get NULL in result, because NULL will be substitute value when conversion type failed.