Configuring CarbonData

This tutorial will guide you through the advance configuration of CarbonData.

System Configuration

This section provides the details of all the configurations required for Carbon System. System Configuration in carbon.properties

ParameterDefault ValueDescription
carbon.storelocation/user/hive/warehouse/carbon.storeLocation where Carbon will create the store, and write the data in its own format.NOTE: Store location should be in HDFS.
carbon.ddl.base.hdfs.urlhdfs://hacluster/opt/dataThis property is used to configure the HDFS relative path from the HDFS base path, configured in fs.defaultFS. The path configured in carbon.ddl.base.hdfs.url will be appended to the HDFS path configured in fs.defaultFS. If this path is configured, then user need not pass the complete path while dataload.For example: If absolute path of the csv file is hdfs://10.18.101.155:54310/data/cnbc/2016/xyz.csv,the path “hdfs://10.18.101.155:54310” will come from property fs.defaultFS and user can configure the /data/cnbc/ as carbon.ddl.base.hdfs.url.Now while dataload user can specify the csv path as/2016/xyz.csv.
carbon.badRecords.location/opt/Carbon/Spark/badrecordsPath where the bad records are stored.
carbon.kettle.home$SPARK_HOME/carbonlib/carbonpluginsPath used by Carbon internally to create graph for loading the data.
carbon.data.file.version2If this parameter value is set to1, then the Carbon supports the data load which is in old format. If the value is set to 2, then the Carbon supports the data load of new format only.NOTE: The file format created before DataSight Spark V100R002C30 is considered as old format.

Performance Configuration

This section provides the details of all the configurations required for Carbon Performance Optimization. Performance Configuration in carbon.properties

  1. Data Loading Configuration
ParameterDefault ValueDescriptionRange
carbon.sort.file.buffer.size20File read buffer size used during sorting.The value is in MB.Min=1 and Max=100
carbon.graph.rowset.size100000Rowset size exchanged between data load graph steps.Min=500 and Max=1000000
carbon.number.of.cores.while.loading6Number of cores to be used while data loading.
carbon.sort.size500000Record count to sort and write to temp intermediate files.
carbon.enableXXHashtrueAlgorithm for hashmap for hashkey calculation.
carbon.number.of.cores.block.sort7Number of cores to be used for block sort while dataloading.
carbon.max.driver.lru.cache.size-1Max LRU cache size upto which data will be loaded at the driver side.The value is in MB. The default value is -1, means there is no memory limit for caching. Only integer values greater than 0 are accepted.
carbon.max.executor.lru.cache.size-1Max LRU cache size upto which data will be loaded at the executor side.The value is in MB. The default value is -1, means there is no memory limit for caching. Only integer values greater than 0 are accepted. If this parameter is not configured, then thecarbon.max.driver.lru.cache.size value will be considered.
carbon.merge.sort.prefetchtrueEnable prefetch of data during merge sort while reading data from sort temp files in data loading.
carbon.update.persist.enabletrueEnabling this parameter considers persistent data. Enabling this will reduce the execution time of UPDATE operation.
  1. Compaction Configuration
ParameterDefault ValueDescriptionRange
carbon.number.of.cores.while.compacting2Number of cores which is used to write data during compaction.
carbon.compaction.level.threshold4,3This property is for minor compaction which decides how many segments to be merged.Example: if it is set as 2,3 then minor compaction will be triggered for every 2 segments. 3 is the number of level 1 compacted segment which is further compacted to new segment.Valid values are from 0-100.
carbon.major.compaction.size1024Major compaction size can be configured using this parameter. Sum of the segments which is below this threshold will be merged. The value is in MB.
carbon.horizontal.compaction.enabletrueThis property is used to turn ON/OFF horizontal compaction. After every DELETE and UPDATE statement, horizontal compaction may occur in case the delta (DELETE/ UPDATE) files becomes more than specified threshold. By default the horizontal compaction is Turned ON but can turn OFF the horizontal compaction by setting the value to false.
carbon.horizontal.UPDATE.compaction.threshold1This property specifies the threshold limit on number of UPDATE delta files within a segment. In case the number of delta files goes beyond the threshold, the UPDATE delta files within the segment becomes eligible for horizontal compaction and compacted into single UPDATE delta file.By default the value is set to 1 and can be altered to values between 1 to 10000.
carbon.horizontal.DELETE.compaction.threshold1This property specifies the threshold limit on number of DELETE delta files within a block of a segment. In case the number of delta files goes beyond the threshold, the DELETE delta files for the particular block of the segment becomes eligible for horizontal compaction and compacted into single DELETE delta file.By default the value is set to 1 and can be altered to values between 1 to 10000.
  1. Query Configuration
ParameterDefault ValueDescriptionRange
carbon.number.of.cores4Number of cores to be used while querying.
carbon.inmemory.record.size120000Number of records to be in memory while querying.Min=100000 and Max=240000
carbon.enable.quick.filterfalseImproves the performance of filter query.
no.of.cores.to.load.blocks.in.driver10Number of core to load the blocks in driver.

Miscellaneous Configurations

Extra Configuration in carbon.properties

  1. Time format for CarbonData
ParameterDefault FormatDescription
carbon.timestamp.formatyyyy-MM-dd HH:mm:ssTimestamp format of input data used for timestamp data type.
  1. Dataload Configuration
ParameterDefault ValueDescription
carbon.sort.file.write.buffer.size10485760File write buffer size used during sorting.
carbon.lock.typeLOCALLOCKThis configuration specifies the type of lock to be acquired during concurrent operations on table.There are following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other Carbon spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple carbon spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking.
carbon.sort.intermediate.files.limit20Minimum no of intermediate files after which sort merged to be started.
carbon.block.meta.size.reserved.percentage10space reserved in percentage for writing block meta data in carbon data file.
carbon.csv.read.buffersize.byte1048576csv reading buffer size.
high.cardinality.value100000To identify and apply compression for non-high cardinality columns.
carbon.merge.sort.reader.thread3Maximum no of threads used for reading intermediate files for final merging.
carbon.load.metadata.lock.retries3Maximum number of retries to get the metadata lock for loading data to table.
carbon.load.metadata.lock.retry.timeout.sec5Interval between the retries to get the lock.
carbon.tempstore.location/opt/Carbon/TempStoreLocTemporary store location. By default it takes System.getProperty(“java.io.tmpdir”).
carbon.load.log.counter500000Data loading records count logger.
  1. Compaction Configuration
ParameterDefault ValueDescription
carbon.numberof.preserve.segments0If the user wants to preserve some number of segments from being compacted then he can set this property.Example: carbon.numberof.preserve.segments=2 then 2 latest segments will always be excluded from the compaction. No segments will be preserved by default.
carbon.allowed.compaction.days0Compaction will merge the segments which are loaded with in the specific number of days configured.Example: if the configuration is 2, then the segments which are loaded in the time frame of 2 days only will get merged. Segments which are loaded 2 days apart will not be merged.This is disabled by default.
carbon.enable.auto.load.mergefalseTo enable compaction while data loading.
  1. Query Configuration
ParameterDefault ValueDescription
max.query.execution.time60Maximum time allowed for one query to be executed. The value is in minutes.
carbon.enableMinMaxtrueMin max is feature added to enhance query performance. To disable this feature, set it false.
  1. Global Dictionary Configurations
ParameterDefault ValueDescription
high.cardinality.identify.enabletrueIf the parameter is true, the high cardinality columns of the dictionary code are automatically recognized and these columns will not be used as global dictionary encoding. If the parameter is false, all dictionary encoding columns are used as dictionary encoding.The high cardinality column must meet the following requirements:value of cardinality > configured value of high.cardinalityEqually, the value of cardinality is higher than the threshold.value of cardinality/ row number x 100 > configured value of high.cardinality.row.count.percentageEqually, the ratio of the cardinality value to data row number is higher than the configured percentage.
high.cardinality.threshold1000000Threshold to identify whether high cardinality column.Configuration value formula:Value of cardinality > configured value of high.cardinalityThe minimum value is 10000.
high.cardinality.row.count.percentage80Percentage to identify whether column cardinality is more than configured percent of total row count.Configuration value formula:Value of cardinality/ row number x 100 > configured value of high.cardinality.row.count.percentageThe value of the parameter must be larger than 0.
carbon.cutOffTimestamp1970-01-01 05:30:00Sets the start date for calculating the timestamp. Java counts the number of milliseconds from start of “1970-01-01 00:00:00”. This property is used to customize the start of position. For example “2000-01-01 00:00:00”.The date must be in the form “carbon.timestamp.format”.NOTE: The Carbon supports data store up to 68 years from the cut-off time defined. For example, if the cut-off time is 1970-01-01 05:30:00, then the data can be stored up to 2038-01-01 05:30:00.
carbon.timegranularitySECONDThe property used to set the data granularity level DAY, HOUR, MINUTE, or SECOND.

Spark Configuration

Spark Configuration Reference in spark-defaults.conf

ParameterDefault ValueDescription
spark.driver.memory1gAmount of memory to use for the driver process, i.e. where SparkContext is initialized.NOTE: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file.
spark.executor.memory1gAmount of memory to use per executor process.
spark.sql.bigdata.register.analyseRuleorg.apache.spark.sql.hive.acl.CarbonAccessControlRulesCarbonAccessControlRules need to be set for enabling Access Control.