Installation Guide

This tutorial guides you through the installation and configuration of CarbonData in the following two modes :

followed by :

Installing and Configuring CarbonData on Standalone Spark Cluster


  • Hadoop HDFS and Yarn should be installed and running.

  • Spark should be installed and running on all the cluster nodes.

  • CarbonData user should have permission to access HDFS.


  • Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the "<SPARK_HOME>/carbonlib" folder.

    NOTE: Create the carbonlib folder if it does not exists inside "<SPARK_HOME>" path.

  • Add the carbonlib folder path in the Spark classpath. (Edit "<SPARK_HOME>/conf/" file and modify the value of SPARK_CLASSPATH by appending "<SPARK_HOME>/carbonlib/*" to the existing value)

  • Copy the to "<SPARK_HOME>/conf/" folder from “./conf/” of CarbonData repository.

  • Copy the “carbonplugins” folder to "<SPARK_HOME>/carbonlib" folder from “./processing/” folder of CarbonData repository.

    NOTE: carbonplugins will contain .kettle folder.

  • In Spark node, configure the properties mentioned in the following table in "<SPARK_HOME>/conf/spark-defaults.conf" file.

carbon.kettle.home$SPARK_HOME /carbonlib/carbonpluginsPath that will be used by CarbonData internally to create graph for loading the data$SPARK_HOME/conf/carbon.propertiesA string of extra JVM options to pass to the driver. For instance, GC settings or other logging.$SPARK_HOME/conf/carbon.propertiesA string of extra JVM options to pass to executors. For instance, GC settings or other logging. NOTE: You can enter multiple values separated by space.
  • Add the following properties in "<SPARK_HOME>/conf/"
carbon.storelocationNOLocation where data CarbonData will create the store and write the data in its own format.hdfs://HOSTNAME:PORT/Opt/CarbonStorePropose to set HDFS directory
carbon.kettle.homeYESPath that will be used by CarbonData internally to create graph for loading the data.$SPARK_HOME/carbonlib/carbonplugins
  • Verify the installation. For example:
   ./spark-shell --master spark://HOSTNAME:PORT --total-executor-cores 2
   --executor-memory 2G

NOTE: Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.

To get started with CarbonData : Quick Start, DDL Operations on CarbonData

Installing and Configuring CarbonData on “Spark on YARN” Cluster

This section provides the procedure to install CarbonData on “Spark on YARN” cluster.


  • Hadoop HDFS and Yarn should be installed and running.
  • Spark should be installed and running in all the clients.
  • CarbonData user should have permission to access HDFS.


The following steps are only for Driver Nodes. (Driver nodes are the one which starts the spark context.)

  • Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the "<SPARK_HOME>/carbonlib" folder.

    NOTE: Create the carbonlib folder if it does not exists inside ``"<SPARK_HOME>"`` path.
  • Copy “carbonplugins” folder to "<SPARK_HOME>/carbonlib" folder from “./processing/” folder of CarbonData repository. carbonplugins will contain .kettle folder.

  • Copy the “” to "<SPARK_HOME>/conf/" folder from conf folder of CarbonData repository.

  • Modify the parameters in “spark-default.conf” located in the "<SPARK_HOME>/conf"

spark.masterSet this value to run the Spark in yarn cluster mode.Set “yarn-client” to run the Spark in yarn cluster mode.
spark.yarn.dist.filesComma-separated list of files to be placed in the working directory of each executor."<YOUR_SPARK_HOME_PATH>"/conf/
spark.yarn.dist.archivesComma-separated list of archives to be extracted into the working directory of each executor."<YOUR_SPARK_HOME_PATH>"/carbonlib/carbondata_xxx.jar
spark.executor.extraJavaOptionsA string of extra JVM options to pass to executors. For instance NOTE: You can enter multiple values separated by"<YOUR_SPARK_HOME_PATH>"/conf/
spark.executor.extraClassPathExtra classpath entries to prepend to the classpath of executors. NOTE: If SPARK_CLASSPATH is defined in, then comment it and append the values in below parameter spark.driver.extraClassPath"<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonlib/carbondata_xxx.jar
spark.driver.extraClassPathExtra classpath entries to prepend to the classpath of the driver. NOTE: If SPARK_CLASSPATH is defined in, then comment it and append the value in below parameter spark.driver.extraClassPath."<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonlib/carbondata_xxx.jar
spark.driver.extraJavaOptionsA string of extra JVM options to pass to the driver. For instance, GC settings or other"<YOUR_SPARK_HOME_PATH>"/conf/
carbon.kettle.homePath that will be used by CarbonData internally to create graph for loading the data."<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonplugins
  • Add the following properties in <SPARK_HOME>/conf/
PropertyRequiredDescriptionExampleDefault Value
carbon.storelocationNOLocation where CarbonData will create the store and write the data in its own format.hdfs://HOSTNAME:PORT/Opt/CarbonStorePropose to set HDFS directory
carbon.kettle.homeYESPath that will be used by CarbonData internally to create graph for loading the data.$SPARK_HOME/carbonlib/carbonplugins
  • Verify the installation.
     ./bin/spark-shell --master yarn-client --driver-memory 1g
     --executor-cores 2 --executor-memory 2G

NOTE: Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.

Getting started with CarbonData : Quick Start, DDL Operations on CarbonData

Query Execution Using CarbonData Thrift Server

Starting CarbonData Thrift Server

a. cd <SPARK_HOME>

b. Run the following command to start the CarbonData thrift server.

./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer
$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
CARBON_ASSEMBLY_JARCarbonData assembly jar name present in the "<SPARK_HOME>"/carbonlib/ folder.carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar
carbon_store_pathThis is a parameter to the CarbonThriftServer class. This a HDFS path where CarbonData files will be kept. Strongly Recommended to put same as carbon.storelocation parameter of<host_name>:54310/user/hive/warehouse/


  • Start with default memory and executors
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true 
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
  • Start with Fixed executors and resources
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true 
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
--num-executors 3 --driver-memory 20g --executor-memory 250g 
--executor-cores 32 

Connecting to CarbonData Thrift Server Using Beeline

     cd <SPARK_HOME>
     ./bin/beeline jdbc:hive2://<thrftserver_host>:port

     ./bin/beeline jdbc:hive2://