Installation Guide

This tutorial guides you through the installation and configuration of CarbonData in the following two modes :

followed by :

Installing and Configuring CarbonData on Standalone Spark Cluster

Prerequisites

  • Hadoop HDFS and Yarn should be installed and running.

  • Spark should be installed and running on all the cluster nodes.

  • CarbonData user should have permission to access HDFS.

Procedure

  • Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the "<SPARK_HOME>/carbonlib" folder.

    NOTE: Create the carbonlib folder if it does not exists inside "<SPARK_HOME>" path.

  • Add the carbonlib folder path in the Spark classpath. (Edit "<SPARK_HOME>/conf/spark-env.sh" file and modify the value of SPARK_CLASSPATH by appending "<SPARK_HOME>/carbonlib/*" to the existing value)

  • Copy the carbon.properties.template to "<SPARK_HOME>/conf/carbon.properties" folder from “./conf/” of CarbonData repository.

  • Copy the “carbonplugins” folder to "<SPARK_HOME>/carbonlib" folder from “./processing/” folder of CarbonData repository.

    NOTE: carbonplugins will contain .kettle folder.

  • In Spark node, configure the properties mentioned in the following table in "<SPARK_HOME>/conf/spark-defaults.conf" file.

PropertyValueDescription
carbon.kettle.home$SPARK_HOME /carbonlib/carbonpluginsPath that will be used by CarbonData internally to create graph for loading the data
spark.driver.extraJavaOptions-Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.propertiesA string of extra JVM options to pass to the driver. For instance, GC settings or other logging.
spark.executor.extraJavaOptions-Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.propertiesA string of extra JVM options to pass to executors. For instance, GC settings or other logging. NOTE: You can enter multiple values separated by space.
  • Add the following properties in "<SPARK_HOME>/conf/" carbon.properties:
PropertyRequiredDescriptionExampleRemark
carbon.storelocationNOLocation where data CarbonData will create the store and write the data in its own format.hdfs://HOSTNAME:PORT/Opt/CarbonStorePropose to set HDFS directory
carbon.kettle.homeYESPath that will be used by CarbonData internally to create graph for loading the data.$SPARK_HOME/carbonlib/carbonplugins
  • Verify the installation. For example:
   ./spark-shell --master spark://HOSTNAME:PORT --total-executor-cores 2
   --executor-memory 2G

NOTE: Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.

To get started with CarbonData : Quick Start, DDL Operations on CarbonData

Installing and Configuring CarbonData on “Spark on YARN” Cluster

This section provides the procedure to install CarbonData on “Spark on YARN” cluster.

Prerequisites

  • Hadoop HDFS and Yarn should be installed and running.
  • Spark should be installed and running in all the clients.
  • CarbonData user should have permission to access HDFS.

Procedure

The following steps are only for Driver Nodes. (Driver nodes are the one which starts the spark context.)

  • Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the "<SPARK_HOME>/carbonlib" folder.

    NOTE: Create the carbonlib folder if it does not exists inside ``"<SPARK_HOME>"`` path.
    
  • Copy “carbonplugins” folder to "<SPARK_HOME>/carbonlib" folder from “./processing/” folder of CarbonData repository. carbonplugins will contain .kettle folder.

  • Copy the “carbon.properties.template” to "<SPARK_HOME>/conf/carbon.properties" folder from conf folder of CarbonData repository.

  • Modify the parameters in “spark-default.conf” located in the "<SPARK_HOME>/conf"

PropertyDescriptionValue
spark.masterSet this value to run the Spark in yarn cluster mode.Set “yarn-client” to run the Spark in yarn cluster mode.
spark.yarn.dist.filesComma-separated list of files to be placed in the working directory of each executor."<YOUR_SPARK_HOME_PATH>"/conf/carbon.properties
spark.yarn.dist.archivesComma-separated list of archives to be extracted into the working directory of each executor."<YOUR_SPARK_HOME_PATH>"/carbonlib/carbondata_xxx.jar
spark.executor.extraJavaOptionsA string of extra JVM options to pass to executors. For instance NOTE: You can enter multiple values separated by space.-Dcarbon.properties.filepath="<YOUR_SPARK_HOME_PATH>"/conf/carbon.properties
spark.executor.extraClassPathExtra classpath entries to prepend to the classpath of executors. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the values in below parameter spark.driver.extraClassPath"<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonlib/carbondata_xxx.jar
spark.driver.extraClassPathExtra classpath entries to prepend to the classpath of the driver. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the value in below parameter spark.driver.extraClassPath."<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonlib/carbondata_xxx.jar
spark.driver.extraJavaOptionsA string of extra JVM options to pass to the driver. For instance, GC settings or other logging.-Dcarbon.properties.filepath="<YOUR_SPARK_HOME_PATH>"/conf/carbon.properties
carbon.kettle.homePath that will be used by CarbonData internally to create graph for loading the data."<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonplugins
  • Add the following properties in <SPARK_HOME>/conf/ carbon.properties:
PropertyRequiredDescriptionExampleDefault Value
carbon.storelocationNOLocation where CarbonData will create the store and write the data in its own format.hdfs://HOSTNAME:PORT/Opt/CarbonStorePropose to set HDFS directory
carbon.kettle.homeYESPath that will be used by CarbonData internally to create graph for loading the data.$SPARK_HOME/carbonlib/carbonplugins
  • Verify the installation.
     ./bin/spark-shell --master yarn-client --driver-memory 1g
     --executor-cores 2 --executor-memory 2G

NOTE: Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.

Getting started with CarbonData : Quick Start, DDL Operations on CarbonData

Query Execution Using CarbonData Thrift Server

Starting CarbonData Thrift Server

a. cd <SPARK_HOME>

b. Run the following command to start the CarbonData thrift server.

./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer
$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
ParameterDescriptionExample
CARBON_ASSEMBLY_JARCarbonData assembly jar name present in the "<SPARK_HOME>"/carbonlib/ folder.carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar
carbon_store_pathThis is a parameter to the CarbonThriftServer class. This a HDFS path where CarbonData files will be kept. Strongly Recommended to put same as carbon.storelocation parameter of carbon.properties.hdfs//<host_name>:54310/user/hive/warehouse/carbon.store

Examples

  • Start with default memory and executors
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true 
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
$SPARK_HOME/carbonlib
/carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar 
hdfs://hacluster/user/hive/warehouse/carbon.store
  • Start with Fixed executors and resources
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true 
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
--num-executors 3 --driver-memory 20g --executor-memory 250g 
--executor-cores 32 
/srv/OSCON/BigData/HACluster/install/spark/sparkJdbc/lib
/carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar 
hdfs://hacluster/user/hive/warehouse/carbon.store

Connecting to CarbonData Thrift Server Using Beeline

     cd <SPARK_HOME>
     ./bin/beeline jdbc:hive2://<thrftserver_host>:port

     Example
     ./bin/beeline jdbc:hive2://10.10.10.10:10000