This tutorial will guide you through the installation and configuration of CarbonData in the following two modes :
followed by :
The following steps are only for Driver Nodes.(Driver nodes are the one which starts the spark context.)
Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the “<SPARK_HOME>/carbonlib” folder.
(Note: - Create the carbonlib folder if does not exists inside “<SPARK_HOME>” path.)
carbonlib folder path must be added in Spark classpath. (Edit “<SPARK_HOME>/conf/spark-env.sh” file and modify the value of SPARK_CLASSPATH by appending “<SPARK_HOME>/carbonlib/*” to the existing value)
Copy the carbon.properties.template to “<SPARK_HOME>/conf/carbon.properties” folder from “./conf/” of CarbonData repository.
Copy “carbonplugins” folder to “<SPARK_HOME>/carbonlib” folder from “./processing/” folder of CarbonData repository.
(Note: -carbonplugins will contain .kettle folder.)
In Spark node, configure the properties mentioned as the below table in “<SPARK_HOME>/conf/spark-defaults.conf” file
Property | Description | Value |
---|---|---|
carbon.kettle.home | Path that will be used by CarbonData internally to create graph for loading the data | $SPARK_HOME /carbonlib/carbonplugins |
spark.driver.extraJavaOptions | A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. | -Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties |
spark.executor.extraJavaOptions | A string of extra JVM options to pass to executors. For instance, GC settings or other logging. NOTE: You can enter multiple values separated by space. | -Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties |
Property | Required | Description | Example | Remark |
---|---|---|---|---|
carbon.storelocation | NO | Location where data Carbon will create the store and write the data in its own format. | hdfs://IP:PORT/Opt/CarbonStore | Propose |
carbon.kettle.home | YES | Path that will used by Carbon internally to create graph for loading the data. | $SPARK_HOME/carbonlib/carbonplugins |
./spark-shell --master spark://IP:PORT --total-executor-cores 2 --executor-memory 2G
Note: Make sure that user should have permission of carbon jars and files through which driver and executor will start.
To get started with CarbonData : Quick Start ,DDL Operations
This section provides the procedure to install Carbon on “Spark on YARN” cluster.
Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the “<SPARK_HOME>/carbonlib” folder.
(Note: - Create the carbonlib folder if does not exists inside “<SPARK_HOME>” path.)
Copy the carbon.properties.template to “<SPARK_HOME>/conf/carbon.properties” folder from “./conf/” of CarbonData repository. carbonplugins will contain .kettle folder.
Copy the “carbon.properties.template” to “<SPARK_HOME>/conf/carbon.properties” folder from conf folder of carbondata repository.
Modify the parameters in “spark-default.conf” located in the “<SPARK_HOME>/conf”
Property | Description | Value |
---|---|---|
spark.master | Set this value to run the Spark in yarn cluster mode. | Set “yarn-client” to run the Spark in yarn cluster mode. |
spark.yarn.dist.files | Comma-separated list of files to be placed in the working directory of each executor. | “<YOUR_SPARK_HOME_PATH>”/conf/carbon.properties |
spark.yarn.dist.archives | Comma-separated list of archives to be extracted into the working directory of each executor. | “<YOUR_SPARK_HOME_PATH>”/carbonlib/carbondata_xxx.jar |
spark.executor.extraJavaOptions | A string of extra JVM options to pass to executors. For instance NOTE: You can enter multiple values separated by space. | -Dcarbon.properties.filepath=carbon.properties |
spark.executor.extraClassPath | Extra classpath entries to prepend to the classpath of executors. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the values in below parameter spark.driver.extraClassPath | “<YOUR_SPARK_HOME_PATH>”/carbonlib/carbonlib/carbondata_xxx.jar |
spark.driver.extraClassPath | Extra classpath entries to prepend to the classpath of the driver. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the value in below parameter spark.driver.extraClassPath. | “<YOUR_SPARK_HOME_PATH>”/carbonlib/carbonlib/carbondata_xxx.jar |
spark.driver.extraJavaOptions | A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. | -Dcarbon.properties.filepath=“<YOUR_SPARK_HOME_PATH>”/conf/carbon.properties |
carbon.kettle.home | Path that will used by Carbon internally to create graph for loading the data. | “<YOUR_SPARK_HOME_PATH>”/carbonlib/carbonplugins |
Property | Required | Description | Example | Default Value |
---|---|---|---|---|
carbon.storelocation | NO | Location where data Carbon will create the store and write the data in its own format. | hdfs://IP:PORT/Opt/CarbonStore | Propose |
carbon.kettle.home | YES | Path that will used by Carbon internally to create graph for loading the data. | $SPARK_HOME/carbonlib/carbonplugins |
./bin/spark-shell --master yarn-client --driver-memory 1g --executor-cores 2 --executor-memory 2G
Note: Make sure that user should have permission of carbon jars and files through which driver and executor will start.
To get started with CarbonData : Quick Start ,DDL Operations
a. cd <SPARK_HOME>
b. Run below command to start the Carbon thrift server
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
Parameter | Description | Example |
---|---|---|
CARBON_ASSEMBLY_JAR | Carbon assembly jar name present in the ""/carbonlib/ folder. | carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar |
carbon_store_path | This is parameter to the CarbonThriftServer class. This a HDFS path where carbon files will be kept. Strongly Recommended to put same as carbon.storelocation parameter of carbon.proeprties. | hdfs//hacluster/user/hive/warehouse/carbon.storehdfs//10.10.10.10:54310 /user/hive/warehouse/carbon.store |
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer $SPARK_HOME/carbonlib/carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar hdfs://hacluster/user/hive/warehouse/carbon.store
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer --num-executors 3 --driver-memory 20g --executor-memory 250g --executor-cores 32 /srv/OSCON/BigData/HACluster/install/spark/sparkJdbc/lib/carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar hdfs://hacluster/user/hive/warehouse/carbon.store
cd <SPARK_HOME> ./bin/beeline jdbc:hive2://<thrftserver_host>:port Example ./bin/beeline jdbc:hive2://10.10.10.10:10000