This tutorial guides you through the installation and configuration of CarbonData in the following two modes :
followed by :
Hadoop HDFS and Yarn should be installed and running.
Spark should be installed and running on all the cluster nodes.
CarbonData user should have permission to access HDFS.
Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the "<SPARK_HOME>/carbonlib"
folder.
NOTE: Create the carbonlib folder if it does not exists inside "<SPARK_HOME>"
path.
Add the carbonlib folder path in the Spark classpath. (Edit "<SPARK_HOME>/conf/spark-env.sh"
file and modify the value of SPARK_CLASSPATH by appending "<SPARK_HOME>/carbonlib/*"
to the existing value)
Copy the carbon.properties.template to "<SPARK_HOME>/conf/carbon.properties"
folder from “./conf/” of CarbonData repository.
Copy the “carbonplugins” folder to "<SPARK_HOME>/carbonlib"
folder from “./processing/” folder of CarbonData repository.
NOTE: carbonplugins will contain .kettle folder.
In Spark node, configure the properties mentioned in the following table in "<SPARK_HOME>/conf/spark-defaults.conf"
file.
Property | Value | Description |
---|---|---|
carbon.kettle.home | $SPARK_HOME /carbonlib/carbonplugins | Path that will be used by CarbonData internally to create graph for loading the data |
spark.driver.extraJavaOptions | -Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties | A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. |
spark.executor.extraJavaOptions | -Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties | A string of extra JVM options to pass to executors. For instance, GC settings or other logging. NOTE: You can enter multiple values separated by space. |
"<SPARK_HOME>/conf/" carbon.properties
:Property | Required | Description | Example | Remark |
---|---|---|---|---|
carbon.storelocation | NO | Location where data CarbonData will create the store and write the data in its own format. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose to set HDFS directory |
carbon.kettle.home | YES | Path that will be used by CarbonData internally to create graph for loading the data. | $SPARK_HOME/carbonlib/carbonplugins |
./spark-shell --master spark://HOSTNAME:PORT --total-executor-cores 2 --executor-memory 2G
NOTE: Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.
To get started with CarbonData : Quick Start, DDL Operations on CarbonData
This section provides the procedure to install CarbonData on “Spark on YARN” cluster.
The following steps are only for Driver Nodes. (Driver nodes are the one which starts the spark context.)
Build the CarbonData project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the "<SPARK_HOME>/carbonlib"
folder.
NOTE: Create the carbonlib folder if it does not exists inside ``"<SPARK_HOME>"`` path.
Copy “carbonplugins” folder to "<SPARK_HOME>/carbonlib"
folder from “./processing/” folder of CarbonData repository. carbonplugins will contain .kettle folder.
Copy the “carbon.properties.template” to "<SPARK_HOME>/conf/carbon.properties"
folder from conf folder of CarbonData repository.
Modify the parameters in “spark-default.conf” located in the "<SPARK_HOME>/conf
"
Property | Description | Value |
---|---|---|
spark.master | Set this value to run the Spark in yarn cluster mode. | Set “yarn-client” to run the Spark in yarn cluster mode. |
spark.yarn.dist.files | Comma-separated list of files to be placed in the working directory of each executor. | "<YOUR_SPARK_HOME_PATH>"/conf/carbon.properties |
spark.yarn.dist.archives | Comma-separated list of archives to be extracted into the working directory of each executor. | "<YOUR_SPARK_HOME_PATH>"/carbonlib/carbondata_xxx.jar |
spark.executor.extraJavaOptions | A string of extra JVM options to pass to executors. For instance NOTE: You can enter multiple values separated by space. | -Dcarbon.properties.filepath="<YOUR_SPARK_HOME_PATH>"/conf/carbon.properties |
spark.executor.extraClassPath | Extra classpath entries to prepend to the classpath of executors. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the values in below parameter spark.driver.extraClassPath | "<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonlib/carbondata_xxx.jar |
spark.driver.extraClassPath | Extra classpath entries to prepend to the classpath of the driver. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the value in below parameter spark.driver.extraClassPath. | "<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonlib/carbondata_xxx.jar |
spark.driver.extraJavaOptions | A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. | -Dcarbon.properties.filepath="<YOUR_SPARK_HOME_PATH>"/conf/carbon.properties |
carbon.kettle.home | Path that will be used by CarbonData internally to create graph for loading the data. | "<YOUR_SPARK_HOME_PATH>"/carbonlib/carbonplugins |
<SPARK_HOME>/conf/ carbon.properties
:Property | Required | Description | Example | Default Value |
---|---|---|---|---|
carbon.storelocation | NO | Location where CarbonData will create the store and write the data in its own format. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose to set HDFS directory |
carbon.kettle.home | YES | Path that will be used by CarbonData internally to create graph for loading the data. | $SPARK_HOME/carbonlib/carbonplugins |
./bin/spark-shell --master yarn-client --driver-memory 1g --executor-cores 2 --executor-memory 2G
NOTE: Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.
Getting started with CarbonData : Quick Start, DDL Operations on CarbonData
a. cd <SPARK_HOME>
b. Run the following command to start the CarbonData thrift server.
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
Parameter | Description | Example |
---|---|---|
CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the "<SPARK_HOME>"/carbonlib/ folder. | carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar |
carbon_store_path | This is a parameter to the CarbonThriftServer class. This a HDFS path where CarbonData files will be kept. Strongly Recommended to put same as carbon.storelocation parameter of carbon.properties. | hdfs//<host_name>:54310/user/hive/warehouse/carbon.store |
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer $SPARK_HOME/carbonlib /carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar hdfs://hacluster/user/hive/warehouse/carbon.store
./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer --num-executors 3 --driver-memory 20g --executor-memory 250g --executor-cores 32 /srv/OSCON/BigData/HACluster/install/spark/sparkJdbc/lib /carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar hdfs://hacluster/user/hive/warehouse/carbon.store
cd <SPARK_HOME> ./bin/beeline jdbc:hive2://<thrftserver_host>:port Example ./bin/beeline jdbc:hive2://10.10.10.10:10000