This tutorial provides a quick introduction to using CarbonData. To follow along with this guide, first download a packaged release of CarbonData from the CarbonData website.Alternatively it can be created following Building CarbonData steps.
CarbonData supports Spark versions upto 2.2.1.Please download Spark package from Spark website
Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData
cd carbondata cat > sample.csv << EOF id,name,city,age 1,david,shenzhen,31 2,eason,shenzhen,27 3,jarry,wuhan,35 EOF
CarbonData can be integrated with Spark,Presto and Hive execution engines. The below documentation guides on Installing and Configuring with these execution engines.
Installing and Configuring CarbonData to run locally with Spark Shell
Installing and Configuring CarbonData on Standalone Spark Cluster
Installing and Configuring CarbonData on Spark on YARN Cluster
Installing and Configuring CarbonData Thrift Server for Query Execution
Installing and Configuring CarbonData on Presto
Installing and Configuring CarbonData on Hive
CarbonData supports read and write with HDFS
CarbonData supports read and write with S3
CarbonData supports read and write with Alluxio
Apache Spark Shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Please visit Apache Spark Documentation for more details on Spark shell.
Start Spark shell by running the following command in the Spark directory:
./bin/spark-shell --jars <carbondata assembly jar path>
NOTE: Path where packaged release of CarbonData was downloaded or assembly jar will be available after building CarbonData and can be copied from ./assembly/target/scala-2.1x/carbondata_xxx.jar
In this shell, SparkSession is readily available as spark
and Spark context is readily available as sc
.
In order to create a CarbonSession we will have to configure it explicitly in the following manner :
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("<carbon_store_path>")
NOTE
../carbon.metastore
, user can provide own metastore location to CarbonSession like SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("<carbon_store_path>", "<local metastore path>")
.<carbon_store_path>
, like /carbon/data/store
, hdfs://localhost:9000/carbon/data/store
or s3a://carbon/data/store
.carbon.sql( s""" | CREATE TABLE IF NOT EXISTS test_table( | id string, | name string, | city string, | age Int) | STORED AS carbondata """.stripMargin)
carbon.sql("LOAD DATA INPATH '/path/to/sample.csv' INTO TABLE test_table")
NOTE: Please provide the real file path of sample.csv
for the above script. If you get “tablestatus.lock” issue, please refer to FAQ
carbon.sql("SELECT * FROM test_table").show() carbon.sql( s""" | SELECT city, avg(age), sum(age) | FROM test_table | GROUP BY city """.stripMargin).show()
Build the CarbonData project and get the assembly jar from ./assembly/target/scala-2.1x/carbondata_xxx.jar
.
Copy ./assembly/target/scala-2.1x/carbondata_xxx.jar
to $SPARK_HOME/carbonlib
folder.
NOTE: Create the carbonlib folder if it does not exist inside $SPARK_HOME
path.
Add the carbonlib folder path in the Spark classpath. (Edit $SPARK_HOME/conf/spark-env.sh
file and modify the value of SPARK_CLASSPATH
by appending $SPARK_HOME/carbonlib/*
to the existing value)
Copy the ./conf/carbon.properties.template
file from CarbonData repository to $SPARK_HOME/conf/
folder and rename the file to carbon.properties
.
Repeat Step 2 to Step 5 in all the nodes of the cluster.
In Spark node[master], configure the properties mentioned in the following table in $SPARK_HOME/conf/spark-defaults.conf
file.
Property | Value | Description |
---|---|---|
spark.driver.extraJavaOptions | -Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties | A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. |
spark.executor.extraJavaOptions | -Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties | A string of extra JVM options to pass to executors. For instance, GC settings or other logging. NOTE: You can enter multiple values separated by space. |
$SPARK_HOME/conf/carbon.properties
file:Property | Required | Description | Example | Remark |
---|---|---|---|---|
carbon.storelocation | NO | Location where data CarbonData will create the store and write the data in its own format. If not specified then it takes spark.sql.warehouse.dir path. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose to set HDFS directory |
./bin/spark-shell \ --master spark://HOSTNAME:PORT \ --total-executor-cores 2 \ --executor-memory 2G
NOTE: Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.
This section provides the procedure to install CarbonData on “Spark on YARN” cluster.
The following steps are only for Driver Nodes. (Driver nodes are the one which starts the spark context.)
Build the CarbonData project and get the assembly jar from ./assembly/target/scala-2.1x/carbondata_xxx.jar
and copy to $SPARK_HOME/carbonlib
folder.
NOTE: Create the carbonlib folder if it does not exists inside $SPARK_HOME
path.
Copy the ./conf/carbon.properties.template
file from CarbonData repository to $SPARK_HOME/conf/
folder and rename the file to carbon.properties
.
Create tar.gz
file of carbonlib folder and move it inside the carbonlib folder.
cd $SPARK_HOME tar -zcvf carbondata.tar.gz carbonlib/ mv carbondata.tar.gz carbonlib/
$SPARK_HOME/conf/spark-defaults.conf
file.Property | Description | Value |
---|---|---|
spark.master | Set this value to run the Spark in yarn cluster mode. | Set yarn-client to run the Spark in yarn cluster mode. |
spark.yarn.dist.files | Comma-separated list of files to be placed in the working directory of each executor. | $SPARK_HOME/conf/carbon.properties |
spark.yarn.dist.archives | Comma-separated list of archives to be extracted into the working directory of each executor. | $SPARK_HOME/carbonlib/carbondata.tar.gz |
spark.executor.extraJavaOptions | A string of extra JVM options to pass to executors. For instance NOTE: You can enter multiple values separated by space. | -Dcarbon.properties.filepath = carbon.properties |
spark.executor.extraClassPath | Extra classpath entries to prepend to the classpath of executors. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the values in below parameter spark.driver.extraClassPath | carbondata.tar.gz/carbonlib/* |
spark.driver.extraClassPath | Extra classpath entries to prepend to the classpath of the driver. NOTE: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the value in below parameter spark.driver.extraClassPath. | $SPARK_HOME/carbonlib/* |
spark.driver.extraJavaOptions | A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. | -Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties |
$SPARK_HOME/conf/carbon.properties
:Property | Required | Description | Example | Default Value |
---|---|---|---|---|
carbon.storelocation | NO | Location where CarbonData will create the store and write the data in its own format. If not specified then it takes spark.sql.warehouse.dir path. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | Propose to set HDFS directory |
./bin/spark-shell \ --master yarn-client \ --driver-memory 1G \ --executor-memory 2G \ --executor-cores 2
NOTE:
a. cd $SPARK_HOME
b. Run the following command to start the CarbonData thrift server.
./bin/spark-submit \ --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
Parameter | Description | Example |
---|---|---|
CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the $SPARK_HOME/carbonlib/ folder. | carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar |
carbon_store_path | This is a parameter to the CarbonThriftServer class. This a HDFS path where CarbonData files will be kept. Strongly Recommended to put same as carbon.storelocation parameter of carbon.properties. If not specified then it takes spark.sql.warehouse.dir path. | hdfs://<host_name>:port/user/hive/warehouse/carbon.store |
NOTE: From Spark 1.6, by default the Thrift server runs in multi-session mode. Which means each JDBC/ODBC connection owns a copy of their own SQL configuration and temporary function registry. Cached tables are still shared though. If you prefer to run the Thrift server in single-session mode and share all SQL configuration and temporary function registry, please set option spark.sql.hive.thriftServer.singleSession
to true
. You may either add this option to spark-defaults.conf
, or pass it to spark-submit.sh
via --conf
:
./bin/spark-submit \ --conf spark.sql.hive.thriftServer.singleSession=true \ --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
But in single-session mode, if one user changes the database from one connection, the database of the other connections will be changed too.
Examples
./bin/spark-submit \ --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ $SPARK_HOME/carbonlib/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar \ hdfs://<host_name>:port/user/hive/warehouse/carbon.store
./bin/spark-submit \ --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ --num-executors 3 \ --driver-memory 20G \ --executor-memory 250G \ --executor-cores 32 \ $SPARK_HOME/carbonlib/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar \ hdfs://<host_name>:port/user/hive/warehouse/carbon.store
cd $SPARK_HOME ./sbin/start-thriftserver.sh ./bin/beeline -u jdbc:hive2://<thriftserver_host>:port Example ./bin/beeline -u jdbc:hive2://10.10.10.10:10000
NOTE: CarbonData tables cannot be created nor loaded from Presto. User need to create CarbonData Table and load data into it either with Spark or SDK or C++ SDK. Once the table is created,it can be queried from Presto.
Download the 0.210 version of Presto using: wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.210/presto-server-0.210.tar.gz
Extract Presto tar file: tar zxvf presto-server-0.210.tar.gz
.
Download the Presto CLI for the coordinator and name it presto.
wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.210/presto-cli-0.210-executable.jar mv presto-cli-0.210-executable.jar presto chmod +x presto
etc
folder in presto-server-0.210 directory.config.properties
, jvm.config
, log.properties
, and node.properties
files.sudo apt-get install uuid uuid
node.environment=production node.id=<generated uuid> node.data-dir=/home/ubuntu/data
-server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p
com.facebook.presto=INFO
The default minimum level is INFO
. There are four levels: DEBUG
, INFO
, WARN
and ERROR
.
coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8086 query.max-memory=5GB query.max-total-memory-per-node=5GB query.max-memory-per-node=3GB memory.heap-headroom-per-node=1GB discovery-server.enabled=true discovery.uri=http://localhost:8086 task.max-worker-threads=4 optimizer.dictionary-aggregation=true optimizer.optimize-hash-generation = false
The options node-scheduler.include-coordinator=false
and coordinator=true
indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
Note: It is recommended to set query.max-memory-per-node
to half of the JVM config max memory, though the workload is highly concurrent, lower value for query.max-memory-per-node
is to be used.
Also relation between below two configuration-properties should be like: If, query.max-memory-per-node=30GB
Then, query.max-memory=<30GB * number of nodes>
.
coordinator=false http-server.http.port=8086 query.max-memory=5GB query.max-memory-per-node=2GB discovery.uri=<coordinator_ip>:8086
Note: jvm.config
and node.properties
files are same for all the nodes (worker + coordinator). All the nodes should have different node.id
.(generated by uuid command).
catalog
in etc directory of presto on all the nodes of the cluster including the coordinator.carbondata.properties
in the catalog
folder and set the required properties on all the nodes.carbondata
in plugin directory of presto.carbondata
jars to plugin/carbondata
directory on all nodes../presto-server-0.210/bin/launcher start
To run it as a background process.
./presto-server-0.210/bin/launcher run
To run it in foreground.
./presto
To connect to carbondata catalog use the following command:
./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
Execute the following command to ensure the workers are connected.
select * from system.runtime.nodes;
Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
List the schemas(databases) available
show schemas;
Selected the schema where CarbonData table resides
use carbonschema;
List the available tables
show tables;
Query from the available tables
select * from carbon_table;
Note : Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.