This tutorial provides a quick introduction to using CarbonData.
Carbon Spark shell is a wrapper around Apache Spark Shell, it provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Please visit Apache Spark Documentation for more details on Spark shell.
Start Spark shell by running the following in the Carbon directory:
./bin/carbon-spark-shell
Note: In this shell SparkContext is readily available as sc and CarbonContext is available as cc.
CarbonData stores and writes the data in its specified format at the default location on the hdfs. By default carbon.storelocation is set as :
hdfs://IP:PORT/Opt/CarbonStore
And you can provide your own store location by providing configuration using --conf option like:
./bin/carbon-spark-sql --conf spark.carbon.storepath=<storelocation>
Prerequisites
Create sample.csv file in CarbonData directory. The CSV is required for loading data into Carbon.
$ cd carbondata $ cat > sample.csv << EOF id,name,city,age 1,david,shenzhen,31 2,eason,shenzhen,27 3,jarry,wuhan,35 EOF
Create table
scala>cc.sql("create table if not exists test_table (id string, name string, city string, age Int) STORED BY 'carbondata'")
Load data to table
scala>val dataFilePath = new File("../carbondata/sample.csv").getCanonicalPath scala>cc.sql(s"load data inpath '$dataFilePath' into table test_table")
Query data from table
scala>cc.sql("select * from test_table").show scala>cc.sql("select city, avg(age), sum(age) from test_table group by city").show
The Carbon Spark SQL CLI is a wrapper around Apache Spark SQL CLI. It is a convenient tool to execute queries input from the command line. Please visit Apache Spark Documentation for more information Spark SQL CLI.
Start the Carbon Spark SQL CLI, run the following in the Carbon directory :
./bin/carbon-spark-sql
CarbonData stores and writes the data in its specified format at the default location on the hdfs. By default carbon.storelocation is set as :
hdfs://IP:PORT/Opt/CarbonStore
And you can provide your own store location by providing configuration using --conf option like:
./bin/carbon-spark-sql --conf spark.carbon.storepath=/home/root/carbonstore
spark-sql> create table if not exists test_table (id string, name string, city string, age Int) STORED BY 'carbondata' spark-sql> load data inpath '../sample.csv' into table test_table spark-sql> select city, avg(age), sum(age) from test_table group by city
To get started, get CarbonData from the downloads on the http://carbondata.incubator.apache.org. CarbonData uses Hadoop’s client libraries for HDFS and YARN and Spark's libraries. Downloads are pre-packaged for a handful of popular Spark versions.
If you’d like to build CarbonData from source, Please visit Building CarbonData And IDE Configuration