Getting started with Apache CarbonData

This tutorial provides a quick introduction to using CarbonData.

Install

Download released package of Spark 1.5.0 or later
Download and install Apache Thrift 0.9.3, make sure thrift is added to system path.
Download Apache CarbonData code and build it. Please visit Building CarbonData And IDE Configuration for more information.

Interactive Data Query

Prerequisite

Create sample.csv file in carbondata directory

$ cd carbondata
$ cat > sample.csv << EOF
  id,name,city,age
  1,david,shenzhen,31
  2,eason,shenzhen,27
  3,jarry,wuhan,35
  EOF

Carbon Spark Shell

Carbon Spark shell is a wrapper around Apache Spark Shell, it provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Please visit Apache Spark Documentation for more details on Spark shell. Start Spark shell by running the following in the Carbon directory:

./bin/carbon-spark-shell

Note: In this shell SparkContext is readily available as sc and CarbonContext is available as cc.

Create table

scala>cc.sql("create table if not exists test_table (id string, name string, city string, age Int) STORED BY 'carbondata'")

Load data to table

scala>val dataFilePath = new File("../carbondata/sample.csv").getCanonicalPath
scala>cc.sql(s"load data inpath '$dataFilePath' into table test_table")

Query data from table

scala>cc.sql("select * from test_table").show
scala>cc.sql("select city, avg(age), sum(age) from test_table group by city").show

Carbon SQL CLI

The Carbon Spark SQL CLI is a wrapper around Apache Spark SQL CLI. It is a convenient tool to execute queries input from the command line. Please visit Apache Spark Documentation for more information Spark SQL CLI. Start the Carbon Spark SQL CLI, run the following in the Carbon directory

./bin/carbon-spark-sql

And you can provide your own store location by providing configuration using --conf option like:

./bin/carbon-spark-sql --conf spark.carbon.storepath=/home/root/carbonstore

Execute Queries in CLI

spark-sql> create table if not exists test_table (id string, name string, city string, age Int) STORED BY 'carbondata'
spark-sql> load data inpath '../sample.csv' into table test_table
spark-sql> select city, avg(age), sum(age) from test_table group by city