CarbonData_Logo

Quick Start

This tutorial provides a quick introduction to using CarbonData.

Getting started with Apache CarbonData

Installation
Interactive Analysis with Carbon-Spark Shell
- Basics
- Executing Queries
Carbon SQL CLI
- Basics
- Execute Queries in CLI
Building CarbonData

Installation

Download released package of Spark 1.5.0 to 1.6.2
Download and install Apache Thrift 0.9.3, make sure thrift is added to system path.
Download Apache CarbonData code and build it. Please visit Building CarbonData And IDE Configuration for more information.

Interactive Analysis with Carbon-Spark Shell

Carbon Spark shell is a wrapper around Apache Spark Shell, it provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Please visit Apache Spark Documentation for more details on Spark shell.

Basics

Start Spark shell by running the following in the Carbon directory:

./bin/carbon-spark-shell

Note: In this shell SparkContext is readily available as sc and CarbonContext is available as cc.

CarbonData stores and writes the data in its specified format at the default location on the hdfs. By default carbon.storelocation is set as :

hdfs://IP:PORT/Opt/CarbonStore

And you can provide your own store location by providing configuration using --conf option like:

./bin/carbon-spark-sql --conf spark.carbon.storepath=<storelocation>

Executing Queries

Prerequisites

Create sample.csv file in CarbonData directory. The CSV is required for loading data into Carbon.

$ cd carbondata
$ cat > sample.csv << EOF
  id,name,city,age
  1,david,shenzhen,31
  2,eason,shenzhen,27
  3,jarry,wuhan,35
  EOF

Create table

scala>cc.sql("create table if not exists test_table (id string, name string, city string, age Int) STORED BY 'carbondata'")

Load data to table

scala>val dataFilePath = new File("../carbondata/sample.csv").getCanonicalPath
scala>cc.sql(s"load data inpath '$dataFilePath' into table test_table")

Query data from table

scala>cc.sql("select * from test_table").show
scala>cc.sql("select city, avg(age), sum(age) from test_table group by city").show

Carbon SQL CLI

The Carbon Spark SQL CLI is a wrapper around Apache Spark SQL CLI. It is a convenient tool to execute queries input from the command line. Please visit Apache Spark Documentation for more information Spark SQL CLI.

Basics

Start the Carbon Spark SQL CLI, run the following in the Carbon directory :

./bin/carbon-spark-sql

CarbonData stores and writes the data in its specified format at the default location on the hdfs. By default carbon.storelocation is set as :

hdfs://IP:PORT/Opt/CarbonStore

And you can provide your own store location by providing configuration using --conf option like:

./bin/carbon-spark-sql --conf spark.carbon.storepath=/home/root/carbonstore

Execute Queries in CLI

spark-sql> create table if not exists test_table (id string, name string, city string, age Int) STORED BY 'carbondata'
spark-sql> load data inpath '../sample.csv' into table test_table
spark-sql> select city, avg(age), sum(age) from test_table group by city

Building CarbonData

To get started, get CarbonData from the downloads on the http://carbondata.incubator.apache.org. CarbonData uses Hadoop’s client libraries for HDFS and YARN and Spark's libraries. Downloads are pre-packaged for a handful of popular Spark versions.

If you’d like to build CarbonData from source, Please visit Building CarbonData And IDE Configuration