This tutorial provides a detailed overview about :
CarbonData(incubating) is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. CarbonData allows faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data.
In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
Some of the Salient features of CarbonData are :
CarbonData file contains groups of data called blocklet, along with all required information like schema, offsets and indices, etc, in a file footer, co-located in HDFS.
The file footer can be read once to build the indices in memory, which can be utilized for optimizing the scans and processing for all subsequent queries.
Each blocklet in the file is further divided into chunks of data called Data Chunks. Each data chunk is organized either in columnar format or row format, and stores the data of either a single column or a set of columns. All blocklets in one file contain the same number and type of Data Chunks.
Each Data Chunk contains multiple groups of data called as Pages. There are three types of pages.
CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:
The following types are supported :
Numeric Types
Date/Time Types
String Types
Complex Types
Carbon provides following JAR packages:
carbon-store.jar or carbondata-assembly.jar: This is the main Jar for carbon project, the target user of it are both user and developer. - For MapReduce application users, this jar provides API to read and write carbon files through CarbonInput/OutputFormat in carbon-hadoop module. - For developer, this jar can be used to integrate carbon with processing engine like spark and hive, by leveraging API in carbon-processing module.
carbon-spark.jar(Currently it is part of assembly jar): provides support for spark user, spark user can manipulate carbon data files by using native spark DataFrame/SQL interface. Apart from this, in order to leverage carbon's builtin lifecycle management function, higher level concept like Managed Carbon Table, Database and corresponding DDL are introduced.
carbon-hive.jar(not yet provided): similar to carbon-spark, which provides integration to carbon and hive.
Carbon can be used in following scenarios:
For MapReduce application user This User API is provided by carbon-hadoop. In this scenario, user can process carbon files in his MapReduce application by choosing CarbonInput/OutputFormat, and is responsible using it correctly.Currently only CarbonInputFormat is provided and OutputFormat will be provided soon.
For Spark user This User API is provided by the Spark itself. There are also two levels of APIs
Carbon File
Similar to parquet, json, or other data source in Spark, carbon can be used with data source API. For example(please refer to DataFrameAPIExample for the more detail):
// User can create a DataFrame from any data source or transformation. val df = ... // Write data // User can write a DataFrame to a carbon file df.write .format("carbondata") .option("tableName", "carbontable") .mode(SaveMode.Overwrite) .save() // read carbon data by data source API df = carbonContext.read .format("carbondata") .option("tableName", "carbontable") .load("/path") // User can then use DataFrame for analysis df.count SVMWithSGD.train(df, numIterations) // User can also register the DataFrame with a table name, and use SQL for analysis df.registerTempTable("t1") // register temporary table in SparkSQL catalog df.registerHiveTable("t2") // Or, use a implicit funtion to register to Hive metastore sqlContext.sql("select count(*) from t1").show
Managed Carbon Table
Carbon has in built support for high level concept like Table, Database, and supports full data lifecycle management, instead of dealing with just files, user can use carbon specific DDL to manipulate data in Table and Database level. Please refer DDL and DML
// Use SQL to manage table and query data create database db1; use database db1; show databases; create table tbl1 using org.apache.carbondata.spark; load data into table tlb1 path 'some_files'; select count(*) from tbl1;
For developer who want to integrate carbon into a processing engines like spark,hive or flink, use API provided by carbon-hadoop and carbon-processing:
Query : integrate carbon-hadoop with engine specific API, like spark data source API
Data life cycle management : carbon provides utility functions in carbon-processing to manage data life cycle, like data loading, compact, retention, schema evolution. Developer can implement DDLs of their choice and leverage these utility function to do data life cycle management.