CarbonData Use Cases

This tutorial will discuss about the problems that CarbonData address.It shall take you through the identified top use cases of Carbon.

Introduction

For big data interactive analysis scenarios, many customers expect sub-second response to query TB-PB level data on general hardware clusters with just a few nodes.

In the current big data ecosystem, there are few columnar storage formats such as ORC and Parquet that are designed for SQL on Big Data. Apache Hive’s ORC format is a columnar storage format with basic indexing capability. However, ORC cannot meet the sub-second query response expectation on TB level data, because ORC format performs only stride level dictionary encoding and all analytical operations such as filtering and aggregation is done on the actual data. Apache Parquet is columnar storage can improve performance in comparison to ORC, because of more efficient storage organization. Though Parquet can provide query response on TB level data in a few seconds, it is still far from the sub-second expectation of interactive analysis users. Cloudera Kudu can effectively solve some query performance issues, but kudu is not hadoop native, can’t seamlessly integrate historic HDFS data into new kudu system.

However, CarbonData uses specially engineered optimizations targeted to improve performance of analytical queries which can include filters, aggregation and distinct counts, the required data to be stored in an indexed, well organized, read-optimized format, CarbonData’s query performance can achieve sub-second response.

Motivation: Single Format to provide low latency response for all use cases

The main motivation behind CarbonData is to provide a single storage format for all the usecases of querying big data on Hadoop. Thus CarbonData is able to cover all use-cases into a single storage format.

Motivation

Use Cases

  • Sequential Access

    • Supports queries that select only a few columns with a group by clause but do not contain any filters. This results in full scan over the complete store for the selected columns.

    Sequential_Scan

    Scenario

    • ETL jobs
    • Log Analysis
  • Random Access

    • Supports Point Query. These are queries used from operational applications and usually select all or most of the columns but do involve a large number of filters which reduce the result to a small size. Such queries generally do not involve any aggregation or group by clause.
      • Row-key query(like HBase)
      • Narrow Scan
      • Requires second/sub-second level low latency

    random_access

    Scenario

    • Operational Query
    • User Profiling
  • Olap Style Query

    • Supports Interactive data analysis for any dimensions. These are queries which are typically fired from Interactive Analysis tools. Such queries often select a few columns but involve filters and group by on a column or a grouping expression. It also supports queries that :
      • involves aggregation/join
      • Roll-up,Drill-down,Slicing and Dicing
      • Low-latency ad-hoc query

    Olap_style_query

    Scenario

    • Dash-board reporting
    • Fraud & Ad-hoc Analysis