2c0ee8e2c0e74e9a2f1f3f62e13238a45442b012 - carbondata

commit	2c0ee8e2c0e74e9a2f1f3f62e13238a45442b012	[log] [tgz]
author	ajantha-bhat <ajanthabhat@gmail.com>	Sat Jan 04 23:51:51 2020 +0800
committer	Jacky Li <jacky.likun@qq.com>	Wed Jan 08 09:32:15 2020 +0800
tree	2d495e046a4c38fb48ada678223f785dda6cb2d4
parent	71a4cf46a7dfe20698e77c83c6e3cff7cc3cd07d [diff]

commit

2c0ee8e2c0e74e9a2f1f3f62e13238a45442b012

[log] [tgz]

author

ajantha-bhat <ajanthabhat@gmail.com>

Sat Jan 04 23:51:51 2020 +0800

committer

Jacky Li <jacky.likun@qq.com>

Wed Jan 08 09:32:15 2020 +0800

tree

2d495e046a4c38fb48ada678223f785dda6cb2d4

parent

71a4cf46a7dfe20698e77c83c6e3cff7cc3cd07d [diff]

[CARBONDATA-3653] Support huge data for complex child columns Why is this PR needed? Currently complex child columns string and binary is stored as short length. So, if the data is more than 32000 characters. Data load will fail for binary and long string columns. What changes were proposed in this PR? complex child columns string, binary, decimal, date is stored as byte_array page with short length. Changed it to int length. [Just separating string and binary is hard now, to do in future] Handled compatibility by introducing the new encoding type for complex child columns Does this PR introduce any user interface change? No Is any new testcase added? Yes This closes #3562

tree: 2d495e046a4c38fb48ada678223f785dda6cb2d4

README.md

Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.

You can find the latest CarbonData document and learn more at: http://carbondata.apache.org

CarbonData cwiki

Visit count:

Status

Spark2.2:

Features

CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:

Stores data along with index: it can significantly accelerate query performance and reduces the I/O scans and CPU resources, where there are filters in the query. CarbonData index consists of multiple level of indices, a processing framework can leverage this index to reduce the task it needs to schedule and process, and it can also do skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file.
Operable encoded data :Through supporting efficient compression and global encoding schemes, can query on compressed/encoded data, the data can be converted just before returning the results to the users, which is “late materialized”.
Supports for various use cases with one single Data format : like interactive OLAP-style query, Sequential Access (big scan), Random Access (narrow scan).

Building CarbonData

CarbonData is built using Apache Maven, to build CarbonData

Online Documentation

Integration

Other Technical Material

Fork and Contribute

This is an active open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.

Contact us

To get involved in CarbonData:

First join by emailing to dev-subscribe@carbondata.apache.org,then you can discuss issues by emailing to dev@carbondata.apache.org or visit http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com
Report issues on Apache Jira.

About

Apache CarbonData is an open source project of The Apache Software Foundation (ASF).