commit | be76a624eec6516a09691de1e553c7a8a2976b3b | [log] [tgz] |
---|---|---|
author | Jacky Li <jacky.likun@qq.com> | Wed Nov 13 19:05:00 2019 +0800 |
committer | xubo245 <601450868@qq.com> | Sun Nov 17 22:52:04 2019 +0800 |
tree | 7a8fcf2e82a1f15a4bd99dbbae89f5ca31967178 | |
parent | 86b9e5d5bbcef8fb2e36718db8a4dc14aa2bb9e9 [diff] |
[CARBONDATA-3578] Make table status file smaller Currently, each segment entry in the table status file occupies 347 Bytes, if one has 10000 segments, the file becomes 3.47MB. Since carbondata relies on this file heavily, it is better to reduce its size to improve IO, especially in data lake scenario. Each entry in table status file is one LoadMetadataDetails object. In this PR, following changes are made in LoadMetadataDetails to reduce its size: Do not write fields that has default value, like "visibility", "fileFormat", etc Use shorter key, for example, "loadStatus" is changed to "ls" In this PR, table status file size is reduced to 1/3. Before change: 347Bytes { "timestamp": "1573635015982", "loadStatus": "Success", "loadName": "0", "partitionCount": "0", "isDeleted": "FALSE", "dataSize": "2977", "indexSize": "1469", "updateDeltaEndTimestamp": "", "updateDeltaStartTimestamp": "", "updateStatusFileName": "", "loadStartTime": "1573635014638", "visibility": "true", "fileFormat": "columnar_v3", "segmentFile": "0_1573635014638.segment" } After change: 118Bytes ( reduced to 1/3 size) { "ts": "1573635284677", "ls": "S", -- stands for Success "ln": "0", "ds": "2977", "is": "1469", "lt": "1573635284045", "sf": "0_1573635284045.segment" } About the backward compatibility, this PR still can read the old table status file, by using GSON's @SerializedName(alternate), so it does not break backward compatibility. This closes #3449
Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.
You can find the latest CarbonData document and learn more at: http://carbondata.apache.org
CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:
CarbonData is built using Apache Maven, to build CarbonData
This is an active open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.
To get involved in CarbonData:
Apache CarbonData is an open source project of The Apache Software Foundation (ASF).