commit | 3acc2de351940c00564744ddf5da2a681a481a75 | [log] [tgz] |
---|---|---|
author | h00424960 <h00424960@SZA150414400A> | Fri Feb 14 23:25:59 2020 +0800 |
committer | liuzhi <371684521@qq.com> | Thu Mar 05 21:48:43 2020 +0800 |
tree | e3afc4d941a9108b9b9db1ecfbbfa3f03073d9ae | |
parent | 8808e9c65b404007d0b553e6f722a5a83aef4b8d [diff] |
[CARBONDATA-3700] Optimize prune performance when prunning with multi-threads Why is this PR needed? 1. When pruning with multi-threads, there is a bug hambers the prunning performance heavily. When the pruning results in no blocklets to map the query filter, The getExtendblocklet function will be triggered to get the extend blocklet metadata, when the Input of this function is an empty blocklet list, this function is expected to return an empty extendblocklet list directyly , but now there is a bug leading to "a hashset add operation" overhead which is meaningless. Meanwhile, When pruning with multi-threads, the getExtendblocklet function will be triggerd for each blocklet, which should be avoided by triggerring this function for each segment. 2. When pruning, there is a bug hambers the prunning performance heavily. ValidatePartitionInfo operation is executed by every blocklet, and it iterates all the partitions info for each blocklet. sIf there are millions blocklets, and hundreds partitions, the compatutaion complexity will be hundreds millions. 3. In the prunning, It will create filterexecuter pre blocklet, which involves a huge performance degrade when there are serveral millions blocklet. Specically, The creating of filterexecuter is a heavy operation which involves a lot of time cost init works. What changes were proposed in this PR? 1.1 if the input is an empty blocklet list in the getExtendblocklet function, we return an empty extendblocklet list directyly 1.2 We trigger the getExtendblocklet functon for each segment instead of each blocklet. 2.1 Remove validatePartitionInfo. Add the validatePartiionInfo in the getDataMap processing 3.1 We create filterexecuter per segment instead of that per blocklet, and share the filterexecuter between all blocklets. In the case, add column or change sort column, then update the segment, there will be serveral different columnschemas of blocklets which exist in the segment, only if the columnshemas of all the blocklets are same, the filterexecuter can be shared. So we add a fingerprinter for each blocklet, to identify the columnschema. If the fingerprinters are same, which means that the columnschema are equal with each other, so the filterexecuter can be reused Does this PR introduce any user interface change? No. Is any new testcase added? Yes. This closes #3620
Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.
You can find the latest CarbonData document and learn more at: http://carbondata.apache.org
CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:
CarbonData is built using Apache Maven, to build CarbonData
This is an active open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.
To get involved in CarbonData:
Apache CarbonData is an open source project of The Apache Software Foundation (ASF).