[CARBONDATA-3700] Optimize prune performance when prunning with multi-threads

Why is this PR needed?
1. When pruning with multi-threads, there is a bug hambers the prunning performance heavily.
When the pruning results in no blocklets to map the query filter, The getExtendblocklet function will be triggered to get the extend blocklet metadata, when the Input of this function is an empty blocklet list, this function is expected to return an empty extendblocklet list directyly , but now there is a bug leading to "a hashset add operation" overhead which is meaningless.
Meanwhile, When pruning with multi-threads, the getExtendblocklet function will be triggerd for each blocklet, which should be avoided by triggerring this function for each segment.
2. When pruning, there is a bug hambers the prunning performance heavily.
ValidatePartitionInfo operation is executed by every blocklet, and it iterates all the partitions info for each blocklet. sIf there are millions blocklets, and hundreds partitions, the compatutaion complexity will be hundreds millions.
3. In the prunning, It will create filterexecuter pre blocklet, which involves a huge performance degrade when there are serveral millions blocklet.
Specically, The creating of filterexecuter is a heavy operation which involves a lot of time cost init works.

What changes were proposed in this PR?
1.1 if the input is an empty blocklet list in the getExtendblocklet function, we return an empty extendblocklet list directyly
1.2 We trigger the getExtendblocklet functon for each segment instead of each blocklet.
2.1 Remove validatePartitionInfo. Add the validatePartiionInfo in the getDataMap processing
3.1 We create filterexecuter per segment instead of that per blocklet, and share the filterexecuter between all blocklets.
In the case, add column or change sort column, then update the segment, there will be serveral different columnschemas of blocklets which exist in the segment, only if the columnshemas of all the blocklets are same, the filterexecuter can be shared. So we add a fingerprinter for each blocklet, to identify the columnschema. If the fingerprinters are same, which means that the columnschema are equal with each other, so the filterexecuter can be reused

Does this PR introduce any user interface change?
No.

Is any new testcase added?
Yes.

This closes #3620
16 files changed
tree: e3afc4d941a9108b9b9db1ecfbbfa3f03073d9ae
  1. .github/
  2. assembly/
  3. bin/
  4. build/
  5. common/
  6. conf/
  7. core/
  8. dev/
  9. docs/
  10. examples/
  11. format/
  12. geo/
  13. hadoop/
  14. index/
  15. integration/
  16. licenses-binary/
  17. mv/
  18. processing/
  19. python/
  20. sdk/
  21. streaming/
  22. tools/
  23. .gitignore
  24. LICENSE
  25. NOTICE
  26. pom.xml
  27. README.md
README.md

Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.

You can find the latest CarbonData document and learn more at: http://carbondata.apache.org

CarbonData cwiki

Visit count: HitCount

Status

Spark2.3: Build Status Coverage Status

Features

CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:

  • Stores data along with index: it can significantly accelerate query performance and reduces the I/O scans and CPU resources, where there are filters in the query. CarbonData index consists of multiple level of indices, a processing framework can leverage this index to reduce the task it needs to schedule and process, and it can also do skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file.
  • Operable encoded data :Through supporting efficient compression and global encoding schemes, can query on compressed/encoded data, the data can be converted just before returning the results to the users, which is “late materialized”.
  • Supports for various use cases with one single Data format : like interactive OLAP-style query, Sequential Access (big scan), Random Access (narrow scan).

Building CarbonData

CarbonData is built using Apache Maven, to build CarbonData

Online Documentation

Integration

Other Technical Material

Fork and Contribute

This is an active open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.

Contact us

To get involved in CarbonData:

About

Apache CarbonData is an open source project of The Apache Software Foundation (ASF).