commit | 97adba5192cff75f91255105bc871ec542f390de | [log] [tgz] |
---|---|---|
author | Riza Suminto <riza.suminto@cloudera.com> | Thu Feb 15 16:57:36 2024 -0800 |
committer | Impala Public Jenkins <impala-public-jenkins@cloudera.com> | Thu Apr 04 04:46:19 2024 +0000 |
tree | 6037435d5411224bce00d88104b20ee1e307bd66 | |
parent | a6234472066e6fab4815568e63061df5930e4c2c [diff] |
IMPALA-12881: Use getFkPkJoinCardinality to reduce scan cardinality IMPALA-12018 adds reduceCardinalityForScanNode to lower cardinality estimation when a runtime filter is involved. It calls JoinNode.computeGenericJoinCardinality(). However, if the originating join node has FK-PK conjunct, it should be possible to obtain a lower cardinality estimate by calling JoinNode.getFkPkJoinCardinality() instead. This patch adds that analysis and calls JoinNode.getFkPkJoinCardinality() when possible. It is, however, only limited to runtime filters that evaluate at the storage layer, such as partition filter and pushed-down Kudu filter. Row-level runtime filters that evaluate at scan node will continue using JoinNode.computeGenericJoinCardinality(). This distinction is because a storage layer filter is applied more consistently than a row-level filter. For example, a partition filter evaluate all partition_id and never disabled regardless of its precision (see HdfsScanNodeBase::PartitionPassesFilters). On the other hand, scan node can disable a row-level filter later on if it is deemed ineffective / not precise enough (see HdfsScanner::CheckFiltersEffectiveness, LocalFilterStats::enabled_for_row, and min_filter_reject_ratio flag). For the pushed-down Kudu filter, Impala will rely on Kudu to evaluate the filter. Runtime filters can arrive late as well. But for both storage layer filter and row-level filter, the scan node can stop waiting and start scanning after runtime_filter_wait_time_ms passed. Scan node will still evaluate a late runtime filter later on if the scan process is still ongoing. Also, note that this cardinality reduction algorithm is based only on highly selective runtime filters to increase its estimate confidence (see RuntimeFilter.isHighlySelective()). Testing: - Update TpcdsCpuCostPlannerTest. - Pass FE tests. Change-Id: I6efafffc8f96247a860b88e85d9097b2b4327f32 Reviewed-on: http://gerrit.cloudera.org:8080/21118 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Lightning-fast, distributed SQL queries for petabytes of data stored in open data and table formats.
Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:
The fastest way to try out Impala is a quickstart Docker container. You can try out running queries and processing data sets in Impala on a single machine without installing dependencies. It can automatically load test data sets into Apache Kudu and Apache Parquet formats and you can start playing around with Apache Impala SQL within minutes.
To learn more about Impala as a user or administrator, or to try Impala, please visit the Impala homepage. Detailed documentation for administrators and users is available at Apache Impala documentation.
If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.
Impala only supports Linux at the moment. Impala supports x86_64 and has experimental support for arm64 (as of Impala 4.0). Impala Requirements contains more detailed information on the minimum CPU requirements.
Impala runs on Linux systems only. The supported distros are
Other systems, e.g. SLES12, may also be supported but are not tested by the community.
This distribution uses cryptographic software and may be subject to export controls. Please refer to EXPORT_CONTROL.md for more information.
See Impala's developer documentation to get started.
Detailed build notes has some detailed information on the project layout and build.