commit | 15ce822e77325969cf75e198979c55acd671ef19 | [log] [tgz] |
---|---|---|
author | Joe McDonnell <joemcdonnell@cloudera.com> | Fri Jul 29 15:56:21 2022 -0700 |
committer | Joe McDonnell <joemcdonnell@cloudera.com> | Fri Aug 12 03:13:55 2022 +0000 |
tree | 4e365a479db058bc9da19340807036ab514ceffc | |
parent | ff0465bd075a95aa02e5ca3b2e94dc606cf40e33 [diff] |
IMPALA-11468: Port "Block Bloom filter false positive correction" from Kudu Block Bloom filters have a higher false positive rate than standard Bloom filter, due to the uneven distribution of keys between buckets. This patch changes the code to match the theory, using an approximation from the paper that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al. In scan_predicate.cc, filters are created with BlockBloomFilter::MinLogSpace. Prior to this patch, that method will sometimes return a value that is lower than the true answer, leading to smaller filters and higher false positive probabilities than expected. This patch corrects BlockBloomFilter::MinLogSpace, leading to filters having the expected false positive rate by dint of their larger size. The performance impact here is dependent on the extent than a scan is bottlenecked by heap space for the filter vs. compute time for the scan predicate application to filter false positives. For a false positive probability of 1%, as is currently set in scan_predicate.cc, this patch leads to an increase in filter size of about 10% and a decrease in filter false positive probability by 50%. However, this is obscured by the coarseness of the fact that filters are constrained to have a size in bytes that is a power of two. Loosening that restriction is potential future work. Porting Notes: - The MaxNdv() function is not present in Impala, so it is omitted. - This resolves a test failure for ParquetBloomFilter.FindInvalid when building with GCC 10 and the associated libstdc++. - This adds a comment noting that the test is also dependent on the libstdc++ implementation of unordered_set. Change-Id: Ic992e47976274e3ef0db3633d38e5a8e886274b4 Reviewed-on: http://gerrit.cloudera.org:8080/18807 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:
The fastest way to try out Impala is a quickstart Docker container. You can try out running queries and processing data sets in Impala on a single machine without installing dependencies. It can automatically load test data sets into Apache Kudu and Apache Parquet formats and you can start playing around with Apache Impala SQL within minutes.
To learn more about Impala as a user or administrator, or to try Impala, please visit the Impala homepage. Detailed documentation for administrators and users is available at Apache Impala documentation.
If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.
Impala only supports Linux at the moment. Impala supports x86_64 and has experimental support for arm64 (as of Impala 4.0). Impala Requirements contains more detailed information on the minimum CPU requirements.
Impala runs on Linux systems only. The supported distros are
Other systems, e.g. SLES12, may also be supported but are not tested by the community.
This distribution uses cryptographic software and may be subject to export controls. Please refer to EXPORT_CONTROL.md for more information.
See Impala's developer documentation to get started.
Detailed build notes has some detailed information on the project layout and build.