commit | 5fedf7bf7247240ae1d356a393179f80b97d5cb5 | [log] [tgz] |
---|---|---|
author | Qifan Chen <qchen@cloudera.com> | Thu Jun 18 19:31:20 2020 -0700 |
committer | Impala Public Jenkins <impala-public-jenkins@cloudera.com> | Thu Aug 13 03:10:16 2020 +0000 |
tree | 9e2b30ce520dae468832bc0f076f6f23bf92fff3 | |
parent | 8addf4300be33c57b79580d879c08316a47e3f90 [diff] |
IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans This work addresses the current limitation in computing the total row count for a Hive table in a scan. The row count can be incorrectly computed as 0, even though there exists data in the Hive table. This is the stats corruption at table level. Similar stats corruption exists for a partition. The row count of a table or a partition sometime can also be -1 which indicates a missing stats situation. In the fix, as long as no partition in a Hive table exhibits any missing or corrupt stats, the total row count for the table is computed from the row counts in all partitions. Otherwise, Impala looks at the table level stats particularly the table row count. In addition, if the table stats is missing or corrupted, Impala estimates a row count for the table, if feasible. This row count is the sum of the row count from the partitions with good stats, and an estimation of the number of rows in the partitions with missing or corrupt stats. Such estimation also applies when some partition has corrupt stats. One way to observe the fix is through the explain of queries scanning Hive tables with missing or corrupted stats. The cardinality for any full scan should be a positive value (i.e. the estimated row count), instead of 'unavailable'. At the beginning of the explain output, that table is still listed in the WARNING section for potentially corrupt table statistics. Testing: 1. Ran unit tests with queries documented in the case against Hive tables with the following configrations: a. No stats corruption in any partitions b. Stats corruption in some partitions c. Stats corruption in all partitions 2. Added two new tests in test_compute_stats.py: a. test_corrupted_stats_in_partitioned_Hive_tables b. test_corrupted_stats_in_unpartitioned_Hive_tables 3. Fixed failures in corrupt-stats.test 4. Ran "core" test Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Reviewed-on: http://gerrit.cloudera.org:8080/16098 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:
To learn more about Impala as a business user, or to try Impala live or in a VM, please visit the Impala homepage. Detailed documentation for administrators and users is available at Apache Impala documentation.
If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.
Impala only supports Linux at the moment.
This distribution uses cryptographic software and may be subject to export controls. Please refer to EXPORT_CONTROL.md for more information.
See Impala's developer documentation to get started.
Detailed build notes has some detailed information on the project layout and build.