| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="perf_stats"> |
| |
| <title>Table and Column Statistics</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Concepts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Impala can do better optimization for complex or multi-table queries when it has access to |
| statistics about the volume of data and how the values are distributed. Impala uses this |
| information to help parallelize and distribute the work for a query. For example, |
| optimizing join queries requires a way of determining if one table is <q>bigger</q> than |
| another, which is a function of the number of rows and the average row size for each |
| table. The following sections describe the categories of statistics Impala can work with, |
| and how to produce them and keep them up to date. |
| </p> |
| |
| <p outputclass="toc inpage all"/> |
| |
| </conbody> |
| |
| <concept id="perf_table_stats"> |
| |
| <title id="table_stats">Overview of Table Statistics</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Concepts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| The Impala query planner can make use of statistics about entire tables and partitions. |
| This information includes physical characteristics such as the number of rows, number of |
| data files, the total size of the data files, and the file format. For partitioned |
| tables, the numbers are calculated per partition, and as totals for the whole table. |
| This metadata is stored in the metastore database, and can be updated by either Impala |
| or Hive. If a number is not available, the value -1 is used as a placeholder. Some |
| numbers, such as number and total sizes of data files, are always kept up to date |
| because they can be calculated cheaply, as part of gathering HDFS block metadata. |
| </p> |
| |
| <p> |
| The following example shows table stats for an unpartitioned Parquet table. The values |
| for the number and sizes of files are always available. Initially, the number of rows is |
| not known, because it requires a potentially expensive scan through the entire table, |
| and so that value is displayed as -1. The <codeph>COMPUTE STATS</codeph> statement fills |
| in any unknown table stats values. |
| </p> |
| |
| <codeblock> |
| show table stats parquet_snappy; |
| +-------+--------+---------+--------------+-------------------+---------+-------------------+... |
| | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |... |
| +-------+--------+---------+--------------+-------------------+---------+-------------------+... |
| | -1 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |... |
| +-------+--------+---------+--------------+-------------------+---------+-------------------+... |
| |
| compute stats parquet_snappy; |
| +-----------------------------------------+ |
| | summary | |
| +-----------------------------------------+ |
| | Updated 1 partition(s) and 6 column(s). | |
| +-----------------------------------------+ |
| |
| |
| show table stats parquet_snappy; |
| +------------+--------+---------+--------------+-------------------+---------+-------------------+... |
| | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |... |
| +------------+--------+---------+--------------+-------------------+---------+-------------------+... |
| | 1000000000 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |... |
| +------------+--------+---------+--------------+-------------------+---------+-------------------+... |
| </codeblock> |
| |
| <p> |
| Impala performs some optimizations using this metadata on its own, and other |
| optimizations by using a combination of table and column statistics. |
| </p> |
| |
| <p rev="1.2.1"> |
| To check that table statistics are available for a table, and see the details of those |
| statistics, use the statement <codeph>SHOW TABLE STATS |
| <varname>table_name</varname></codeph>. See <xref href="impala_show.xml#show"/> for |
| details. |
| </p> |
| |
| <p> |
| If you use the Hive-based methods of gathering statistics, see |
| <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the |
| Hive wiki</xref> for information about the required configuration on the Hive side. |
| Where practical, use the Impala <codeph>COMPUTE STATS</codeph> statement to avoid |
| potential configuration and scalability issues with the statistics-gathering process. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="perf_column_stats"> |
| |
| <title id="column_stats">Overview of Column Statistics</title> |
| |
| <conbody> |
| |
| <p> |
| The Impala query planner can make use of statistics about individual columns when that |
| metadata is available in the metastore database. This technique is most valuable for |
| columns compared across tables in <xref href="impala_perf_joins.xml#perf_joins">join |
| queries</xref>, to help estimate how many rows the query will retrieve from each table. |
| <ph rev="2.0.0"> These statistics are also important for correlated subqueries using the |
| <codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed |
| internally the same way as join queries.</ph> |
| </p> |
| |
| <p> |
| The following example shows column stats for an unpartitioned Parquet table. The values |
| for the maximum and average sizes of some types are always available, because those |
| figures are constant for numeric and other fixed-size types. Initially, the number of |
| distinct values is not known, because it requires a potentially expensive scan through |
| the entire table, and so that value is displayed as -1. The same applies to maximum and |
| average sizes of variable-sized types, such as <codeph>STRING</codeph>. The |
| <codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values. (It |
| does not record the number of <codeph>NULL</codeph> values, because currently Impala |
| does not use that figure for query optimization.) |
| </p> |
| |
| <codeblock> |
| show column stats parquet_snappy; |
| +-------------+----------+------------------+--------+----------+----------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +-------------+----------+------------------+--------+----------+----------+ |
| | id | BIGINT | -1 | -1 | 8 | 8 | |
| | val | INT | -1 | -1 | 4 | 4 | |
| | zerofill | STRING | -1 | -1 | -1 | -1 | |
| | name | STRING | -1 | -1 | -1 | -1 | |
| | assertion | BOOLEAN | -1 | -1 | 1 | 1 | |
| | location_id | SMALLINT | -1 | -1 | 2 | 2 | |
| +-------------+----------+------------------+--------+----------+----------+ |
| |
| compute stats parquet_snappy; |
| +-----------------------------------------+ |
| | summary | |
| +-----------------------------------------+ |
| | Updated 1 partition(s) and 6 column(s). | |
| +-----------------------------------------+ |
| |
| show column stats parquet_snappy; |
| +-------------+----------+------------------+--------+----------+-------------------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +-------------+----------+------------------+--------+----------+-------------------+ |
| | id | BIGINT | 183861280 | -1 | 8 | 8 | |
| | val | INT | 139017 | -1 | 4 | 4 | |
| | zerofill | STRING | 101761 | -1 | 6 | 6 | |
| | name | STRING | 145636240 | -1 | 22 | 13.00020027160645 | |
| | assertion | BOOLEAN | 2 | -1 | 1 | 1 | |
| | location_id | SMALLINT | 339 | -1 | 2 | 2 | |
| +-------------+----------+------------------+--------+----------+-------------------+ |
| </codeblock> |
| |
| <note> |
| <p> |
| For column statistics to be effective in Impala, you also need to have table |
| statistics for the applicable tables, as described in |
| <xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use the Impala |
| <codeph>COMPUTE STATS</codeph> statement, both table and column statistics are |
| automatically gathered at the same time, for all columns in the table. |
| </p> |
| </note> |
| |
| <note conref="../shared/impala_common.xml#common/compute_stats_nulls"/> |
| |
| <p rev="1.2.1"> |
| To check whether column statistics are available for a particular set of columns, use |
| the <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statement, or check |
| the extended <codeph>EXPLAIN</codeph> output for a query against that table that refers |
| to those columns. See <xref href="impala_show.xml#show"/> and |
| <xref href="impala_explain.xml#explain"/> for details. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="perf_stats_partitions"> |
| |
| <title id="stats_partitions">How Table and Column Statistics Work for Partitioned Tables</title> |
| |
| <conbody> |
| |
| <p> |
| When you use Impala for <q>big data</q>, you are highly likely to use partitioning for |
| your biggest tables, the ones representing data that can be logically divided based on |
| dates, geographic regions, or similar criteria. The table and column statistics are |
| especially useful for optimizing queries on such tables. For example, a query involving |
| one year might involve substantially more or less data than a query involving a |
| different year, or a range of several years. Each query might be optimized differently |
| as a result. |
| </p> |
| |
| <p> |
| The following examples show how table and column stats work with a partitioned table. |
| The table for this example is partitioned by year, month, and day. For simplicity, the |
| sample data consists of 5 partitions, all from the same year and month. Table stats are |
| collected independently for each partition. (In fact, the <codeph>SHOW |
| PARTITIONS</codeph> statement displays exactly the same information as <codeph>SHOW |
| TABLE STATS</codeph> for a partitioned table.) Column stats apply to the entire table, |
| not to individual partitions. Because the partition key column values are represented as |
| HDFS directories, their characteristics are typically known in advance, even when the |
| values for non-key columns are shown as -1. |
| </p> |
| |
| <codeblock> |
| show partitions year_month_day; |
| +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... |
| | year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... |
| +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... |
| | 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | Total | | | -1 | 5 | 12.58MB | 0B | | |... |
| +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... |
| |
| show table stats year_month_day; |
| +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... |
| | year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... |
| +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... |
| | 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | Total | | | -1 | 5 | 12.58MB | 0B | | |... |
| +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... |
| |
| show column stats year_month_day; |
| +-----------+---------+------------------+--------+----------+----------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +-----------+---------+------------------+--------+----------+----------+ |
| | id | INT | -1 | -1 | 4 | 4 | |
| | val | INT | -1 | -1 | 4 | 4 | |
| | zfill | STRING | -1 | -1 | -1 | -1 | |
| | name | STRING | -1 | -1 | -1 | -1 | |
| | assertion | BOOLEAN | -1 | -1 | 1 | 1 | |
| | year | INT | 1 | 0 | 4 | 4 | |
| | month | INT | 1 | 0 | 4 | 4 | |
| | day | INT | 5 | 0 | 4 | 4 | |
| +-----------+---------+------------------+--------+----------+----------+ |
| |
| compute stats year_month_day; |
| +-----------------------------------------+ |
| | summary | |
| +-----------------------------------------+ |
| | Updated 5 partition(s) and 5 column(s). | |
| +-----------------------------------------+ |
| |
| show table stats year_month_day; |
| +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... |
| | year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... |
| +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... |
| | 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... |
| | Total | | | 469290 | 5 | 12.58MB | 0B | | |... |
| +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... |
| |
| show column stats year_month_day; |
| +-----------+---------+------------------+--------+----------+-------------------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +-----------+---------+------------------+--------+----------+-------------------+ |
| | id | INT | 511129 | -1 | 4 | 4 | |
| | val | INT | 364853 | -1 | 4 | 4 | |
| | zfill | STRING | 311430 | -1 | 6 | 6 | |
| | name | STRING | 471975 | -1 | 22 | 13.00160026550293 | |
| | assertion | BOOLEAN | 2 | -1 | 1 | 1 | |
| | year | INT | 1 | 0 | 4 | 4 | |
| | month | INT | 1 | 0 | 4 | 4 | |
| | day | INT | 5 | 0 | 4 | 4 | |
| +-----------+---------+------------------+--------+----------+-------------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="perf_generating_stats"> |
| |
| <title>Generating Table and Column Statistics</title> |
| |
| <conbody> |
| |
| <p> |
| Use the <codeph>COMPUTE STATS</codeph> family of commands to collect table and |
| column statistics. The <codeph>COMPUTE STATS</codeph> variants offer |
| different tradeoffs between computation cost, staleness, and maintenance |
| workflows which are explained below. |
| </p> |
| |
| <note type="important"> |
| <p conref="../shared/impala_common.xml#common/cs_or_cis"/> |
| </note> |
| |
| <!-- TODO: Commented out because it is inaccurate and confusing. Leaving this |
| material for future refactoring into a Hive-compatibility section. |
| <p> |
| If you use Hive as part of your ETL workflow, you can also use Hive to generate table |
| and column statistics. You might need to do extra configuration within Hive itself, the |
| metastore, or even set up a separate database to hold Hive-generated statistics. You |
| might need to run multiple statements to generate all the necessary statistics. |
| Therefore, prefer the Impala <codeph>COMPUTE STATS</codeph> statement where that |
| technique is practical. For details about collecting statistics through Hive, see |
| <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the |
| Hive wiki</xref>. |
| </p> |
| --> |
| |
| </conbody> |
| |
| <concept id="concept_y2f_nfl_mdb"> |
| |
| <title>COMPUTE STATS</title> |
| |
| <conbody> |
| |
| <p> |
| The <codeph>COMPUTE STATS</codeph> command collects and sets the table-level |
| and partition-level row counts as well as all column statistics for a given |
| table. The collection process is CPU-intensive and can take a long time to |
| complete for very large tables. |
| </p> |
| <p> |
| To speed up <codeph>COMPUTE STATS</codeph> consider the following options |
| which can be combined. |
| <ul> |
| <li> |
| <p> |
| Limit the number of columns for which statistics are collected to increase |
| the efficiency of COMPUTE STATS. Queries benefit from statistics for those |
| columns involved in filters, join conditions, group by or partition by |
| clauses. Other columns are good candidates to exclude from COMPUTE STATS. |
| This feature is available since Impala 2.12. |
| </p> |
| </li> |
| <li> |
| <p> |
| Set the MT_DOP query option to use more threads within each participating |
| impalad to compute the statistics faster - but not more efficiently. Note |
| that computing stats on a large table with a high MT_DOP value can |
| negatively affect other queries running at the same time if the |
| COMPUTE STATS claims most CPU cycles. |
| This feature is available since Impala 2.8. |
| </p> |
| </li> |
| <li> |
| <p> |
| Consider the experimental extrapolation and sampling features (see below) |
| to further increase the efficiency of computing stats. |
| </p> |
| </li> |
| </ul> |
| </p> |
| |
| <p> |
| <codeph>COMPUTE STATS</codeph> is intended to be run periodically, |
| e.g. weekly, or on-demand when the contents of a table have changed |
| significantly. Due to the high resource utilization and long repsonse |
| time of t<codeph>COMPUTE STATS</codeph>, it is most practical to run it |
| in a scheduled maintnance window where the Impala cluster is idle |
| enough to accommodate the expensive operation. The degree of change that |
| qualifies as <q>significant</q> depends on the query workload, but typically, |
| if 30% of the rows have changed then it is recommended to recompute |
| statistics. |
| </p> |
| |
| <p> |
| If you reload a complete new set of data for a table, but the number of rows and |
| number of distinct values for each column is relatively unchanged from before, you |
| do not need to recompute stats for the table. |
| </p> |
| |
| </conbody> |
| |
| <concept id="experimental_stats_features"> |
| <title>Experimental: Extrapolation and Sampling</title> |
| <conbody> |
| <p> |
| Impala 2.12 and higher includes two experimental features to alleviate |
| common issues for computing and maintaining statistics on very large tables. |
| The following shortcomings are improved upon: |
| <ul> |
| <li> |
| <p> |
| Newly added partitions do not have row count statistics. Table scans |
| that only access those new partitions are treated as not having stats. |
| Similarly, table scans that access both new and old partitions estimate |
| the scan cardinality based on those old partitions that have stats, and |
| the new partitions without stats are treated as having 0 rows. |
| </p> |
| </li> |
| <li> |
| <p> |
| The row counts of existing partitions become stale when data is added |
| or dropped. |
| </p> |
| </li> |
| <li> |
| <p> |
| Computing stats for tables with a 100,000 or more partitions might fail |
| or be very slow due to the high cost of updating the partition metadata |
| in the Hive Metastore. |
| </p> |
| </li> |
| <li> |
| <p> |
| With transient compute resources it is important to minimize the time |
| from starting a new cluster to successfully running queries. |
| Since the cluster might be relatively short-lived, users might prefer to |
| quickly collect stats that are "good enough" as opposed to spending |
| a lot of time and resouces on computing full-fidelity stats. |
| </p> |
| </li> |
| </ul> |
| For very large tables, it is often wasteful or impractical to run a full |
| COMPUTE STATS to address the scenarios above on a frequent basis. |
| </p> |
| <p> |
| The sampling feature makes COMPUTE STATS more efficient by processing a |
| fraction of the table data, and the extrapolation feature aims to reduce |
| the frequency at which COMPUTE STATS needs to be re-run by estimating |
| the row count of new and modified partitions. |
| </p> |
| <p> |
| The sampling and extrapolation features are disabled by default. |
| They can be enabled globally or for specific tables, as follows. |
| Set the impalad start-up configuration "--enable_stats_extrapolation" to |
| enable the features globally. To enable them only for a specific table, set |
| the "impala.enable.stats.extrapolation" table property to "true" for the |
| desired table. The table-level property overrides the global setting, so |
| it is also possible to enable sampling and extrapolation globally, but |
| disable it for specific tables by setting the table property to "false". |
| Example: |
| ALTER TABLE mytable test_table SET TBLPROPERTIES("impala.enable.stats.extrapolation"="true") |
| </p> |
| <note> |
| Why are these features experimental? Due to their probabilistic nature |
| it is possible that these features perform pathologically poorly on tables |
| with extreme data/file/size distributions. Since it is not feasible for us |
| to test all possible scenarios we only cautiously advertise these new |
| capabilities. That said, the features have been thoroughly tested and |
| are considered functionally stable. If you decide to give these features |
| a try, please tell us about your experience at user@impala.apache.org! |
| We rely on user feedback to guide future inprovements in statistics |
| collection. |
| </note> |
| </conbody> |
| |
| <concept id="experimental_stats_extrapolation"> |
| <title>Stats Extrapolation</title> |
| <conbody> |
| <p> |
| The main idea of stats extrapolation is to estimate the row count of new |
| and modified partitions based on the result of the last COMPUTE STATS. |
| Enabling stats extrapolation changes the behavior of COMPUTE STATS, |
| as well as the cardinality estimation of table scans. COMPUTE STATS no |
| longer computes and stores per-partition row counts, and instead, only |
| computes a table-level row count together with the total number of file |
| bytes in the table at that time. No partition metadata is modified. The |
| input cardinality of a table scan is estimated by converting the data |
| volume of relevant partitions to a row count, based on the table-level |
| row count and file bytes statistics. It is assumed that within the same |
| table, different sets of files with the same data volume correspond |
| to the similar number of rows on average. With extrapolation enabled, |
| the scan cardinality estimation ignores per-partition row counts. It |
| only relies on the table-level statistics and the scanned data volume. |
| </p> |
| <p> |
| The SHOW TABLE STATS and EXPLAIN commands distinguish between row counts |
| stored in the Hive Metastore, and the row counts extrapolated based on the |
| above process. Consult the SHOW TABLE STATS and EXPLAIN documentation |
| for more details. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="experimental_stats_sampling"> |
| <title>Sampling</title> |
| <conbody> |
| <p> |
| A TABLESAMPLE clause may be added to COMPUTE STATS to limit the |
| percentage of data to be processed. The final statistics are obtained |
| by extrapolating the statistics from the data sample over the entire table. |
| The extrapolated statistics are stored in the Hive Metastore, just as if no |
| sampling was used. The following example runs COMPUTE STATS over a 10 percent |
| data sample: COMPUTE STATS test_table TABLESAMPLE SYSTEM(10) |
| </p> |
| <p> |
| We have found that a 10 percent sampling rate typically offers a good |
| tradeoff between statistics accuracy and execution cost. A sampling rate |
| well below 10 percent has shown poor results and is not recommended. |
| </p> |
| <note type="important"> |
| Sampling-based techniques sacrifice result accuracy for execution |
| efficiency, so your mileage may vary for different tables and columns |
| depending on their data distribution. The extrapolation procedure Impala |
| uses for estimating the number of distinct values per column is inherently |
| non-detetministic, so your results may even vary between runs of |
| COMPUTE STATS TABLESAMPLE, even if no data has changed. |
| </note> |
| </conbody> |
| </concept> |
| </concept> |
| </concept> |
| |
| <concept id="concept_bmk_pfl_mdb"> |
| |
| <title>COMPUTE INCREMENTAL STATS</title> |
| |
| <conbody> |
| |
| <p> |
| In Impala 2.1.0 and higher, you can use the |
| <codeph>COMPUTE INCREMENTAL STATS</codeph> and |
| <codeph>DROP INCREMENTAL STATS</codeph> commands. |
| The <codeph>INCREMENTAL</codeph> clauses work with incremental statistics, |
| a specialized feature for partitioned tables. |
| </p> |
| |
| <p> |
| When you compute incremental statistics for a partitioned table, by default Impala only |
| processes those partitions that do not yet have incremental statistics. By processing |
| only newly added partitions, you can keep statistics up to date without incurring the |
| overhead of reprocessing the entire table each time. |
| </p> |
| |
| <p> |
| You can also compute or drop statistics for a specified subset of partitions by |
| including a <codeph>PARTITION</codeph> clause in the |
| <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph> |
| statement. |
| </p> |
| |
| <note type="important"> |
| <p |
| conref="../shared/impala_common.xml#common/incremental_stats_caveats"/> |
| <p |
| conref="../shared/impala_common.xml#common/incremental_stats_after_full"/> |
| </note> |
| |
| <p> |
| The metadata for incremental statistics is handled differently from the original style |
| of statistics: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Issuing a <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition |
| clause causes Impala to compute incremental stats for all partitions that |
| do not already have incremental stats. This might be the entire table when |
| running the command for the first time, but subsequent runs should only |
| update new partitions. You can force updating a partition that already has |
| incremental stats by issuing a <codeph>DROP INCREMENTAL STATS</codeph> |
| before running <codeph>COMPUTE INCREMENTAL STATS</codeph>. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph> |
| statements now include an additional column showing whether incremental statistics |
| are available for each column. A partition could already be covered by the original |
| type of statistics based on a prior <codeph>COMPUTE STATS</codeph> statement, as |
| indicated by a value other than <codeph>-1</codeph> under the <codeph>#Rows</codeph> |
| column. Impala query planning uses either kind of statistics when available. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| <codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE |
| STATS</codeph> for the same volume of data. Therefore it is most suitable for tables |
| with large data volume where new partitions are added frequently, making it |
| impractical to run a full <codeph>COMPUTE STATS</codeph> operation for each new |
| partition. For unpartitioned tables, or partitioned tables that are loaded once and |
| not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph> |
| syntax. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the |
| <cmdname>catalogd</cmdname> process, proportional to the number |
| of partitions and number of columns in the applicable table. The |
| memory overhead is approximately 400 bytes for each column in each |
| partition. This memory is reserved in the |
| <cmdname>catalogd</cmdname> daemon, the |
| <cmdname>statestored</cmdname> daemon, and in each instance of |
| the impalad daemon. </p> |
| </li> |
| |
| <li> |
| <p> |
| In cases where new files are added to an existing partition, issue a |
| <codeph>REFRESH</codeph> statement for the table, followed by a <codeph>DROP |
| INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL STATS</codeph> sequence |
| for the changed partition. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single |
| partition at a time. To remove statistics (whether incremental or not) from all |
| partitions of a table, issue a <codeph>DROP STATS</codeph> statement with no |
| <codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| The following considerations apply to incremental statistics when the structure of an |
| existing table is changed (known as <term>schema evolution</term>): |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing |
| statistics remain valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not |
| rescan any partitions. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans |
| all partitions and fills in the appropriate column-level values the next time you |
| run <codeph>COMPUTE INCREMENTAL STATS</codeph>. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a |
| column, Impala rescans all partitions and fills in the appropriate column-level |
| values the next time you run <codeph>COMPUTE INCREMENTAL STATS</codeph>. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a |
| table, the existing statistics remain valid and a subsequent <codeph>COMPUTE |
| INCREMENTAL STATS</codeph> does not rescan any partitions. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| See <xref href="impala_compute_stats.xml#compute_stats"/> and |
| <xref |
| href="impala_drop_stats.xml#drop_stats"/> for syntax details. |
| </p> |
| |
| </conbody> |
| <concept id="inc_stats_size_limit_bytes"> |
| <title>Maximum Serialized Stats Size</title> |
| <conbody> |
| <p>In Impala 3.0 and lower, when executing <codeph>COMPUTE INCREMENTAL |
| STATS</codeph> on very large tables, use the configuration setting |
| <codeph>--inc_stats_size_limit_bytes</codeph> to prevent Impala |
| from running out of memory while updating table metadata. If this |
| limit is reached, Impala will stop loading the table and return an |
| error. The error serves as an indication that <codeph>COMPUTE |
| INCREMENTAL STATS</codeph> should not be used on the particular |
| table. Consider spitting the table and using regular <codeph>COMPUTE |
| STATS</codeph> ]if possible. </p> |
| |
| <p> The <codeph>--inc_stats_size_limit_bytes</codeph> limit is set as |
| a safety check, to prevent Impala from hitting the maximum limit for |
| the table metadata. Note that this limit is only one part of the |
| entire table's metadata all of which together must be below 2 GB. </p> |
| |
| <p> The default value for |
| <codeph>--inc_stats_size_limit_bytes</codeph> is 209715200, 200 |
| MB. </p> |
| |
| <p> To change the <codeph>--inc_stats_size_limit_bytes</codeph> value, |
| restart impalad and catalogd with the new value specified in bytes, |
| for example, 1048576000 for 1 GB. See <xref |
| href="impala_config_options.xml#config_options"/> for the steps to |
| change the option and restart Impala daemons. </p> |
| |
| <note type="attention"> The |
| <codeph>--inc_stats_size_limit_bytes</codeph> setting should be |
| increased with care. A big value for the setting, such as 1 GB or |
| more, can result in a spike in heap usage as well as a crash of |
| Impala. </note> |
| <p>In Impala 3.1 and higher, Impala improved how metadata is updated |
| when executing <codeph>COMPUTE INCREMENTAL STATS</codeph>, |
| significantly reducing the need for |
| <codeph>--inc_stats_size_limit_bytes</codeph>. </p> |
| </conbody> |
| </concept> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept rev="2.1.0" id="perf_stats_checking"> |
| |
| <title>Detecting Missing Statistics</title> |
| |
| <conbody> |
| |
| <p> |
| You can check whether a specific table has statistics using the <codeph>SHOW TABLE |
| STATS</codeph> statement (for any table) or the <codeph>SHOW PARTITIONS</codeph> |
| statement (for a partitioned table). Both statements display the same information. If a |
| table or a partition does not have any statistics, the <codeph>#Rows</codeph> field |
| contains <codeph>-1</codeph>. Once you compute statistics for the table or partition, |
| the <codeph>#Rows</codeph> field changes to an accurate value. |
| </p> |
| |
| <p> |
| The following example shows a table that initially does not have any statistics. The |
| <codeph>SHOW TABLE STATS</codeph> statement displays different values for |
| <codeph>#Rows</codeph> before and after the <codeph>COMPUTE STATS</codeph> operation. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table no_stats (x int); |
| [localhost:21000] > show table stats no_stats; |
| +-------+--------+------+--------------+--------+-------------------+ |
| | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | |
| +-------+--------+------+--------------+--------+-------------------+ |
| | -1 | 0 | 0B | NOT CACHED | TEXT | false | |
| +-------+--------+------+--------------+--------+-------------------+ |
| [localhost:21000] > compute stats no_stats; |
| +-----------------------------------------+ |
| | summary | |
| +-----------------------------------------+ |
| | Updated 1 partition(s) and 1 column(s). | |
| +-----------------------------------------+ |
| [localhost:21000] > show table stats no_stats; |
| +-------+--------+------+--------------+--------+-------------------+ |
| | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | |
| +-------+--------+------+--------------+--------+-------------------+ |
| | 0 | 0 | 0B | NOT CACHED | TEXT | false | |
| +-------+--------+------+--------------+--------+-------------------+ |
| </codeblock> |
| |
| <p> |
| The following example shows a similar progression with a partitioned table. Initially, |
| <codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph> |
| operation, <codeph>#Rows</codeph> changes to an accurate value. Any newly added |
| partition starts with no statistics, meaning that you must collect statistics after |
| adding a new partition. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint); |
| [localhost:21000] > show table stats no_stats_partitioned; |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| | year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| | Total | -1 | 0 | 0B | 0B | | | |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| [localhost:21000] > show partitions no_stats_partitioned; |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| | year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| | Total | -1 | 0 | 0B | 0B | | | |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| [localhost:21000] > alter table no_stats_partitioned add partition (year=2013); |
| [localhost:21000] > compute stats no_stats_partitioned; |
| +-----------------------------------------+ |
| | summary | |
| +-----------------------------------------+ |
| | Updated 1 partition(s) and 1 column(s). | |
| +-----------------------------------------+ |
| [localhost:21000] > alter table no_stats_partitioned add partition (year=2014); |
| [localhost:21000] > show partitions no_stats_partitioned; |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| | year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| | 2013 | 0 | 0 | 0B | NOT CACHED | TEXT | false | |
| | 2014 | -1 | 0 | 0B | NOT CACHED | TEXT | false | |
| | Total | 0 | 0 | 0B | 0B | | | |
| +-------+-------+--------+------+--------------+--------+-------------------+ |
| </codeblock> |
| |
| <note> |
| Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates |
| statistics for all partitions in a table, if you expect to frequently add new |
| partitions, use the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax instead, which |
| lets you compute stats for a single specified partition, or only for those partitions |
| that do not already have incremental stats. |
| </note> |
| |
| <p> |
| If checking each individual table is impractical, due to a large number of tables or |
| views that hide the underlying base tables, you can also check for missing statistics |
| for a particular query. Use the <codeph>EXPLAIN</codeph> statement to preview query |
| efficiency before actually running the query. Use the query profile output available |
| through the <codeph>PROFILE</codeph> command in <cmdname>impala-shell</cmdname> or the |
| web UI to verify query execution and timing after running the query. Both the |
| <codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning |
| if any tables or partitions involved in the query do not have statistics. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table no_stats (x int); |
| [localhost:21000] > explain select count(*) from no_stats; |
| +------------------------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------------------------+ |
| | Estimated Per-Host Requirements: Memory=10.00MB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | incremental_stats.no_stats | |
| | | |
| | 03:AGGREGATE [FINALIZE] | |
| | | output: count:merge(*) | |
| | | | |
| | 02:EXCHANGE [UNPARTITIONED] | |
| | | | |
| | 01:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| | 00:SCAN HDFS [incremental_stats.no_stats] | |
| | partitions=1/1 files=0 size=0B | |
| +------------------------------------------------------------------------------------+ |
| </codeblock> |
| |
| <p> |
| Because Impala uses the <term>partition pruning</term> technique when possible to only |
| evaluate certain partitions, if you have a partitioned table with statistics for some |
| partitions and not others, whether or not the <codeph>EXPLAIN</codeph> statement shows |
| the warning depends on the actual partitions used by the query. For example, you might |
| see warnings or not for different queries against the same table: |
| </p> |
| |
| <codeblock>-- No warning because all the partitions for the year 2012 have stats. |
| EXPLAIN SELECT ... FROM t1 WHERE year = 2012; |
| |
| -- Missing stats warning because one or more partitions in this range |
| -- do not have stats. |
| EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009; |
| </codeblock> |
| |
| <p> |
| To confirm if any partitions at all in the table are missing statistics, you might |
| explain a query that scans the entire table, such as <codeph>SELECT COUNT(*) FROM |
| <varname>table_name</varname></codeph>. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="concept_s3c_4gl_mdb"> |
| |
| <title>Manually Setting Table and Column Statistics with ALTER TABLE</title> |
| |
| <concept id="concept_wpt_pgl_mdb"> |
| |
| <title>Setting Table Statistics</title> |
| |
| <conbody> |
| |
| <p> |
| The most crucial piece of data in all the statistics is the number of rows in the |
| table (for an unpartitioned or partitioned table) and for each partition (for a |
| partitioned table). The <codeph>COMPUTE STATS</codeph> statement always gathers |
| statistics about all columns, as well as overall table statistics. If it is not |
| practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL |
| STATS</codeph> operation after adding a partition or inserting data, or if you can see |
| that Impala would produce a more efficient plan if the number of rows was different, |
| you can manually set the number of rows through an <codeph>ALTER TABLE</codeph> |
| statement: |
| </p> |
| |
| <codeblock> |
| -- Set total number of rows. Applies to both unpartitioned and partitioned tables. |
| alter table <varname>table_name</varname> set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true'); |
| |
| -- Set total number of rows for a specific partition. Applies to partitioned tables only. |
| -- You must specify all the partition key columns in the PARTITION clause. |
| alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...) set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true'); |
| </codeblock> |
| |
| <p> |
| This statement avoids re-scanning any data files. (The requirement to include the |
| <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a |
| result of the issue |
| <xref |
| href="https://issues.apache.org/jira/browse/HIVE-8648" |
| scope="external" format="html">HIVE-8648</xref> |
| for the Hive metastore.) |
| </p> |
| |
| <codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/> |
| |
| <p> |
| For a partitioned table, update both the per-partition number of rows and the number |
| of rows for the whole table: |
| </p> |
| |
| <codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/> |
| |
| <p> |
| In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE |
| INCREMENTAL STATS</codeph> for a partitioned table, should be fast and convenient |
| enough that this technique is only useful for the very largest partitioned tables. |
| <!-- |
| It is most useful as a workaround for in case of performance issues where you might adjust the <codeph>numRows</codeph> value higher |
| or lower to produce the ideal join order. |
| --> |
| <!-- Following wording is duplicated from earlier. Consider conref'ing. --> |
| Because the column statistics might be left in a stale state, do not use this |
| technique as a replacement for <codeph>COMPUTE STATS</codeph>. Only use this technique |
| if all other means of collecting statistics are impractical, or as a low-overhead |
| operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or |
| <codeph>COMPUTE INCREMENTAL STATS</codeph> operations. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="concept_asb_vgl_mdb"> |
| |
| <title>Setting Column Statistics</title> |
| |
| <conbody> |
| |
| <p> |
| In <keyword keyref="impala26_full"/> and higher, you can also use the <codeph>SET |
| COLUMN STATS</codeph> clause of <codeph>ALTER TABLE</codeph> to manually set or change |
| column statistics. Only use this technique in cases where it is impractical to run |
| <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> |
| frequently enough to keep up with data changes for a huge table. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/set_column_stats_example"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept rev="1.2.2" id="perf_stats_examples"> |
| |
| <title>Examples of Using Table and Column Statistics with Impala</title> |
| |
| <conbody> |
| |
| <p> |
| The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>, |
| <codeph>SHOW COLUMN STATS</codeph>, <codeph>ALTER TABLE</codeph>, and |
| <codeph>SELECT</codeph> and <codeph>INSERT</codeph> statements to illustrate various |
| aspects of how Impala uses statistics to help optimize queries. |
| </p> |
| |
| <p> |
| This example shows table and column statistics for the <codeph>STORE</codeph> column |
| used in the <xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS |
| benchmarks for decision support</xref> systems. It is a tiny table holding data for 12 |
| stores. Initially, before any statistics are gathered by a <codeph>COMPUTE |
| STATS</codeph> statement, most of the numeric fields show placeholder values of -1, |
| indicating that the figures are unknown. The figures that are filled in are values that |
| are easily countable or deducible at the physical level, such as the number of files, |
| total data size of the files, and the maximum and average sizes for data types that have |
| a constant size such as <codeph>INT</codeph>, <codeph>FLOAT</codeph>, and |
| <codeph>TIMESTAMP</codeph>. |
| </p> |
| |
| <codeblock>[localhost:21000] > show table stats store; |
| +-------+--------+--------+--------+ |
| | #Rows | #Files | Size | Format | |
| +-------+--------+--------+--------+ |
| | -1 | 1 | 3.08KB | TEXT | |
| +-------+--------+--------+--------+ |
| Returned 1 row(s) in 0.03s |
| [localhost:21000] > show column stats store; |
| +--------------------+-----------+------------------+--------+----------+----------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +--------------------+-----------+------------------+--------+----------+----------+ |
| | s_store_sk | INT | -1 | -1 | 4 | 4 | |
| | s_store_id | STRING | -1 | -1 | -1 | -1 | |
| | s_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 | |
| | s_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 | |
| | s_closed_date_sk | INT | -1 | -1 | 4 | 4 | |
| | s_store_name | STRING | -1 | -1 | -1 | -1 | |
| | s_number_employees | INT | -1 | -1 | 4 | 4 | |
| | s_floor_space | INT | -1 | -1 | 4 | 4 | |
| | s_hours | STRING | -1 | -1 | -1 | -1 | |
| | s_manager | STRING | -1 | -1 | -1 | -1 | |
| | s_market_id | INT | -1 | -1 | 4 | 4 | |
| | s_geography_class | STRING | -1 | -1 | -1 | -1 | |
| | s_market_desc | STRING | -1 | -1 | -1 | -1 | |
| | s_market_manager | STRING | -1 | -1 | -1 | -1 | |
| | s_division_id | INT | -1 | -1 | 4 | 4 | |
| | s_division_name | STRING | -1 | -1 | -1 | -1 | |
| | s_company_id | INT | -1 | -1 | 4 | 4 | |
| | s_company_name | STRING | -1 | -1 | -1 | -1 | |
| | s_street_number | STRING | -1 | -1 | -1 | -1 | |
| | s_street_name | STRING | -1 | -1 | -1 | -1 | |
| | s_street_type | STRING | -1 | -1 | -1 | -1 | |
| | s_suite_number | STRING | -1 | -1 | -1 | -1 | |
| | s_city | STRING | -1 | -1 | -1 | -1 | |
| | s_county | STRING | -1 | -1 | -1 | -1 | |
| | s_state | STRING | -1 | -1 | -1 | -1 | |
| | s_zip | STRING | -1 | -1 | -1 | -1 | |
| | s_country | STRING | -1 | -1 | -1 | -1 | |
| | s_gmt_offset | FLOAT | -1 | -1 | 4 | 4 | |
| | s_tax_percentage | FLOAT | -1 | -1 | 4 | 4 | |
| +--------------------+-----------+------------------+--------+----------+----------+ |
| Returned 29 row(s) in 0.04s</codeblock> |
| |
| <p> |
| With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to |
| specify each column for which to gather statistics. The Impala <codeph>COMPUTE |
| STATS</codeph> statement automatically gathers statistics for all columns, because it |
| reads through the entire table relatively quickly and can efficiently compute the values |
| for all the columns. This example shows how after running the <codeph>COMPUTE |
| STATS</codeph> statement, statistics are filled in for both the table and all its |
| columns: |
| </p> |
| |
| <codeblock>[localhost:21000] > compute stats store; |
| +------------------------------------------+ |
| | summary | |
| +------------------------------------------+ |
| | Updated 1 partition(s) and 29 column(s). | |
| +------------------------------------------+ |
| Returned 1 row(s) in 1.88s |
| [localhost:21000] > show table stats store; |
| +-------+--------+--------+--------+ |
| | #Rows | #Files | Size | Format | |
| +-------+--------+--------+--------+ |
| | 12 | 1 | 3.08KB | TEXT | |
| +-------+--------+--------+--------+ |
| Returned 1 row(s) in 0.02s |
| [localhost:21000] > show column stats store; |
| +--------------------+-----------+------------------+--------+----------+-------------------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +--------------------+-----------+------------------+--------+----------+-------------------+ |
| | s_store_sk | INT | 12 | -1 | 4 | 4 | |
| | s_store_id | STRING | 6 | -1 | 16 | 16 | |
| | s_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16 | |
| | s_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16 | |
| | s_closed_date_sk | INT | 3 | -1 | 4 | 4 | |
| | s_store_name | STRING | 8 | -1 | 5 | 4.25 | |
| | s_number_employees | INT | 9 | -1 | 4 | 4 | |
| | s_floor_space | INT | 10 | -1 | 4 | 4 | |
| | s_hours | STRING | 2 | -1 | 8 | 7.083300113677979 | |
| | s_manager | STRING | 7 | -1 | 15 | 12 | |
| | s_market_id | INT | 7 | -1 | 4 | 4 | |
| | s_geography_class | STRING | 1 | -1 | 7 | 7 | |
| | s_market_desc | STRING | 10 | -1 | 94 | 55.5 | |
| | s_market_manager | STRING | 7 | -1 | 16 | 14 | |
| | s_division_id | INT | 1 | -1 | 4 | 4 | |
| | s_division_name | STRING | 1 | -1 | 7 | 7 | |
| | s_company_id | INT | 1 | -1 | 4 | 4 | |
| | s_company_name | STRING | 1 | -1 | 7 | 7 | |
| | s_street_number | STRING | 9 | -1 | 3 | 2.833300113677979 | |
| | s_street_name | STRING | 12 | -1 | 11 | 6.583300113677979 | |
| | s_street_type | STRING | 8 | -1 | 9 | 4.833300113677979 | |
| | s_suite_number | STRING | 11 | -1 | 9 | 8.25 | |
| | s_city | STRING | 2 | -1 | 8 | 6.5 | |
| | s_county | STRING | 1 | -1 | 17 | 17 | |
| | s_state | STRING | 1 | -1 | 2 | 2 | |
| | s_zip | STRING | 2 | -1 | 5 | 5 | |
| | s_country | STRING | 1 | -1 | 13 | 13 | |
| | s_gmt_offset | FLOAT | 1 | -1 | 4 | 4 | |
| | s_tax_percentage | FLOAT | 5 | -1 | 4 | 4 | |
| +--------------------+-----------+------------------+--------+----------+-------------------+ |
| Returned 29 row(s) in 0.04s</codeblock> |
| |
| <p> |
| The following example shows how statistics are represented for a partitioned table. In |
| this case, we have set up a table to hold the world's most trivial census data, a single |
| <codeph>STRING</codeph> field, partitioned by a <codeph>YEAR</codeph> column. The table |
| statistics include a separate entry for each partition, plus final totals for the |
| numeric fields. The column statistics include some easily deducible facts for the |
| partitioning column, such as the number of distinct values (the number of partition |
| subdirectories). |
| <!-- and the number of <codeph>NULL</codeph> values (none in this case). --> |
| </p> |
| |
| <codeblock>localhost:21000] > describe census; |
| +------+----------+---------+ |
| | name | type | comment | |
| +------+----------+---------+ |
| | name | string | | |
| | year | smallint | | |
| +------+----------+---------+ |
| Returned 2 row(s) in 0.02s |
| [localhost:21000] > show table stats census; |
| +-------+-------+--------+------+---------+ |
| | year | #Rows | #Files | Size | Format | |
| +-------+-------+--------+------+---------+ |
| | 2000 | -1 | 0 | 0B | TEXT | |
| | 2004 | -1 | 0 | 0B | TEXT | |
| | 2008 | -1 | 0 | 0B | TEXT | |
| | 2010 | -1 | 0 | 0B | TEXT | |
| | 2011 | 0 | 1 | 22B | TEXT | |
| | 2012 | -1 | 1 | 22B | TEXT | |
| | 2013 | -1 | 1 | 231B | PARQUET | |
| | Total | 0 | 3 | 275B | | |
| +-------+-------+--------+------+---------+ |
| Returned 8 row(s) in 0.02s |
| [localhost:21000] > show column stats census; |
| +--------+----------+------------------+--------+----------+----------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +--------+----------+------------------+--------+----------+----------+ |
| | name | STRING | -1 | -1 | -1 | -1 | |
| | year | SMALLINT | 7 | -1 | 2 | 2 | |
| +--------+----------+------------------+--------+----------+----------+ |
| Returned 2 row(s) in 0.02s</codeblock> |
| |
| <p> |
| The following example shows how the statistics are filled in by a <codeph>COMPUTE |
| STATS</codeph> statement in Impala. |
| </p> |
| |
| <codeblock>[localhost:21000] > compute stats census; |
| +-----------------------------------------+ |
| | summary | |
| +-----------------------------------------+ |
| | Updated 3 partition(s) and 1 column(s). | |
| +-----------------------------------------+ |
| Returned 1 row(s) in 2.16s |
| [localhost:21000] > show table stats census; |
| +-------+-------+--------+------+---------+ |
| | year | #Rows | #Files | Size | Format | |
| +-------+-------+--------+------+---------+ |
| | 2000 | -1 | 0 | 0B | TEXT | |
| | 2004 | -1 | 0 | 0B | TEXT | |
| | 2008 | -1 | 0 | 0B | TEXT | |
| | 2010 | -1 | 0 | 0B | TEXT | |
| | 2011 | 4 | 1 | 22B | TEXT | |
| | 2012 | 4 | 1 | 22B | TEXT | |
| | 2013 | 1 | 1 | 231B | PARQUET | |
| | Total | 9 | 3 | 275B | | |
| +-------+-------+--------+------+---------+ |
| Returned 8 row(s) in 0.02s |
| [localhost:21000] > show column stats census; |
| +--------+----------+------------------+--------+----------+----------+ |
| | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | |
| +--------+----------+------------------+--------+----------+----------+ |
| | name | STRING | 4 | -1 | 5 | 4.5 | |
| | year | SMALLINT | 7 | -1 | 2 | 2 | |
| +--------+----------+------------------+--------+----------+----------+ |
| Returned 2 row(s) in 0.02s</codeblock> |
| |
| <p rev="1.4.0"> |
| For examples showing how some queries work differently when statistics are available, |
| see <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala |
| executes a query differently in each case by observing the <codeph>EXPLAIN</codeph> |
| output before and after collecting statistics. Measure the before and after query times, |
| and examine the throughput numbers in before and after <codeph>SUMMARY</codeph> or |
| <codeph>PROFILE</codeph> output, to verify how much the improved plan speeds up |
| performance. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |