docs/topics/impala_perf_stats.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="perf_stats">

   <title>Table and Column Statistics</title>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       Impala can do better optimization for complex or multi-table queries when it has access to statistics about
       the volume of data and how the values are distributed. Impala uses this information to help parallelize and
       distribute the work for a query. For example, optimizing join queries requires a way of determining if one
       table is <q>bigger</q> than another, which is a function of the number of rows and the average row size
       for each table. The following sections describe the categories of statistics Impala can work
       with, and how to produce them and keep them up to date.
     </p>

     <note>
       <p rev="1.2.2">
         Originally, Impala relied on the Hive mechanism for collecting statistics, through the Hive <codeph>ANALYZE
         TABLE</codeph> statement which initiates a MapReduce job. For better user-friendliness and reliability,
         Impala implements its own <codeph>COMPUTE STATS</codeph> statement in Impala 1.2.2 and higher, along with the
         <codeph>DROP STATS</codeph>, <codeph>SHOW TABLE STATS</codeph>, and <codeph>SHOW COLUMN STATS</codeph>
         statements.
       </p>
     </note>

     <p outputclass="toc inpage"/>
   </conbody>

   <concept id="perf_table_stats">

     <title id="table_stats">Overview of Table Statistics</title>
   <prolog>
     <metadata>
       <data name="Category" value="Concepts"/>
     </metadata>
   </prolog>

     <conbody>

 <!-- Hive background info: https://cwiki.apache.org/Hive/statsdev.html -->

       <p>
         The Impala query planner can make use of statistics about entire tables and partitions.
         This information includes physical characteristics such as the number of rows, number of data files,
         the total size of the data files, and the file format. For partitioned tables, the numbers
         are calculated per partition, and as totals for the whole table.
         This metadata is stored in the metastore database, and can be updated by either Impala or Hive.
         If a number is not available, the value -1 is used as a placeholder.
         Some numbers, such as number and total sizes of data files, are always kept up to date because
         they can be calculated cheaply, as part of gathering HDFS block metadata.
       </p>

       <p>
         The following example shows table stats for an unpartitioned Parquet table.
         The values for the number and sizes of files are always available.
         Initially, the number of rows is not known, because it requires a potentially expensive
         scan through the entire table, and so that value is displayed as -1.
         The <codeph>COMPUTE STATS</codeph> statement fills in any unknown table stats values.
       </p>

 <codeblock>
 show table stats parquet_snappy;
 +-------+--------+---------+--------------+-------------------+---------+-------------------+...
 | #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats |...
 +-------+--------+---------+--------------+-------------------+---------+-------------------+...
 | -1    | 96     | 23.35GB | NOT CACHED   | NOT CACHED        | PARQUET | false             |...
 +-------+--------+---------+--------------+-------------------+---------+-------------------+...

 compute stats parquet_snappy;
 +-----------------------------------------+
 | summary                                 |
 +-----------------------------------------+
 | Updated 1 partition(s) and 6 column(s). |
 +-----------------------------------------+


 show table stats parquet_snappy;
 +------------+--------+---------+--------------+-------------------+---------+-------------------+...
 | #Rows      | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats |...
 +------------+--------+---------+--------------+-------------------+---------+-------------------+...
 | 1000000000 | 96     | 23.35GB | NOT CACHED   | NOT CACHED        | PARQUET | false             |...
 +------------+--------+---------+--------------+-------------------+---------+-------------------+...
 </codeblock>

       <p>
         Impala performs some optimizations using this metadata on its own, and other optimizations by
         using a combination of table and column statistics.
       </p>

       <p rev="1.2.1">
         To check that table statistics are available for a table, and see the details of those statistics, use the
         statement <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph>. See
         <xref href="impala_show.xml#show"/> for details.
       </p>

       <p>
         If you use the Hive-based methods of gathering statistics, see
         <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the
         Hive wiki</xref> for information about the required configuration on the Hive side. Where practical,
         use the Impala <codeph>COMPUTE STATS</codeph> statement to avoid potential configuration and scalability
         issues with the statistics-gathering process.
       </p>

       <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
     </conbody>
   </concept>

   <concept id="perf_column_stats">

     <title id="column_stats">Overview of Column Statistics</title>

     <conbody>

       <p>
         The Impala query planner can make use of statistics about individual columns when that metadata is
         available in the metastore database. This technique is most valuable for columns compared across tables in
         <xref href="impala_perf_joins.xml#perf_joins">join queries</xref>, to help estimate how many rows the query
         will retrieve from each table. <ph rev="2.0.0"> These statistics are also important for correlated
         subqueries using the <codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed
         internally the same way as join queries.</ph>
       </p>

       <p>
         The following example shows column stats for an unpartitioned Parquet table.
         The values for the maximum and average sizes of some types are always available,
         because those figures are constant for numeric and other fixed-size types.
         Initially, the number of distinct values is not known, because it requires a potentially expensive
         scan through the entire table, and so that value is displayed as -1.
         The same applies to maximum and average sizes of variable-sized types, such as <codeph>STRING</codeph>.
         The <codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values.
         (It does not record the number of <codeph>NULL</codeph> values, because currently Impala
         does not use that figure for query optimization.)
       </p>

 <codeblock>
 show column stats parquet_snappy;
 +-------------+----------+------------------+--------+----------+----------+
 | Column      | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
 +-------------+----------+------------------+--------+----------+----------+
 | id          | BIGINT   | -1               | -1     | 8        | 8        |
 | val         | INT      | -1               | -1     | 4        | 4        |
 | zerofill    | STRING   | -1               | -1     | -1       | -1       |
 | name        | STRING   | -1               | -1     | -1       | -1       |
 | assertion   | BOOLEAN  | -1               | -1     | 1        | 1        |
 | location_id | SMALLINT | -1               | -1     | 2        | 2        |
 +-------------+----------+------------------+--------+----------+----------+

 compute stats parquet_snappy;
 +-----------------------------------------+
 | summary                                 |
 +-----------------------------------------+
 | Updated 1 partition(s) and 6 column(s). |
 +-----------------------------------------+

 show column stats parquet_snappy;
 +-------------+----------+------------------+--------+----------+-------------------+
 | Column      | Type     | #Distinct Values | #Nulls | Max Size | Avg Size          |
 +-------------+----------+------------------+--------+----------+-------------------+
 | id          | BIGINT   | 183861280        | -1     | 8        | 8                 |
 | val         | INT      | 139017           | -1     | 4        | 4                 |
 | zerofill    | STRING   | 101761           | -1     | 6        | 6                 |
 | name        | STRING   | 145636240        | -1     | 22       | 13.00020027160645 |
 | assertion   | BOOLEAN  | 2                | -1     | 1        | 1                 |
 | location_id | SMALLINT | 339              | -1     | 2        | 2                 |
 +-------------+----------+------------------+--------+----------+-------------------+
 </codeblock>

       <note>
         <p>
           For column statistics to be effective in Impala, you also need to have table statistics for the
           applicable tables, as described in <xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use
           the Impala <codeph>COMPUTE STATS</codeph> statement, both table and column statistics are automatically
           gathered at the same time, for all columns in the table.
         </p>
       </note>

       <note conref="../shared/impala_common.xml#common/compute_stats_nulls"/>

 <!-- Hive-based instructions are considered obsolete since the introduction of the Impala COMPUTE STATS statement.
       <p>
         Add settings like the following to the <filepath>hive-site.xml</filepath>
         configuration file, in the Hive configuration directory, on every node where you run
         <codeph>ANALYZE TABLE</codeph> statements through the
         <codeph>hive</codeph> shell. The
         <codeph>hive.stats.ndv.error</codeph> setting represents the standard error when
         estimating the number of distinct values for a column. The value of 5.0 is recommended as a tradeoff between the
         accuracy of the gathered statistics and the resource usage of the stats-gathering process.
       </p>

 <codeblock><![CDATA[<property>
  <name>hive.stats.ndv.error</name>
  <value>5.0</value>
 </property>]]></codeblock>

       <p>
         5.0 is a relatively low value that devotes substantial computational resources to the statistics-gathering
         process. To reduce the resource usage, you could increase this value; to make the statistics even more precise,
         you could lower it.
       </p>

       <p>
         The syntax for gathering column statistics uses the <codeph>ANALYZE TABLE ...
         COMPUTE STATISTICS</codeph> clause, with an additional <codeph>FOR
         COLUMNS</codeph> clause. For partitioned tables, you can gather statistics for specific partitions by including
         a clause <codeph>PARTITION
         (<varname>col1=val1</varname>,<varname>col2=val2</varname>,
         ...)</codeph>; but you cannot include the partitioning columns in the
         <codeph>FOR COLUMNS</codeph> clause. Also, you cannot use fully qualified table
         names, so issue a <codeph>USE</codeph> command first to switch to the
         appropriate database. For example:
       </p>

 <codeblock>USE <varname>database_name</varname>;
 ANALYZE TABLE <varname>table_name</varname> COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>;
 ANALYZE TABLE <varname>table_name</varname> PARTITION (<varname>partition_specs</varname>) COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>;</codeblock>
 -->

       <p rev="1.2.1">
         To check whether column statistics are available for a particular set of columns, use the <codeph>SHOW
         COLUMN STATS <varname>table_name</varname></codeph> statement, or check the extended
         <codeph>EXPLAIN</codeph> output for a query against that table that refers to those columns. See
         <xref href="impala_show.xml#show"/> and <xref href="impala_explain.xml#explain"/> for details.
       </p>

       <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
     </conbody>
   </concept>

   <concept id="perf_stats_partitions">
     <title id="stats_partitions">How Table and Column Statistics Work for Partitioned Tables</title>
     <conbody>

       <p>
         When you use Impala for <q>big data</q>, you are highly likely to use partitioning
         for your biggest tables, the ones representing data that can be logically divided
         based on dates, geographic regions, or similar criteria. The table and column statistics
         are especially useful for optimizing queries on such tables. For example, a query involving
         one year might involve substantially more or less data than a query involving a different year,
         or a range of several years. Each query might be optimized differently as a result.
       </p>

       <p>
         The following examples show how table and column stats work with a partitioned table.
         The table for this example is partitioned by year, month, and day.
         For simplicity, the sample data consists of 5 partitions, all from the same year and month.
         Table stats are collected independently for each partition. (In fact, the
         <codeph>SHOW PARTITIONS</codeph> statement displays exactly the same information as
         <codeph>SHOW TABLE STATS</codeph> for a partitioned table.) Column stats apply to
         the entire table, not to individual partitions. Because the partition key column values
         are represented as HDFS directories, their characteristics are typically known in advance,
         even when the values for non-key columns are shown as -1.
       </p>

 <codeblock>
 show partitions year_month_day;
 +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
 | year  | month | day | #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  |...
 +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
 | 2013  | 12    | 1   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 2   | -1    | 1      | 2.53MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 3   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 4   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 5   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | Total |       |     | -1    | 5      | 12.58MB | 0B           |                   |         |...
 +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...

 show table stats year_month_day;
 +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
 | year  | month | day | #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  |...
 +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
 | 2013  | 12    | 1   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 2   | -1    | 1      | 2.53MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 3   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 4   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 5   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | Total |       |     | -1    | 5      | 12.58MB | 0B           |                   |         |...
 +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...

 show column stats year_month_day;
 +-----------+---------+------------------+--------+----------+----------+
 | Column    | Type    | #Distinct Values | #Nulls | Max Size | Avg Size |
 +-----------+---------+------------------+--------+----------+----------+
 | id        | INT     | -1               | -1     | 4        | 4        |
 | val       | INT     | -1               | -1     | 4        | 4        |
 | zfill     | STRING  | -1               | -1     | -1       | -1       |
 | name      | STRING  | -1               | -1     | -1       | -1       |
 | assertion | BOOLEAN | -1               | -1     | 1        | 1        |
 | year      | INT     | 1                | 0      | 4        | 4        |
 | month     | INT     | 1                | 0      | 4        | 4        |
 | day       | INT     | 5                | 0      | 4        | 4        |
 +-----------+---------+------------------+--------+----------+----------+

 compute stats year_month_day;
 +-----------------------------------------+
 | summary                                 |
 +-----------------------------------------+
 | Updated 5 partition(s) and 5 column(s). |
 +-----------------------------------------+

 show table stats year_month_day;
 +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
 | year  | month | day | #Rows  | #Files | Size    | Bytes Cached | Cache Replication | Format  |...
 +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
 | 2013  | 12    | 1   | 93606  | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 2   | 94158  | 1      | 2.53MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 3   | 94122  | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 4   | 93559  | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | 2013  | 12    | 5   | 93845  | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET |...
 | Total |       |     | 469290 | 5      | 12.58MB | 0B           |                   |         |...
 +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...

 show column stats year_month_day;
 +-----------+---------+------------------+--------+----------+-------------------+
 | Column    | Type    | #Distinct Values | #Nulls | Max Size | Avg Size          |
 +-----------+---------+------------------+--------+----------+-------------------+
 | id        | INT     | 511129           | -1     | 4        | 4                 |
 | val       | INT     | 364853           | -1     | 4        | 4                 |
 | zfill     | STRING  | 311430           | -1     | 6        | 6                 |
 | name      | STRING  | 471975           | -1     | 22       | 13.00160026550293 |
 | assertion | BOOLEAN | 2                | -1     | 1        | 1                 |
 | year      | INT     | 1                | 0      | 4        | 4                 |
 | month     | INT     | 1                | 0      | 4        | 4                 |
 | day       | INT     | 5                | 0      | 4        | 4                 |
 +-----------+---------+------------------+--------+----------+-------------------+
 </codeblock>

       <note>
         Partitioned tables can grow so large that scanning the entire table, as the <codeph>COMPUTE STATS</codeph>
         statement does, is impractical just to update the statistics for a new partition. The standard
         <codeph>COMPUTE STATS</codeph> statement might take hours, or even days. That situation is where you switch
         to using incremental statistics, a feature available in <keyword keyref="impala21_full"/> and higher.
         See <xref href="impala_perf_stats.xml#perf_stats_incremental"/> for details about this feature
         and the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax.
       </note>

       <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
     </conbody>
   </concept>

   <concept rev="2.1.0" id="perf_stats_incremental">

     <title id="incremental_stats">Overview of Incremental Statistics</title>

     <conbody>

       <p>
         In Impala 2.1.0 and higher, you can use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph> and
         <codeph>DROP INCREMENTAL STATS</codeph>. The <codeph>INCREMENTAL</codeph> clauses work with incremental
         statistics, a specialized feature for partitioned tables that are large or frequently updated with new
         partitions.
       </p>

       <p>
         When you compute incremental statistics for a partitioned table, by default Impala only processes those
         partitions that do not yet have incremental statistics. By processing only newly added partitions, you can
         keep statistics up to date for large partitioned tables, without incurring the overhead of reprocessing the
         entire table each time.
       </p>

       <p>
         You can also compute or drop statistics for a single partition by including a <codeph>PARTITION</codeph>
         clause in the <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph>
         statement.
       </p>

       <p>
         The metadata for incremental statistics is handled differently from the original style of statistics:
       </p>

       <ul>
         <li>
           <p>
             If you have an existing partitioned table for which you have already computed statistics, issuing
             <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition clause causes Impala to rescan the
             entire table. Once the incremental statistics are computed, any future <codeph>COMPUTE INCREMENTAL
             STATS</codeph> statements only scan any new partitions and any partitions where you performed
             <codeph>DROP INCREMENTAL STATS</codeph>.
           </p>
         </li>

         <li>
           <p>
             The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph> statements now include an
             additional column showing whether incremental statistics are available for each column. A partition
             could already be covered by the original type of statistics based on a prior <codeph>COMPUTE
             STATS</codeph> statement, as indicated by a value other than <codeph>-1</codeph> under the
             <codeph>#Rows</codeph> column. Impala query planning uses either kind of statistics when available.
           </p>
         </li>

         <li>
           <p>
             <codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE STATS</codeph> for the
             same volume of data. Therefore it is most suitable for tables with large data volume where new
             partitions are added frequently, making it impractical to run a full <codeph>COMPUTE STATS</codeph>
             operation for each new partition. For unpartitioned tables, or partitioned tables that are loaded once
             and not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph> syntax.
           </p>
         </li>

         <li>
           <p>
             <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the <cmdname>catalogd</cmdname> process,
             proportional to the number of partitions and number of columns in the applicable table. The memory
             overhead is approximately 400 bytes for each column in each partition. This memory is reserved in the
             <cmdname>catalogd</cmdname> daemon, the <cmdname>statestored</cmdname> daemon, and in each instance of
             the <cmdname>impalad</cmdname> daemon.
           </p>
         </li>

         <li>
           <p>
             In cases where new files are added to an existing partition, issue a <codeph>REFRESH</codeph> statement
             for the table, followed by a <codeph>DROP INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL
             STATS</codeph> sequence for the changed partition.
           </p>
         </li>

         <li>
           <p>
             The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single partition at a time. To
             remove statistics (whether incremental or not) from all partitions of a table, issue a <codeph>DROP
             STATS</codeph> statement with no <codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses.
           </p>
         </li>
       </ul>

       <p>
         The following considerations apply to incremental statistics when the structure of an existing table is
         changed (known as <term>schema evolution</term>):
       </p>

       <ul>
         <li>
           <p>
             If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing statistics remain
             valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any partitions.
           </p>
         </li>

         <li>
           <p>
             If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans all partitions and
             fills in the appropriate column-level values the next time you run <codeph>COMPUTE INCREMENTAL
             STATS</codeph>.
           </p>
         </li>

         <li>
           <p>
             If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a column, Impala
             rescans all partitions and fills in the appropriate column-level values the next time you run
             <codeph>COMPUTE INCREMENTAL STATS</codeph>.
           </p>
         </li>

         <li>
           <p>
             If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a table, the existing
             statistics remain valid and a subsequent <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any
             partitions.
           </p>
         </li>
       </ul>

       <p>
         See <xref href="impala_compute_stats.xml#compute_stats"/> and
         <xref href="impala_drop_stats.xml#drop_stats"/> for syntax details.
       </p>
     </conbody>
   </concept>

   <concept id="perf_stats_computing">
     <title>Generating Table and Column Statistics (COMPUTE STATS Statement)</title>
     <conbody>

       <p>
         To gather table statistics after loading data into a table or partition, you typically use the
         <codeph>COMPUTE STATS</codeph> statement. This statement is available in Impala 1.2.2 and higher.
         It gathers both table statistics and column statistics for all columns in a single operation.
         For large partitioned tables, where you frequently need to update statistics and it is impractical
         to scan the entire table each time, use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph>,
         which is available in <keyword keyref="impala21_full"/> and higher.
       </p>

       <p>
         If you use Hive as part of your ETL workflow, you can also use Hive to generate table and
         column statistics. You might need to do extra configuration within Hive itself, the metastore,
         or even set up a separate database to hold Hive-generated statistics. You might need to run
         multiple statements to generate all the necessary statistics. Therefore, prefer the
         Impala <codeph>COMPUTE STATS</codeph> statement where that technique is practical.
         For details about collecting statistics through Hive, see
         <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the Hive wiki</xref>.
       </p>

       <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>

 <!-- Commenting out over-detailed Hive instructions as part of stats reorg.
         <li>
           Issue an <codeph>ANALYZE TABLE</codeph> statement in Hive, for the entire table or a specific partition.
 <codeblock>ANALYZE TABLE <varname>tablename</varname> [PARTITION(<varname>partcol1</varname>[=<varname>val1</varname>], <varname>partcol2</varname>[=<varname>val2</varname>], ...)] COMPUTE STATISTICS [NOSCAN];</codeblock>
           For example, to gather statistics for a non-partitioned table:
 <codeblock>ANALYZE TABLE customer COMPUTE STATISTICS;</codeblock>
           To gather statistics for a <codeph>store</codeph> table partitioned by state and city, and both of its
           partitions:
 <codeblock>ANALYZE TABLE store PARTITION(s_state, s_county) COMPUTE STATISTICS;</codeblock>
           To gather statistics for the <codeph>store</codeph> table and only the partitions for California:
 <codeblock>ANALYZE TABLE store PARTITION(s_state='CA', s_county) COMPUTE STATISTICS;</codeblock>
         </li>

         <li>
           Load the data through the <codeph>INSERT OVERWRITE</codeph> statement in Hive, while the Hive setting
           <b>hive.stats.autogather</b> is enabled.
         </li>

       </ul>
 -->

       <p rev="2.0.1">
 <!-- Additional info as a result of IMPALA-1420 -->
 <!-- Keep checking if https://issues.apache.org/jira/browse/HIVE-8648 ever gets fixed and when that fix makes it into an Impala release. -->
         For your very largest tables, you might find that <codeph>COMPUTE STATS</codeph> or even <codeph>COMPUTE INCREMENTAL STATS</codeph>
         take so long to scan the data that it is impractical to use them regularly. In such a case, after adding a partition or inserting new data,
         you can update just the number of rows property through an <codeph>ALTER TABLE</codeph> statement.
         See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for details.
         Because the column statistics might be left in a stale state, do not use this technique as a replacement
         for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a
         low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations.
       </p>

     </conbody>
   </concept>

   <concept rev="2.1.0" id="perf_stats_checking">

     <title>Detecting Missing Statistics</title>

     <conbody>

       <p>
         You can check whether a specific table has statistics using the <codeph>SHOW TABLE STATS</codeph> statement
         (for any table) or the <codeph>SHOW PARTITIONS</codeph> statement (for a partitioned table). Both
         statements display the same information. If a table or a partition does not have any statistics, the
         <codeph>#Rows</codeph> field contains <codeph>-1</codeph>. Once you compute statistics for the table or
         partition, the <codeph>#Rows</codeph> field changes to an accurate value.
       </p>

       <p>
         The following example shows a table that initially does not have any statistics. The <codeph>SHOW TABLE
         STATS</codeph> statement displays different values for <codeph>#Rows</codeph> before and after the
         <codeph>COMPUTE STATS</codeph> operation.
       </p>

 <codeblock>[localhost:21000] &gt; create table no_stats (x int);
 [localhost:21000] &gt; show table stats no_stats;
 +-------+--------+------+--------------+--------+-------------------+
 | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
 +-------+--------+------+--------------+--------+-------------------+
 | -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
 +-------+--------+------+--------------+--------+-------------------+
 [localhost:21000] &gt; compute stats no_stats;
 +-----------------------------------------+
 | summary                                 |
 +-----------------------------------------+
 | Updated 1 partition(s) and 1 column(s). |
 +-----------------------------------------+
 [localhost:21000] &gt; show table stats no_stats;
 +-------+--------+------+--------------+--------+-------------------+
 | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
 +-------+--------+------+--------------+--------+-------------------+
 | 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
 +-------+--------+------+--------------+--------+-------------------+
 </codeblock>

       <p>
         The following example shows a similar progression with a partitioned table. Initially,
         <codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph> operation,
         <codeph>#Rows</codeph> changes to an accurate value. Any newly added partition starts with no statistics,
         meaning that you must collect statistics after adding a new partition.
       </p>

 <codeblock>[localhost:21000] &gt; create table no_stats_partitioned (x int) partitioned by (year smallint);
 [localhost:21000] &gt; show table stats no_stats_partitioned;
 +-------+-------+--------+------+--------------+--------+-------------------+
 | year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
 +-------+-------+--------+------+--------------+--------+-------------------+
 | Total | -1    | 0      | 0B   | 0B           |        |                   |
 +-------+-------+--------+------+--------------+--------+-------------------+
 [localhost:21000] &gt; show partitions no_stats_partitioned;
 +-------+-------+--------+------+--------------+--------+-------------------+
 | year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
 +-------+-------+--------+------+--------------+--------+-------------------+
 | Total | -1    | 0      | 0B   | 0B           |        |                   |
 +-------+-------+--------+------+--------------+--------+-------------------+
 [localhost:21000] &gt; alter table no_stats_partitioned add partition (year=2013);
 [localhost:21000] &gt; compute stats no_stats_partitioned;
 +-----------------------------------------+
 | summary                                 |
 +-----------------------------------------+
 | Updated 1 partition(s) and 1 column(s). |
 +-----------------------------------------+
 [localhost:21000] &gt; alter table no_stats_partitioned add partition (year=2014);
 [localhost:21000] &gt; show partitions no_stats_partitioned;
 +-------+-------+--------+------+--------------+--------+-------------------+
 | year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
 +-------+-------+--------+------+--------------+--------+-------------------+
 | 2013  | 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
 | 2014  | -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
 | Total | 0     | 0      | 0B   | 0B           |        |                   |
 +-------+-------+--------+------+--------------+--------+-------------------+
 </codeblock>

       <note>
         Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates statistics for all
         partitions in a table, if you expect to frequently add new partitions, use the <codeph>COMPUTE INCREMENTAL
         STATS</codeph> syntax instead, which lets you compute stats for a single specified partition, or only for
         those partitions that do not already have incremental stats.
       </note>

       <p>
         If checking each individual table is impractical, due to a large number of tables or views that hide the
         underlying base tables, you can also check for missing statistics for a particular query. Use the
         <codeph>EXPLAIN</codeph> statement to preview query efficiency before actually running the query. Use the
         query profile output available through the <codeph>PROFILE</codeph> command in
         <cmdname>impala-shell</cmdname> or the web UI to verify query execution and timing after running the query.
         Both the <codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning if any
         tables or partitions involved in the query do not have statistics.
       </p>

 <codeblock>[localhost:21000] &gt; create table no_stats (x int);
 [localhost:21000] &gt; explain select count(*) from no_stats;
 +------------------------------------------------------------------------------------+
 | Explain String                                                                     |
 +------------------------------------------------------------------------------------+
 | Estimated Per-Host Requirements: Memory=10.00MB VCores=1                           |
 | WARNING: The following tables are missing relevant table and/or column statistics. |
 | incremental_stats.no_stats                                                         |
 |                                                                                    |
 | 03:AGGREGATE [FINALIZE]                                                            |
 | |  output: count:merge(*)                                                          |
 | |                                                                                  |
 | 02:EXCHANGE [UNPARTITIONED]                                                        |
 | |                                                                                  |
 | 01:AGGREGATE                                                                       |
 | |  output: count(*)                                                                |
 | |                                                                                  |
 | 00:SCAN HDFS [incremental_stats.no_stats]                                          |
 |    partitions=1/1 files=0 size=0B                                                  |
 +------------------------------------------------------------------------------------+
 </codeblock>

       <p>
         Because Impala uses the <term>partition pruning</term> technique when possible to only evaluate certain
         partitions, if you have a partitioned table with statistics for some partitions and not others, whether or
         not the <codeph>EXPLAIN</codeph> statement shows the warning depends on the actual partitions used by the
         query. For example, you might see warnings or not for different queries against the same table:
       </p>

 <codeblock>-- No warning because all the partitions for the year 2012 have stats.
 EXPLAIN SELECT ... FROM t1 WHERE year = 2012;

 -- Missing stats warning because one or more partitions in this range
 -- do not have stats.
 EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;
 </codeblock>

       <p>
         To confirm if any partitions at all in the table are missing statistics, you might explain a query that
         scans the entire table, such as <codeph>SELECT COUNT(*) FROM <varname>table_name</varname></codeph>.
       </p>
     </conbody>
   </concept>

   <concept rev="2.1.0" id="perf_stats_collecting">

     <title>Keeping Statistics Up to Date</title>

     <conbody>

       <p>
         When the contents of a table or partition change significantly, recompute the stats for the relevant table
         or partition. The degree of change that qualifies as <q>significant</q> varies, depending on the absolute
         and relative sizes of the tables. Typically, if you add more than 30% more data to a table, it is
         worthwhile to recompute stats, because the differences in number of rows and number of distinct values
         might cause Impala to choose a different join order when that table is used in join queries. This guideline
         is most important for the largest tables. For example, adding 30% new data to a table containing 1 TB has a
         greater effect on join order than adding 30% to a table containing only a few megabytes, and the larger
         table has a greater effect on query performance if Impala chooses a suboptimal join order as a result of
         outdated statistics.
       </p>

       <p>
         If you reload a complete new set of data for a table, but the number of rows and number of distinct values
         for each column is relatively unchanged from before, you do not need to recompute stats for the table.
       </p>

       <p>
         If the statistics for a table are out of date, and the table's large size makes it impractical to recompute
         new stats immediately, you can use the <codeph>DROP STATS</codeph> statement to remove the obsolete
         statistics, making it easier to identify tables that need a new <codeph>COMPUTE STATS</codeph> operation.
       </p>

       <p>
         For a large partitioned table, consider using the incremental stats feature available in Impala 2.1.0 and
         higher, as explained in <xref href="impala_perf_stats.xml#perf_stats_incremental"/>. If you add a new
         partition to a table, it is worthwhile to recompute incremental stats, because the operation only scans the
         data for that one new partition.
       </p>
     </conbody>
   </concept>

 <!-- Might deserve its own conceptual topic at some point. -->

   <concept audience="hidden" rev="1.2.2" id="perf_stats_joins">

     <title>How Statistics Are Used in Join Queries</title>

     <conbody>

       <p></p>
     </conbody>
   </concept>

 <!-- Might deserve its own conceptual topic at some point. -->

   <concept audience="hidden" rev="1.2.2" id="perf_stats_inserts">

     <title>How Statistics Are Used in INSERT Operations</title>

     <conbody>

       <p conref="../shared/impala_common.xml#common/insert_hints"/>
     </conbody>
   </concept>

   <concept rev="1.2.2" id="perf_table_stats_manual">

     <title>Setting the NUMROWS Value Manually through ALTER TABLE</title>

     <conbody>

       <p>
         The most crucial piece of data in all the statistics is the number of rows in the table (for an
         unpartitioned or partitioned table) and for each partition (for a partitioned table). The <codeph>COMPUTE STATS</codeph>
         statement always gathers statistics about all columns, as well as overall table statistics. If it is not
         practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
         operation after adding a partition or inserting data, or if you can see that Impala would produce a more
         efficient plan if the number of rows was different, you can manually set the number of rows through an
         <codeph>ALTER TABLE</codeph> statement:
       </p>

 <codeblock>
 -- Set total number of rows. Applies to both unpartitioned and partitioned tables.
 alter table <varname>table_name</varname> set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');

 -- Set total number of rows for a specific partition. Applies to partitioned tables only.
 -- You must specify all the partition key columns in the PARTITION clause.
 alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...) set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
 </codeblock>

       <p>
         This statement avoids re-scanning any data files. (The requirement to include the <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a
         result of the issue <xref href="https://issues.apache.org/jira/browse/HIVE-8648" scope="external" format="html">HIVE-8648</xref>
         for the Hive metastore.)
       </p>

 <codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/>

       <p>
         For a partitioned table, update both the per-partition number of rows and the number of rows for the whole
         table:
       </p>

 <codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/>

       <p>
         In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE INCREMENTAL STATS</codeph>
         for a partitioned table, should be fast and convenient enough that this technique is only useful for the very
         largest partitioned tables.
         <!--
         It is most useful as a workaround for in case of performance issues where you might adjust the <codeph>numRows</codeph> value higher
         or lower to produce the ideal join order.
         -->
         <!-- Following wording is duplicated from earlier. Consider conref'ing. -->
         Because the column statistics might be left in a stale state, do not use this technique as a replacement
         for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a
         low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations.
       </p>
     </conbody>
   </concept>

   <concept id="perf_column_stats_manual" rev="2.6.0 IMPALA-3369">
     <title>Setting Column Stats Manually through ALTER TABLE</title>
     <conbody>
       <p>
         In <keyword keyref="impala26_full"/> and higher, you can also use the <codeph>SET COLUMN STATS</codeph>
         clause of <codeph>ALTER TABLE</codeph> to manually set or change column statistics.
         Only use this technique in cases where it is impractical to run
         <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
         frequently enough to keep up with data changes for a huge table.
       </p>
       <p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
     </conbody>
   </concept>

   <concept rev="1.2.2" id="perf_stats_examples">

     <title>Examples of Using Table and Column Statistics with Impala</title>

     <conbody>

       <p>
         The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>, <codeph>SHOW COLUMN
         STATS</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>SELECT</codeph> and <codeph>INSERT</codeph>
         statements to illustrate various aspects of how Impala uses statistics to help optimize queries.
       </p>

       <p>
         This example shows table and column statistics for the <codeph>STORE</codeph> column used in the
         <xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS benchmarks for decision
         support</xref> systems. It is a tiny table holding data for 12 stores. Initially, before any statistics are
         gathered by a <codeph>COMPUTE STATS</codeph> statement, most of the numeric fields show placeholder values
         of -1, indicating that the figures are unknown. The figures that are filled in are values that are easily
         countable or deducible at the physical level, such as the number of files, total data size of the files,
         and the maximum and average sizes for data types that have a constant size such as <codeph>INT</codeph>,
         <codeph>FLOAT</codeph>, and <codeph>TIMESTAMP</codeph>.
       </p>

 <codeblock>[localhost:21000] &gt; show table stats store;
 +-------+--------+--------+--------+
 | #Rows | #Files | Size   | Format |
 +-------+--------+--------+--------+
 | -1    | 1      | 3.08KB | TEXT   |
 +-------+--------+--------+--------+
 Returned 1 row(s) in 0.03s
 [localhost:21000] &gt; show column stats store;
 +--------------------+-----------+------------------+--------+----------+----------+
 | Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg Size |
 +--------------------+-----------+------------------+--------+----------+----------+
 | s_store_sk         | INT       | -1               | -1     | 4        | 4        |
 | s_store_id         | STRING    | -1               | -1     | -1       | -1       |
 | s_rec_start_date   | TIMESTAMP | -1               | -1     | 16       | 16       |
 | s_rec_end_date     | TIMESTAMP | -1               | -1     | 16       | 16       |
 | s_closed_date_sk   | INT       | -1               | -1     | 4        | 4        |
 | s_store_name       | STRING    | -1               | -1     | -1       | -1       |
 | s_number_employees | INT       | -1               | -1     | 4        | 4        |
 | s_floor_space      | INT       | -1               | -1     | 4        | 4        |
 | s_hours            | STRING    | -1               | -1     | -1       | -1       |
 | s_manager          | STRING    | -1               | -1     | -1       | -1       |
 | s_market_id        | INT       | -1               | -1     | 4        | 4        |
 | s_geography_class  | STRING    | -1               | -1     | -1       | -1       |
 | s_market_desc      | STRING    | -1               | -1     | -1       | -1       |
 | s_market_manager   | STRING    | -1               | -1     | -1       | -1       |
 | s_division_id      | INT       | -1               | -1     | 4        | 4        |
 | s_division_name    | STRING    | -1               | -1     | -1       | -1       |
 | s_company_id       | INT       | -1               | -1     | 4        | 4        |
 | s_company_name     | STRING    | -1               | -1     | -1       | -1       |
 | s_street_number    | STRING    | -1               | -1     | -1       | -1       |
 | s_street_name      | STRING    | -1               | -1     | -1       | -1       |
 | s_street_type      | STRING    | -1               | -1     | -1       | -1       |
 | s_suite_number     | STRING    | -1               | -1     | -1       | -1       |
 | s_city             | STRING    | -1               | -1     | -1       | -1       |
 | s_county           | STRING    | -1               | -1     | -1       | -1       |
 | s_state            | STRING    | -1               | -1     | -1       | -1       |
 | s_zip              | STRING    | -1               | -1     | -1       | -1       |
 | s_country          | STRING    | -1               | -1     | -1       | -1       |
 | s_gmt_offset       | FLOAT     | -1               | -1     | 4        | 4        |
 | s_tax_percentage   | FLOAT     | -1               | -1     | 4        | 4        |
 +--------------------+-----------+------------------+--------+----------+----------+
 Returned 29 row(s) in 0.04s</codeblock>

       <p>
         With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to specify each
         column for which to gather statistics. The Impala <codeph>COMPUTE STATS</codeph> statement automatically
         gathers statistics for all columns, because it reads through the entire table relatively quickly and can
         efficiently compute the values for all the columns. This example shows how after running the
         <codeph>COMPUTE STATS</codeph> statement, statistics are filled in for both the table and all its columns:
       </p>

 <codeblock>[localhost:21000] &gt; compute stats store;
 +------------------------------------------+
 | summary                                  |
 +------------------------------------------+
 | Updated 1 partition(s) and 29 column(s). |
 +------------------------------------------+
 Returned 1 row(s) in 1.88s
 [localhost:21000] &gt; show table stats store;
 +-------+--------+--------+--------+
 | #Rows | #Files | Size   | Format |
 +-------+--------+--------+--------+
 | 12    | 1      | 3.08KB | TEXT   |
 +-------+--------+--------+--------+
 Returned 1 row(s) in 0.02s
 [localhost:21000] &gt; show column stats store;
 +--------------------+-----------+------------------+--------+----------+-------------------+
 | Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg Size          |
 +--------------------+-----------+------------------+--------+----------+-------------------+
 | s_store_sk         | INT       | 12               | -1     | 4        | 4                 |
 | s_store_id         | STRING    | 6                | -1     | 16       | 16                |
 | s_rec_start_date   | TIMESTAMP | 4                | -1     | 16       | 16                |
 | s_rec_end_date     | TIMESTAMP | 3                | -1     | 16       | 16                |
 | s_closed_date_sk   | INT       | 3                | -1     | 4        | 4                 |
 | s_store_name       | STRING    | 8                | -1     | 5        | 4.25              |
 | s_number_employees | INT       | 9                | -1     | 4        | 4                 |
 | s_floor_space      | INT       | 10               | -1     | 4        | 4                 |
 | s_hours            | STRING    | 2                | -1     | 8        | 7.083300113677979 |
 | s_manager          | STRING    | 7                | -1     | 15       | 12                |
 | s_market_id        | INT       | 7                | -1     | 4        | 4                 |
 | s_geography_class  | STRING    | 1                | -1     | 7        | 7                 |
 | s_market_desc      | STRING    | 10               | -1     | 94       | 55.5              |
 | s_market_manager   | STRING    | 7                | -1     | 16       | 14                |
 | s_division_id      | INT       | 1                | -1     | 4        | 4                 |
 | s_division_name    | STRING    | 1                | -1     | 7        | 7                 |
 | s_company_id       | INT       | 1                | -1     | 4        | 4                 |
 | s_company_name     | STRING    | 1                | -1     | 7        | 7                 |
 | s_street_number    | STRING    | 9                | -1     | 3        | 2.833300113677979 |
 | s_street_name      | STRING    | 12               | -1     | 11       | 6.583300113677979 |
 | s_street_type      | STRING    | 8                | -1     | 9        | 4.833300113677979 |
 | s_suite_number     | STRING    | 11               | -1     | 9        | 8.25              |
 | s_city             | STRING    | 2                | -1     | 8        | 6.5               |
 | s_county           | STRING    | 1                | -1     | 17       | 17                |
 | s_state            | STRING    | 1                | -1     | 2        | 2                 |
 | s_zip              | STRING    | 2                | -1     | 5        | 5                 |
 | s_country          | STRING    | 1                | -1     | 13       | 13                |
 | s_gmt_offset       | FLOAT     | 1                | -1     | 4        | 4                 |
 | s_tax_percentage   | FLOAT     | 5                | -1     | 4        | 4                 |
 +--------------------+-----------+------------------+--------+----------+-------------------+
 Returned 29 row(s) in 0.04s</codeblock>

       <p>
         The following example shows how statistics are represented for a partitioned table. In this case, we have
         set up a table to hold the world's most trivial census data, a single <codeph>STRING</codeph> field,
         partitioned by a <codeph>YEAR</codeph> column. The table statistics include a separate entry for each
         partition, plus final totals for the numeric fields. The column statistics include some easily deducible
         facts for the partitioning column, such as the number of distinct values (the number of partition
         subdirectories).
 <!-- and the number of <codeph>NULL</codeph> values (none in this case). -->
       </p>

 <codeblock>localhost:21000] &gt; describe census;
 +------+----------+---------+
 | name | type     | comment |
 +------+----------+---------+
 | name | string   |         |
 | year | smallint |         |
 +------+----------+---------+
 Returned 2 row(s) in 0.02s
 [localhost:21000] &gt; show table stats census;
 +-------+-------+--------+------+---------+
 | year  | #Rows | #Files | Size | Format  |
 +-------+-------+--------+------+---------+
 | 2000  | -1    | 0      | 0B   | TEXT    |
 | 2004  | -1    | 0      | 0B   | TEXT    |
 | 2008  | -1    | 0      | 0B   | TEXT    |
 | 2010  | -1    | 0      | 0B   | TEXT    |
 | 2011  | 0     | 1      | 22B  | TEXT    |
 | 2012  | -1    | 1      | 22B  | TEXT    |
 | 2013  | -1    | 1      | 231B | PARQUET |
 | Total | 0     | 3      | 275B |         |
 +-------+-------+--------+------+---------+
 Returned 8 row(s) in 0.02s
 [localhost:21000] &gt; show column stats census;
 +--------+----------+------------------+--------+----------+----------+
 | Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
 +--------+----------+------------------+--------+----------+----------+
 | name   | STRING   | -1               | -1     | -1       | -1       |
 | year   | SMALLINT | 7                | -1     | 2        | 2        |
 +--------+----------+------------------+--------+----------+----------+
 Returned 2 row(s) in 0.02s</codeblock>

       <p>
         The following example shows how the statistics are filled in by a <codeph>COMPUTE STATS</codeph> statement
         in Impala.
       </p>

 <codeblock>[localhost:21000] &gt; compute stats census;
 +-----------------------------------------+
 | summary                                 |
 +-----------------------------------------+
 | Updated 3 partition(s) and 1 column(s). |
 +-----------------------------------------+
 Returned 1 row(s) in 2.16s
 [localhost:21000] &gt; show table stats census;
 +-------+-------+--------+------+---------+
 | year  | #Rows | #Files | Size | Format  |
 +-------+-------+--------+------+---------+
 | 2000  | -1    | 0      | 0B   | TEXT    |
 | 2004  | -1    | 0      | 0B   | TEXT    |
 | 2008  | -1    | 0      | 0B   | TEXT    |
 | 2010  | -1    | 0      | 0B   | TEXT    |
 | 2011  | 4     | 1      | 22B  | TEXT    |
 | 2012  | 4     | 1      | 22B  | TEXT    |
 | 2013  | 1     | 1      | 231B | PARQUET |
 | Total | 9     | 3      | 275B |         |
 +-------+-------+--------+------+---------+
 Returned 8 row(s) in 0.02s
 [localhost:21000] &gt; show column stats census;
 +--------+----------+------------------+--------+----------+----------+
 | Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
 +--------+----------+------------------+--------+----------+----------+
 | name   | STRING   | 4                | -1     | 5        | 4.5      |
 | year   | SMALLINT | 7                | -1     | 2        | 2        |
 +--------+----------+------------------+--------+----------+----------+
 Returned 2 row(s) in 0.02s</codeblock>

       <p rev="1.4.0">
         For examples showing how some queries work differently when statistics are available, see
         <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala executes a query
         differently in each case by observing the <codeph>EXPLAIN</codeph> output before and after collecting
         statistics. Measure the before and after query times, and examine the throughput numbers in before and
         after <codeph>SUMMARY</codeph> or <codeph>PROFILE</codeph> output, to verify how much the improved plan
         speeds up performance.
       </p>
     </conbody>
   </concept>
 </concept>