docs/topics/impala_tablesample.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="tablesample" rev="IMPALA-5309">

   <title>TABLESAMPLE Clause</title>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need
       to explore the data distribution within the table, the table is very large,
       and it is impractical or unnecessary to process all the data from the table
       or selected partitions.
     </p>

     <p>
       The clause makes the query process a randomized set of data files from the
       table, so that the total volume of data is greater than or equal to the specified
       percentage of data bytes within that table. (Or the data bytes within the set of
       partitions that remain after partition pruning is performed.)
     </p>

     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

 <codeblock>
   <ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
 </codeblock>

     <p>
       The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table alias.
     </p>

     <p>
       The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently,
       Impala only supports a single sampling method named <codeph>SYSTEM</codeph>.
     </p>

     <p>
       The <varname>percentage</varname> argument is an integer literal from 0 to 100.
       A percentage of 0 produces an empty result set for a particular table reference,
       while a percentage of 100 uses the entire contents. Because the sampling works by
       selecting a random set of data files, the proportion of sampled data from the
       table may be greater than the specified percentage, based on the number and sizes
       of the underlying data files. See the usage notes for details.
     </p>

     <p>
       The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary
       positive integer seed value that ensures that when the query is run again, the
       sampling selects the same set of data files each time. <codeph>REPEATABLE</codeph>
       does not have a default value. If you omit the <codeph>REPEATABLE</codeph> keyword,
       the random seed is derived from the current time.
     </p>

     <p conref="../shared/impala_common.xml#common/added_in_290"/>

     <p rev="2.12.0 IMPALA-5310">
       See <keyword keyref="compute_stats"/> for the
         <codeph>TABLESAMPLE</codeph> clause used in the <codeph>COMPUTE
         STATS</codeph> statement.
     </p>

     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

     <p>
       You might use this clause with aggregation queries, such as finding
       the approximate average, minimum, or maximum where exact precision
       is not required. You can use these findings to plan the most effective
       strategy for constructing queries against the full table or designing
       a partitioning strategy for the data.
     </p>

     <p>
       Some other database systems have a <codeph>TABLESAMPLE</codeph> clause.
       The Impala syntax for this clause is modeled on the syntax for popular
       relational databases, not the Hive <codeph>TABLESAMPLE</codeph> clause.
       For example, there is no <codeph>BUCKETS</codeph> keyword as in HiveQL.
     </p>

     <p>
       The precision of the <varname>percentage</varname> threshold depends on
       the number and sizes of the underlying data files. Impala brings in
       additional data files, one at a time, until the number of bytes exceeds
       the specified percentage based on the total number of bytes for the
       entire set of table data. The precision of the percentage threshold is higher
       when the table contains many data files with consistent sizes. See the
       code listings later in this section for examples.
     </p>

     <p>
       When you estimate characteristics of the data distribution based on sampling
       a percentage of the table data, be aware that the data might be unevenly distributed
       between different files. Do not assume that the percentage figure reflects the
       percentage of rows in the table. For example, one file might contain all blank values
       for a <codeph>STRING</codeph> column, while another file contains long strings
       in that column; therefore, one file could contain many more rows than another.
       Likewise, a table created with the <codeph>SORT BY</codeph> clause might
       contain narrow ranges of values for the sort columns, making it impractical to
       extrapolate the number of distinct values for those columns based on sampling
       only some of the data files.
     </p>

     <p>
       Because a sample of the table data might not contain all values for a particular
       column, if the <codeph>TABLESAMPLE</codeph> is used in a join query, the
       key relationships between the tables might produce incomplete result sets
       compared to joins using all the table data. For example, if you join 50%
       of table A with 50% of table B, some values in the join columns might
       not match between the two tables, even though overall there is a 1:1
       relationship between the tables.
     </p>

     <p>
       The <codeph>REPEATABLE</codeph> keyword makes identical queries use a
       consistent set of data files when the query is repeated. You specify an
       arbitrary integer key that acts as a seed value when Impala randomly
       selects the set of data files to use in the query. This technique
       lets you verify correctness, examine performance, and so on for queries
       using the <codeph>TABLESAMPLE</codeph> clause without the sampled data
       being different each time. The repeatable aspect is reset (that is, the
       set of selected data files may change) any time the contents of the table
       change. The statements or operations that can make sampling results
       non-repeatable are:
     </p>

     <ul>
       <li>
         <codeph>INSERT</codeph>.
       </li>
       <li>
         <codeph>TRUNCATE TABLE</codeph>.
       </li>
       <li>
         <codeph>LOAD DATA</codeph>.
       </li>
       <li>
         <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph>
         after files are added or removed by a non-Impala mechanism.
       </li>
       <li>
       </li>
     </ul>

     <p>
       This clause is similar in some ways to the <codeph>LIMIT</codeph> clause,
       because both serve to limit the size of the intermediate data and final
       result set. <codeph>LIMIT 0</codeph> is more efficient than
       <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying that a query can execute
       without producing any results. <codeph>TABLESAMPLE SYSTEM(<varname>n</varname>)</codeph>
       often makes query processing more efficient than using a <codeph>LIMIT</codeph> clause
       by itself, because all phases of query execution use less data overall.
       If the intent is to retrieve some representative values from the table
       in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>,
       <codeph>ORDER BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
     </p>

     <p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
     <p>
       When you query a partitioned table, any partition pruning happens
       before Impala selects the data files to sample. For example, in a
       table partitioned by year, a query with <codeph>WHERE year = 2017</codeph>
       and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause would sample
       data files representing at least 10% of the bytes present in the
       2017 partition.
     </p>

     <p conref="../shared/impala_common.xml#common/s3_blurb"/>
     <p>
       This clause applies to S3 tables the same way as tables
       with data files stored on HDFS.
     </p>

     <p conref="../shared/impala_common.xml#common/adls_blurb"/>
     <p>
       This clause applies to ADLS tables the same way as tables
       with data files stored on HDFS.
     </p>

     <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
     <p>
       This clause does not apply to Kudu tables.
     </p>

     <p conref="../shared/impala_common.xml#common/hbase_blurb"/>
     <p>
       This clause does not apply to HBase tables.
     </p>

     <p conref="../shared/impala_common.xml#common/performance_blurb"/>
     <p>
       From a performance perspective, the <codeph>TABLESAMPLE</codeph>
       clause is especially valuable for exploratory queries on
       text, Avro, or other file formats other than Parquet. Text-based
       or row-oriented file formats must process substantial amounts of
       redundant data for queries that derive aggregate results such as
       <codeph>MAX()</codeph>, <codeph>MIN()</codeph>, or <codeph>AVG()</codeph>
       for a single column. Therefore, you might use <codeph>TABLESAMPLE</codeph>
       early in the ETL pipeline, when data is still in raw text format
       and has not been converted to Parquet or moved into a partitioned
       table.
     </p>

     <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

     <p>
       This clause applies only to tables that use a storage layer
       with underlying raw data files, such as HDFS, Amazon S3,
       or Microsoft ADLS.
     </p>

     <p>
       This clause does not apply to table references that represent views.
       A query that applies the <codeph>TABLESAMPLE</codeph> clause to a
       view or a subquery fails with a semantic error.
     </p>

     <p>
       Because the sampling works at the level of entire data files, it
       is by nature coarse-grained. It is possible to specify a small
       sample percentage but still process a substantial portion of the
       table data if the table contains relatively few data files, if
       each data file is very large, or if the data files vary substantially
       in size. Be sure that you understand the data distribution and physical
       file layout so that you can verify if the results are suitable for
       extrapolation. For example, if the table contains only a single data file,
       the <q>sample</q> will consist of all the table data regardless of
       the percentage you specify. If the table contains data files of
       1 GiB, 1 GiB, and 1 KiB, when you specify a sampling percentage of
       50 you would either process slightly more than 50% of the table
       (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB),
       depending on which data files were selected for sampling.
     </p>

     <p>
       If data files are added by a non-Impala mechanism, and the
       table metadata is not updated by a <codeph>REFRESH</codeph>
       or <codeph>INVALIDATE METADATA</codeph> statement, the
       <codeph>TABLESAMPLE</codeph> clause does not consider those
       new files when computing the number of bytes in the table
       or selecting which files to sample.
     </p>

     <p>
       If data files are removed by a non-Impala mechanism, and the
       table metadata is not updated by a <codeph>REFRESH</codeph>
       or <codeph>INVALIDATE METADATA</codeph> statement, the
       query fails if the <codeph>TABLESAMPLE</codeph> clause
       attempts to reference any of the missing files.
     </p>

     <p conref="../shared/impala_common.xml#common/example_blurb"/>

     <p>
       The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause.
       These examples intentionally use very small data sets to illustrate how
       the number of files, size of each file, and overall size of data in the table
       interact with the percentage specified in the clause.
     </p>

     <p>
       These examples use an unpartitioned table, containing several files of roughly
       the same size:
     </p>

 <codeblock><![CDATA[
 create table sample_demo (x int, s string);

 insert into sample_demo values (1, 'one');
 insert into sample_demo values (2, 'two');
 insert into sample_demo values (3, 'three');
 insert into sample_demo values (4, 'four');
 insert into sample_demo values (5, 'five');

 show files in sample_demo;
 +---------------------+------+-----------+
 | Path                | Size | Partition |
 +---------------------+------+-----------+
 | 991213608_data.0.   | 7B   |           |
 | 982196806_data.0.   | 6B   |           |
 | _2122096884_data.0. | 8B   |           |
 | _586325431_data.0.  | 6B   |           |
 | 1894746258_data.0.  | 7B   |           |
 +---------------------+------+-----------+

 show table stats sample_demo;
 +-------+--------+------+--------+-------------------------+
 | #Rows | #Files | Size | Format | Location                |
 +-------+--------+------+--------+-------------------------+
 | -1    | 5      | 34B  | TEXT   | /tsample.db/sample_demo |
 +-------+--------+------+--------+-------------------------+
 </codeblock>

     <p>
       A query that samples 50% of the table must process at least
       17 bytes of data. Based on the sizes of the data files,
       we can predict that each such query uses 3 arbitrary files.
       Any 1 or 2 files are not enough to reach 50% of the total
       data in the table (34 bytes), so the query adds more files
       until it passes the 50% threshold:
     </p>

 <codeblock><![CDATA[
 select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
 | 4 |
 | 1 |
 | 5 |
 +---+

 select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
 | 5 |
 | 4 |
 | 2 |
 +---+

 select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
 | 5 |
 | 3 |
 | 2 |
 +---+
 </codeblock>

     <p>
       To help run reproducible experiments, the <codeph>REPEATABLE</codeph>
       clause causes Impala to choose the same set of files for each query.
       Although the data set being considered is deterministic, the order
       of results varies (in the absence of an <codeph>ORDER BY</codeph>
       clause) because of the way distributed queries are processed:
     </p>

 <codeblock><![CDATA[
 select distinct x from sample_demo
   tablesample system(50) repeatable (12345);
 +---+
 | x |
 +---+
 | 3 |
 | 2 |
 | 1 |
 +---+

 select distinct x from sample_demo
   tablesample system(50) repeatable (12345);
 +---+
 | x |
 +---+
 | 2 |
 | 1 |
 | 3 |
 +---+
 </codeblock>

     <p>
       The following examples show how uneven data distribution affects
       which data is sampled. Adding another data file containing a long
       string value changes the threshold for 50% of the total data in
       the table:
     </p>

 <codeblock><![CDATA[
 insert into sample_demo values (1000, 'Boyhood is the longest time in li
 fe for a boy. The last term of the school-year is made of decades, not o
 f weeks, and living through them is like waiting for the millennium. Boo
 th Tarkington');

 show files in sample_demo;
 +---------------------+------+-----------+
 | Path                | Size | Partition |
 +---------------------+------+-----------+
 | 991213608_data.0.   | 7B   |           |
 | 982196806_data.0.   | 6B   |           |
 | _253317650_data.0.  | 196B |           |
 | _2122096884_data.0. | 8B   |           |
 | _586325431_data.0.  | 6B   |           |
 | 1894746258_data.0.  | 7B   |           |
 +---------------------+------+-----------+

 show table stats sample_demo;
 +-------+--------+------+--------+-------------------------+
 | #Rows | #Files | Size | Format | Location                |
 +-------+--------+------+--------+-------------------------+
 | -1    | 6      | 230B | TEXT   | /tsample.db/sample_demo |
 +-------+--------+------+--------+-------------------------+
 </codeblock>

     <p>
       Even though the queries do not refer to the <codeph>S</codeph>
       column containing the long value, all the sampling queries include
       the data file containing the column value <codeph>X=1000</codeph>,
       because the query cannot reach the 50% threshold (115 bytes) without
       including that file. The large file might be considered first, in which
       case it is the only file processed by the query. Or an arbitrary
       set of other files might be considered first.
     </p>

 <codeblock><![CDATA[
 select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
 | 1000 |
 | 3    |
 | 1    |
 +------+

 select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
 | 1000 |
 +------+

 select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
 | 1000 |
 | 4    |
 | 2    |
 | 1    |
 +------+
 </codeblock>

     <p>
       The following examples demonstrate how the <codeph>TABLESAMPLE</codeph>
       clause interacts with other table aspects, such as partitioning and file
       format:
     </p>

 <codeblock><![CDATA[
 create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;

 insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
 insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
 insert into sample_demo_partitions partition (n = 3) select * from sample_demo;

 show files in sample_demo_partitions;
 +--------------------------------+--------+-----------+
 | Path                           | Size   | Partition |
 +--------------------------------+--------+-----------+
 | 000000_364262785_data.0.parq   | 1.24KB | n=1       |
 | 000001_973526736_data.0.parq   | 566B   | n=1       |
 | 0000000_1300598134_data.0.parq | 1.24KB | n=2       |
 | 0000001_689099063_data.0.parq  | 568B   | n=2       |
 | 0000000_1861371709_data.0.parq | 1.24KB | n=3       |
 | 0000001_1065507912_data.0.parq | 566B   | n=3       |
 +--------------------------------+--------+-----------+

 show table stats tablesample_demo_partitioned;
 +-------+-------+--------+--------+---------+----------------------------------------------+
 | n     | #Rows | #Files | Size   | Format  | Location                                     |
 +-------+-------+--------+--------+---------+----------------------------------------------+
 | 1     | -1    | 2      | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=1 |
 | 2     | -1    | 2      | 1.80KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=2 |
 | 3     | -1    | 2      | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=3 |
 | Total | -1    | 6      | 5.39KB |         |                                              |
 +-------+-------+--------+--------+---------+----------------------------------------------+
 </codeblock>

     <p>
       If the query does not involve any partition pruning, the
       sampling applies to the data volume of the entire table:
     </p>

 <codeblock><![CDATA[
 -- 18 rows total.
 select count(*) from sample_demo_partitions;
 +----------+
 | count(*) |
 +----------+
 | 18       |
 +----------+

 -- The number of rows per data file is not
 -- perfectly balanced, therefore the count
 -- is different depending on which set of files
 -- is considered.
 select count(*) from sample_demo_partitions
   tablesample system(75);
 +----------+
 | count(*) |
 +----------+
 | 14       |
 +----------+

 select count(*) from sample_demo_partitions
   tablesample system(75);
 +----------+
 | count(*) |
 +----------+
 | 16       |
 +----------+
 </codeblock>

     <p>
       If the query only processes certain partitions,
       the query computes the sampling threshold based on
       the data size and set of files only from the
       relevant partitions:
     </p>

 <codeblock><![CDATA[
 select count(*) from sample_demo_partitions
   tablesample system(50) where n = 1;
 +----------+
 | count(*) |
 +----------+
 | 6        |
 +----------+

 select count(*) from sample_demo_partitions
   tablesample system(50) where n = 1;
 +----------+
 | count(*) |
 +----------+
 | 2        |
 +----------+
 ]]>
 </codeblock>

     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_select.xml#select"/>
     </p>

   </conbody>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="tablesample" rev="IMPALA-5309">

	<title>TABLESAMPLE Clause</title>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="SQL"/>
	<data name="Category" value="Querying"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need
	to explore the data distribution within the table, the table is very large,
	and it is impractical or unnecessary to process all the data from the table
	or selected partitions.
	</p>

	<p>
	The clause makes the query process a randomized set of data files from the
	table, so that the total volume of data is greater than or equal to the specified
	percentage of data bytes within that table. (Or the data bytes within the set of
	partitions that remain after partition pruning is performed.)
	</p>

	<p conref="../shared/impala_common.xml#common/syntax_blurb"/>

	<codeblock>
	<ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
	</codeblock>

	<p>
	The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table alias.
	</p>

	<p>
	The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently,
	Impala only supports a single sampling method named <codeph>SYSTEM</codeph>.
	</p>

	<p>
	The <varname>percentage</varname> argument is an integer literal from 0 to 100.
	A percentage of 0 produces an empty result set for a particular table reference,
	while a percentage of 100 uses the entire contents. Because the sampling works by
	selecting a random set of data files, the proportion of sampled data from the
	table may be greater than the specified percentage, based on the number and sizes
	of the underlying data files. See the usage notes for details.
	</p>

	<p>
	The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary
	positive integer seed value that ensures that when the query is run again, the
	sampling selects the same set of data files each time. <codeph>REPEATABLE</codeph>
	does not have a default value. If you omit the <codeph>REPEATABLE</codeph> keyword,
	the random seed is derived from the current time.
	</p>

	<p conref="../shared/impala_common.xml#common/added_in_290"/>

	<p rev="2.12.0 IMPALA-5310">
	See <keyword keyref="compute_stats"/> for the
	<codeph>TABLESAMPLE</codeph> clause used in the <codeph>COMPUTE
	STATS</codeph> statement.
	</p>

	<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

	<p>
	You might use this clause with aggregation queries, such as finding
	the approximate average, minimum, or maximum where exact precision
	is not required. You can use these findings to plan the most effective
	strategy for constructing queries against the full table or designing
	a partitioning strategy for the data.
	</p>

	<p>
	Some other database systems have a <codeph>TABLESAMPLE</codeph> clause.
	The Impala syntax for this clause is modeled on the syntax for popular
	relational databases, not the Hive <codeph>TABLESAMPLE</codeph> clause.
	For example, there is no <codeph>BUCKETS</codeph> keyword as in HiveQL.
	</p>

	<p>
	The precision of the <varname>percentage</varname> threshold depends on
	the number and sizes of the underlying data files. Impala brings in
	additional data files, one at a time, until the number of bytes exceeds
	the specified percentage based on the total number of bytes for the
	entire set of table data. The precision of the percentage threshold is higher
	when the table contains many data files with consistent sizes. See the
	code listings later in this section for examples.
	</p>

	<p>
	When you estimate characteristics of the data distribution based on sampling
	a percentage of the table data, be aware that the data might be unevenly distributed
	between different files. Do not assume that the percentage figure reflects the
	percentage of rows in the table. For example, one file might contain all blank values
	for a <codeph>STRING</codeph> column, while another file contains long strings
	in that column; therefore, one file could contain many more rows than another.
	Likewise, a table created with the <codeph>SORT BY</codeph> clause might
	contain narrow ranges of values for the sort columns, making it impractical to
	extrapolate the number of distinct values for those columns based on sampling
	only some of the data files.
	</p>

	<p>
	Because a sample of the table data might not contain all values for a particular
	column, if the <codeph>TABLESAMPLE</codeph> is used in a join query, the
	key relationships between the tables might produce incomplete result sets
	compared to joins using all the table data. For example, if you join 50%
	of table A with 50% of table B, some values in the join columns might
	not match between the two tables, even though overall there is a 1:1
	relationship between the tables.
	</p>

	<p>
	The <codeph>REPEATABLE</codeph> keyword makes identical queries use a
	consistent set of data files when the query is repeated. You specify an
	arbitrary integer key that acts as a seed value when Impala randomly
	selects the set of data files to use in the query. This technique
	lets you verify correctness, examine performance, and so on for queries
	using the <codeph>TABLESAMPLE</codeph> clause without the sampled data
	being different each time. The repeatable aspect is reset (that is, the
	set of selected data files may change) any time the contents of the table
	change. The statements or operations that can make sampling results
	non-repeatable are:
	</p>

	<ul>
	<li>
	<codeph>INSERT</codeph>.
	</li>
	<li>
	<codeph>TRUNCATE TABLE</codeph>.
	</li>
	<li>
	<codeph>LOAD DATA</codeph>.
	</li>
	<li>
	<codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph>
	after files are added or removed by a non-Impala mechanism.
	</li>
	<li>
	</li>
	</ul>

	<p>
	This clause is similar in some ways to the <codeph>LIMIT</codeph> clause,
	because both serve to limit the size of the intermediate data and final
	result set. <codeph>LIMIT 0</codeph> is more efficient than
	<codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying that a query can execute
	without producing any results. <codeph>TABLESAMPLE SYSTEM(<varname>n</varname>)</codeph>
	often makes query processing more efficient than using a <codeph>LIMIT</codeph> clause
	by itself, because all phases of query execution use less data overall.
	If the intent is to retrieve some representative values from the table
	in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>,
	<codeph>ORDER BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
	</p>

	<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
	<p>
	When you query a partitioned table, any partition pruning happens
	before Impala selects the data files to sample. For example, in a
	table partitioned by year, a query with <codeph>WHERE year = 2017</codeph>
	and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause would sample
	data files representing at least 10% of the bytes present in the
	2017 partition.
	</p>

	<p conref="../shared/impala_common.xml#common/s3_blurb"/>
	<p>
	This clause applies to S3 tables the same way as tables
	with data files stored on HDFS.
	</p>

	<p conref="../shared/impala_common.xml#common/adls_blurb"/>
	<p>
	This clause applies to ADLS tables the same way as tables
	with data files stored on HDFS.
	</p>

	<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
	<p>
	This clause does not apply to Kudu tables.
	</p>

	<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
	<p>
	This clause does not apply to HBase tables.
	</p>

	<p conref="../shared/impala_common.xml#common/performance_blurb"/>
	<p>
	From a performance perspective, the <codeph>TABLESAMPLE</codeph>
	clause is especially valuable for exploratory queries on
	text, Avro, or other file formats other than Parquet. Text-based
	or row-oriented file formats must process substantial amounts of
	redundant data for queries that derive aggregate results such as
	<codeph>MAX()</codeph>, <codeph>MIN()</codeph>, or <codeph>AVG()</codeph>
	for a single column. Therefore, you might use <codeph>TABLESAMPLE</codeph>
	early in the ETL pipeline, when data is still in raw text format
	and has not been converted to Parquet or moved into a partitioned
	table.
	</p>

	<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

	<p>
	This clause applies only to tables that use a storage layer
	with underlying raw data files, such as HDFS, Amazon S3,
	or Microsoft ADLS.
	</p>

	<p>
	This clause does not apply to table references that represent views.
	A query that applies the <codeph>TABLESAMPLE</codeph> clause to a
	view or a subquery fails with a semantic error.
	</p>

	<p>
	Because the sampling works at the level of entire data files, it
	is by nature coarse-grained. It is possible to specify a small
	sample percentage but still process a substantial portion of the
	table data if the table contains relatively few data files, if
	each data file is very large, or if the data files vary substantially
	in size. Be sure that you understand the data distribution and physical
	file layout so that you can verify if the results are suitable for
	extrapolation. For example, if the table contains only a single data file,
	the <q>sample</q> will consist of all the table data regardless of
	the percentage you specify. If the table contains data files of
	1 GiB, 1 GiB, and 1 KiB, when you specify a sampling percentage of
	50 you would either process slightly more than 50% of the table
	(1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB),
	depending on which data files were selected for sampling.
	</p>

	<p>
	If data files are added by a non-Impala mechanism, and the
	table metadata is not updated by a <codeph>REFRESH</codeph>
	or <codeph>INVALIDATE METADATA</codeph> statement, the
	<codeph>TABLESAMPLE</codeph> clause does not consider those
	new files when computing the number of bytes in the table
	or selecting which files to sample.
	</p>

	<p>
	If data files are removed by a non-Impala mechanism, and the
	table metadata is not updated by a <codeph>REFRESH</codeph>
	or <codeph>INVALIDATE METADATA</codeph> statement, the
	query fails if the <codeph>TABLESAMPLE</codeph> clause
	attempts to reference any of the missing files.
	</p>

	<p conref="../shared/impala_common.xml#common/example_blurb"/>

	<p>
	The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause.
	These examples intentionally use very small data sets to illustrate how
	the number of files, size of each file, and overall size of data in the table
	interact with the percentage specified in the clause.
	</p>

	<p>
	These examples use an unpartitioned table, containing several files of roughly
	the same size:
	</p>

	<codeblock><![CDATA[
	create table sample_demo (x int, s string);

	insert into sample_demo values (1, 'one');
	insert into sample_demo values (2, 'two');
	insert into sample_demo values (3, 'three');
	insert into sample_demo values (4, 'four');
	insert into sample_demo values (5, 'five');

	show files in sample_demo;
	+---------------------+------+-----------+
	\| Path \| Size \| Partition \|
	+---------------------+------+-----------+
	\| 991213608_data.0. \| 7B \| \|
	\| 982196806_data.0. \| 6B \| \|
	\| _2122096884_data.0. \| 8B \| \|
	\| _586325431_data.0. \| 6B \| \|
	\| 1894746258_data.0. \| 7B \| \|
	+---------------------+------+-----------+

	show table stats sample_demo;
	+-------+--------+------+--------+-------------------------+
	\| #Rows \| #Files \| Size \| Format \| Location \|
	+-------+--------+------+--------+-------------------------+
	\| -1 \| 5 \| 34B \| TEXT \| /tsample.db/sample_demo \|
	+-------+--------+------+--------+-------------------------+
	</codeblock>

	<p>
	A query that samples 50% of the table must process at least
	17 bytes of data. Based on the sizes of the data files,
	we can predict that each such query uses 3 arbitrary files.
	Any 1 or 2 files are not enough to reach 50% of the total
	data in the table (34 bytes), so the query adds more files
	until it passes the 50% threshold:
	</p>

	<codeblock><![CDATA[
	select distinct x from sample_demo tablesample system(50);
	+---+
	\| x \|
	+---+
	\| 4 \|
	\| 1 \|
	\| 5 \|
	+---+

	select distinct x from sample_demo tablesample system(50);
	+---+
	\| x \|
	+---+
	\| 5 \|
	\| 4 \|
	\| 2 \|
	+---+

	select distinct x from sample_demo tablesample system(50);
	+---+
	\| x \|
	+---+
	\| 5 \|
	\| 3 \|
	\| 2 \|
	+---+
	</codeblock>

	<p>
	To help run reproducible experiments, the <codeph>REPEATABLE</codeph>
	clause causes Impala to choose the same set of files for each query.
	Although the data set being considered is deterministic, the order
	of results varies (in the absence of an <codeph>ORDER BY</codeph>
	clause) because of the way distributed queries are processed:
	</p>

	<codeblock><![CDATA[
	select distinct x from sample_demo
	tablesample system(50) repeatable (12345);
	+---+
	\| x \|
	+---+
	\| 3 \|
	\| 2 \|
	\| 1 \|
	+---+

	select distinct x from sample_demo
	tablesample system(50) repeatable (12345);
	+---+
	\| x \|
	+---+
	\| 2 \|
	\| 1 \|
	\| 3 \|
	+---+
	</codeblock>

	<p>
	The following examples show how uneven data distribution affects
	which data is sampled. Adding another data file containing a long
	string value changes the threshold for 50% of the total data in
	the table:
	</p>

	<codeblock><![CDATA[
	insert into sample_demo values (1000, 'Boyhood is the longest time in li
	fe for a boy. The last term of the school-year is made of decades, not o
	f weeks, and living through them is like waiting for the millennium. Boo
	th Tarkington');

	show files in sample_demo;
	+---------------------+------+-----------+
	\| Path \| Size \| Partition \|
	+---------------------+------+-----------+
	\| 991213608_data.0. \| 7B \| \|
	\| 982196806_data.0. \| 6B \| \|
	\| _253317650_data.0. \| 196B \| \|
	\| _2122096884_data.0. \| 8B \| \|
	\| _586325431_data.0. \| 6B \| \|
	\| 1894746258_data.0. \| 7B \| \|
	+---------------------+------+-----------+

	show table stats sample_demo;
	+-------+--------+------+--------+-------------------------+
	\| #Rows \| #Files \| Size \| Format \| Location \|
	+-------+--------+------+--------+-------------------------+
	\| -1 \| 6 \| 230B \| TEXT \| /tsample.db/sample_demo \|
	+-------+--------+------+--------+-------------------------+
	</codeblock>

	<p>
	Even though the queries do not refer to the <codeph>S</codeph>
	column containing the long value, all the sampling queries include
	the data file containing the column value <codeph>X=1000</codeph>,
	because the query cannot reach the 50% threshold (115 bytes) without
	including that file. The large file might be considered first, in which
	case it is the only file processed by the query. Or an arbitrary
	set of other files might be considered first.
	</p>

	<codeblock><![CDATA[
	select distinct x from sample_demo tablesample system(50);
	+------+
	\| x \|
	+------+
	\| 1000 \|
	\| 3 \|
	\| 1 \|
	+------+

	select distinct x from sample_demo tablesample system(50);
	+------+
	\| x \|
	+------+
	\| 1000 \|
	+------+

	select distinct x from sample_demo tablesample system(50);
	+------+
	\| x \|
	+------+
	\| 1000 \|
	\| 4 \|
	\| 2 \|
	\| 1 \|
	+------+
	</codeblock>

	<p>
	The following examples demonstrate how the <codeph>TABLESAMPLE</codeph>
	clause interacts with other table aspects, such as partitioning and file
	format:
	</p>

	<codeblock><![CDATA[
	create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;

	insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
	insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
	insert into sample_demo_partitions partition (n = 3) select * from sample_demo;

	show files in sample_demo_partitions;
	+--------------------------------+--------+-----------+
	\| Path \| Size \| Partition \|
	+--------------------------------+--------+-----------+
	\| 000000_364262785_data.0.parq \| 1.24KB \| n=1 \|
	\| 000001_973526736_data.0.parq \| 566B \| n=1 \|
	\| 0000000_1300598134_data.0.parq \| 1.24KB \| n=2 \|
	\| 0000001_689099063_data.0.parq \| 568B \| n=2 \|
	\| 0000000_1861371709_data.0.parq \| 1.24KB \| n=3 \|
	\| 0000001_1065507912_data.0.parq \| 566B \| n=3 \|
	+--------------------------------+--------+-----------+

	show table stats tablesample_demo_partitioned;
	+-------+-------+--------+--------+---------+----------------------------------------------+
	\| n \| #Rows \| #Files \| Size \| Format \| Location \|
	+-------+-------+--------+--------+---------+----------------------------------------------+
	\| 1 \| -1 \| 2 \| 1.79KB \| PARQUET \| /tsample.db/tablesample_demo_partitioned/n=1 \|
	\| 2 \| -1 \| 2 \| 1.80KB \| PARQUET \| /tsample.db/tablesample_demo_partitioned/n=2 \|
	\| 3 \| -1 \| 2 \| 1.79KB \| PARQUET \| /tsample.db/tablesample_demo_partitioned/n=3 \|
	\| Total \| -1 \| 6 \| 5.39KB \| \| \|
	+-------+-------+--------+--------+---------+----------------------------------------------+
	</codeblock>

	<p>
	If the query does not involve any partition pruning, the
	sampling applies to the data volume of the entire table:
	</p>

	<codeblock><![CDATA[
	-- 18 rows total.
	select count(*) from sample_demo_partitions;
	+----------+
	\| count(*) \|
	+----------+
	\| 18 \|
	+----------+

	-- The number of rows per data file is not
	-- perfectly balanced, therefore the count
	-- is different depending on which set of files
	-- is considered.
	select count(*) from sample_demo_partitions
	tablesample system(75);
	+----------+
	\| count(*) \|
	+----------+
	\| 14 \|
	+----------+

	select count(*) from sample_demo_partitions
	tablesample system(75);
	+----------+
	\| count(*) \|
	+----------+
	\| 16 \|
	+----------+
	</codeblock>

	<p>
	If the query only processes certain partitions,
	the query computes the sampling threshold based on
	the data size and set of files only from the
	relevant partitions:
	</p>

	<codeblock><![CDATA[
	select count(*) from sample_demo_partitions
	tablesample system(50) where n = 1;
	+----------+
	\| count(*) \|
	+----------+
	\| 6 \|
	+----------+

	select count(*) from sample_demo_partitions
	tablesample system(50) where n = 1;
	+----------+
	\| count(*) \|
	+----------+
	\| 2 \|
	+----------+
	]]>
	</codeblock>

	<p conref="../shared/impala_common.xml#common/related_info"/>
	<p>
	<xref href="impala_select.xml#select"/>
	</p>

	</conbody>
	</concept>