docs/build/plain-html/topics/impala_tablesample.html - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

 <meta name="copyright" content="(C) Copyright 2025" />
 <meta name="DC.rights.owner" content="(C) Copyright 2025" />
 <meta name="DC.Type" content="concept" />
 <meta name="DC.Title" content="TABLESAMPLE Clause" />
 <meta name="DC.Relation" scheme="URI" content="../topics/impala_select.html" />
 <meta name="prodname" content="Impala" />
 <meta name="prodname" content="Impala" />
 <meta name="version" content="Impala 3.4.x" />
 <meta name="version" content="Impala 3.4.x" />
 <meta name="DC.Format" content="XHTML" />
 <meta name="DC.Identifier" content="tablesample" />
 <link rel="stylesheet" type="text/css" href="../commonltr.css" />
 <title>TABLESAMPLE Clause</title>
 </head>
 <body id="tablesample">


   <h1 class="title topictitle1" id="ariaid-title1">TABLESAMPLE Clause</h1>


   <div class="body conbody">

     <p class="p">
       Specify the <code class="ph codeph">TABLESAMPLE</code> clause in cases where you need to explore the
       data distribution within the table, the table is very large, and it is impractical or
       unnecessary to process all the data from the table or selected partitions.
     </p>


     <p class="p">
       The clause makes the query process a randomized set of data files from the table, so that
       the total volume of data is greater than or equal to the specified percentage of data
       bytes within that table. (Or the data bytes within the set of partitions that remain after
       partition pruning is performed.)
     </p>


     <p class="p">
         <strong class="ph b">Syntax:</strong>
       </p>


 <pre class="pre codeblock"><code>
 <span class="ph">TABLESAMPLE SYSTEM(<var class="keyword varname">percentage</var>) [REPEATABLE(<var class="keyword varname">seed</var>)]</span>
 </code></pre>

     <p class="p">
       The <code class="ph codeph">TABLESAMPLE</code> clause comes immediately after a table name or table
       alias.
     </p>


     <p class="p">
       The <code class="ph codeph">SYSTEM</code> keyword represents the sampling method. Currently, Impala only
       supports a single sampling method named <code class="ph codeph">SYSTEM</code>.
     </p>


     <p class="p">
       The <var class="keyword varname">percentage</var> argument is an integer literal from 0 to 100. A
       percentage of 0 produces an empty result set for a particular table reference, while a
       percentage of 100 uses the entire contents. Because the sampling works by selecting a
       random set of data files, the proportion of sampled data from the table may be greater
       than the specified percentage, based on the number and sizes of the underlying data files.
       See the usage notes for details.
     </p>


     <p class="p">
       The optional <code class="ph codeph">REPEATABLE</code> keyword lets you specify an arbitrary positive
       integer seed value that ensures that when the query is run again, the sampling selects the
       same set of data files each time. <code class="ph codeph">REPEATABLE</code> does not have a default
       value. If you omit the <code class="ph codeph">REPEATABLE</code> keyword, the random seed is derived
       from the current time.
     </p>


     <p class="p">
         <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.9.0</span>
       </p>


     <p class="p">
       See <a href="impala_compute_stats.html"><span class="keyword">COMPUTE STATS Statement</span></a> for the <code class="ph codeph">TABLESAMPLE</code> clause used in
       the <code class="ph codeph">COMPUTE STATS</code> statement.
     </p>


     <p class="p">
         <strong class="ph b">Usage notes:</strong>
       </p>


     <p class="p">
       You might use this clause with aggregation queries, such as finding the approximate
       average, minimum, or maximum where exact precision is not required. You can use these
       findings to plan the most effective strategy for constructing queries against the full
       table or designing a partitioning strategy for the data.
     </p>


     <p class="p">
       Some other database systems have a <code class="ph codeph">TABLESAMPLE</code> clause. The Impala syntax
       for this clause is modeled on the syntax for popular relational databases, not the Hive
       <code class="ph codeph">TABLESAMPLE</code> clause. For example, there is no <code class="ph codeph">BUCKETS</code>
       keyword as in HiveQL.
     </p>


     <p class="p">
       The precision of the <var class="keyword varname">percentage</var> threshold depends on the number and
       sizes of the underlying data files. Impala brings in additional data files, one at a time,
       until the number of bytes exceeds the specified percentage based on the total number of
       bytes for the entire set of table data. The precision of the percentage threshold is
       higher when the table contains many data files with consistent sizes. See the code
       listings later in this section for examples.
     </p>


     <p class="p">
       When you estimate characteristics of the data distribution based on sampling a percentage
       of the table data, be aware that the data might be unevenly distributed between different
       files. Do not assume that the percentage figure reflects the percentage of rows in the
       table. For example, one file might contain all blank values for a <code class="ph codeph">STRING</code>
       column, while another file contains long strings in that column; therefore, one file could
       contain many more rows than another. Likewise, a table created with the <code class="ph codeph">SORT
       BY</code> clause might contain narrow ranges of values for the sort columns, making it
       impractical to extrapolate the number of distinct values for those columns based on
       sampling only some of the data files.
     </p>


     <p class="p">
       Because a sample of the table data might not contain all values for a particular column,
       if the <code class="ph codeph">TABLESAMPLE</code> is used in a join query, the key relationships between
       the tables might produce incomplete result sets compared to joins using all the table
       data. For example, if you join 50% of table A with 50% of table B, some values in the join
       columns might not match between the two tables, even though overall there is a 1:1
       relationship between the tables.
     </p>


     <p class="p">
       The <code class="ph codeph">REPEATABLE</code> keyword makes identical queries use a consistent set of
       data files when the query is repeated. You specify an arbitrary integer key that acts as a
       seed value when Impala randomly selects the set of data files to use in the query. This
       technique lets you verify correctness, examine performance, and so on for queries using
       the <code class="ph codeph">TABLESAMPLE</code> clause without the sampled data being different each
       time. The repeatable aspect is reset (that is, the set of selected data files may change)
       any time the contents of the table change. The statements or operations that can make
       sampling results non-repeatable are:
     </p>


     <ul class="ul">
       <li class="li">
         <code class="ph codeph">INSERT</code>.
       </li>


       <li class="li">
         <code class="ph codeph">TRUNCATE TABLE</code>.
       </li>


       <li class="li">
         <code class="ph codeph">LOAD DATA</code>.
       </li>


       <li class="li">
         <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> after files are added
         or removed by a non-Impala mechanism.
       </li>


       <li class="li"></li>

     </ul>


     <p class="p">
       This clause is similar in some ways to the <code class="ph codeph">LIMIT</code> clause, because both
       serve to limit the size of the intermediate data and final result set. <code class="ph codeph">LIMIT
       0</code> is more efficient than <code class="ph codeph">TABLESAMPLE SYSTEM(0)</code> for verifying
       that a query can execute without producing any results. <code class="ph codeph">TABLESAMPLE
       SYSTEM(<var class="keyword varname">n</var>)</code> often makes query processing more efficient than
       using a <code class="ph codeph">LIMIT</code> clause by itself, because all phases of query execution use
       less data overall. If the intent is to retrieve some representative values from the table
       in an efficient way, you might combine <code class="ph codeph">TABLESAMPLE</code>, <code class="ph codeph">ORDER
       BY</code>, and <code class="ph codeph">LIMIT</code> clauses within a single query.
     </p>


     <p class="p">
         <strong class="ph b">Partitioning:</strong>
       </p>


     <p class="p">
       When you query a partitioned table, any partition pruning happens before Impala selects
       the data files to sample. For example, in a table partitioned by year, a query with
       <code class="ph codeph">WHERE year = 2017</code> and a <code class="ph codeph">TABLESAMPLE SYSTEM(10)</code> clause
       would sample data files representing at least 10% of the bytes present in the 2017
       partition.
     </p>


     <p class="p">
         <strong class="ph b">Amazon S3 considerations:</strong>
       </p>


     <p class="p">
       This clause applies to S3 tables the same way as tables with data files stored on HDFS.
     </p>


     <p class="p">
         <strong class="ph b">ADLS considerations:</strong>
       </p>


     <p class="p">
       This clause applies to ADLS tables the same way as tables with data files stored on HDFS.
     </p>


     <p class="p">
         <strong class="ph b">Kudu considerations:</strong>
       </p>


     <p class="p">
       This clause does not apply to Kudu tables.
     </p>


     <p class="p">
         <strong class="ph b">HBase considerations:</strong>
       </p>


     <p class="p">
       This clause does not apply to HBase tables.
     </p>


     <p class="p">
         <strong class="ph b">Performance considerations:</strong>
       </p>


     <p class="p">
       From a performance perspective, the <code class="ph codeph">TABLESAMPLE</code> clause is especially
       valuable for exploratory queries on text, Avro, or other file formats other than Parquet.
       Text-based or row-oriented file formats must process substantial amounts of redundant data
       for queries that derive aggregate results such as <code class="ph codeph">MAX()</code>,
       <code class="ph codeph">MIN()</code>, or <code class="ph codeph">AVG()</code> for a single column. Therefore, you
       might use <code class="ph codeph">TABLESAMPLE</code> early in the ETL pipeline, when data is still in
       raw text format and has not been converted to Parquet or moved into a partitioned table.
     </p>


     <p class="p">
         <strong class="ph b">Restrictions:</strong>
       </p>


     <p class="p">
       This clause applies only to tables that use a storage layer with underlying raw data
       files, such as HDFS, Amazon S3, or Microsoft ADLS.
     </p>


     <p class="p">
       This clause does not apply to table references that represent views. A query that applies
       the <code class="ph codeph">TABLESAMPLE</code> clause to a view or a subquery fails with a semantic
       error.
     </p>


     <p class="p">
       Because the sampling works at the level of entire data files, it is by nature
       coarse-grained. It is possible to specify a small sample percentage but still process a
       substantial portion of the table data if the table contains relatively few data files, if
       each data file is very large, or if the data files vary substantially in size. Be sure
       that you understand the data distribution and physical file layout so that you can verify
       if the results are suitable for extrapolation. For example, if the table contains only a
       single data file, the <span class="q">"sample"</span> will consist of all the table data regardless of the
       percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when
       you specify a sampling percentage of 50 you would either process slightly more than 50% of
       the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which
       data files were selected for sampling.
     </p>


     <p class="p">
       If data files are added by a non-Impala mechanism, and the table metadata is not updated
       by a <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> statement, the
       <code class="ph codeph">TABLESAMPLE</code> clause does not consider those new files when computing the
       number of bytes in the table or selecting which files to sample.
     </p>


     <p class="p">
       If data files are removed by a non-Impala mechanism, and the table metadata is not updated
       by a <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> statement, the query
       fails if the <code class="ph codeph">TABLESAMPLE</code> clause attempts to reference any of the missing
       files.
     </p>


     <p class="p">
         <strong class="ph b">Examples:</strong>
       </p>


     <p class="p">
       The following examples demonstrate the <code class="ph codeph">TABLESAMPLE</code> clause. These examples
       intentionally use very small data sets to illustrate how the number of files, size of each
       file, and overall size of data in the table interact with the percentage specified in the
       clause.
     </p>


     <p class="p">
       These examples use an unpartitioned table, containing several files of roughly the same
       size:
     </p>


 <pre class="pre codeblock"><code>create table sample_demo (x int, s string);

 insert into sample_demo values (1, 'one');
 insert into sample_demo values (2, 'two');
 insert into sample_demo values (3, 'three');
 insert into sample_demo values (4, 'four');
 insert into sample_demo values (5, 'five');

 show files in sample_demo;
 +---------------------+------+-----------+
 | Path                | Size | Partition |
 +---------------------+------+-----------+
 | 991213608_data.0.   | 7B   |           |
 | 982196806_data.0.   | 6B   |           |
 | _2122096884_data.0. | 8B   |           |
 | _586325431_data.0.  | 6B   |           |
 | 1894746258_data.0.  | 7B   |           |
 +---------------------+------+-----------+

 show table stats sample_demo;
 +-------+--------+------+--------+-------------------------+
 | #Rows | #Files | Size | Format | Location                |
 +-------+--------+------+--------+-------------------------+
 | -1    | 5      | 34B  | TEXT   | /tsample.db/sample_demo |
 +-------+--------+------+--------+-------------------------+</code></pre>

     <p class="p">
       A query that samples 50% of the table must process at least 17 bytes of data. Based on the
       sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1
       or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the
       query adds more files until it passes the 50% threshold:
     </p>


 <pre class="pre codeblock"><code>select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
 | 4 |
 | 1 |
 | 5 |
 +---+

 select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
 | 5 |
 | 4 |
 | 2 |
 +---+

 select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
 | 5 |
 | 3 |
 | 2 |
 +---+</code></pre>

     <p class="p">
       To help run reproducible experiments, the <code class="ph codeph">REPEATABLE</code> clause causes Impala
       to choose the same set of files for each query. Although the data set being considered is
       deterministic, the order of results varies (in the absence of an <code class="ph codeph">ORDER BY</code>
       clause) because of the way distributed queries are processed:
     </p>


 <pre class="pre codeblock"><code>select distinct x from sample_demo
   tablesample system(50) repeatable (12345);
 +---+
 | x |
 +---+
 | 3 |
 | 2 |
 | 1 |
 +---+

 select distinct x from sample_demo
   tablesample system(50) repeatable (12345);
 +---+
 | x |
 +---+
 | 2 |
 | 1 |
 | 3 |
 +---+
 </code></pre>

     <p class="p">
       The following examples show how uneven data distribution affects which data is sampled.
       Adding another data file containing a long string value changes the threshold for 50% of
       the total data in the table:
     </p>


 <pre class="pre codeblock"><code>insert into sample_demo values
 (1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington');

 show files in sample_demo;
 +---------------------+------+-----------+
 | Path                | Size | Partition |
 +---------------------+------+-----------+
 | 991213608_data.0.   | 7B   |           |
 | 982196806_data.0.   | 6B   |           |
 | _253317650_data.0.  | 196B |           |
 | _2122096884_data.0. | 8B   |           |
 | _586325431_data.0.  | 6B   |           |
 | 1894746258_data.0.  | 7B   |           |
 +---------------------+------+-----------+

 show table stats sample_demo;
 +-------+--------+------+--------+-------------------------+
 | #Rows | #Files | Size | Format | Location                |
 +-------+--------+------+--------+-------------------------+
 | -1    | 6      | 230B | TEXT   | /tsample.db/sample_demo |
 +-------+--------+------+--------+-------------------------+
 </code></pre>

     <p class="p">
       Even though the queries do not refer to the <code class="ph codeph">S</code> column containing the long
       value, all the sampling queries include the data file containing the column value
       <code class="ph codeph">X=1000</code>, because the query cannot reach the 50% threshold (115 bytes)
       without including that file. The large file might be considered first, in which case it is
       the only file processed by the query. Or an arbitrary set of other files might be
       considered first.
     </p>


 <pre class="pre codeblock"><code>select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
 | 1000 |
 | 3    |
 | 1    |
 +------+

 select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
 | 1000 |
 +------+

 select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
 | 1000 |
 | 4    |
 | 2    |
 | 1    |
 +------+</code></pre>

     <p class="p">
       The following examples demonstrate how the <code class="ph codeph">TABLESAMPLE</code> clause interacts
       with other table aspects, such as partitioning and file format:
     </p>


 <pre class="pre codeblock"><code>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;

 insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
 insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
 insert into sample_demo_partitions partition (n = 3) select * from sample_demo;

 show files in sample_demo_partitions;
 +--------------------------------+--------+-----------+
 | Path                           | Size   | Partition |
 +--------------------------------+--------+-----------+
 | 000000_364262785_data.0.parq   | 1.24KB | n=1       |
 | 000001_973526736_data.0.parq   | 566B   | n=1       |
 | 0000000_1300598134_data.0.parq | 1.24KB | n=2       |
 | 0000001_689099063_data.0.parq  | 568B   | n=2       |
 | 0000000_1861371709_data.0.parq | 1.24KB | n=3       |
 | 0000001_1065507912_data.0.parq | 566B   | n=3       |
 +--------------------------------+--------+-----------+

 show table stats tablesample_demo_partitioned;
 +-------+-------+--------+--------+---------+----------------------------------------------+
 | n     | #Rows | #Files | Size   | Format  | Location                                     |
 +-------+-------+--------+--------+---------+----------------------------------------------+
 | 1     | -1    | 2      | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=1 |
 | 2     | -1    | 2      | 1.80KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=2 |
 | 3     | -1    | 2      | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=3 |
 | Total | -1    | 6      | 5.39KB |         |                                              |
 +-------+-------+--------+--------+---------+----------------------------------------------+
 </code></pre>

     <p class="p">
       If the query does not involve any partition pruning, the sampling applies to the data
       volume of the entire table:
     </p>


 <pre class="pre codeblock"><code>-- 18 rows total.
 select count(*) from sample_demo_partitions;
 +----------+
 | count(*) |
 +----------+
 | 18       |
 +----------+

 -- The number of rows per data file is not
 -- perfectly balanced, therefore the count
 -- is different depending on which set of files
 -- is considered.
 select count(*) from sample_demo_partitions
   tablesample system(75);
 +----------+
 | count(*) |
 +----------+
 | 14       |
 +----------+

 select count(*) from sample_demo_partitions
   tablesample system(75);
 +----------+
 | count(*) |
 +----------+
 | 16       |
 +----------+</code></pre>

     <p class="p">
       If the query only processes certain partitions, the query computes the sampling threshold
       based on the data size and set of files only from the relevant partitions:
     </p>


 <pre class="pre codeblock"><code>select count(*) from sample_demo_partitions
   tablesample system(50) where n = 1;
 +----------+
 | count(*) |
 +----------+
 | 6        |
 +----------+

 select count(*) from sample_demo_partitions
   tablesample system(50) where n = 1;
 +----------+
 | count(*) |
 +----------+
 | 2        |
 +----------+
 </code></pre>

     <p class="p">
         <strong class="ph b">Related information:</strong>
       </p>


     <p class="p">
       <a class="xref" href="impala_select.html#select">SELECT Statement</a>
     </p>


   </div>


 <div class="related-links">
 <div class="familylinks">
 <div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_select.html">SELECT Statement</a></div>
 </div>
 </div></body>
 </html>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE html
	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
	<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

	<meta name="copyright" content="(C) Copyright 2025" />
	<meta name="DC.rights.owner" content="(C) Copyright 2025" />
	<meta name="DC.Type" content="concept" />
	<meta name="DC.Title" content="TABLESAMPLE Clause" />
	<meta name="DC.Relation" scheme="URI" content="../topics/impala_select.html" />
	<meta name="prodname" content="Impala" />
	<meta name="prodname" content="Impala" />
	<meta name="version" content="Impala 3.4.x" />
	<meta name="version" content="Impala 3.4.x" />
	<meta name="DC.Format" content="XHTML" />
	<meta name="DC.Identifier" content="tablesample" />
	<link rel="stylesheet" type="text/css" href="../commonltr.css" />
	<title>TABLESAMPLE Clause</title>
	</head>
	<body id="tablesample">


	<h1 class="title topictitle1" id="ariaid-title1">TABLESAMPLE Clause</h1>




	<div class="body conbody">

	<p class="p">
	Specify the <code class="ph codeph">TABLESAMPLE</code> clause in cases where you need to explore the
	data distribution within the table, the table is very large, and it is impractical or
	unnecessary to process all the data from the table or selected partitions.
	</p>


	<p class="p">
	The clause makes the query process a randomized set of data files from the table, so that
	the total volume of data is greater than or equal to the specified percentage of data
	bytes within that table. (Or the data bytes within the set of partitions that remain after
	partition pruning is performed.)
	</p>


	<p class="p">
	<strong class="ph b">Syntax:</strong>
	</p>


	<pre class="pre codeblock"><code>
	<span class="ph">TABLESAMPLE SYSTEM(<var class="keyword varname">percentage</var>) [REPEATABLE(<var class="keyword varname">seed</var>)]</span>
	</code></pre>

	<p class="p">
	The <code class="ph codeph">TABLESAMPLE</code> clause comes immediately after a table name or table
	alias.
	</p>


	<p class="p">
	The <code class="ph codeph">SYSTEM</code> keyword represents the sampling method. Currently, Impala only
	supports a single sampling method named <code class="ph codeph">SYSTEM</code>.
	</p>


	<p class="p">
	The <var class="keyword varname">percentage</var> argument is an integer literal from 0 to 100. A
	percentage of 0 produces an empty result set for a particular table reference, while a
	percentage of 100 uses the entire contents. Because the sampling works by selecting a
	random set of data files, the proportion of sampled data from the table may be greater
	than the specified percentage, based on the number and sizes of the underlying data files.
	See the usage notes for details.
	</p>


	<p class="p">
	The optional <code class="ph codeph">REPEATABLE</code> keyword lets you specify an arbitrary positive
	integer seed value that ensures that when the query is run again, the sampling selects the
	same set of data files each time. <code class="ph codeph">REPEATABLE</code> does not have a default
	value. If you omit the <code class="ph codeph">REPEATABLE</code> keyword, the random seed is derived
	from the current time.
	</p>


	<p class="p">
	<strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.9.0</span>
	</p>


	<p class="p">
	See <a href="impala_compute_stats.html"><span class="keyword">COMPUTE STATS Statement</span></a> for the <code class="ph codeph">TABLESAMPLE</code> clause used in
	the <code class="ph codeph">COMPUTE STATS</code> statement.
	</p>


	<p class="p">
	<strong class="ph b">Usage notes:</strong>
	</p>


	<p class="p">
	You might use this clause with aggregation queries, such as finding the approximate
	average, minimum, or maximum where exact precision is not required. You can use these
	findings to plan the most effective strategy for constructing queries against the full
	table or designing a partitioning strategy for the data.
	</p>


	<p class="p">
	Some other database systems have a <code class="ph codeph">TABLESAMPLE</code> clause. The Impala syntax
	for this clause is modeled on the syntax for popular relational databases, not the Hive
	<code class="ph codeph">TABLESAMPLE</code> clause. For example, there is no <code class="ph codeph">BUCKETS</code>
	keyword as in HiveQL.
	</p>


	<p class="p">
	The precision of the <var class="keyword varname">percentage</var> threshold depends on the number and
	sizes of the underlying data files. Impala brings in additional data files, one at a time,
	until the number of bytes exceeds the specified percentage based on the total number of
	bytes for the entire set of table data. The precision of the percentage threshold is
	higher when the table contains many data files with consistent sizes. See the code
	listings later in this section for examples.
	</p>


	<p class="p">
	When you estimate characteristics of the data distribution based on sampling a percentage
	of the table data, be aware that the data might be unevenly distributed between different
	files. Do not assume that the percentage figure reflects the percentage of rows in the
	table. For example, one file might contain all blank values for a <code class="ph codeph">STRING</code>
	column, while another file contains long strings in that column; therefore, one file could
	contain many more rows than another. Likewise, a table created with the <code class="ph codeph">SORT
	BY</code> clause might contain narrow ranges of values for the sort columns, making it
	impractical to extrapolate the number of distinct values for those columns based on
	sampling only some of the data files.
	</p>


	<p class="p">
	Because a sample of the table data might not contain all values for a particular column,
	if the <code class="ph codeph">TABLESAMPLE</code> is used in a join query, the key relationships between
	the tables might produce incomplete result sets compared to joins using all the table
	data. For example, if you join 50% of table A with 50% of table B, some values in the join
	columns might not match between the two tables, even though overall there is a 1:1
	relationship between the tables.
	</p>


	<p class="p">
	The <code class="ph codeph">REPEATABLE</code> keyword makes identical queries use a consistent set of
	data files when the query is repeated. You specify an arbitrary integer key that acts as a
	seed value when Impala randomly selects the set of data files to use in the query. This
	technique lets you verify correctness, examine performance, and so on for queries using
	the <code class="ph codeph">TABLESAMPLE</code> clause without the sampled data being different each
	time. The repeatable aspect is reset (that is, the set of selected data files may change)
	any time the contents of the table change. The statements or operations that can make
	sampling results non-repeatable are:
	</p>


	<ul class="ul">
	<li class="li">
	<code class="ph codeph">INSERT</code>.
	</li>


	<li class="li">
	<code class="ph codeph">TRUNCATE TABLE</code>.
	</li>


	<li class="li">
	<code class="ph codeph">LOAD DATA</code>.
	</li>


	<li class="li">
	<code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> after files are added
	or removed by a non-Impala mechanism.
	</li>


	<li class="li"></li>

	</ul>


	<p class="p">
	This clause is similar in some ways to the <code class="ph codeph">LIMIT</code> clause, because both
	serve to limit the size of the intermediate data and final result set. <code class="ph codeph">LIMIT
	0</code> is more efficient than <code class="ph codeph">TABLESAMPLE SYSTEM(0)</code> for verifying
	that a query can execute without producing any results. <code class="ph codeph">TABLESAMPLE
	SYSTEM(<var class="keyword varname">n</var>)</code> often makes query processing more efficient than
	using a <code class="ph codeph">LIMIT</code> clause by itself, because all phases of query execution use
	less data overall. If the intent is to retrieve some representative values from the table
	in an efficient way, you might combine <code class="ph codeph">TABLESAMPLE</code>, <code class="ph codeph">ORDER
	BY</code>, and <code class="ph codeph">LIMIT</code> clauses within a single query.
	</p>


	<p class="p">
	<strong class="ph b">Partitioning:</strong>
	</p>


	<p class="p">
	When you query a partitioned table, any partition pruning happens before Impala selects
	the data files to sample. For example, in a table partitioned by year, a query with
	<code class="ph codeph">WHERE year = 2017</code> and a <code class="ph codeph">TABLESAMPLE SYSTEM(10)</code> clause
	would sample data files representing at least 10% of the bytes present in the 2017
	partition.
	</p>


	<p class="p">
	<strong class="ph b">Amazon S3 considerations:</strong>
	</p>


	<p class="p">
	This clause applies to S3 tables the same way as tables with data files stored on HDFS.
	</p>


	<p class="p">
	<strong class="ph b">ADLS considerations:</strong>
	</p>


	<p class="p">
	This clause applies to ADLS tables the same way as tables with data files stored on HDFS.
	</p>


	<p class="p">
	<strong class="ph b">Kudu considerations:</strong>
	</p>


	<p class="p">
	This clause does not apply to Kudu tables.
	</p>


	<p class="p">
	<strong class="ph b">HBase considerations:</strong>
	</p>


	<p class="p">
	This clause does not apply to HBase tables.
	</p>


	<p class="p">
	<strong class="ph b">Performance considerations:</strong>
	</p>


	<p class="p">
	From a performance perspective, the <code class="ph codeph">TABLESAMPLE</code> clause is especially
	valuable for exploratory queries on text, Avro, or other file formats other than Parquet.
	Text-based or row-oriented file formats must process substantial amounts of redundant data
	for queries that derive aggregate results such as <code class="ph codeph">MAX()</code>,
	<code class="ph codeph">MIN()</code>, or <code class="ph codeph">AVG()</code> for a single column. Therefore, you
	might use <code class="ph codeph">TABLESAMPLE</code> early in the ETL pipeline, when data is still in
	raw text format and has not been converted to Parquet or moved into a partitioned table.
	</p>


	<p class="p">
	<strong class="ph b">Restrictions:</strong>
	</p>


	<p class="p">
	This clause applies only to tables that use a storage layer with underlying raw data
	files, such as HDFS, Amazon S3, or Microsoft ADLS.
	</p>


	<p class="p">
	This clause does not apply to table references that represent views. A query that applies
	the <code class="ph codeph">TABLESAMPLE</code> clause to a view or a subquery fails with a semantic
	error.
	</p>


	<p class="p">
	Because the sampling works at the level of entire data files, it is by nature
	coarse-grained. It is possible to specify a small sample percentage but still process a
	substantial portion of the table data if the table contains relatively few data files, if
	each data file is very large, or if the data files vary substantially in size. Be sure
	that you understand the data distribution and physical file layout so that you can verify
	if the results are suitable for extrapolation. For example, if the table contains only a
	single data file, the <span class="q">"sample"</span> will consist of all the table data regardless of the
	percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when
	you specify a sampling percentage of 50 you would either process slightly more than 50% of
	the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which
	data files were selected for sampling.
	</p>


	<p class="p">
	If data files are added by a non-Impala mechanism, and the table metadata is not updated
	by a <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> statement, the
	<code class="ph codeph">TABLESAMPLE</code> clause does not consider those new files when computing the
	number of bytes in the table or selecting which files to sample.
	</p>


	<p class="p">
	If data files are removed by a non-Impala mechanism, and the table metadata is not updated
	by a <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> statement, the query
	fails if the <code class="ph codeph">TABLESAMPLE</code> clause attempts to reference any of the missing
	files.
	</p>


	<p class="p">
	<strong class="ph b">Examples:</strong>
	</p>


	<p class="p">
	The following examples demonstrate the <code class="ph codeph">TABLESAMPLE</code> clause. These examples
	intentionally use very small data sets to illustrate how the number of files, size of each
	file, and overall size of data in the table interact with the percentage specified in the
	clause.
	</p>


	<p class="p">
	These examples use an unpartitioned table, containing several files of roughly the same
	size:
	</p>


	<pre class="pre codeblock"><code>create table sample_demo (x int, s string);

	insert into sample_demo values (1, 'one');
	insert into sample_demo values (2, 'two');
	insert into sample_demo values (3, 'three');
	insert into sample_demo values (4, 'four');
	insert into sample_demo values (5, 'five');

	show files in sample_demo;
	+---------------------+------+-----------+
	\| Path \| Size \| Partition \|
	+---------------------+------+-----------+
	\| 991213608_data.0. \| 7B \| \|
	\| 982196806_data.0. \| 6B \| \|
	\| _2122096884_data.0. \| 8B \| \|
	\| _586325431_data.0. \| 6B \| \|
	\| 1894746258_data.0. \| 7B \| \|
	+---------------------+------+-----------+

	show table stats sample_demo;
	+-------+--------+------+--------+-------------------------+
	\| #Rows \| #Files \| Size \| Format \| Location \|
	+-------+--------+------+--------+-------------------------+
	\| -1 \| 5 \| 34B \| TEXT \| /tsample.db/sample_demo \|
	+-------+--------+------+--------+-------------------------+</code></pre>

	<p class="p">
	A query that samples 50% of the table must process at least 17 bytes of data. Based on the
	sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1
	or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the
	query adds more files until it passes the 50% threshold:
	</p>


	<pre class="pre codeblock"><code>select distinct x from sample_demo tablesample system(50);
	+---+
	\| x \|
	+---+
	\| 4 \|
	\| 1 \|
	\| 5 \|
	+---+

	select distinct x from sample_demo tablesample system(50);
	+---+
	\| x \|
	+---+
	\| 5 \|
	\| 4 \|
	\| 2 \|
	+---+

	select distinct x from sample_demo tablesample system(50);
	+---+
	\| x \|
	+---+
	\| 5 \|
	\| 3 \|
	\| 2 \|
	+---+</code></pre>

	<p class="p">
	To help run reproducible experiments, the <code class="ph codeph">REPEATABLE</code> clause causes Impala
	to choose the same set of files for each query. Although the data set being considered is
	deterministic, the order of results varies (in the absence of an <code class="ph codeph">ORDER BY</code>
	clause) because of the way distributed queries are processed:
	</p>


	<pre class="pre codeblock"><code>select distinct x from sample_demo
	tablesample system(50) repeatable (12345);
	+---+
	\| x \|
	+---+
	\| 3 \|
	\| 2 \|
	\| 1 \|
	+---+

	select distinct x from sample_demo
	tablesample system(50) repeatable (12345);
	+---+
	\| x \|
	+---+
	\| 2 \|
	\| 1 \|
	\| 3 \|
	+---+
	</code></pre>

	<p class="p">
	The following examples show how uneven data distribution affects which data is sampled.
	Adding another data file containing a long string value changes the threshold for 50% of
	the total data in the table:
	</p>


	<pre class="pre codeblock"><code>insert into sample_demo values
	(1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington');

	show files in sample_demo;
	+---------------------+------+-----------+
	\| Path \| Size \| Partition \|
	+---------------------+------+-----------+
	\| 991213608_data.0. \| 7B \| \|
	\| 982196806_data.0. \| 6B \| \|
	\| _253317650_data.0. \| 196B \| \|
	\| _2122096884_data.0. \| 8B \| \|
	\| _586325431_data.0. \| 6B \| \|
	\| 1894746258_data.0. \| 7B \| \|
	+---------------------+------+-----------+

	show table stats sample_demo;
	+-------+--------+------+--------+-------------------------+
	\| #Rows \| #Files \| Size \| Format \| Location \|
	+-------+--------+------+--------+-------------------------+
	\| -1 \| 6 \| 230B \| TEXT \| /tsample.db/sample_demo \|
	+-------+--------+------+--------+-------------------------+
	</code></pre>

	<p class="p">
	Even though the queries do not refer to the <code class="ph codeph">S</code> column containing the long
	value, all the sampling queries include the data file containing the column value
	<code class="ph codeph">X=1000</code>, because the query cannot reach the 50% threshold (115 bytes)
	without including that file. The large file might be considered first, in which case it is
	the only file processed by the query. Or an arbitrary set of other files might be
	considered first.
	</p>


	<pre class="pre codeblock"><code>select distinct x from sample_demo tablesample system(50);
	+------+
	\| x \|
	+------+
	\| 1000 \|
	\| 3 \|
	\| 1 \|
	+------+

	select distinct x from sample_demo tablesample system(50);
	+------+
	\| x \|
	+------+
	\| 1000 \|
	+------+

	select distinct x from sample_demo tablesample system(50);
	+------+
	\| x \|
	+------+
	\| 1000 \|
	\| 4 \|
	\| 2 \|
	\| 1 \|
	+------+</code></pre>

	<p class="p">
	The following examples demonstrate how the <code class="ph codeph">TABLESAMPLE</code> clause interacts
	with other table aspects, such as partitioning and file format:
	</p>


	<pre class="pre codeblock"><code>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;

	insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
	insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
	insert into sample_demo_partitions partition (n = 3) select * from sample_demo;

	show files in sample_demo_partitions;
	+--------------------------------+--------+-----------+
	\| Path \| Size \| Partition \|
	+--------------------------------+--------+-----------+
	\| 000000_364262785_data.0.parq \| 1.24KB \| n=1 \|
	\| 000001_973526736_data.0.parq \| 566B \| n=1 \|
	\| 0000000_1300598134_data.0.parq \| 1.24KB \| n=2 \|
	\| 0000001_689099063_data.0.parq \| 568B \| n=2 \|
	\| 0000000_1861371709_data.0.parq \| 1.24KB \| n=3 \|
	\| 0000001_1065507912_data.0.parq \| 566B \| n=3 \|
	+--------------------------------+--------+-----------+

	show table stats tablesample_demo_partitioned;
	+-------+-------+--------+--------+---------+----------------------------------------------+
	\| n \| #Rows \| #Files \| Size \| Format \| Location \|
	+-------+-------+--------+--------+---------+----------------------------------------------+
	\| 1 \| -1 \| 2 \| 1.79KB \| PARQUET \| /tsample.db/tablesample_demo_partitioned/n=1 \|
	\| 2 \| -1 \| 2 \| 1.80KB \| PARQUET \| /tsample.db/tablesample_demo_partitioned/n=2 \|
	\| 3 \| -1 \| 2 \| 1.79KB \| PARQUET \| /tsample.db/tablesample_demo_partitioned/n=3 \|
	\| Total \| -1 \| 6 \| 5.39KB \| \| \|
	+-------+-------+--------+--------+---------+----------------------------------------------+
	</code></pre>

	<p class="p">
	If the query does not involve any partition pruning, the sampling applies to the data
	volume of the entire table:
	</p>


	<pre class="pre codeblock"><code>-- 18 rows total.
	select count(*) from sample_demo_partitions;
	+----------+
	\| count(*) \|
	+----------+
	\| 18 \|
	+----------+

	-- The number of rows per data file is not
	-- perfectly balanced, therefore the count
	-- is different depending on which set of files
	-- is considered.
	select count(*) from sample_demo_partitions
	tablesample system(75);
	+----------+
	\| count(*) \|
	+----------+
	\| 14 \|
	+----------+

	select count(*) from sample_demo_partitions
	tablesample system(75);
	+----------+
	\| count(*) \|
	+----------+
	\| 16 \|
	+----------+</code></pre>

	<p class="p">
	If the query only processes certain partitions, the query computes the sampling threshold
	based on the data size and set of files only from the relevant partitions:
	</p>


	<pre class="pre codeblock"><code>select count(*) from sample_demo_partitions
	tablesample system(50) where n = 1;
	+----------+
	\| count(*) \|
	+----------+
	\| 6 \|
	+----------+

	select count(*) from sample_demo_partitions
	tablesample system(50) where n = 1;
	+----------+
	\| count(*) \|
	+----------+
	\| 2 \|
	+----------+
	</code></pre>

	<p class="p">
	<strong class="ph b">Related information:</strong>
	</p>


	<p class="p">
	<a class="xref" href="impala_select.html#select">SELECT Statement</a>
	</p>


	</div>


	<div class="related-links">
	<div class="familylinks">
	<div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_select.html">SELECT Statement</a></div>
	</div>
	</div></body>
	</html>