| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE html |
| PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> |
| <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> |
| |
| <meta name="copyright" content="(C) Copyright 2025" /> |
| <meta name="DC.rights.owner" content="(C) Copyright 2025" /> |
| <meta name="DC.Type" content="concept" /> |
| <meta name="DC.Title" content="TABLESAMPLE Clause" /> |
| <meta name="DC.Relation" scheme="URI" content="../topics/impala_select.html" /> |
| <meta name="prodname" content="Impala" /> |
| <meta name="prodname" content="Impala" /> |
| <meta name="version" content="Impala 3.4.x" /> |
| <meta name="version" content="Impala 3.4.x" /> |
| <meta name="DC.Format" content="XHTML" /> |
| <meta name="DC.Identifier" content="tablesample" /> |
| <link rel="stylesheet" type="text/css" href="../commonltr.css" /> |
| <title>TABLESAMPLE Clause</title> |
| </head> |
| <body id="tablesample"> |
| |
| |
| <h1 class="title topictitle1" id="ariaid-title1">TABLESAMPLE Clause</h1> |
| |
| |
| |
| |
| <div class="body conbody"> |
| |
| <p class="p"> |
| Specify the <code class="ph codeph">TABLESAMPLE</code> clause in cases where you need to explore the |
| data distribution within the table, the table is very large, and it is impractical or |
| unnecessary to process all the data from the table or selected partitions. |
| </p> |
| |
| |
| <p class="p"> |
| The clause makes the query process a randomized set of data files from the table, so that |
| the total volume of data is greater than or equal to the specified percentage of data |
| bytes within that table. (Or the data bytes within the set of partitions that remain after |
| partition pruning is performed.) |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Syntax:</strong> |
| </p> |
| |
| |
| <pre class="pre codeblock"><code> |
| <span class="ph">TABLESAMPLE SYSTEM(<var class="keyword varname">percentage</var>) [REPEATABLE(<var class="keyword varname">seed</var>)]</span> |
| </code></pre> |
| |
| <p class="p"> |
| The <code class="ph codeph">TABLESAMPLE</code> clause comes immediately after a table name or table |
| alias. |
| </p> |
| |
| |
| <p class="p"> |
| The <code class="ph codeph">SYSTEM</code> keyword represents the sampling method. Currently, Impala only |
| supports a single sampling method named <code class="ph codeph">SYSTEM</code>. |
| </p> |
| |
| |
| <p class="p"> |
| The <var class="keyword varname">percentage</var> argument is an integer literal from 0 to 100. A |
| percentage of 0 produces an empty result set for a particular table reference, while a |
| percentage of 100 uses the entire contents. Because the sampling works by selecting a |
| random set of data files, the proportion of sampled data from the table may be greater |
| than the specified percentage, based on the number and sizes of the underlying data files. |
| See the usage notes for details. |
| </p> |
| |
| |
| <p class="p"> |
| The optional <code class="ph codeph">REPEATABLE</code> keyword lets you specify an arbitrary positive |
| integer seed value that ensures that when the query is run again, the sampling selects the |
| same set of data files each time. <code class="ph codeph">REPEATABLE</code> does not have a default |
| value. If you omit the <code class="ph codeph">REPEATABLE</code> keyword, the random seed is derived |
| from the current time. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.9.0</span> |
| </p> |
| |
| |
| <p class="p"> |
| See <a href="impala_compute_stats.html"><span class="keyword">COMPUTE STATS Statement</span></a> for the <code class="ph codeph">TABLESAMPLE</code> clause used in |
| the <code class="ph codeph">COMPUTE STATS</code> statement. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Usage notes:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| You might use this clause with aggregation queries, such as finding the approximate |
| average, minimum, or maximum where exact precision is not required. You can use these |
| findings to plan the most effective strategy for constructing queries against the full |
| table or designing a partitioning strategy for the data. |
| </p> |
| |
| |
| <p class="p"> |
| Some other database systems have a <code class="ph codeph">TABLESAMPLE</code> clause. The Impala syntax |
| for this clause is modeled on the syntax for popular relational databases, not the Hive |
| <code class="ph codeph">TABLESAMPLE</code> clause. For example, there is no <code class="ph codeph">BUCKETS</code> |
| keyword as in HiveQL. |
| </p> |
| |
| |
| <p class="p"> |
| The precision of the <var class="keyword varname">percentage</var> threshold depends on the number and |
| sizes of the underlying data files. Impala brings in additional data files, one at a time, |
| until the number of bytes exceeds the specified percentage based on the total number of |
| bytes for the entire set of table data. The precision of the percentage threshold is |
| higher when the table contains many data files with consistent sizes. See the code |
| listings later in this section for examples. |
| </p> |
| |
| |
| <p class="p"> |
| When you estimate characteristics of the data distribution based on sampling a percentage |
| of the table data, be aware that the data might be unevenly distributed between different |
| files. Do not assume that the percentage figure reflects the percentage of rows in the |
| table. For example, one file might contain all blank values for a <code class="ph codeph">STRING</code> |
| column, while another file contains long strings in that column; therefore, one file could |
| contain many more rows than another. Likewise, a table created with the <code class="ph codeph">SORT |
| BY</code> clause might contain narrow ranges of values for the sort columns, making it |
| impractical to extrapolate the number of distinct values for those columns based on |
| sampling only some of the data files. |
| </p> |
| |
| |
| <p class="p"> |
| Because a sample of the table data might not contain all values for a particular column, |
| if the <code class="ph codeph">TABLESAMPLE</code> is used in a join query, the key relationships between |
| the tables might produce incomplete result sets compared to joins using all the table |
| data. For example, if you join 50% of table A with 50% of table B, some values in the join |
| columns might not match between the two tables, even though overall there is a 1:1 |
| relationship between the tables. |
| </p> |
| |
| |
| <p class="p"> |
| The <code class="ph codeph">REPEATABLE</code> keyword makes identical queries use a consistent set of |
| data files when the query is repeated. You specify an arbitrary integer key that acts as a |
| seed value when Impala randomly selects the set of data files to use in the query. This |
| technique lets you verify correctness, examine performance, and so on for queries using |
| the <code class="ph codeph">TABLESAMPLE</code> clause without the sampled data being different each |
| time. The repeatable aspect is reset (that is, the set of selected data files may change) |
| any time the contents of the table change. The statements or operations that can make |
| sampling results non-repeatable are: |
| </p> |
| |
| |
| <ul class="ul"> |
| <li class="li"> |
| <code class="ph codeph">INSERT</code>. |
| </li> |
| |
| |
| <li class="li"> |
| <code class="ph codeph">TRUNCATE TABLE</code>. |
| </li> |
| |
| |
| <li class="li"> |
| <code class="ph codeph">LOAD DATA</code>. |
| </li> |
| |
| |
| <li class="li"> |
| <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> after files are added |
| or removed by a non-Impala mechanism. |
| </li> |
| |
| |
| <li class="li"></li> |
| |
| </ul> |
| |
| |
| <p class="p"> |
| This clause is similar in some ways to the <code class="ph codeph">LIMIT</code> clause, because both |
| serve to limit the size of the intermediate data and final result set. <code class="ph codeph">LIMIT |
| 0</code> is more efficient than <code class="ph codeph">TABLESAMPLE SYSTEM(0)</code> for verifying |
| that a query can execute without producing any results. <code class="ph codeph">TABLESAMPLE |
| SYSTEM(<var class="keyword varname">n</var>)</code> often makes query processing more efficient than |
| using a <code class="ph codeph">LIMIT</code> clause by itself, because all phases of query execution use |
| less data overall. If the intent is to retrieve some representative values from the table |
| in an efficient way, you might combine <code class="ph codeph">TABLESAMPLE</code>, <code class="ph codeph">ORDER |
| BY</code>, and <code class="ph codeph">LIMIT</code> clauses within a single query. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Partitioning:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| When you query a partitioned table, any partition pruning happens before Impala selects |
| the data files to sample. For example, in a table partitioned by year, a query with |
| <code class="ph codeph">WHERE year = 2017</code> and a <code class="ph codeph">TABLESAMPLE SYSTEM(10)</code> clause |
| would sample data files representing at least 10% of the bytes present in the 2017 |
| partition. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Amazon S3 considerations:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| This clause applies to S3 tables the same way as tables with data files stored on HDFS. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">ADLS considerations:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| This clause applies to ADLS tables the same way as tables with data files stored on HDFS. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Kudu considerations:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| This clause does not apply to Kudu tables. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">HBase considerations:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| This clause does not apply to HBase tables. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Performance considerations:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| From a performance perspective, the <code class="ph codeph">TABLESAMPLE</code> clause is especially |
| valuable for exploratory queries on text, Avro, or other file formats other than Parquet. |
| Text-based or row-oriented file formats must process substantial amounts of redundant data |
| for queries that derive aggregate results such as <code class="ph codeph">MAX()</code>, |
| <code class="ph codeph">MIN()</code>, or <code class="ph codeph">AVG()</code> for a single column. Therefore, you |
| might use <code class="ph codeph">TABLESAMPLE</code> early in the ETL pipeline, when data is still in |
| raw text format and has not been converted to Parquet or moved into a partitioned table. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Restrictions:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| This clause applies only to tables that use a storage layer with underlying raw data |
| files, such as HDFS, Amazon S3, or Microsoft ADLS. |
| </p> |
| |
| |
| <p class="p"> |
| This clause does not apply to table references that represent views. A query that applies |
| the <code class="ph codeph">TABLESAMPLE</code> clause to a view or a subquery fails with a semantic |
| error. |
| </p> |
| |
| |
| <p class="p"> |
| Because the sampling works at the level of entire data files, it is by nature |
| coarse-grained. It is possible to specify a small sample percentage but still process a |
| substantial portion of the table data if the table contains relatively few data files, if |
| each data file is very large, or if the data files vary substantially in size. Be sure |
| that you understand the data distribution and physical file layout so that you can verify |
| if the results are suitable for extrapolation. For example, if the table contains only a |
| single data file, the <span class="q">"sample"</span> will consist of all the table data regardless of the |
| percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when |
| you specify a sampling percentage of 50 you would either process slightly more than 50% of |
| the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which |
| data files were selected for sampling. |
| </p> |
| |
| |
| <p class="p"> |
| If data files are added by a non-Impala mechanism, and the table metadata is not updated |
| by a <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> statement, the |
| <code class="ph codeph">TABLESAMPLE</code> clause does not consider those new files when computing the |
| number of bytes in the table or selecting which files to sample. |
| </p> |
| |
| |
| <p class="p"> |
| If data files are removed by a non-Impala mechanism, and the table metadata is not updated |
| by a <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE METADATA</code> statement, the query |
| fails if the <code class="ph codeph">TABLESAMPLE</code> clause attempts to reference any of the missing |
| files. |
| </p> |
| |
| |
| <p class="p"> |
| <strong class="ph b">Examples:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| The following examples demonstrate the <code class="ph codeph">TABLESAMPLE</code> clause. These examples |
| intentionally use very small data sets to illustrate how the number of files, size of each |
| file, and overall size of data in the table interact with the percentage specified in the |
| clause. |
| </p> |
| |
| |
| <p class="p"> |
| These examples use an unpartitioned table, containing several files of roughly the same |
| size: |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>create table sample_demo (x int, s string); |
| |
| insert into sample_demo values (1, 'one'); |
| insert into sample_demo values (2, 'two'); |
| insert into sample_demo values (3, 'three'); |
| insert into sample_demo values (4, 'four'); |
| insert into sample_demo values (5, 'five'); |
| |
| show files in sample_demo; |
| +---------------------+------+-----------+ |
| | Path | Size | Partition | |
| +---------------------+------+-----------+ |
| | 991213608_data.0. | 7B | | |
| | 982196806_data.0. | 6B | | |
| | _2122096884_data.0. | 8B | | |
| | _586325431_data.0. | 6B | | |
| | 1894746258_data.0. | 7B | | |
| +---------------------+------+-----------+ |
| |
| show table stats sample_demo; |
| +-------+--------+------+--------+-------------------------+ |
| | #Rows | #Files | Size | Format | Location | |
| +-------+--------+------+--------+-------------------------+ |
| | -1 | 5 | 34B | TEXT | /tsample.db/sample_demo | |
| +-------+--------+------+--------+-------------------------+</code></pre> |
| |
| <p class="p"> |
| A query that samples 50% of the table must process at least 17 bytes of data. Based on the |
| sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1 |
| or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the |
| query adds more files until it passes the 50% threshold: |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>select distinct x from sample_demo tablesample system(50); |
| +---+ |
| | x | |
| +---+ |
| | 4 | |
| | 1 | |
| | 5 | |
| +---+ |
| |
| select distinct x from sample_demo tablesample system(50); |
| +---+ |
| | x | |
| +---+ |
| | 5 | |
| | 4 | |
| | 2 | |
| +---+ |
| |
| select distinct x from sample_demo tablesample system(50); |
| +---+ |
| | x | |
| +---+ |
| | 5 | |
| | 3 | |
| | 2 | |
| +---+</code></pre> |
| |
| <p class="p"> |
| To help run reproducible experiments, the <code class="ph codeph">REPEATABLE</code> clause causes Impala |
| to choose the same set of files for each query. Although the data set being considered is |
| deterministic, the order of results varies (in the absence of an <code class="ph codeph">ORDER BY</code> |
| clause) because of the way distributed queries are processed: |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>select distinct x from sample_demo |
| tablesample system(50) repeatable (12345); |
| +---+ |
| | x | |
| +---+ |
| | 3 | |
| | 2 | |
| | 1 | |
| +---+ |
| |
| select distinct x from sample_demo |
| tablesample system(50) repeatable (12345); |
| +---+ |
| | x | |
| +---+ |
| | 2 | |
| | 1 | |
| | 3 | |
| +---+ |
| </code></pre> |
| |
| <p class="p"> |
| The following examples show how uneven data distribution affects which data is sampled. |
| Adding another data file containing a long string value changes the threshold for 50% of |
| the total data in the table: |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>insert into sample_demo values |
| (1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington'); |
| |
| show files in sample_demo; |
| +---------------------+------+-----------+ |
| | Path | Size | Partition | |
| +---------------------+------+-----------+ |
| | 991213608_data.0. | 7B | | |
| | 982196806_data.0. | 6B | | |
| | _253317650_data.0. | 196B | | |
| | _2122096884_data.0. | 8B | | |
| | _586325431_data.0. | 6B | | |
| | 1894746258_data.0. | 7B | | |
| +---------------------+------+-----------+ |
| |
| show table stats sample_demo; |
| +-------+--------+------+--------+-------------------------+ |
| | #Rows | #Files | Size | Format | Location | |
| +-------+--------+------+--------+-------------------------+ |
| | -1 | 6 | 230B | TEXT | /tsample.db/sample_demo | |
| +-------+--------+------+--------+-------------------------+ |
| </code></pre> |
| |
| <p class="p"> |
| Even though the queries do not refer to the <code class="ph codeph">S</code> column containing the long |
| value, all the sampling queries include the data file containing the column value |
| <code class="ph codeph">X=1000</code>, because the query cannot reach the 50% threshold (115 bytes) |
| without including that file. The large file might be considered first, in which case it is |
| the only file processed by the query. Or an arbitrary set of other files might be |
| considered first. |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>select distinct x from sample_demo tablesample system(50); |
| +------+ |
| | x | |
| +------+ |
| | 1000 | |
| | 3 | |
| | 1 | |
| +------+ |
| |
| select distinct x from sample_demo tablesample system(50); |
| +------+ |
| | x | |
| +------+ |
| | 1000 | |
| +------+ |
| |
| select distinct x from sample_demo tablesample system(50); |
| +------+ |
| | x | |
| +------+ |
| | 1000 | |
| | 4 | |
| | 2 | |
| | 1 | |
| +------+</code></pre> |
| |
| <p class="p"> |
| The following examples demonstrate how the <code class="ph codeph">TABLESAMPLE</code> clause interacts |
| with other table aspects, such as partitioning and file format: |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet; |
| |
| insert into sample_demo_partitions partition (n = 1) select * from sample_demo; |
| insert into sample_demo_partitions partition (n = 2) select * from sample_demo; |
| insert into sample_demo_partitions partition (n = 3) select * from sample_demo; |
| |
| show files in sample_demo_partitions; |
| +--------------------------------+--------+-----------+ |
| | Path | Size | Partition | |
| +--------------------------------+--------+-----------+ |
| | 000000_364262785_data.0.parq | 1.24KB | n=1 | |
| | 000001_973526736_data.0.parq | 566B | n=1 | |
| | 0000000_1300598134_data.0.parq | 1.24KB | n=2 | |
| | 0000001_689099063_data.0.parq | 568B | n=2 | |
| | 0000000_1861371709_data.0.parq | 1.24KB | n=3 | |
| | 0000001_1065507912_data.0.parq | 566B | n=3 | |
| +--------------------------------+--------+-----------+ |
| |
| show table stats tablesample_demo_partitioned; |
| +-------+-------+--------+--------+---------+----------------------------------------------+ |
| | n | #Rows | #Files | Size | Format | Location | |
| +-------+-------+--------+--------+---------+----------------------------------------------+ |
| | 1 | -1 | 2 | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=1 | |
| | 2 | -1 | 2 | 1.80KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=2 | |
| | 3 | -1 | 2 | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=3 | |
| | Total | -1 | 6 | 5.39KB | | | |
| +-------+-------+--------+--------+---------+----------------------------------------------+ |
| </code></pre> |
| |
| <p class="p"> |
| If the query does not involve any partition pruning, the sampling applies to the data |
| volume of the entire table: |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>-- 18 rows total. |
| select count(*) from sample_demo_partitions; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 18 | |
| +----------+ |
| |
| -- The number of rows per data file is not |
| -- perfectly balanced, therefore the count |
| -- is different depending on which set of files |
| -- is considered. |
| select count(*) from sample_demo_partitions |
| tablesample system(75); |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 14 | |
| +----------+ |
| |
| select count(*) from sample_demo_partitions |
| tablesample system(75); |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 16 | |
| +----------+</code></pre> |
| |
| <p class="p"> |
| If the query only processes certain partitions, the query computes the sampling threshold |
| based on the data size and set of files only from the relevant partitions: |
| </p> |
| |
| |
| <pre class="pre codeblock"><code>select count(*) from sample_demo_partitions |
| tablesample system(50) where n = 1; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 6 | |
| +----------+ |
| |
| select count(*) from sample_demo_partitions |
| tablesample system(50) where n = 1; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 2 | |
| +----------+ |
| </code></pre> |
| |
| <p class="p"> |
| <strong class="ph b">Related information:</strong> |
| </p> |
| |
| |
| <p class="p"> |
| <a class="xref" href="impala_select.html#select">SELECT Statement</a> |
| </p> |
| |
| |
| </div> |
| |
| |
| <div class="related-links"> |
| <div class="familylinks"> |
| <div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_select.html">SELECT Statement</a></div> |
| </div> |
| </div></body> |
| </html> |