| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="partitioning"> |
| |
| <title>Partitioning for Impala Tables</title> |
| |
| <titlealts audience="PDF"> |
| |
| <navtitle>Partitioning</navtitle> |
| |
| </titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">partitioning</indexterm> |
| By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the |
| data during loading, based on values from one or more columns, to speed up queries that test those columns. For example, with a |
| <codeph>school_records</codeph> table partitioned on a <codeph>year</codeph> column, there is a separate data directory for each |
| different year value, and all the data for that year is stored in a data file in that directory. A query that includes a |
| <codeph>WHERE</codeph> condition such as <codeph>YEAR=1966</codeph>, <codeph>YEAR IN (1989,1999)</codeph>, or <codeph>YEAR BETWEEN |
| 1984 AND 1989</codeph> can examine only the data files from the appropriate directory or directories, greatly reducing the amount of |
| data to read and test. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| <p> |
| See <xref href="impala_tutorial.xml#tut_external_partition_data"/> for an example that illustrates the syntax for creating partitioned |
| tables, the underlying directory structure in HDFS, and how to attach a partitioned Impala external table to data files stored |
| elsewhere in HDFS. |
| </p> |
| |
| <p> |
| Parquet is a popular format for partitioned Impala tables because it is well suited to handle huge data volumes. See |
| <xref href="impala_parquet.xml#parquet_performance"/> for performance considerations for partitioned Parquet tables. |
| </p> |
| |
| <p> |
| See <xref href="impala_literals.xml#null"/> for details about how <codeph>NULL</codeph> values are represented in partitioned tables. |
| </p> |
| |
| <p rev="2.2.0"> |
| See <xref href="impala_s3.xml#s3"/> for details about setting up tables where some or all partitions reside on the Amazon Simple |
| Storage Service (S3). |
| </p> |
| |
| </conbody> |
| |
| <concept id="partitioning_choosing"> |
| |
| <title>When to Use Partitioned Tables</title> |
| |
| <conbody> |
| |
| <p> |
| Partitioning is typically appropriate for: |
| </p> |
| |
| <ul> |
| <li> |
| Tables that are very large, where reading the entire data set takes an impractical amount of time. |
| </li> |
| |
| <li> |
| Tables that are always or almost always queried with conditions on the partitioning columns. In our example of a table partitioned |
| by year, <codeph>SELECT COUNT(*) FROM school_records WHERE year = 1985</codeph> is efficient, only examining a small fraction of |
| the data; but <codeph>SELECT COUNT(*) FROM school_records</codeph> has to process a separate data file for each year, resulting in |
| more overall work than in an unpartitioned table. You would probably not partition this way if you frequently queried the table |
| based on last name, student ID, and so on without testing the year. |
| </li> |
| |
| <li> |
| Columns that have reasonable cardinality (number of different values). If a column only has a small number of values, for example |
| <codeph>Male</codeph> or <codeph>Female</codeph>, you do not gain much efficiency by eliminating only about 50% of the data to |
| read for each query. If a column has only a few rows matching each value, the number of directories to process can become a |
| limiting factor, and the data file in each directory could be too small to take advantage of the Hadoop mechanism for transmitting |
| data in multi-megabyte blocks. For example, you might partition census data by year, store sales data by year and month, and web |
| traffic data by year, month, and day. (Some users with high volumes of incoming data might even partition down to the individual |
| hour and minute.) |
| </li> |
| |
| <li> |
| Data that already passes through an extract, transform, and load (ETL) pipeline. The values of the partitioning columns are |
| stripped from the original data files and represented by directory names, so loading data into a partitioned table involves some |
| sort of transformation or preprocessing. |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_sql"> |
| |
| <title>SQL Statements for Partitioned Tables</title> |
| |
| <conbody> |
| |
| <p> |
| In terms of Impala SQL syntax, partitioning affects these statements: |
| </p> |
| |
| <ul> |
| <li> |
| <codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>: you specify a <codeph>PARTITIONED |
| BY</codeph> clause when creating the table to identify names and data types of the partitioning columns. These columns are not |
| included in the main list of columns for the table. |
| </li> |
| |
| <li rev="2.5.0"> |
| In <keyword keyref="impala25_full"/> and higher, you can also use the <codeph>PARTITIONED BY</codeph> clause in a <codeph>CREATE TABLE AS |
| SELECT</codeph> statement. This syntax lets you use a single statement to create a partitioned table, copy data into it, and |
| create new partitions based on the values in the inserted data. |
| </li> |
| |
| <li> |
| <codeph><xref href="impala_alter_table.xml#alter_table">ALTER |
| TABLE</xref></codeph>: you can add or drop partitions, to work |
| with different portions of a huge data set. You can designate the HDFS |
| directory that holds the data files for a specific partition. With |
| data partitioned by date values, you might <q>age out</q> data that is |
| no longer relevant. <note |
| conref="../shared/impala_common.xml#common/add_partition_set_location" |
| /> |
| </li> |
| |
| <li> |
| <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph>: |
| When you insert data into a partitioned table, you identify the |
| partitioning columns. One or more values from each inserted row are |
| not stored in data files, but instead determine the directory where |
| that row value is stored. You can also specify which partition to load |
| a set of data into, with <codeph>INSERT OVERWRITE</codeph> statements; |
| you can replace the contents of a specific partition but you cannot |
| append data to a specific partition. <p rev="1.3.1" |
| conref="../shared/impala_common.xml#common/insert_inherit_permissions" |
| /> |
| </li> |
| |
| <li> |
| Although the syntax of the <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement is the same whether or |
| not the table is partitioned, the way queries interact with partitioned tables can have a dramatic impact on performance and |
| scalability. The mechanism that lets queries skip certain partitions during a query is known as partition pruning; see |
| <xref href="impala_partitioning.xml#partition_pruning"/> for details. |
| </li> |
| |
| <li rev="1.4.0"> |
| In Impala 1.4 and later, there is a <codeph>SHOW PARTITIONS</codeph> statement that displays information about each partition in a |
| table. See <xref href="impala_show.xml#show"/> for details. |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_static_dynamic"> |
| |
| <title>Static and Dynamic Partitioning Clauses</title> |
| |
| <conbody> |
| |
| <p> |
| Specifying all the partition columns in a SQL statement is called <term>static partitioning</term>, because the statement affects a |
| single predictable partition. For example, you use static partitioning with an <codeph>ALTER TABLE</codeph> statement that affects |
| only one partition, or with an <codeph>INSERT</codeph> statement that inserts all values into the same partition: |
| </p> |
| |
| <codeblock>insert into t1 <b>partition(x=10, y='a')</b> select c1 from some_other_table; |
| </codeblock> |
| |
| <p> |
| When you specify some partition key columns in an <codeph>INSERT</codeph> statement, but leave out the values, Impala determines |
| which partition to insert. This technique is called <term>dynamic partitioning</term>: |
| </p> |
| |
| <codeblock>insert into t1 <b>partition(x, y='b')</b> select c1, c2 from some_other_table; |
| -- Create new partition if necessary based on variable year, month, and day; insert a single value. |
| insert into weather <b>partition (year, month, day)</b> select 'cloudy',2014,4,21; |
| -- Create new partition if necessary for specified year and month but variable day; insert a single value. |
| insert into weather <b>partition (year=2014, month=04, day)</b> select 'sunny',22; |
| </codeblock> |
| |
| <p> |
| The more key columns you specify in the <codeph>PARTITION</codeph> clause, the fewer columns you need in the <codeph>SELECT</codeph> |
| list. The trailing columns in the <codeph>SELECT</codeph> list are substituted in order for the partition key columns with no |
| specified value. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_refresh" rev="2.7.0 IMPALA-1683"> |
| |
| <title>Refreshing a Single Partition</title> |
| |
| <conbody> |
| |
| <p> |
| The <codeph>REFRESH</codeph> statement is typically used with partitioned tables when new data files are loaded into a partition by |
| some non-Impala mechanism, such as a Hive or Spark job. The <codeph>REFRESH</codeph> statement makes Impala aware of the new data |
| files so that they can be used in Impala queries. Because partitioned tables typically contain a high volume of data, the |
| <codeph>REFRESH</codeph> operation for a full partitioned table can take significant time. |
| </p> |
| |
| <p> |
| In <keyword keyref="impala27_full"/> and higher, you can include a <codeph>PARTITION (<varname>partition_spec</varname>)</codeph> clause in the |
| <codeph>REFRESH</codeph> statement so that only a single partition is refreshed. For example, <codeph>REFRESH big_table PARTITION |
| (year=2017, month=9, day=30)</codeph>. The partition spec must include all the partition key columns. See |
| <xref href="impala_refresh.xml#refresh"/> for more details and examples of <codeph>REFRESH</codeph> syntax and usage. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_permissions"> |
| |
| <title>Permissions for Partition Subdirectories</title> |
| |
| <conbody> |
| |
| <p rev="1.3.1" |
| conref="../shared/impala_common.xml#common/insert_inherit_permissions"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_pruning"> |
| |
| <title>Partition Pruning for Queries</title> |
| |
| <conbody> |
| |
| <p> |
| Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions. If |
| you can arrange for queries to prune large numbers of unnecessary partitions from the query execution plan, the queries use fewer |
| resources and are thus proportionally faster and more scalable. |
| </p> |
| |
| <p> |
| For example, if a table is partitioned by columns <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>, then |
| <codeph>WHERE</codeph> clauses such as <codeph>WHERE year = 2013</codeph>, <codeph>WHERE year < 2010</codeph>, or <codeph>WHERE |
| year BETWEEN 1995 AND 1998</codeph> allow Impala to skip the data files in all partitions outside the specified range. Likewise, |
| <codeph>WHERE year = 2013 AND month BETWEEN 1 AND 3</codeph> could prune even more partitions, reading the data files for only a |
| portion of one year. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="partition_pruning_checking"> |
| |
| <title>Checking if Partition Pruning Happens for a Query</title> |
| |
| <conbody> |
| |
| <p> |
| To check the effectiveness of partition pruning for a query, check the <codeph>EXPLAIN</codeph> output for the query before |
| running it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation |
| <codeph>#partitions=1/3</codeph> in the <codeph>EXPLAIN</codeph> plan confirms that Impala can do the appropriate partition |
| pruning. |
| </p> |
| |
| <codeblock>[localhost:21000] > insert into census partition (year=2010) values ('Smith'),('Jones'); |
| [localhost:21000] > insert into census partition (year=2011) values ('Smith'),('Jones'),('Doe'); |
| [localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Doe'); |
| [localhost:21000] > select name from census where year=2010; |
| +-------+ |
| | name | |
| +-------+ |
| | Smith | |
| | Jones | |
| +-------+ |
| [localhost:21000] > explain select name from census <b>where year=2010</b>; |
| +------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------+ |
| | PLAN FRAGMENT 0 | |
| | PARTITION: UNPARTITIONED | |
| | | |
| | 1:EXCHANGE | |
| | | |
| | PLAN FRAGMENT 1 | |
| | PARTITION: RANDOM | |
| | | |
| | STREAM DATA SINK | |
| | EXCHANGE ID: 1 | |
| | UNPARTITIONED | |
| | | |
| | 0:SCAN HDFS | |
| | table=predicate_propagation.census <b>#partitions=1/3</b> size=12B | |
| +------------------------------------------------------------------+</codeblock> |
| |
| <p rev="1.4.0"> |
| For a report of the volume of data that was actually read and processed at each stage of the query, check the output of the |
| <codeph>SUMMARY</codeph> command immediately after running the query. For a more detailed analysis, look at the output of the |
| <codeph>PROFILE</codeph> command; it includes this same summary report near the start of the profile output. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_pruning_sql"> |
| |
| <title>What SQL Constructs Work with Partition Pruning</title> |
| |
| <conbody> |
| |
| <p rev="1.2.2"> |
| <indexterm audience="hidden">predicate propagation</indexterm> |
| Impala can even do partition pruning in cases where the partition key column is not directly compared to a constant, by applying |
| the transitive property to other parts of the <codeph>WHERE</codeph> clause. This technique is known as predicate propagation, and |
| is available in Impala 1.2.2 and later. In this example, the census table includes another column indicating when the data was |
| collected, which happens in 10-year intervals. Even though the query does not compare the partition key column |
| (<codeph>YEAR</codeph>) to a constant value, Impala can deduce that only the partition <codeph>YEAR=2010</codeph> is required, and |
| again only reads 1 out of 3 partitions. |
| </p> |
| |
| <codeblock rev="1.2.2">[localhost:21000] > drop table census; |
| [localhost:21000] > create table census (name string, census_year int) partitioned by (year int); |
| [localhost:21000] > insert into census partition (year=2010) values ('Smith',2010),('Jones',2010); |
| [localhost:21000] > insert into census partition (year=2011) values ('Smith',2020),('Jones',2020),('Doe',2020); |
| [localhost:21000] > insert into census partition (year=2012) values ('Smith',2020),('Doe',2020); |
| [localhost:21000] > select name from census where year = census_year and census_year=2010; |
| +-------+ |
| | name | |
| +-------+ |
| | Smith | |
| | Jones | |
| +-------+ |
| [localhost:21000] > explain select name from census <b>where year = census_year and census_year=2010</b>; |
| +------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------+ |
| | PLAN FRAGMENT 0 | |
| | PARTITION: UNPARTITIONED | |
| | | |
| | 1:EXCHANGE | |
| | | |
| | PLAN FRAGMENT 1 | |
| | PARTITION: RANDOM | |
| | | |
| | STREAM DATA SINK | |
| | EXCHANGE ID: 1 | |
| | UNPARTITIONED | |
| | | |
| | 0:SCAN HDFS | |
| | table=predicate_propagation.census <b>#partitions=1/3</b> size=22B | |
| | predicates: census_year = 2010, year = census_year | |
| +------------------------------------------------------------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/partitions_and_views"/> |
| |
| <p |
| conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="dynamic_partition_pruning"> |
| |
| <title>Dynamic Partition Pruning</title> |
| |
| <conbody> |
| |
| <p> |
| The original mechanism uses to prune partitions is <term>static partition pruning</term>, in which the conditions in the |
| <codeph>WHERE</codeph> clause are analyzed to determine in advance which partitions can be safely skipped. In <keyword keyref="impala25_full"/> |
| and higher, Impala can perform <term>dynamic partition pruning</term>, where information about the partitions is collected during |
| the query, and Impala prunes unnecessary partitions in ways that were impractical to predict in advance. |
| </p> |
| |
| <p> |
| For example, if partition key columns are compared to literal values in a <codeph>WHERE</codeph> clause, Impala can perform static |
| partition pruning during the planning phase to only read the relevant partitions: |
| </p> |
| |
| <codeblock> |
| -- The query only needs to read 3 partitions whose key values are known ahead of time. |
| -- That's static partition pruning. |
| SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015); |
| </codeblock> |
| |
| <p> |
| Dynamic partition pruning involves using information only available |
| at run time, such as the result of a subquery. The following example |
| shows a simple dynamic partition pruning. |
| </p> |
| |
| <codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/> |
| |
| <p> |
| In the above example, Impala evaluates the subquery, sends the |
| subquery results to all Impala nodes participating in the query, and |
| then each <cmdname>impalad</cmdname> daemon uses the dynamic partition |
| pruning optimization to read only the partitions with the relevant key |
| values. |
| </p> |
| |
| <p> |
| The output query plan from the <codeph>EXPLAIN</codeph> statement |
| shows that runtime filters are enabled. The plan also shows that it |
| expects to read all 5 partitions of the <codeph>yy</codeph> table, |
| indicating that static partition pruning will not happen. |
| </p> |
| |
| <p> |
| The Filter summary in the <codeph>PROFILE</codeph> output shows that |
| the scan node filtered out based on a runtime filter of dynamic |
| partition pruning. |
| </p> |
| |
| <codeblock>Filter 0 (1.00 MB): |
| - Files processed: 3 |
| - <b>Files rejected: 1 (1)</b> |
| - Files total: 3 (3) |
| </codeblock> |
| |
| <p> |
| Dynamic partition pruning is especially effective for queries involving joins of several large partitioned tables. Evaluating the |
| <codeph>ON</codeph> clauses of the join predicates might normally require reading data from all partitions of certain tables. If |
| the <codeph>WHERE</codeph> clauses of the query refer to the partition key columns, Impala can now often skip reading many of the |
| partitions while evaluating the <codeph>ON</codeph> clauses. The dynamic partition pruning optimization reduces the amount of I/O |
| and the amount of intermediate data stored and transmitted across the network during the query. |
| </p> |
| |
| <p |
| conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/> |
| |
| <p> |
| Dynamic partition pruning is part of the runtime filtering feature, which applies to other kinds of queries in addition to queries |
| against partitioned tables. See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for full details about this feature. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="partition_key_columns"> |
| |
| <title>Partition Key Columns</title> |
| |
| <conbody> |
| |
| <p> |
| The columns you choose as the partition keys should be ones that are frequently used to filter query results in important, |
| large-scale queries. Popular examples are some combination of year, month, and day when the data has associated time values, and |
| geographic region when the data is associated with some place. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| For time-based data, split out the separate parts into their own columns, because Impala cannot partition based on a |
| <codeph>TIMESTAMP</codeph> column. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The data type of the partition columns does not have a significant effect on the storage required, because the values from those |
| columns are not stored in the data files, rather they are represented as strings inside HDFS directory names. |
| </p> |
| </li> |
| |
| <li rev="IMPALA-2499"> |
| <p> |
| In <keyword keyref="impala25_full"/> and higher, you can enable the <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph> query option to speed up |
| queries that only refer to partition key columns, such as <codeph>SELECT MAX(year)</codeph>. This setting is not enabled by |
| default because the query behavior is slightly different if the table contains partition directories without actual data inside. |
| See <xref href="impala_optimize_partition_key_scans.xml#optimize_partition_key_scans"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p |
| conref="../shared/impala_common.xml#common/complex_types_partitioning"/> |
| </li> |
| |
| <li> |
| <p> |
| Remember that when Impala queries data stored in HDFS, it is most efficient to use multi-megabyte files to take advantage of the |
| HDFS block size. For Parquet tables, the block size (and ideal size of the data files) is <ph rev="parquet_block_size">256 MB in |
| Impala 2.0 and later</ph>. Therefore, avoid specifying too many partition key columns, which could result in individual |
| partitions containing only small amounts of data. For example, if you receive 1 GB of data per day, you might partition by year, |
| month, and day; while if you receive 5 GB of data per minute, you might partition by year, month, day, hour, and minute. If you |
| have data with a geographic component, you might partition based on postal code if you have many megabytes of data for each |
| postal code, but if not, you might partition by some larger region such as city, state, or country. state |
| </p> |
| </li> |
| </ul> |
| |
| <p conref="../shared/impala_common.xml#common/partition_key_optimization"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="mixed_format_partitions"> |
| |
| <title>Setting Different File Formats for Partitions</title> |
| |
| <conbody> |
| |
| <p> |
| Partitioned tables have the flexibility to use different file formats for different partitions. (For background information about |
| the different file formats Impala supports, see <xref href="impala_file_formats.xml#file_formats"/>.) For example, if you originally |
| received data in text format, then received new data in RCFile format, and eventually began receiving data in Parquet format, all |
| that data could reside in the same table for queries. You just need to ensure that the table is structured so that the data files |
| that use different file formats reside in separate partitions. |
| </p> |
| |
| <p> |
| For example, here is how you might switch from text to Parquet data as you receive data for different years: |
| </p> |
| |
| <codeblock>[localhost:21000] > create table census (name string) partitioned by (year smallint); |
| [localhost:21000] > alter table census add partition (year=2012); -- Text format; |
| |
| [localhost:21000] > alter table census add partition (year=2013); -- Text format switches to Parquet before data loaded; |
| [localhost:21000] > alter table census partition (year=2013) set fileformat parquet; |
| |
| [localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Jones'),('Lee'),('Singh'); |
| [localhost:21000] > insert into census partition (year=2013) values ('Flores'),('Bogomolov'),('Cooper'),('Appiah');</codeblock> |
| |
| <p> |
| At this point, the HDFS directory for <codeph>year=2012</codeph> contains a text-format data file, while the HDFS directory for |
| <codeph>year=2013</codeph> contains a Parquet data file. As always, when loading non-trivial data, you would use <codeph>INSERT ... |
| SELECT</codeph> or <codeph>LOAD DATA</codeph> to import data in large batches, rather than <codeph>INSERT ... VALUES</codeph> which |
| produces small files that are inefficient for real-world queries. |
| </p> |
| |
| <p> |
| For other file types that Impala cannot create natively, you can switch into Hive and issue the <codeph>ALTER TABLE ... SET |
| FILEFORMAT</codeph> statements and <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statements there. After switching back to |
| Impala, issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement so that Impala recognizes any partitions or new |
| data added through Hive. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_management"> |
| |
| <title>Managing Partitions</title> |
| |
| <conbody> |
| |
| <p> |
| You can add, drop, set the expected file format, or set the HDFS location of the data files for individual partitions within an |
| Impala table. See <xref href="impala_alter_table.xml#alter_table"/> for syntax details, and |
| <xref href="impala_partitioning.xml#mixed_format_partitions"/> for tips on managing tables containing partitions with different file |
| formats. |
| </p> |
| |
| <note |
| conref="../shared/impala_common.xml#common/add_partition_set_location"/> |
| |
| <p> |
| What happens to the data files when a partition is dropped depends on whether the partitioned table is designated as internal or |
| external. For an internal (managed) table, the data files are deleted. For example, if data in the partitioned table is a copy of |
| raw data files stored elsewhere, you might save disk space by dropping older partitions that are no longer required for reporting, |
| knowing that the original data is still available if needed later. For an external table, the data files are left alone. For |
| example, dropping a partition without deleting the associated files lets Impala consider a smaller set of partitions, improving |
| query efficiency and reducing overhead for DDL operations on the table; if the data is needed again later, you can add the partition |
| again. See <xref href="impala_tables.xml#tables" /> for details and examples. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="kudu 2.8.0" id="partition_kudu"> |
| |
| <title>Using Partitioning with Kudu Tables</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Kudu"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Kudu tables use a more fine-grained partitioning scheme than tables containing HDFS data files. You specify a <codeph>PARTITION |
| BY</codeph> clause with the <codeph>CREATE TABLE</codeph> statement to identify how to divide the values from the partition key |
| columns. |
| </p> |
| |
| <p> |
| See <xref href="impala_kudu.xml#kudu_partitioning"/> for |
| details and examples of the partitioning techniques |
| for Kudu tables. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="partition_stats"> |
| <title>Keeping Statistics Up to Date for Partitioned Tables</title> |
| <conbody> |
| |
| <p> |
| Because the <codeph>COMPUTE STATS</codeph> statement can be resource-intensive to run on a partitioned table |
| as new partitions are added, Impala includes a variation of this statement that allows computing statistics |
| on a per-partition basis such that stats can be incrementally updated when new partitions are added. |
| </p> |
| |
| <note type="important"> |
| <p conref="../shared/impala_common.xml#common/cs_or_cis"/> |
| <p |
| conref="../shared/impala_common.xml#common/incremental_stats_after_full"/> |
| <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/> |
| </note> |
| |
| <p rev="2.1.0"> |
| The <codeph>COMPUTE INCREMENTAL STATS</codeph> variation computes statistics only for partitions that were |
| added or changed since the last <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, rather than the entire |
| table. It is typically used for tables where a full <codeph>COMPUTE STATS</codeph> |
| operation takes too long to be practical each time a partition is added or dropped. See |
| <xref href="impala_perf_stats.xml#perf_stats_incremental"/> for full usage details. |
| </p> |
| |
| <codeblock conref="../shared/impala_common.xml#common/compute_stats_walkthrough"/> |
| |
| </conbody> |
| </concept> |
| |
| </concept> |