blob: 65f543ff8fa6d4f99d7cd5cb7fed804b31d7b89c [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="ver" id="new_features">
<title><ph audience="standalone">New Features in Apache Impala</ph><ph audience="integrated">What's New in Apache Impala</ph></title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Release Notes"/>
<data name="Category" value="New Features"/>
<data name="Category" value="What's New"/>
<data name="Category" value="Getting Started"/>
<data name="Category" value="Upgrading"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
This release of Impala contains the following changes and enhancements from previous releases.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept rev="3.2.0" id="new_features_33">
<title>New Features in <keyword keyref="impala33"/></title>
<conbody>
<p> The following sections describe the noteworthy improvements made in
<keyword keyref="impala33"/>. </p>
<p> For the full list of issues closed in this release, see the <xref
keyref="changelog_33">changelog for <keyword keyref="impala33"
/></xref>. </p>
<section id="section_ezf_tnq_s3b">
<title>Increased Compatibility with Apache Projects</title>
<p>Impala is integrate with the following components:<ul>
<li dir="ltr">
<p dir="ltr">Apache Ranger: Use Apache Ranger to manage
authorization in Impala. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_authorization.html"
format="html" scope="external"><u>Impala
Authorization</u></xref> for details.</p>
</li>
<li dir="ltr">
<p dir="ltr">Apache Atlas: Use Apache Atlas to manage data
governance in Impala.</p>
</li>
<li dir="ltr">
<p dir="ltr">Hive 3</p>
</li>
</ul></p>
</section>
<section id="section_ys5_k4n_t3b">
<title>Parquet Page Index </title>
<p>To improve performance when using Parquet files, Impala can now write
page indexes in Parquet files and use those indexes to skip pages for
the faster scan.</p>
<p>See <xref href="impala_parquet.xml#parquet_performance"/> for
details.</p>
</section>
<section id="section_zs5_k4n_t3b">
<title>The Remote File Handle Cache Supports S3</title>
<p>Impala can now cache remote HDFS file handles when the tables that
store their data in Amazon S3 cloud storage.</p>
<p>See <xref href="impala_scalability.xml#scalability_file_handle_cache"
/> for the information on remote file handle cache.</p>
</section>
<section id="section_jls_hxj_s3b">
<title>Support for Kudu Integrated with Hive Metastore</title>
<p>In Impala 3.3 and Kudu 1.10, Kudu is integrated with Hive Metastore
(HMS), and from Impala, you can create, update, delete, and query the
tables in the Kudu services integrated with HMS.</p>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_kudu.html"
format="html" scope="external">Using Kudu with Impala</xref> for
information on using Kudu tables in Impala.</p>
</section>
<section id="section_dp4_mxj_s3b">
<title>Zstd Compression for Parquet files</title>
<p>Zstandard (Zstd) is a real-time compression algorithm offering a
tradeoff between speed and ratio of compression. Compression levels
from 1 up to 22 are supported. The lower the level, the faster the
speed at the cost of compression ratio.</p>
</section>
<section id="section_parquet_lz4_notes">
<title>Lz4 Compression for Parquet files</title>
<p>Lz4 is a lossless compression algorithm providing extremely fast
and scalable compression and decompression.</p>
</section>
<section id="section_drv_nxj_s3b">
<title>Data Cache for Remote Reads</title>
<p>To improve performance on multi-cluster HDFS environments as well as
on object store environments, Impala now caches data for non-local
reads (e.g. S3, ABFS, ADLS) on local storage.</p>
<p>The data cache is enabled with the <codeph>--data_cache
startup</codeph> flag.</p>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_data_cache.html"
format="html" scope="external">Impala Remote Data Cache</xref> for
the information and steps to enable remote data cache.</p>
</section>
<section id="section_xp4_b1f_t3b">
<title>Metadata Performance Improvements </title>
<p>The following features to improve metadata performance are enabled by
default in this release:</p>
<ul>
<li>
<p>Incremental stats are now compressed in memory in
<codeph>catalogd</codeph>, reducing memory footprint in
<codeph>catalogd</codeph>.</p>
</li>
<li>
<p><codeph>impalad</codeph>coordinators fetch incremental stats from
<codeph>catalogd</codeph> on-demand, reducing the memory
footprint and the network requirements for broadcasting
metadata.</p>
</li>
<li>
<p>Time-based and memory-based automatic invalidation of metadata to
keep the size of metadata bounded and to reduce the chances of
<codeph>catalogd</codeph>cache running out of memory.</p>
</li>
<li>
<p>Automatic invalidation of metadata</p>
<p>With automatic metadata management enabled, you no longer have to
issue <codeph>INVALIDATE</codeph> / <codeph>REFRESH</codeph> in a
number of conditions.</p>
<p>In Impala 3.3, the following additional event in Hive Metastore
can trigger automatic INVALIDATE / REFRESH of Metadata:</p>
<ul>
<li>
<p>INSERT into tables and partitions from Impala or from Spark
on the same or multiple cluster configuration</p>
</li>
</ul>
</li>
</ul>
<p>See <xref href="impala_metadata.xml#impala_metadata"/> for the
information on the above features.</p>
</section>
<section id="section_ztf_c4q_s3b">
<title>Scalable Pool Configuration in Admission Controller</title>
<p>To offer more dynamic and flexible resource management, Impala
supports the new configuration parameters that scale with the number
of hosts in the resource pool. You can use the parameters to control
the number of running queries, queued queries, and maximum amount of
memory allocated for Impala resource pools. See <xref
href="impala_admission.xml#admission_control"/> for the information
about the new parameters and using them for admission control.</p>
</section>
<section id="section_b55_gxj_s3b">
<title>Query Profile</title>
<p>The following information was added to the Query Profile output for
better monitoring and troubleshooting of query performance.</p>
<ul>
<li>
<p>Network I/O throughput</p>
</li>
<li>
<p>System disk I/O throughput</p>
</li>
</ul>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html"
format="html" scope="external">Impala Query Profile</xref> for
generating and reading query profile.</p>
</section>
<section id="section_lbh_kzj_s3b">
<title>DATE Data Type and Functions</title>
<p>You can use the new the DATE type to describe a particular
year/month/day, in the form YYYY-­MM-­DD.</p>
<p>This initial DATE type support the TEXT, Parquet, and HBASE file
formats.</p>
<p>The support of DATE data type includes the following features:</p>
<ul>
<li><codeph>DATE</codeph> type column as a partitioning key
column</li>
<li><codeph>DATE</codeph> literal</li>
<li>Implicit casting between <codeph>DATE</codeph> and other types:
<codeph>STRING</codeph> and <codeph>TIMESTAMP</codeph></li>
<li>Most of the built-in functions for <codeph>TIMESTAMP</codeph> now
allow the <codeph>DATE</codeph> type arguments, as well.</li>
</ul>
<p>See <xref href="impala_date.xml#date"/> and <xref
href="impala_datetime_functions.xml#datetime_functions"/> for using
the DATE type.</p>
</section>
<section id="section_wpm_zzj_s3b">
<title>Support Hive Insert-Only Transactional Tables</title>
<p>Impala added the support to create, drop, query, and insert into the
insert-only type of transactional tables. </p>
</section>
<section>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_transactions.html"
format="html" scope="external">Impala Transactions</xref> for
details.</p>
</section>
<section id="section_ab2_41k_s3b">
<title>HiveServer2 HTTP Connection for Clients</title>
<p>Now client applications can connect to Impala over HTTP via
HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for
authentication. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_client.html"
format="html" scope="external">Impala Clients</xref> for
details.</p>
</section>
<section id="section_xxt_44q_s3b">
<title>Default File Format Changed to Parquet</title>
<p>When you create a table, the default format for that table data is
now Parquet.</p>
<p>For backward compatibility, you can use the DEFAULT_FILE_FORMAT query
option to set the default file format to the previous default, text,
or other formats.</p>
</section>
<section id="section_m1h_mnf_t3b">
<title>Built-in Function to Process JSON Objects</title>
<p>The <codeph>GET_JSON_OBJECT()</codeph> function extracts JSON object
from a string based on the path specified and returns the extracted
JSON object.</p>
<p>See <xref href="impala_misc_functions.xml#misc_functions">Impala
Miscellaneous Functions</xref>. for details.</p>
</section>
<section id="section_acs_wck_s3b">
<title>Ubuntu 18.04</title>
<p>This version of Impala is certified to run on Ubuntu 18.04.</p>
</section>
</conbody>
</concept>
<concept rev="3.2.0" id="new_features_32">
<title>New Features in <keyword keyref="impala32"/></title>
<conbody>
<p> The following sections describe the noteworthy improvements made in
<keyword keyref="impala32"/>. </p>
<p> For the full list of issues closed in this release, see the <xref
keyref="changelog_32">changelog for <keyword keyref="impala32"
/></xref>. </p>
</conbody>
<concept id="rn_32_multi_cluster">
<title>Multi-cluster Support</title>
<conbody>
<ul>
<li dir="ltr">Remote File Handle Cache<p>Impala can now cache remote
HDFS file handles when the
<codeph>cache_remote_file_handles</codeph> impalad flag is set
to <codeph>true</codeph>. This feature does not apply to non-HDFS
tables, such as Kudu or HBase tables, and does not apply to the
tables that store their data on cloud services, such as S3 or
ADLS. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_scalability.html"
format="html" scope="external">Scalabilty Considerations</xref>
for file handle caching in Impala.</p></li>
</ul>
</conbody>
</concept>
<concept id="rn_32_ac">
<title>Enhancements in Resource Management and Admission Control</title>
<conbody>
<ul>
<li>Admission Debug page is available in <xref
href="https://impala.apache.org/docs/build/html/topics/impala_webui.html"
format="html" scope="external">Impala Daemon (impalad) web
UI</xref> at <codeph>\admission</codeph> and provides the
following information about Impala resource pools:<ul>
<li>Pool configuration</li>
<li>Relevant pool stats</li>
<li>Queued queries in order of being queued (local to the
coordinator)</li>
<li>Running queries (local to this coordinator)</li>
<li>Histogram of the distribution of peak memory usage by admitted
queries</li>
</ul></li>
</ul>
<ul>
<li>A new query option, <xref
href="https://impala.apache.org/docs/build/html/topics/impala_num_rows_produced_limit.html"
format="html" scope="external">NUM_ROWS_PRODUCED_LIMIT</xref>, was
added to limit the number of rows returned from queries.<p>Impala
will cancel a query if the query produces more rows than the limit
specified by this query option. The limit applies only when the
results are returned to a client, e.g. for a
<codeph>SELECT</codeph> query, but not an
<codeph>INSERT</codeph> query. This query option is a guardrail
against users accidentally submitting queries that return a large
number of rows.</p></li>
</ul>
</conbody>
</concept>
<concept id="rn_32_metadata">
<title>Metadata Performance Improvements</title>
<conbody>
<ul>
<li><xref
href="https://impala.apache.org/docs/build/html/topics/impala_metadata.html"
format="html" scope="external">Automatic Metadata Sync using Hive
Metastore Notification Events</xref><p>When enabled, the
<codeph>catalogd</codeph> polls Hive Metastore (HMS)
notifications events at a configurable interval and syncs with
HMS. You can use the new web UI pages of the
<codeph>catalogd</codeph> to check the state of the automatic
invalidate event processor. </p><p><b>Note</b>: This is a preview
feature in <keyword keyref="impala32">Impala
3.2</keyword>.</p></li>
</ul>
</conbody>
</concept>
<concept id="rn_32_usability">
<title>Compatibility and Usability Enhancements</title>
<conbody>
<ul>
<li>Impala can now read the <codeph>TIMESTAMP_MILLIS</codeph> and
<codeph>TIMESTAMP_MICROS</codeph> Parquet types. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_parquet.html"
format="html" scope="external">Using Parquet File Format for
Impala Tables</xref> for the Parquet support in Impala.</li>
<li>Impala can now read the complex types in ORC such as ARRAY,
STRUCT, and MAP. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_orc.html"
format="html" scope="external">Using ORC File Format for Impala
Tables</xref> for the ORC support in Impala.</li>
<li>The <xref
href="https://impala.apache.org/docs/build/html/topics/impala_string_functions.html"
format="html" scope="external">LEVENSHTEIN</xref> string function
is supported.<p>The function returns the Levenshtein distance
between two input strings, the minimum number of single-character
edits required to transform one string to other.</p></li>
<li>The <codeph>IF NOT EXISTS</codeph> clause is supported in the
<xref
href="https://impala.apache.org/docs/build/html/topics/impala_alter_table.html"
format="html" scope="external"><codeph>ALTER TABLE</codeph></xref>
statement.</li>
<li>The new <xref
href="https://impala.apache.org/docs/build/html/topics/impala_default_file_format.html"
format="html" scope="external"
><codeph>DEFAULT_FILE_FORMAT</codeph></xref> query option allows
you to set the default table file format. This removes the need for
the <codeph>STORED AS &lt;format></codeph> clause. Set this option
if you prefer a value that is not <codeph>TEXT</codeph>. The
supported formats are: <ul>
<li><codeph>TEXT</codeph></li>
<li><codeph>RC_FILE</codeph></li>
<li><codeph>SEQUENCE_FILE</codeph></li>
<li><codeph>AVRO</codeph></li>
<li><codeph>PARQUET</codeph></li>
<li><codeph>KUDU</codeph></li>
<li><codeph>ORC</codeph></li>
</ul></li>
<li>The extended or verbose <xref
href="https://impala.apache.org/docs/build/html/topics/impala_explain.html"
format="html" scope="external"><codeph>EXPLAIN</codeph></xref>
output includes the following new information for queries:<ul>
<li>The text of the analyzed query that may have been rewritten to
include various optimizations and implicit casts. </li>
<li>The implicit casts and literals shown with the actual
types.</li>
</ul></li>
<li>CPU resource utilization (user, system, iowait) metrics were added
to the <xref
href="https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html"
format="html" scope="external">Impala profile</xref> output.</li>
</ul>
</conbody>
</concept>
<concept id="rn_32_security">
<title><b id="docs-internal-guid-e1c558d3-7fff-4d4e-0ec1-e40f60c9b64a"
><b>Security Enhancement</b></b></title>
<conbody>
<ul>
<li>The <xref
href="https://impala.apache.org/docs/build/html/topics/impala_refresh_authorization.html"
format="html" scope="external">REFRESH AUTHORIZATION</xref>
statement was implemented for refreshing authorization data.</li>
</ul>
</conbody>
</concept>
</concept>
<!-- All 3.1.x new features go under here -->
<concept rev="3.1.0" id="new_features_31">
<title>New Features in <keyword keyref="impala31"/></title>
<conbody>
<p> For the full list of issues closed in this release, including the
issues marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_31">changelog for <keyword keyref="impala31"
/></xref>. </p>
</conbody>
</concept>
<!-- All 3.0.x new features go under here -->
<concept rev="3.0.0" id="new_features_300">
<title>New Features in <keyword keyref="impala30"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the
issues marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_300">changelog for <keyword keyref="impala30"
/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.12.x new features go under here -->
<concept rev="2.12.0" id="new_features_2120">
<title>New Features in <keyword keyref="impala212_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_212">changelog for <keyword keyref="impala212"/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.11.x new features go under here -->
<concept rev="2.11.0" id="new_features_2110">
<title>New Features in <keyword keyref="impala211_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_211">changelog for <keyword keyref="impala211"/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.10.x new features go under here -->
<concept rev="2.10.0" id="new_features_2100">
<title>New Features in <keyword keyref="impala210_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_210">changelog for <keyword keyref="impala210"/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.9.x new features go under here -->
<concept rev="2.9.0" id="new_features_290">
<title>New Features in <keyword keyref="impala29_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_29">changelog for <keyword keyref="impala29"/></xref>.
</p>
<p>
The following are some of the most significant new features in this release:
</p>
<ul id="feature_list">
<li>
<p rev="IMPALA-4729">
A new function, <codeph>replace()</codeph>, which is faster than
<codeph>regexp_replace()</codeph> for simple string substitutions.
See <xref keyref="string_functions"/> for details.
</p>
</li>
<li>
<p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
Startup flags for the <cmdname>impalad</cmdname> daemon, <codeph>is_executor</codeph>
and <codeph>is_coordinator</codeph>, let you divide the work on a large, busy cluster
between a small number of hosts acting as query coordinators, and a larger number of
hosts acting as query executors. By default, each host can act in both roles,
potentially introducing bottlenecks during heavily concurrent workloads.
See <xref keyref="scalability_coordinator"/> for details.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.8.x new features go under here -->
<concept rev="2.8.0" id="new_features_280">
<title>New Features in <keyword keyref="impala28_full"/></title>
<conbody>
<ul id="feature_list">
<li>
<p>
Performance and scalability improvements:
</p>
<ul>
<li>
<p rev="IMPALA-4572">
The <codeph>COMPUTE STATS</codeph> statement can
take advantage of multithreading.
</p>
</li>
<li>
<p rev="IMPALA-4135">
Improved scalability for highly concurrent loads by reducing the possibility of TCP/IP timeouts.
A configuration setting, <codeph>accepted_cnxn_queue_depth</codeph>, can be adjusted upwards to
avoid this type of timeout on large clusters.
</p>
</li>
<li>
<p>
Several performance improvements were made to the mechanism for generating native code:
</p>
<ul>
<li>
<p rev="IMPALA-3638">
Some queries involving analytic functions can take better advantage of native code generation.
</p>
</li>
<li>
<p rev="IMPALA-4008">
Modules produced during intermediate code generation are organized
to be easier to cache and reuse during the lifetime of a long-running or complicated query.
</p>
</li>
<li>
<p rev="IMPALA-4397 IMPALA-1430">
The <codeph>COMPUTE STATS</codeph> statement is more efficient
(less time for the codegen phase) for tables with a large number
of columns, especially for tables containing <codeph>TIMESTAMP</codeph>
columns.
</p>
</li>
<li>
<p rev="IMPALA-3838 IMPALA-4495">
The logic for determining whether or not to use a runtime filter is more reliable, and the
evaluation process itself is faster because of native code generation.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-3902">
The <codeph>MT_DOP</codeph> query option enables
multithreading for a number of Impala operations.
<codeph>COMPUTE STATS</codeph> statements for Parquet tables
use a default of <codeph>MT_DOP=4</codeph> to improve the
intra-node parallelism and CPU efficiency of this data-intensive
operation.
</p>
</li>
<li>
<p rev="IMPALA-4397">
The <codeph>COMPUTE STATS</codeph> statement is more efficient
(less time for the codegen phase) for tables with a large number
of columns.
</p>
</li>
<li>
<p rev="IMPALA-2521">
A new hint, <codeph>CLUSTERED</codeph>,
allows Impala <codeph>INSERT</codeph> operations on a Parquet table
that use dynamic partitioning to process a high number of
partitions in a single statement. The data is ordered based on the
partition key columns, and each partition is only written
by a single host, reducing the amount of memory needed to buffer
Parquet data while the data blocks are being constructed.
</p>
</li>
<li>
<p rev="IMPALA-3552">
The new configuration setting <codeph>inc_stats_size_limit_bytes</codeph>
lets you reduce the load on the catalog server when running the
<codeph>COMPUTE INCREMENTAL STATS</codeph> statement for very large tables.
</p>
</li>
<li>
<p rev="IMPALA-1788">
Impala folds many constant expressions within query statements,
rather than evaluating them for each row. This optimization
is especially useful when using functions to manipulate and
format <codeph>TIMESTAMP</codeph> values, such as the result
of an expression such as <codeph>to_date(now() - interval 1 day)</codeph>.
</p>
</li>
<li>
<p rev="IMPALA-4529">
Parsing of complicated expressions is faster. This speedup is
especially useful for queries containing large <codeph>CASE</codeph>
expressions.
</p>
</li>
<li>
<p rev="IMPALA-4302">
Evaluation is faster for <codeph>IN</codeph> operators with many constant
arguments. The same performance improvement applies to other functions
with many constant arguments.
</p>
</li>
<li>
<p rev="IMPALA-1286">
Impala optimizes identical comparison operators within multiple <codeph>OR</codeph>
blocks.
</p>
</li>
<li>
<p rev="IMPALA-4193 IMPALA-3342">
The reporting for wall-clock times and total CPU time in profile output is more accurate.
</p>
</li>
<li>
<p rev="IMPALA-3671">
A new query option, <codeph>SCRATCH_LIMIT</codeph>, lets you restrict the amount of
space used when a query exceeds the memory limit and activates the <q>spill to disk</q> mechanism.
This option helps to avoid runaway queries or make queries <q>fail fast</q> if they require more
memory than anticipated. You can prevent runaway queries from using excessive amounts of spill space,
without restarting the cluster to turn the spilling feature off entirely.
See <xref href="impala_scratch_limit.xml#scratch_limit"/> for details.
</p>
</li>
</ul>
</li>
<li>
<p>
Integration with Apache Kudu:
</p>
<ul>
<li>
<p rev="">
The experimental Impala support for the Kudu storage layer has been folded
into the main Impala development branch. Impala can now directly access Kudu tables,
opening up new capabilities such as enhanced DML operations and continuous ingestion.
</p>
</li>
<li>
<p rev="">
The <codeph>DELETE</codeph> statement is a flexible way to remove data from a Kudu table. Previously,
removing data from an Impala table involved removing or rewriting the underlying data files, dropping entire partitions,
or rewriting the entire table. This Impala statement only works for Kudu tables.
</p>
</li>
<li>
<p rev="">
The <codeph>UPDATE</codeph> statement is a flexible way to modify data within a Kudu table. Previously,
updating data in an Impala table involved replacing the underlying data files, dropping entire partitions,
or rewriting the entire table. This Impala statement only works for Kudu tables.
</p>
</li>
<li>
<p rev="IMPALA-3725">
The <codeph>UPSERT</codeph> statement is a flexible way to ingest, modify, or both data within a Kudu table. Previously,
ingesting data that might contain duplicates involved an inefficient multi-stage operation, and there was no
built-in protection against duplicate data. The <codeph>UPSERT</codeph> statement, in combination with
the primary key designation for Kudu tables, lets you add or replace rows in a single operation, and
automatically avoids creating any duplicate data.
</p>
</li>
<li>
<p rev="IMPALA-3719 IMPALA-3726">
The <codeph>CREATE TABLE</codeph> statement gains some new clauses that are specific to Kudu tables:
<codeph>PARTITION BY</codeph>, <codeph>PARTITIONS</codeph>, <codeph>STORED AS KUDU</codeph>, and column
attributes <codeph>PRIMARY KEY</codeph>, <codeph>NULL</codeph> and <codeph>NOT NULL</codeph>,
<codeph>ENCODING</codeph>, <codeph>COMPRESSION</codeph>, <codeph>DEFAULT</codeph>, and <codeph>BLOCK_SIZE</codeph>.
These clauses replace the explicit <codeph>TBLPROPERTIES</codeph> settings that were required in the
early experimental phases of integration between Impala and Kudu.
</p>
</li>
<li>
<p rev="IMPALA-2890">
The <codeph>ALTER TABLE</codeph> statement can change certain attributes of Kudu tables.
You can add, drop, or rename columns.
You can add or drop range partitions.
You can change the <codeph>TBLPROPERTIES</codeph> value to rename or point to a different underlying Kudu table,
independently from the Impala table name in the metastore database.
You cannot change the data type of an existing column in a Kudu table.
</p>
</li>
<li>
<p rev="IMPALA-4403">
The <codeph>SHOW PARTITIONS</codeph> statement displays information about the distribution of data
between partitions in Kudu tables. A new variation, <codeph>SHOW RANGE PARTITIONS</codeph>,
displays information about the Kudu-specific partitions that apply across ranges of key values.
</p>
</li>
<li>
<p rev="IMPALA-4379">
Not all Impala data types are supported in Kudu tables. In particular, currently the Impala
<codeph>TIMESTAMP</codeph> type is not allowed in a Kudu table. Impala does not recognize the
<codeph>UNIXTIME_MICROS</codeph> Kudu type when it is present in a Kudu table. (These two
representations of date/time data use different units and are not directly compatible.)
You cannot create columns of type <codeph>TIMESTAMP</codeph>, <codeph>DECIMAL</codeph>,
<codeph>VARCHAR</codeph>, or <codeph>CHAR</codeph> within a Kudu table. Within a query, you can
cast values in a result set to these types. Certain types, such as <codeph>BOOLEAN</codeph>,
cannot be used as primary key columns.
</p>
</li>
<li>
<p rev="">
Currently, Kudu tables are not interchangeable between Impala and Hive the way other kinds of Impala tables are.
Although the metadata for Kudu tables is stored in the metastore database, currently Hive cannot access Kudu tables.
</p>
</li>
<li>
<p rev="">
The <codeph>INSERT</codeph> statement works for Kudu tables. The organization
of the Kudu data makes it more efficient than with HDFS-backed tables to insert
data in small batches, such as with the <codeph>INSERT ... VALUES</codeph> syntax.
</p>
</li>
<li>
<p rev="IMPALA-4283">
Some audit data is recorded for data governance purposes.
All <codeph>UPDATE</codeph>, <codeph>DELETE</codeph>, and <codeph>UPSERT</codeph> statements are characterized
as <codeph>INSERT</codeph> operations in the audit log. Currently, lineage metadata is not generated for
<codeph>UPDATE</codeph> and <codeph>DELETE</codeph> operations on Kudu tables.
</p>
</li>
<li>
<p rev="IMPALA-4000">
Currently, Kudu tables have limited support for Sentry:
<ul>
<li>
<p>
Access to Kudu tables must be granted to roles as usual.
</p>
</li>
<li>
<p>
Currently, access to a Kudu table through Sentry is <q>all or nothing</q>.
You cannot enforce finer-grained permissions such as at the column level,
or permissions on certain operations such as <codeph>INSERT</codeph>.
</p>
</li>
<li>
<p>
Only users with <codeph>ALL</codeph> privileges on <codeph>SERVER</codeph> can create external Kudu tables.
</p>
</li>
</ul>
Because non-SQL APIs can access Kudu data without going through Sentry
authorization, currently the Sentry support is considered preliminary.
</p>
</li>
<li>
<p rev="IMPALA-4571">
Equality and <codeph>IN</codeph> predicates in Impala queries are pushed to
Kudu and evaluated efficiently by the Kudu storage layer.
</p>
</li>
</ul>
</li>
<li>
<p rev="">
<b>Security:</b>
</p>
<ul>
<li>
<p>
Impala can take advantage of the S3 encrypted credential
store, to avoid exposing the secret key when accessing
data stored on S3.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-1654">
[<xref keyref="IMPALA-1654">IMPALA-1654</xref>]
Several kinds of DDL operations
can now work on a range of partitions. The partitions can be specified
using operators such as <codeph>&lt;</codeph>, <codeph>&gt;=</codeph>, and
<codeph>!=</codeph> rather than just an equality predicate applying to a single
partition.
This new feature extends the syntax of several clauses
of the <codeph>ALTER TABLE</codeph> statement
(<codeph>DROP PARTITION</codeph>, <codeph>SET [UN]CACHED</codeph>,
<codeph>SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES</codeph>),
the <codeph>SHOW FILES</codeph> statement, and the
<codeph>COMPUTE INCREMENTAL STATS</codeph> statement.
It does not apply to statements that are defined to only apply to a single
partition, such as <codeph>LOAD DATA</codeph>, <codeph>ALTER TABLE ... ADD PARTITION</codeph>,
<codeph>SET LOCATION</codeph>, and <codeph>INSERT</codeph> with a static
partitioning clause.
</p>
</li>
<li>
<p rev="IMPALA-3973">
The <codeph>instr()</codeph> function has optional second and third arguments, representing
the character to position to begin searching for the substring, and the Nth occurrence
of the substring to find.
</p>
</li>
<li>
<p rev="IMPALA-3441 IMPALA-4387">
Improved error handling for malformed Avro data. In particular, incorrect
precision or scale for <codeph>DECIMAL</codeph> types is now handled.
</p>
</li>
<li>
<p>
Impala debug web UI:
</p>
<ul>
<li>
<p rev="IMPALA-1169">
In addition to <q>inflight</q> and <q>finished</q> queries, the web UI
now also includes a section for <q>queued</q> queries.
</p>
</li>
<li>
<p rev="IMPALA-4048">
The <uicontrol>/sessions</uicontrol> tab now clarifies how many of the displayed
sections are active, and lets you sort by <uicontrol>Expired</uicontrol> status
to distinguish active sessions from expired ones.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-4020">
Improved stability when DDL operations such as <codeph>CREATE DATABASE</codeph>
or <codeph>DROP DATABASE</codeph> are run in Hive at the same time as an Impala
<codeph>INVALIDATE METADATA</codeph> statement.
</p>
</li>
<li>
<p rev="IMPALA-1616">
The <q>out of memory</q> error report was made more user-friendly, with additional
diagnostic information to help identify the spot where the memory limit was exceeded.
</p>
</li>
<li>
<p rev="IMPALA-3983 IMPALA-3974">
Improved disk space usage for Java-based UDFs. Temporary copies of the associated JAR
files are removed when no longer needed, so that they do not accumulate across restarts
of the <cmdname>catalogd</cmdname> daemon and potentially cause an out-of-space condition.
These temporary files are also created in the directory specified by the <codeph>local_library_dir</codeph>
configuration setting, so that the storage for these temporary files can be independent
from any capacity limits on the <filepath>/tmp</filepath> filesystem.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.7.x new features go under here -->
<concept rev="2.7.0" id="new_features_270">
<title>New Features in <keyword keyref="impala27_full"/></title>
<conbody>
<ul id="feature_list">
<li>
<p>
Performance improvements:
</p>
<ul>
<li>
<p rev="IMPALA-3206">
[<xref keyref="IMPALA-3206">IMPALA-3206</xref>]
Speedup for queries against <codeph>DECIMAL</codeph> columns in Avro tables.
The code that parses <codeph>DECIMAL</codeph> values from Avro now uses
native code generation.
</p>
</li>
<li>
<p rev="IMPALA-3674">
[<xref keyref="IMPALA-3674">IMPALA-3674</xref>]
Improved efficiency in LLVM code generation can reduce codegen time, especially
for short queries.
</p>
</li>
<!-- Not actually a new feature, it's more a tip about when to expect remote reads and how to minimize them. To go somewhere in the performance / best practices / Parquet info.
<li>
<p rev="IMPALA-3885">
[<xref keyref="IMPALA-3885">IMPALA-3885</xref>]
Parquet files with multiple blocks can now be processed
without remote reads.
</p>
</li>
-->
<li>
<p rev="IMPALA-2979">
[<xref keyref="IMPALA-2979">IMPALA-2979</xref>]
Improvements to scheduling on worker nodes,
enabled by the <codeph>REPLICA_PREFERENCE</codeph> query option.
See <xref
href="impala_replica_preference.xml#replica_preference"/> for details.
</p>
</li>
</ul>
</li>
<li audience="hidden">
<p rev="IMPALA-3210"><!-- Patch didn't make it into in <keyword keyref="impala27_full"/> -->
[<xref keyref="IMPALA-3210">IMPALA-3210</xref>]
The analytic functions <codeph>FIRST_VALUE()</codeph> and <codeph>LAST_VALUE()</codeph>
accept a new clause, <codeph>IGNORE NULLS</codeph>.
See <xref href="impala_analytic_functions.xml#first_value"/>
and <xref href="impala_analytic_functions.xml#last_value"/>
for details.
</p>
</li>
<li>
<p rev="IMPALA-1683">
[<xref keyref="IMPALA-1683">IMPALA-1683</xref>]
The <codeph>REFRESH</codeph> statement can be applied to a single partition,
rather than the entire table. See <xref href="impala_refresh.xml#refresh"/>
and <xref href="impala_partitioning.xml#partition_refresh"/> for details.
</p>
</li>
<li>
<p>
Improvements to the Impala web user interface:
</p>
<ul>
<li>
<p rev="IMPALA-2767">
[<xref keyref="IMPALA-2767">IMPALA-2767</xref>]
You can now force a session to expire by clicking a link in the web UI,
on the <uicontrol>/sessions</uicontrol> tab.
</p>
</li>
<li>
<p rev="IMPALA-3715">
[<xref keyref="IMPALA-3715">IMPALA-3715</xref>]
The <uicontrol>/memz</uicontrol> tab includes more information about
Impala memory usage.
</p>
</li>
<li>
<p rev="IMPALA-3716">
[<xref keyref="IMPALA-3716">IMPALA-3716</xref>]
The <uicontrol>Details</uicontrol> page for a query now includes
a <uicontrol>Memory</uicontrol> tab.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-3499">
[<xref keyref="IMPALA-3499">IMPALA-3499</xref>]
Scalability improvements to the catalog server. Impala handles internal communication
more efficiently for tables with large numbers of columns and partitions, where the
size of the metadata exceeds 2 GiB.
</p>
</li>
<li>
<p rev="IMPALA-3677">
[<xref keyref="IMPALA-3677">IMPALA-3677</xref>]
You can send a <codeph>SIGUSR1</codeph> signal to any Impala-related daemon to write a
Breakpad minidump. For advanced troubleshooting, you can now produce a minidump
without triggering a crash. See <xref href="impala_breakpad.xml#breakpad"/> for
details about the Breakpad minidump feature.
</p>
</li>
<li>
<p rev="IMPALA-3687">
[<xref keyref="IMPALA-3687">IMPALA-3687</xref>]
The schema reconciliation rules for Avro tables have changed slightly
for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns. Now, if
the definition of such a column is changed in the Avro schema file,
the column retains its <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph>
type as specified in the SQL definition, but the column name and comment
from the Avro schema file take precedence.
See <xref href="impala_avro.xml#avro_create_table"/> for details about
column definitions in Avro tables.
</p>
</li>
<li>
<p rev="IMPALA-3575">
[<xref keyref="IMPALA-3575">IMPALA-3575</xref>]
Some network
operations now have additional timeout and retry settings. The extra
configuration helps avoid failed queries for transient network
problems, to avoid hangs when a sender or receiver fails in the
middle of a network transmission, and to make cancellation requests
more reliable despite network issues. </p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.6.x new features go under here -->
<concept rev="2.6.0" id="new_features_260">
<title>New Features in <keyword keyref="impala26_full"/></title>
<conbody>
<ul>
<li>
<p>
Improvements to Impala support for the Amazon S3 filesystem:
</p>
<ul>
<li>
<p rev="IMPALA-1878">
Impala can now write to S3 tables through the <codeph>INSERT</codeph>
or <codeph>LOAD DATA</codeph> statements.
See <xref href="impala_s3.xml#s3"/> for general information about
using Impala with S3.
</p>
</li>
<li>
<p rev="IMPALA-3452">
A new query option, <codeph>S3_SKIP_INSERT_STAGING</codeph>, lets you
trade off between fast <codeph>INSERT</codeph> performance and
slower <codeph>INSERT</codeph>s that are more consistent if a
problem occurs during the statement. The new behavior is enabled by default.
See <xref href="impala_s3_skip_insert_staging.xml#s3_skip_insert_staging"/> for details
about this option.
</p>
</li>
</ul>
</li>
<li>
<p rev="">
Performance improvements for the runtime filtering feature:
</p>
<ul>
<li>
<p rev="IMPALA-3333">
The default for the <codeph>RUNTIME_FILTER_MODE</codeph>
query option is changed to <codeph>GLOBAL</codeph> (the highest setting).
See <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> for
details about this option.
</p>
</li>
<li rev="IMPALA-3007">
<p>
The <codeph>RUNTIME_BLOOM_FILTER_SIZE</codeph> setting is now only used
as a fallback if statistics are not available; otherwise, Impala
uses the statistics to estimate the appropriate size to use for each filter.
See <xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size"/> for
details about this option.
</p>
</li>
<li rev="IMPALA-3480">
<p>
New query options <codeph>RUNTIME_FILTER_MIN_SIZE</codeph> and
<codeph>RUNTIME_FILTER_MAX_SIZE</codeph> let you fine-tune
the sizes of the Bloom filter structures used for runtime filtering.
If the filter size derived from Impala internal estimates or from
the <codeph>RUNTIME_FILTER_BLOOM_SIZE</codeph> falls outside the size
range specified by these options, any too-small filter size is adjusted
to the minimum, and any too-large filter size is adjusted to the maximum.
See <xref href="impala_runtime_filter_min_size.xml#runtime_filter_min_size"/>
and <xref href="impala_runtime_filter_max_size.xml#runtime_filter_max_size"/>
for details about these options.
</p>
</li>
<li rev="IMPALA-2956">
<p>
Runtime filter propagation now applies to all the
operands of <codeph>UNION</codeph> and <codeph>UNION ALL</codeph>
operators.
</p>
</li>
<li rev="IMPALA-3077">
<p>
Runtime filters can now be produced during join queries even
when the join processing activates the spill-to-disk mechanism.
</p>
</li>
</ul>
See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for
general information about the runtime filtering feature.
</li>
<!-- Have to look closer at resource management / admission control to see if
there are any ripple effects from this default change. -->
<li>
<p rev="IMPALA-3199">
Admission control and dynamic resource pools are enabled by default.
See <xref href="impala_admission.xml#admission_control"/> for details
about admission control.
</p>
</li>
<!-- Below here are features that are pretty well taken care of already;
some of them didn't need much if any doc in the first place. -->
<li>
<p rev="IMPALA-3369">
Impala can now manually set column statistics,
using the <codeph>ALTER TABLE</codeph> statement with a
<codeph>SET COLUMN STATS</codeph> clause.
See <xref href="impala_perf_stats.xml#perf_column_stats_manual"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-3490 IMPALA-3581 IMPALA-2686">
Impala can now write lightweight <q>minidump</q> files, rather
than large core files, to save diagnostic information when
any of the Impala-related daemons crash. This feature uses the
open source <codeph>breakpad</codeph> framework.
See <xref href="impala_breakpad.xml#breakpad"/> for details.
</p>
</li>
<li>
<p>
New query options improve interoperability with Parquet files:
<ul>
<li>
<p rev="IMPALA-2835">
The <codeph>PARQUET_FALLBACK_SCHEMA_RESOLUTION</codeph> query option
lets Impala locate columns within Parquet files based on
column name rather than ordinal position.
This enhancement improves interoperability with applications
that write Parquet files with a different order or subset of
columns than are used in the Impala table.
See <xref href="impala_parquet_fallback_schema_resolution.xml#parquet_fallback_schema_resolution"/>
for details.
</p>
</li>
<li>
<p rev="IMPALA-2069">
The <codeph>PARQUET_ANNOTATE_STRINGS_UTF8</codeph> query option
makes Impala include the <codeph>UTF-8</codeph> annotation
metadata for <codeph>STRING</codeph>, <codeph>CHAR</codeph>,
and <codeph>VARCHAR</codeph> columns in Parquet files created
by <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph>
statements.
See <xref href="impala_parquet_annotate_strings_utf8.xml#parquet_annotate_strings_utf8"/>
for details.
</p>
</li>
</ul>
See <xref href="impala_parquet.xml#parquet"/> for general information about working
with Parquet files.
</p>
</li>
<li>
<p>
Improvements to security and reduction in overhead for secure clusters:
</p>
<ul>
<li>
<p rev="IMPALA-1928">
Overall performance improvements for secure clusters.
(TPC-H queries on a secure cluster were benchmarked
at roughly 3x as fast as the previous release.)
</p>
</li>
<li>
<p rev="IMPALA-2660">
Impala now recognizes the <codeph>auth_to_local</codeph> setting,
specified through the HDFS configuration setting
<codeph>hadoop.security.auth_to_local</codeph>.
This feature is disabled by default; to enable it,
specify <codeph>--load_auth_to_local_rules=true</codeph>
in the <cmdname>impalad</cmdname> configuration settings.
See <xref href="impala_kerberos.xml#auth_to_local"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-2599">
Timing improvements in the mechanism for the <cmdname>impalad</cmdname>
daemon to acquire Kerberos tickets. This feature spreads out the overhead
on the KDC during Impala startup, especially for large clusters.
</p>
</li>
<li>
<p rev="IMPALA-3554">
For Kerberized clusters, the Catalog service now uses
the Kerberos principal instead of the operating sytem user that runs
the <cmdname>catalogd</cmdname> daemon.
This eliminates the requirement to configure a <codeph>hadoop.user.group.static.mapping.overrides</codeph>
setting to put the OS user into the Sentry administrative group, on clusters where the principal
and the OS user name for this user are different.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-3286">
Overall performance improvements for join queries, by using a prefetching mechanism
while building the in-memory hash table to evaluate join predicates.
See <xref href="impala_prefetch_mode.xml#prefetch_mode"/> for the query option
to control this optimization.
</p>
</li>
<li>
<p rev="IMPALA-3397">
The <cmdname>impala-shell</cmdname> interpreter has a new command,
<codeph>SOURCE</codeph>, that lets you run a set of SQL statements
or other <cmdname>impala-shell</cmdname> commands stored in a file.
You can run additional <codeph>SOURCE</codeph> commands from inside
a file, to set up flexible sequences of statements for use cases
such as schema setup, ETL, or reporting.
See <xref href="impala_shell_commands.xml#shell_commands"/> for details
and <xref href="impala_shell_running_commands.xml#shell_running_commands"/>
for examples.
</p>
</li>
<li>
<p rev="IMPALA-1772">
The <codeph>millisecond()</codeph> built-in function lets you extract
the fractional seconds part of a <codeph>TIMESTAMP</codeph> value.
See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-3092">
If an Avro table is created without column definitions in the
<codeph>CREATE TABLE</codeph> statement, and columns are later
added through <codeph>ALTER TABLE</codeph>, the resulting
table is now queryable. Missing values from the newly added
columns now default to <codeph>NULL</codeph>.
See <xref href="impala_avro.xml#avro"/> for general details about
working with Avro files.
</p>
</li>
<li>
<p>
The mechanism for interpreting <codeph>DECIMAL</codeph> literals is
improved, no longer going through an intermediate conversion step
to <codeph>DOUBLE</codeph>:
<ul>
<li>
<p rev="IMPALA-3163">
Casting a <codeph>DECIMAL</codeph> value to <codeph>TIMESTAMP</codeph>
<codeph>DOUBLE</codeph> produces a more precise
value for the <codeph>TIMESTAMP</codeph> than formerly.
</p>
</li>
<li>
<p rev="IMPALA-3439">
Certain function calls involving <codeph>DECIMAL</codeph> literals
now succeed, when formerly they failed due to lack of a function
signature with a <codeph>DOUBLE</codeph> argument.
</p>
</li>
<li>
<p rev="">
Faster runtime performance for <codeph>DECIMAL</codeph> constant
values, through improved native code generation for all combinations
of precision and scale.
</p>
</li>
</ul>
See <xref href="impala_decimal.xml#decimal"/> for details about the <codeph>DECIMAL</codeph> type.
</p>
</li>
<li>
<p rev="IMPALA-3155">
Improved type accuracy for <codeph>CASE</codeph> return values.
If all <codeph>WHEN</codeph> clauses of the <codeph>CASE</codeph>
expression are of <codeph>CHAR</codeph> type, the final result
is also <codeph>CHAR</codeph> instead of being converted to
<codeph>STRING</codeph>.
See <xref href="impala_conditional_functions.xml#conditional_functions"/>
for details about the <codeph>CASE</codeph> function.
</p>
</li>
<li>
<p rev="IMPALA-3232">
Uncorrelated queries using the <codeph>NOT EXISTS</codeph> operator
are now supported. Formerly, the <codeph>NOT EXISTS</codeph>
operator was only available for correlated subqueries.
</p>
</li>
<li>
<p rev="IMPALA-2736">
Improved performance for reading Parquet files.
</p>
</li>
<li>
<p rev="IMPALA-3375">
Improved performance for <term>top-N</term> queries, that is,
those including both <codeph>ORDER BY</codeph> and
<codeph>LIMIT</codeph> clauses.
</p>
</li>
<!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
<li>
<p rev="IMPALA-3471">
A top-N query can now also activate the spill-to-disk mechanism if
a host runs low on memory while evaluating it. For example, using
large <codeph>LIMIT</codeph> and/or <codeph>OFFSET</codeph> clauses
adds some memory overhead that could cause spilling.
</p>
</li>
-->
<li>
<p rev="IMPALA-1740">
Impala optionally skips an arbitrary number of header lines from text input
files on HDFS based on the <codeph>skip.header.line.count</codeph> value
in the <codeph>TBLPROPERTIES</codeph> field of the table metadata.
See <xref href="impala_txtfile.xml#text_data_files"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-2336">
Trailing comments are now allowed in queries processed by
the <cmdname>impala-shell</cmdname> options <codeph>-q</codeph>
and <codeph>-f</codeph>.
</p>
</li>
<li>
<p rev="IMPALA-2844">
Impala can run <codeph>COUNT</codeph> queries for RCFile tables
that include complex type columns.
See <xref href="impala_complex_types.xml#complex_types"/> for
general information about working with complex types,
and <xref href="impala_array.xml#array"/>,
<xref href="impala_map.xml#map"/>, and <xref href="impala_struct.xml#struct"/>
for syntax details of each type.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.5.x new features go under here -->
<concept rev="2.5.0" id="new_features_250">
<title>New Features in <keyword keyref="impala25_full"/></title>
<conbody>
<ul>
<li><!-- Spec: https://docs.google.com/document/d/1ambtYJ1t05iITCVIrN6N1A-e7PZBSetBPgjy8SLzJrA/edit#heading=h.vcftzwlpn845 -->
<p rev="IMPALA-2552 IMPALA-3054">
Dynamic partition pruning. When a query refers to a partition key column in a <codeph>WHERE</codeph>
clause, and the exact set of column values are not known until the query is executed,
Impala evaluates the predicate and skips the I/O for entire partitions that are not needed.
For example, if a table was partitioned by year, Impala would apply this technique to a query
such as <codeph>SELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table)</codeph>.
<ph audience="standalone">See <xref href="impala_partitioning.xml#dynamic_partition_pruning"/> for details.</ph>
</p>
<p>
The dynamic partition pruning optimization technique lets Impala avoid reading
data files from partitions that are not part of the result set, even when
that determination cannot be made in advance. This technique is especially valuable
when performing join queries involving partitioned tables. For example, if a join
query includes an <codeph>ON</codeph> clause and a <codeph>WHERE</codeph> clause
that refer to the same columns, the query can find the set of column values that
match the <codeph>WHERE</codeph> clause, and only scan the associated partitions
when evaluating the <codeph>ON</codeph> clause.
</p>
<p>
Dynamic partition pruning is controlled by the same settings as the runtime filtering feature.
By default, this feature is enabled at a medium level, because the maximum setting can use
slightly more memory for queries than in previous releases.
To fully enable this feature, set the query option <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>.
</p>
</li>
<li><!-- Spec: https://docs.google.com/document/d/1ambtYJ1t05iITCVIrN6N1A-e7PZBSetBPgjy8SLzJrA/edit#heading=h.vcftzwlpn845 -->
<p rev="IMPALA-2419 IMPALA-3001 IMPALA-3008 IMPALA-3039 IMPALA-3046 IMPALA-3054">
Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries.
Using the same technique as with dynamic partition pruning,
Impala uses the predicates from <codeph>WHERE</codeph> and <codeph>ON</codeph> clauses
to determine the subset of column values from one of the joined tables could possibly be part of the
result set. Impala sends a compact representation of the filter condition to the hosts in the cluster,
instead of the full set of values or the entire table.
<ph audience="PDF">See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for details.</ph>
</p>
<p>
By default, this feature is enabled at a medium level, because the maximum setting can use
slightly more memory for queries than in previous releases.
To fully enable this feature, set the query option <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>.
<ph audience="PDF">See <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> for details.</ph>
</p>
<p>
This feature involves some new query options:
<xref audience="standalone" href="impala_runtime_filter_mode.xml">RUNTIME_FILTER_MODE</xref><codeph audience="integrated">RUNTIME_FILTER_MODE</codeph>,
<xref audience="standalone" href="impala_max_num_runtime_filters.xml">MAX_NUM_RUNTIME_FILTERS</xref><codeph audience="integrated">MAX_NUM_RUNTIME_FILTERS</codeph>,
<xref audience="standalone" href="impala_runtime_bloom_filter_size.xml">RUNTIME_BLOOM_FILTER_SIZE</xref><codeph audience="integrated">RUNTIME_BLOOM_FILTER_SIZE</codeph>,
<xref audience="standalone" href="impala_runtime_filter_wait_time_ms.xml">RUNTIME_FILTER_WAIT_TIME_MS</xref><codeph audience="integrated">RUNTIME_FILTER_WAIT_TIME_MS</codeph>,
and <xref audience="standalone" href="impala_disable_row_runtime_filtering.xml">DISABLE_ROW_RUNTIME_FILTERING</xref><codeph audience="integrated">DISABLE_ROW_RUNTIME_FILTERING</codeph>.
<ph audience="PDF">See
<xref href="impala_runtime_filter_mode.xml#runtime_filter_mode">RUNTIME_FILTER_MODE</xref>,
<xref href="impala_max_num_runtime_filters.xml#max_num_runtime_filters">MAX_NUM_RUNTIME_FILTERS</xref>,
<xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size">RUNTIME_BLOOM_FILTER_SIZE</xref>,
<xref href="impala_runtime_filter_wait_time_ms.xml#runtime_filter_wait_time_ms">RUNTIME_FILTER_WAIT_TIME_MS</xref>, and
<xref href="impala_disable_row_runtime_filtering.xml#disable_row_runtime_filtering">DISABLE_ROW_RUNTIME_FILTERING</xref>
for details.
</ph>
</p>
</li>
<li>
<p rev="IMPALA-2696">
More efficient use of the HDFS caching feature, to avoid
hotspots and bottlenecks that could occur if heavily used
cached data blocks were always processed by the same host.
By default, Impala now randomizes which host processes each cached
HDFS data block, when cached replicas are available on multiple hosts.
(Remember to use the <codeph>WITH REPLICATION</codeph> clause with the
<codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement
when enabling HDFS caching for a table or partition, to cache the same
data blocks across multiple hosts.)
The new query option <codeph>SCHEDULE_RANDOM_REPLICA</codeph>
<!-- and <codeph>REPLICA_PREFERENCE</codeph> -->
lets you fine-tune the interaction with HDFS caching even more.
<ph audience="PDF">See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2641">
The <codeph>TRUNCATE TABLE</codeph> statement now accepts an <codeph>IF EXISTS</codeph>
clause, making <codeph>TRUNCATE TABLE</codeph> easier to use in setup or ETL scripts where the table might or
might not exist.
<ph audience="PDF">See <xref href="impala_truncate_table.xml#truncate_table"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2681 IMPALA-2688 IMPALA-2749">
Improved performance and reliability for the <codeph>DECIMAL</codeph> data type:
<ul>
<li>
<p rev="IMPALA-2681">
Using <codeph>DECIMAL</codeph> values in a <codeph>GROUP BY</codeph> clause now
triggers the native code generation optimization, speeding up queries that
group by values such as prices.
</p>
</li>
<li>
<p rev="IMPALA-2688">
Checking for overflow in <codeph>DECIMAL</codeph>
multiplication is now substantially faster, making <codeph>DECIMAL</codeph>
a more practical data type in some use cases where formerly <codeph>DECIMAL</codeph>
was much slower than <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph>.
</p>
</li>
<li>
<p rev="IMPALA-2749">
Multiplying a mixture of <codeph>DECIMAL</codeph>
and <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> values now returns the
<codeph>DOUBLE</codeph> rather than <codeph>DECIMAL</codeph>. This change avoids
some cases where an intermediate value would underflow or overflow and become
<codeph>NULL</codeph> unexpectedly.
</p>
</li>
</ul>
<ph audience="PDF">See <xref href="impala_decimal.xml"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2382">
For UDFs written in Java, or Hive UDFs reused for Impala,
Impala now allows parameters and return values to be primitive types.
Formerly, these things were required to be one of the <q>Writable</q>
object types.
<ph audience="PDF">See <xref href="impala_udf.xml#udfs_hive"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1588"><!-- This is from 2015, so perhaps it's really in an earlier release. -->
Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the
overhead of repeatedly opening the same file.
</p>
</li>
<!-- Kudu didn't make it into 2.5 / 5.7 release, so no DELETE or UPDATE statement. -->
<li>
<p><!-- Is there a JIRA for that one? Alex? -->
Performance improvements for queries involving nested complex types.
Certain basic query types, such as counting the elements of a complex column,
now use an optimized code path.
</p>
</li>
<li>
<p rev="IMPALA-3044 IMPALA-2538 IMPALA-1168">
Improvements to the memory reservation mechanism for the Impala
admission control feature. You can specify more settings, such
as the timeout period and maximum aggregate memory used, for each
resource pool instead of globally for the Impala instance. The
default limit for concurrent queries (the <uicontrol>max requests</uicontrol>
setting) is now unlimited instead of 200.
</p>
</li>
<li>
<p rev="IMPALA-1755">
Performance improvements related to code generation.
Even in queries where code generation is not performed
for some phases of execution (such as reading data from
Parquet tables), Impala can still use code generation in
other parts of the query, such as evaluating
functions in the <codeph>WHERE</codeph> clause.
</p>
</li>
<li>
<p rev="IMPALA-1305">
Performance improvements for queries using aggregation functions
on high-cardinality columns.
Formerly, Impala could do unnecessary extra work to produce intermediate
results for operations such as <codeph>DISTINCT</codeph> or <codeph>GROUP BY</codeph>
on columns that were unique or had few duplicate values.
Now, Impala decides at run time whether it is more efficient to
do an initial aggregation phase and pass along a smaller set of intermediate data,
or to pass raw intermediate data back to next phase of query processing to be aggregated there.
This feature is known as <term>streaming pre-aggregation</term>.
In case of performance regression, this feature can be turned off
using the <codeph>DISABLE_STREAMING_PREAGGREGATIONS</codeph> query option.
<ph audience="PDF">See <xref href="impala_disable_streaming_preaggregations.xml#disable_streaming_preaggregations"/> for details.</ph>
</p>
</li>
<li>
<p>
Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature
could be turned off using a pair of configuration settings,
<codeph>enable_partitioned_aggregation=false</codeph> and
<codeph>enable_partitioned_hash_join=false</codeph>.
The latest improvements in the spill-to-disk mechanism, and related features that
interact with it, make this feature robust enough that disabling it is now
no longer needed or supported. In particular, some new features in <keyword keyref="impala25_full"/>
and higher do not work when the spill-to-disk feature is disabled.
</p>
</li>
<li>
<p rev="IMPALA-1067">
Improvements to scripting capability for the <cmdname>impala-shell</cmdname> command,
through user-specified substitution variables that can appear in statements processed
by <cmdname>impala-shell</cmdname>:
</p>
<ul>
<li rev="IMPALA-2179">
<p>
The <codeph>--var</codeph> command-line option lets you pass key-value pairs to
<cmdname>impala-shell</cmdname>. The shell can substitute the values
into queries before executing them, where the query text contains the notation
<codeph>${var:<varname>varname</varname>}</codeph>. For example, you might prepare a SQL file
containing a set of DDL statements and queries containing variables for
database and table names, and then pass the applicable names as part of the
<codeph>impala-shell -f <varname>filename</varname></codeph> command.
<ph audience="PDF">See <xref href="impala_shell_running_commands.xml#shell_running_commands"/> for details.</ph>
</p>
</li>
<li rev="IMPALA-2180">
<p>
The <codeph>SET</codeph> and <codeph>UNSET</codeph> commands within the
<cmdname>impala-shell</cmdname> interpreter now work with user-specified
substitution variables, as well as the built-in query options.
The two kinds of variables are divided in the <codeph>SET</codeph> output.
As with variables defined by the <codeph>--var</codeph> command-line option,
you refer to the user-specified substitution variables in queries by using
the notation <codeph>${var:<varname>varname</varname>}</codeph>
in the query text. Because the substitution variables are processed by
<cmdname>impala-shell</cmdname> instead of the <cmdname>impalad</cmdname>
backend, you cannot define your own substitution variables through the
<codeph>SET</codeph> statement in a JDBC or ODBC application.
<ph audience="PDF">See <xref href="impala_set.xml#set"/> for details.</ph>
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-1599">
Performance improvements for query startup. Impala better parallelizes certain work
when coordinating plan distribution between <cmdname>impalad</cmdname> instances, which improves
startup time for queries involving tables with many partitions on large clusters,
or complicated queries with many plan fragments.
</p>
</li>
<li>
<p rev="IMPALA-2560">
Performance and scalability improvements for tables with many partitions.
The memory requirements on the coordinator node are reduced, making it substantially
faster and less resource-intensive
to do joins involving several tables with thousands of partitions each.
</p>
</li>
<li>
<p rev="IMPALA-3095">
Whitelisting for access to internal APIs. For applications that need direct access
to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can
specify a list of Kerberos users who are allowed to call those APIs. By default, the
<codeph>impala</codeph> and <codeph>hdfs</codeph> users are the only ones authorized
for this kind of access.
Any users not explicitly authorized through the <codeph>internal_principals_whitelist</codeph>
configuration setting are blocked from accessing the APIs. This setting applies to all the
Impala-related daemons, although currently it is primarily used for HDFS to control the
behavior of the catalog server.
</p>
</li>
<li>
<p rev="">
Improvements to Impala integration and usability for Hue. (The code changes
are actually on the Hue side.)
</p>
<ul>
<li>
<p rev="">
The list of tables now refreshes dynamically.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-1787">
Usability improvements for case-insensitive queries.
You can now use the operators <codeph>ILIKE</codeph> and <codeph>IREGEXP</codeph>
to perform case-insensitive wildcard matches or regular expression matches,
rather than explicitly converting column values with <codeph>UPPER</codeph>
or <codeph>LOWER</codeph>.
<ph audience="PDF">See <xref href="impala_operators.xml#ilike"/> and <xref href="impala_operators.xml#iregexp"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1480">
Performance and reliability improvements for DDL and insert operations on partitioned tables with a large
number of partitions. Impala only re-evaluates metadata for partitions that are affected by
a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress,
other Impala statements that attempt to modify metadata for the same table wait until the first one
finishes.
</p>
</li>
<li>
<p rev="IMPALA-2867">
Reliability improvements for the <codeph>LOAD DATA</codeph> statement.
Previously, this statement would fail if the source HDFS directory
contained any subdirectories at all. Now, the statement ignores
any hidden subdirectories, for example <filepath>_impala_insert_staging</filepath>.
</p>
</li>
<li>
<p rev="IMPALA-2147">
A new operator, <codeph>IS [NOT] DISTINCT FROM</codeph>, lets you compare values
and always get a <codeph>true</codeph> or <codeph>false</codeph> result,
even if one or both of the values are <codeph>NULL</codeph>.
The <codeph>IS NOT DISTINCT FROM</codeph> operator, or its equivalent
<codeph>&lt;=&gt;</codeph> notation, improves the efficiency of join queries that
treat key values that are <codeph>NULL</codeph> in both tables as equal.
<ph audience="PDF">See <xref href="impala_operators.xml#is_distinct_from"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1934">
Security enhancements for the <cmdname>impala-shell</cmdname> command.
A new option, <codeph>--ldap_password_cmd</codeph>, lets you specify
a command to retrieve the LDAP password. The resulting password is
then used to authenticate the <cmdname>impala-shell</cmdname> command
with the LDAP server.
<ph audience="PDF">See <xref href="impala_shell_options.xml"/> for details.</ph>
</p>
</li>
<li>
<p>
The <codeph>CREATE TABLE AS SELECT</codeph> statement now accepts a
<codeph>PARTITIONED BY</codeph> clause, which lets you create a
partitioned table and insert data into it with a single statement.
<ph audience="PDF">See <xref href="impala_create_table.xml#create_table"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1748">
User-defined functions (UDFs and UDAFs) written in C++ now persist automatically
when the <cmdname>catalogd</cmdname> daemon is restarted. You no longer
have to run the <codeph>CREATE FUNCTION</codeph> statements again after a restart.
</p>
</li>
<li>
<p rev="IMPALA-2843">
User-defined functions (UDFs) written in Java can now persist
when the <cmdname>catalogd</cmdname> daemon is restarted, and can be shared
transparently between Impala and Hive. You must do a one-time operation to recreate these
UDFs using new <codeph>CREATE FUNCTION</codeph> syntax, without a signature for arguments
or the return value. Afterwards, you no longer have to run the <codeph>CREATE FUNCTION</codeph>
statements again after a restart.
Although Impala does not have visibility into the UDFs that implement the
Hive built-in functions, user-created Hive UDFs are now automatically available
for calling through Impala.
<ph audience="PDF">See <xref href="impala_create_function.xml#create_function"/> for details.</ph>
</p>
</li>
<li>
<!-- Listed as fixed in 2.6.0. Is this item inappropriate or did it actually come from a different JIRA? -->
<p rev="IMPALA-2728">
Reliability enhancements for memory management. Some aggregation and join queries
that formerly might have failed with an out-of-memory error due to memory contention,
now can succeed using the spill-to-disk mechanism.
</p>
</li>
<li>
<!-- Same blurb is under Incompatible Changes. Turn into a conref. -->
<p rev="IMPALA-2070">
The <codeph>SHOW DATABASES</codeph> statement now returns two columns rather than one.
The second column includes the associated comment string, if any, for each database.
Adjust any application code that examines the list of databases and assumes the
result set contains only a single column.
<ph audience="PDF">See <xref href="impala_show.xml#show_databases"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2499">
A new optimization speeds up aggregation operations that involve only the partition key
columns of partitioned tables. For example, a query such as <codeph>SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1</codeph>
can avoid reading any data files if <codeph>T1</codeph> is a partitioned table and <codeph>K</codeph>
is one of the partition key columns. Because this technique can produce different results in cases
where HDFS files in a partition are manually deleted or are empty, you must enable the optimization
by setting the query option <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph>.
<ph audience="PDF">See <xref href="impala_optimize_partition_key_scans.xml"/> for details.</ph>
</p>
</li>
<li audience="hidden"><!-- All the other undocumented query options are not really new features for this release, so hiding this whole bullet. -->
<p>
Other new query options:
</p>
<ul>
<li audience="hidden"><!-- Actually from a long way back, just never documented. Not sure if appropriate to keep internal-only or expose. -->
<codeph>DISABLE_OUTERMOST_TOPN</codeph>
</li>
<li audience="hidden"><!-- Actually from a long way back, just never documented. Not sure if appropriate to keep internal-only or expose. -->
<codeph>RM_INITIAL_MEM</codeph>
</li>
<li audience="hidden"><!-- Seems to be related to writing sequence files, a capability not externalized at this time. -->
<codeph>SEQ_COMPRESSION_MODE</codeph>
</li>
<li audience="hidden"><!-- Actually, was only used for working around one JIRA. Being deprecated now in Impala 2.3 via IMPALA-2963. -->
<codeph>DISABLE_CACHED_READS</codeph>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-2196">
The <codeph>DESCRIBE</codeph> statement can now display metadata about a database, using the
syntax <codeph>DESCRIBE DATABASE <varname>db_name</varname></codeph>.
<ph audience="PDF">See <xref href="impala_describe.xml#describe"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1477">
The <codeph>uuid()</codeph> built-in function generates an
alphanumeric value that you can use as a guaranteed unique identifier.
The uniqueness applies even across tables, for cases where an ascending
numeric sequence is not suitable.
<ph audience="PDF">See <xref href="impala_misc_functions.xml#misc_functions"/> for details.</ph>
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.4.x new features go under here -->
<concept rev="2.4.0" id="new_features_240">
<title>New Features in <keyword keyref="impala24_full"/></title>
<conbody>
<ul>
<li>
<p>
Impala can be used on the DSSD D5 Storage Appliance.
From a user perspective, the Impala features are the same as in <keyword keyref="impala23_full"/>.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.3.x subsections go under here -->
<!-- Actually for 2.3 / 5.5, let's get away from doing a separate subhead for each maintenance release,
because in the normal course of events there will be nothing to add here until 5.6. If something new
needs to get noted, just add a new bullet with wording to indicate which 5.5.x release it applies to. -->
<concept rev="2.3.0" id="new_features_230">
<title>New Features in <keyword keyref="impala23_full"/></title>
<conbody>
<p>
The following are the major new features in Impala 2.3.x. This major release
contains improvements to SQL syntax (particularly new support for complex types), performance,
manageability, security.
</p>
<ul>
<li>
<p>
Complex data types: <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and <codeph>MAP</codeph>. These
types can encode multiple named fields, positional items, or key-value pairs within a single column.
You can combine these types to produce nested types with arbitrarily deep nesting,
such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> values,
a <codeph>MAP</codeph> where each key-value pair is an <codeph>ARRAY</codeph> of other <codeph>MAP</codeph> values,
and so on. Currently, complex data types are only supported for the Parquet file format.
<ph audience="PDF">See <xref href="impala_complex_types.xml#complex_types"/> for usage details and <xref href="impala_array.xml#array"/>, <xref href="impala_struct.xml#struct"/>, and <xref href="impala_map.xml#map"/> for syntax.</ph>
</p>
</li>
<li rev="collevelauth">
<p>
Column-level authorization lets you define access to particular columns within a table,
rather than the entire table. This feature lets you reduce the reliance on creating views to
set up authorization schemes for subsets of information.
See <xref keyref="sg_hive_sql"/> for background details, and
<xref href="impala_grant.xml#grant"/> and <xref href="impala_revoke.xml#revoke"/> for Impala-specific syntax.
</p>
</li>
<li rev="IMPALA-1139">
<p>
The <codeph>TRUNCATE TABLE</codeph> statement removes all the data from a table without removing the table itself.
<ph audience="PDF">See <xref href="impala_truncate_table.xml#truncate_table"/> for details.</ph>
</p>
</li>
<li id="IMPALA-2015">
<p>
Nested loop join queries. Some join queries that formerly required equality comparisons can now use
operators such as <codeph>&lt;</codeph> or <codeph>&gt;=</codeph>. This same join mechanism is used
internally to optimize queries that retrieve values from complex type columns.
<ph audience="PDF">See <xref href="impala_joins.xml#joins"/> for details about Impala join queries.</ph>
</p>
</li>
<li>
<p>
Reduced memory usage and improved performance and robustness for spill-to-disk feature.
<ph audience="PDF">See <xref href="impala_scalability.xml#spill_to_disk"/> for details about this feature.</ph>
</p>
</li>
<li rev="IMPALA-1881">
<p>
Performance improvements for querying Parquet data files containing multiple row groups
and multiple data blocks:
</p>
<ul>
<li>
<p> For files written by Hive, SparkSQL, and other Parquet MR writers
and spanning multiple HDFS blocks, Impala now scans the extra
data blocks locally when possible, rather than using remote
reads. </p>
</li>
<li>
<p>
Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet
files written by Hive, MapReduce, and other components. (Impala itself never writes
multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.)
These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks.
The <codeph>parquet.writer.max-padding</codeph> setting specifies the maximum number of bytes, by default
8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block
so that the next row group starts at the beginning of the next block.
If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space.
Include this setting in the <filepath>hive-site</filepath> configuration file to influence Parquet files written by Hive,
or the <filepath>hdfs-site</filepath> configuration file to influence Parquet files written by all non-Impala components.
</p>
</li>
</ul>
<p audience="PDF">
See <xref href="impala_parquet.xml#parquet"/> for instructions about using Parquet data files
with Impala.
</p>
</li>
<li id="IMPALA-1660">
<p>
Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions.
</p>
<p rev="IMPALA-1771">
Math functions<ph audience="PDF"> (see <xref href="impala_math_functions.xml#math_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>ATAN2</codeph>
</li>
<li>
<codeph>COSH</codeph>
</li>
<li>
<codeph>COT</codeph>
</li>
<li>
<codeph>DCEIL</codeph>
</li>
<li>
<codeph>DEXP</codeph>
</li>
<li>
<codeph>DFLOOR</codeph>
</li>
<li>
<codeph>DLOG10</codeph>
</li>
<li>
<codeph>DPOW</codeph>
</li>
<li>
<codeph>DROUND</codeph>
</li>
<li>
<codeph>DSQRT</codeph>
</li>
<li>
<codeph>DTRUNC</codeph>
</li>
<li>
<codeph>FACTORIAL</codeph>, and corresponding <codeph>!</codeph> operator
</li>
<li>
<codeph>FPOW</codeph>
</li>
<li>
<codeph>RADIANS</codeph>
</li>
<li>
<codeph>RANDOM</codeph>
</li>
<li>
<codeph>SINH</codeph>
</li>
<li>
<codeph>TANH</codeph>
</li>
</ul>
<p>
String functions<ph audience="PDF"> (see <xref href="impala_string_functions.xml#string_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>BTRIM</codeph>
</li>
<li>
<codeph>CHR</codeph>
</li>
<li>
<codeph>REGEXP_LIKE</codeph>
</li>
<li>
<codeph>SPLIT_PART</codeph>
</li>
</ul>
<p>
Date and time functions<ph audience="PDF"> (see <xref href="impala_datetime_functions.xml#datetime_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>INT_MONTHS_BETWEEN</codeph>
</li>
<li>
<codeph>MONTHS_BETWEEN</codeph>
</li>
<li>
<codeph>TIMEOFDAY</codeph>
</li>
<li>
<codeph>TIMESTAMP_CMP</codeph>
</li>
</ul>
<p>
Bit manipulation functions<ph audience="PDF"> (see <xref href="impala_bit_functions.xml#bit_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>BITAND</codeph>
</li>
<li>
<codeph>BITNOT</codeph>
</li>
<li>
<codeph>BITOR</codeph>
</li>
<li>
<codeph>BITXOR</codeph>
</li>
<li>
<codeph>COUNTSET</codeph>
</li>
<li>
<codeph>GETBIT</codeph>
</li>
<li>
<codeph>ROTATELEFT</codeph>
</li>
<li>
<codeph>ROTATERIGHT</codeph>
</li>
<li>
<codeph>SETBIT</codeph>
</li>
<li>
<codeph>SHIFTLEFT</codeph>
</li>
<li>
<codeph>SHIFTRIGHT</codeph>
</li>
</ul>
<p>
Type conversion functions<ph audience="PDF"> (see <xref href="impala_conversion_functions.xml#conversion_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>TYPEOF</codeph>
</li>
</ul>
<p>
The <codeph>effective_user()</codeph> function<ph audience="PDF"> (see <xref href="impala_misc_functions.xml#misc_functions"/> for details)</ph>.
</p>
</li>
<li id="IMPALA-2081">
<p>
New built-in analytic functions: <codeph>PERCENT_RANK</codeph>, <codeph>NTILE</codeph>,
<codeph>CUME_DIST</codeph>.
<ph audience="PDF">See <xref href="impala_analytic_functions.xml#analytic_functions"/> for details.</ph>
</p>
</li>
<li id="IMPALA-595">
<p>
The <codeph>DROP DATABASE</codeph> statement now works for a non-empty database.
When you specify the optional <codeph>CASCADE</codeph> clause, any tables in the
database are dropped before the database itself is removed.
<ph audience="PDF">See <xref href="impala_drop_database.xml#drop_database"/> for details.</ph>
</p>
</li>
<li>
<p>
The <codeph>DROP TABLE</codeph> and <codeph>ALTER TABLE DROP PARTITION</codeph> statements have a new optional keyword, <codeph>PURGE</codeph>.
This keyword causes Impala to immediately remove the relevant HDFS data files rather than sending them to the HDFS trashcan.
This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of
a problem with the HDFS trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone
than the data files.
<ph audience="PDF">See <xref href="impala_drop_table.xml#drop_table"/> and <xref href="impala_alter_table.xml#alter_table"/> for syntax.</ph>
</p>
</li>
<li id="IMPALA-80">
<p>
The <cmdname>impala-shell</cmdname> command has a new feature for live progress reporting. This feature
is enabled through the <codeph>--live_progress</codeph> and <codeph>--live_summary</codeph>
command-line options, or during a session through the <codeph>LIVE_SUMMARY</codeph> and
<codeph>LIVE_PROGRESS</codeph> query options.
<ph audience="PDF">See <xref href="impala_live_progress.xml#live_progress"/> and <xref href="impala_live_summary.xml#live_summary"/> for details.</ph>
</p>
</li>
<li>
<p>
The <cmdname>impala-shell</cmdname> command also now displays a random <q>tip of the day</q> when it starts.
</p>
</li>
<li id="IMPALA-1413">
<p>
The <cmdname>impala-shell</cmdname> option <codeph>-f</codeph> now recognizes a special filename
<codeph>-</codeph> to accept input from stdin.
<ph audience="PDF">See <xref href="impala_shell_options.xml#shell_options"/> for details about the options for running <cmdname>impala-shell</cmdname> in non-interactive mode.</ph>
</p>
</li>
<li id="IMPALA-1963">
<p>
Format strings for the <codeph>unix_timestamp()</codeph> function can now include numeric timezone offsets.
<ph audience="PDF">See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details.</ph>
</p>
</li>
<li>
<p>
Impala can now run a specified command to obtain the password to decrypt a private-key PEM file,
rather than having the private-key file be unencrypted on disk.
<ph audience="PDF">See <xref href="impala_ssl.xml#ssl"/> for details.</ph>
</p>
</li>
<li id="IMPALA-859">
<p>
Impala components now can use SSL for more of their internal communication. SSL is used for
communication between all three Impala-related daemons when the configuration option
<codeph>ssl_server_certificate</codeph> is enabled. SSL is used for communication with client
applications when the configuration option <codeph>ssl_client_ca_certificate</codeph> is enabled.
<ph audience="PDF">See <xref href="impala_ssl.xml#ssl"/> for details.</ph>
</p>
<p>
Currently, you can only use one of server-to-server TLS/SSL encryption or Kerberos authentication.
This limitation is tracked by the issue
<xref keyref="IMPALA-2598">IMPALA-2598</xref>.
</p>
</li>
<li id="IMPALA-1829">
<p>
Improved flexibility for intermediate data types in user-defined aggregate functions (UDAFs).
<ph audience="PDF">See <xref href="impala_udf.xml#udafs"/> for details.</ph>
</p>
</li>
</ul>
<p>
In <keyword keyref="impala232"/>, the bug fix for <xref keyref="IMPALA-2598">IMPALA-2598</xref>
removes the restriction on using both Kerberos and SSL for internal communication between Impala components.
</p>
<!-- End of new feature list for 2.3 / 5.5. -->
</conbody>
</concept>
<!-- All 2.2.x subsections go under here -->
<concept rev="2.2.0" id="new_features_220">
<title>New Features in <keyword keyref="impala28_full"/></title>
<conbody>
<p>
The following are the major new features in <keyword keyref="impala22_full"/>. This release
contains improvements to performance, manageability, security, and SQL syntax.
</p>
<ul>
<li>
<p>
Several improvements to date and time features enable higher interoperability with Hive and other
database systems, provide more flexibility for handling time zones, and future-proof the handling of
<codeph>TIMESTAMP</codeph> values:
</p>
<ul>
<li>
<p>
The <codeph>WITH REPLICATION</codeph> clause for the <codeph>CREATE TABLE</codeph> and
<codeph>ALTER TABLE</codeph> statements lets you control the replication factor for
HDFS caching for a specific table or partition. By default, each cached block is
only present on a single host, which can lead to CPU contention if the same host
processes each cached block. Increasing the replication factor lets Impala choose
different hosts to process different cached blocks, to better distribute the CPU load.
</p>
</li>
<li>
<p>
Startup flags for the <cmdname>impalad</cmdname> daemon enable a higher level of compatibility with
<codeph>TIMESTAMP</codeph> values written by Hive, and more flexibility for working with date and
time data using the local time zone instead of UTC. To enable these features, set the
<cmdname>impalad</cmdname> startup flags
<codeph>-use_local_tz_for_unix_timestamp_conversions=true</codeph> and
<codeph>-convert_legacy_hive_parquet_utc_timestamps=true</codeph>.
</p>
<p>
The <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> setting controls how the
<codeph>unix_timestamp()</codeph>, <codeph>from_unixtime()</codeph>, and <codeph>now()</codeph>
functions handle time zones. By default (when this setting is turned off), Impala considers all
<codeph>TIMESTAMP</codeph> values to be in the UTC time zone when converting to or from Unix time
values. When this setting is enabled, Impala treats <codeph>TIMESTAMP</codeph> values passed to or
returned from these functions to be in the local time zone. When this setting is enabled, take
particular care that all hosts in the cluster have the same timezone settings, to avoid
inconsistent results depending on which host reads or writes <codeph>TIMESTAMP</codeph> data.
</p>
<p>
The <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting causes Impala to convert
<codeph>TIMESTAMP</codeph> values to the local time zone when it reads them from Parquet files
written by Hive. This setting only applies to data using the Parquet file format, where Impala can
use metadata in the files to reliably determine that the files were written by Hive. If in the
future Hive changes the way it writes <codeph>TIMESTAMP</codeph> data in Parquet, Impala will
automatically handle that new <codeph>TIMESTAMP</codeph> encoding.
</p>
<p>
See <xref href="impala_timestamp.xml#timestamp"/> for details about time zone handling and the
configuration options for Impala / Hive compatibility with Parquet format.
</p>
</li>
<li>
<p conref="../shared/impala_common.xml#common/y2k38" />
<p>
See <xref href="impala_datetime_functions.xml#datetime_functions"/> for the current function
signatures.
</p>
</li>
</ul>
</li>
<li>
<p>
The <codeph>SHOW FILES</codeph> statement lets you view the names and sizes of the files that make up
an entire table or a specific partition. See <xref href="impala_show.xml#show_files"/> for details.
</p>
</li>
<li>
<p>
Impala can now run queries against Parquet data containing columns with complex or nested types, as
long as the query only refers to columns with scalar types.
</p>
</li>
<li>
<p>
Performance improvements for queries that include <codeph>IN()</codeph> operators and involve
partitioned tables.
</p>
</li>
<li>
<!-- Same text for this item in impala_fixed_issues.xml. Could turn into a conref. -->
<p>
The new <codeph>-max_log_files</codeph> configuration option specifies how many log files to keep at
each severity level. The default value is 10, meaning that Impala preserves the latest 10 log files for
each severity level (<codeph>INFO</codeph>, <codeph>WARNING</codeph>, and <codeph>ERROR</codeph>) for
each Impala-related daemon (<cmdname>impalad</cmdname>, <cmdname>statestored</cmdname>, and
<cmdname>catalogd</cmdname>). Impala checks to see if any old logs need to be removed based on the
interval specified in the <codeph>logbufsecs</codeph> setting, every 5 seconds by default. See
<xref href="impala_logging.xml#logs_rotate"/> for details.
</p>
</li>
<li>
<p>
Redaction of sensitive data from Impala log files. This feature protects details such as credit card
numbers or tax IDs from administrators who see the text of SQL statements in the course of monitoring
and troubleshooting a Hadoop cluster. See <xref href="impala_logging.xml#redaction"/> for background
information for Impala users, and <xref keyref="sg_redaction"/> for usage details.
</p>
</li>
<li>
<p>
Lineage information is available for data created or queried by Impala. This feature lets you track who
has accessed data through Impala SQL statements, down to the level of specific columns, and how data
has been propagated between tables. See <xref href="impala_lineage.xml#lineage"/> for background
information for Impala users, <xref keyref="datamgmt_impala_lineage_log"/> for usage details and
how to interpret the lineage information.
</p>
</li>
<li>
<p>
Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem,
for convenience in cases where data is already located in S3 and you prefer to query it in-place.
Queries might have lower performance than when the data files reside on HDFS, because Impala uses some
HDFS-specific optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements
such as <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> are not available when the destination
table or partition is in S3. See <xref href="impala_s3.xml#s3"/> for details.
</p>
<note conref="../shared/impala_common.xml#common/s3_caveat" />
</li>
<li>
<!-- Only want the link out of the release notes to appear for HTML
(N.B. audience="PDF" means hide from PDF), and only in the HTML for the
integrated build where the topic is available for link resolution. -->
<p>
Improved support for HDFS encryption. The <codeph>LOAD DATA</codeph> statement now works when the
source directory and destination table are in different encryption zones. See
<xref keyref="cdh_sg_component_kms"/> for details about using HDFS encryption with
Impala.
</p>
</li>
<li>
<p>
Additional arithmetic function <codeph>mod()</codeph>. See
<xref href="impala_math_functions.xml#math_functions"/> for details.
</p>
</li>
<li>
<p>
Flexibility to interpret <codeph>TIMESTAMP</codeph> values using the UTC time zone (the traditional
Impala behavior) or using the local time zone (for compatibility with <codeph>TIMESTAMP</codeph> values
produced by Hive).
</p>
</li>
<li>
<p>
Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced
by these tools (filenames with suffixes <codeph>.copying</codeph> and <codeph>.tmp</codeph>).
</p>
</li>
<li>
<p>
The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now
been relaxed.
</p>
<p conref="../shared/impala_common.xml#common/cpu_prereq" />
</li>
<li>
<p>
Enhanced support for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types in the <codeph>COMPUTE
STATS</codeph> statement.
</p>
</li>
<li rev="">
<p>
The amount of memory required during setup for <q>spill to disk</q> operations is greatly reduced. This
enhancement reduces the chance of a memory-intensive join or aggregation query failing with an
out-of-memory error.
</p>
</li>
<li>
<p>
Several new conditional functions provide enhanced compatibility when porting code that uses industry
extensions. The new functions are: <codeph>isfalse()</codeph>, <codeph>isnotfalse()</codeph>,
<codeph>isnottrue()</codeph>, <codeph>istrue()</codeph>, <codeph>nonnullvalue()</codeph>, and
<codeph>nullvalue()</codeph>. See <xref href="impala_conditional_functions.xml#conditional_functions"/>
for details.
</p>
</li>
<li>
<p>
The Impala debug web UI now can display a visual representation of the query plan. On the
<uicontrol>/queries</uicontrol> tab, select <uicontrol>Details</uicontrol> for a particular query. The
<uicontrol>Details</uicontrol> page includes a <uicontrol>Plan</uicontrol> tab with a plan diagram that
you can zoom in or out (using scroll gestures through mouse wheel or trackpad).
</p>
</li>
</ul>
<!-- End of new feature list for 5.4. -->
</conbody>
</concept>
<!-- All 2.1.x subsections go under here -->
<concept rev="2.1.0" id="new_features_210">
<title>New Features in <keyword keyref="impala21_full"/></title>
<conbody>
<p>
This release contains the following enhancements to query performance and system scalability:
</p>
<ul>
<li>
<p>
Impala can now collect statistics for individual partitions in a partitioned table, rather than
processing the entire table for each <codeph>COMPUTE STATS</codeph> statement. This feature is known as
incremental statistics, and is controlled by the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax.
(You can still use the original <codeph>COMPUTE STATS</codeph> statement for nonpartitioned tables or
partitioned tables that are unchanging or whose contents are entirely replaced all at once.) See
<xref href="impala_compute_stats.xml#compute_stats"/> and
<xref href="impala_perf_stats.xml#perf_stats"/> for details.
</p>
</li>
<li>
<p>
Optimization for small queries lets Impala process queries that process very few rows without the
unnecessary overhead of parallelizing and generating native code. Reducing this overhead lets Impala
clear small queries quickly, keeping YARN resources and admission control slots available for
data-intensive queries. The number of rows considered to be a <q>small</q> query is controlled by the
<codeph>EXEC_SINGLE_NODE_ROWS_THRESHOLD</codeph> query option. See
<xref href="impala_exec_single_node_rows_threshold.xml#exec_single_node_rows_threshold"/> for details.
</p>
</li>
<li>
<p>
An enhancement to the statestore component lets it transmit heartbeat information independently of
broadcasting metadata updates. This optimization improves reliability of health checking on large
clusters with many tables and partitions.
</p>
</li>
<li>
<p>
The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data
as it is read, rather than reading the entire gzipped file and decompressing it in memory.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.0.x subsections go under here -->
<concept rev="2.0.0" id="new_features_200">
<title>New Features in <keyword keyref="impala20_full"/></title>
<conbody>
<p>
The following are the major new features in <keyword keyref="impala20_full"/>. This major release
contains improvements to performance, scalability, security, and SQL syntax.
</p>
<ul>
<li>
<p>
Queries with joins or aggregation functions involving high volumes of data can now use temporary work
areas on disk, reducing the chance of failure due to out-of-memory errors. When the required memory for
the intermediate result set exceeds the amount available on a particular node, the query automatically
uses a temporary work area on disk. This <q>spill to disk</q> mechanism is similar to the <codeph>ORDER
BY</codeph> improvement from Impala 1.4. For details, see
<xref href="impala_scalability.xml#spill_to_disk"/>.
</p>
</li>
<li>
<p>
Subquery enhancements:
<ul>
<li>
Subqueries are now allowed in the <codeph>WHERE</codeph> clause, for example with the
<codeph>IN</codeph> operator.
</li>
<li>
The <codeph>EXISTS</codeph> and <codeph>NOT EXISTS</codeph> operators are available. They are
always used in conjunction with subqueries.
</li>
<li>
The <codeph>IN</codeph> and <codeph>NOT IN</codeph> queries can now operate on the result set from
a subquery, not just a hardcoded list of values.
</li>
<li>
Uncorrelated subqueries let you compare against one or more values for equality,
<codeph>IN</codeph>, and <codeph>EXISTS</codeph> comparisons. For example, you might use
<codeph>WHERE</codeph> clauses such as <codeph>WHERE <varname>column</varname> = (SELECT
MAX(<varname>some_other_column</varname> FROM <varname>table</varname>)</codeph> or <codeph>WHERE
<varname>column</varname> IN (SELECT <varname>some_other_column</varname> FROM
<varname>table</varname> WHERE <varname>conditions</varname>)</codeph>.
</li>
<li>
Correlated subqueries let you cross-reference values from the outer query block and the subquery.
</li>
<li>
Scalar subqueries let you substitute the result of single-value aggregate functions such as
<codeph>MAX()</codeph>, <codeph>MIN()</codeph>, <codeph>COUNT()</codeph>, or
<codeph>AVG()</codeph>, where you would normally use a numeric value in a <codeph>WHERE</codeph>
clause.
</li>
</ul>
</p>
<p>
For details about subqueries, see <xref href="impala_subqueries.xml#subqueries"/> For information about
new and improved operators, see <xref href="impala_operators.xml#exists"/> and
<xref href="impala_operators.xml#in"/>.
</p>
</li>
<li>
<p>
Analytic functions such as <codeph>RANK()</codeph>, <codeph>LAG()</codeph>, <codeph>LEAD()</codeph>,
and <codeph>FIRST_VALUE()</codeph> let you analyze sequences of rows with flexible ordering and
grouping. Existing aggregate functions such as <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, and
<codeph>COUNT()</codeph> can also be used in an analytic context. See
<xref href="impala_analytic_functions.xml#analytic_functions"/> for details. See
<xref href="impala_aggregate_functions.xml#aggregate_functions"/> for enhancements to existing
aggregate functions.
</p>
</li>
<li>
<p>
New data types provide greater compatibility with source code from traditional database systems:
</p>
<ul>
<li>
<codeph>VARCHAR</codeph> is like the <codeph>STRING</codeph> data type, but with a maximum length.
See <xref href="impala_varchar.xml#varchar"/> for details.
</li>
<li>
<codeph>CHAR</codeph> is like the <codeph>STRING</codeph> data type, but with a precise length. Short
values are padded with spaces on the right. See <xref href="impala_char.xml#char"/> for details.
</li>
<li audience="hidden">
<!-- This feature will be undocumented in Impala 2.0, probably ready for prime time in 2.1. -->
<codeph>DATE</codeph>. See <xref href="impala_date.xml#date"/> for details.
</li>
</ul>
</li>
<li>
<p>
Security enhancements:
<ul>
<li>
Formerly, Impala was restricted to using either Kerberos or LDAP / Active Directory authentication
within a cluster. Now, Impala can freely accept either kind of authentication request, allowing you
to set up some hosts with Kerberos authentication and others with LDAP or Active Directory. See
<xref href="impala_mixed_security.xml#mixed_security"/> for details.
</li>
<li>
<codeph>GRANT</codeph> statement. See <xref href="impala_grant.xml#grant"/> for details.
</li>
<li>
<codeph>REVOKE</codeph> statement. See <xref href="impala_revoke.xml#revoke"/> for details.
</li>
<li>
<codeph>CREATE ROLE</codeph> statement. See <xref href="impala_create_role.xml#create_role"/> for
details.
</li>
<li>
<codeph>DROP ROLE</codeph> statement. See <xref href="impala_drop_role.xml#drop_role"/> for
details.
</li>
<li>
<codeph>SHOW ROLES</codeph> and <codeph>SHOW ROLE GRANT</codeph> statements. See
<xref href="impala_show.xml#show"/> for details.
</li>
<li>
<p>
To complement the HDFS encryption feature, a new Impala configuration option,
<codeph>--disk_spill_encryption</codeph> secures sensitive data from being observed or tampered
with when temporarily stored on disk.
</p>
</li>
</ul>
</p>
<p>
The new security-related SQL statements work along with the Sentry authorization framework. See
<xref keyref="authorization"/> for details.
</p>
</li>
<li>
<p>
Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not
require any special table settings to work in an Impala text table. Impala recognizes the compression
type automatically based on file extensions of <codeph>.gz</codeph>, <codeph>.bz2</codeph>, and
<codeph>.snappy</codeph> respectively. These types of compressed text files are intended for
convenience with existing ETL pipelines. Their non-splittable nature means they are not optimal for
high-performance parallel queries. See <xref href="impala_txtfile.xml#gzip"/> for details.
</p>
</li>
<li>
<p>
Query hints can now use comment notation, <codeph>/* +<varname>hint_name</varname> */</codeph> or
<codeph>-- +<varname>hint_name</varname></codeph>, at the same places in the query where the hints
enclosed by <codeph>[ ]</codeph> are recognized. This enhancement makes it easier to reuse Impala
queries on other database systems. See <xref href="impala_hints.xml#hints"/> for details.
</p>
</li>
<li>
<p>
A new query option, <codeph>QUERY_TIMEOUT_S</codeph>, lets you specify a timeout period in seconds for
individual queries.
</p>
<p>
The working of the <codeph>--idle_query_timeout</codeph> configuration option is extended. If no
<codeph>QUERY_OPTION_S</codeph> query option is in effect, <codeph>--idle_query_timeout</codeph> works
the same as before, setting the timeout interval. When the <codeph>QUERY_OPTION_S</codeph> query option
is specified, its maximum value is capped by the value of the <codeph>--idle_query_timeout</codeph>
option.
</p>
<p>
That is, the system administrator sets the default and maximum timeout through the
<codeph>--idle_query_timeout</codeph> startup option, and then individual users or applications can set
a lower timeout value if desired through the <codeph>QUERY_TIMEOUT_S</codeph> query option. See
<xref href="impala_timeouts.xml#timeouts"/> and
<xref href="impala_query_timeout_s.xml#query_timeout_s"/> for details.
</p>
</li>
<li>
<p>
New functions <codeph>VAR_SAMP()</codeph> and <codeph>VAR_POP()</codeph> are aliases for the existing
<codeph>VARIANCE_SAMP()</codeph> and <codeph>VARIANCE_POP()</codeph> functions.
</p>
</li>
<li>
<p>
A new date and time function, <codeph>DATE_PART()</codeph>, provides similar functionality to
<codeph>EXTRACT()</codeph>. You can also call the <codeph>EXTRACT()</codeph> function using the SQL-99
syntax, <codeph>EXTRACT(<varname>unit</varname> FROM <varname>timestamp</varname>)</codeph>. These
enhancements simplify the porting process for date-related code from other systems. See
<xref href="impala_datetime_functions.xml#datetime_functions"/> for details.
</p>
</li>
<li>
<p>
New approximation features provide a fast way to get results when absolute precision is not required:
</p>
<ul>
<li>
The <codeph>APPX_COUNT_DISTINCT</codeph> query option lets Impala rewrite
<codeph>COUNT(DISTINCT)</codeph> calls to use <codeph>NDV()</codeph> instead, which speeds up the
operation and allows multiple <codeph>COUNT(DISTINCT)</codeph> operations in a single query. See
<xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for details.
</li>
</ul>
The <codeph>APPX_MEDIAN()</codeph> aggregate function produces an estimate for the median value of a
column by using sampling. See <xref href="impala_appx_median.xml#appx_median"/> for details.
</li>
<li>
<p>
Impala now supports a <codeph>DECODE()</codeph> function. This function works as a shorthand for a
<codeph>CASE()</codeph> expression, and improves compatibility with SQL code containing vendor
extensions. See <xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
</p>
</li>
<li>
<p>
The <codeph>STDDEV()</codeph>, <codeph>STDDEV_POP()</codeph>, <codeph>STDDEV_SAMP()</codeph>,
<codeph>VARIANCE()</codeph>, <codeph>VARIANCE_POP()</codeph>, <codeph>VARIANCE_SAMP()</codeph>, and
<codeph>NDV()</codeph> aggregate functions now all return <codeph>DOUBLE</codeph> results rather than
<codeph>STRING</codeph>. Formerly, you were required to <codeph>CAST()</codeph> the result to a numeric
type before using it in arithmetic operations.
</p>
</li>
<li id="parquet_block_size">
<p>
The default settings for Parquet block size, and the associated <codeph>PARQUET_FILE_SIZE</codeph>
query option, are changed. Now, Impala writes Parquet files with a size of 256 MB and an HDFS block
size of 256 MB. Previously, Impala attempted to write Parquet files with a size of 1 GB and an HDFS
block size of 1 GB. In practice, Impala used a conservative estimate of the disk space needed for each
Parquet block, leading to files that were typically 512 MB anyway. Thus, this change will make the file
size more accurate if you specify a value for the <codeph>PARQUET_FILE_SIZE</codeph> query option. It
also reduces the amount of memory reserved during <codeph>INSERT</codeph> into Parquet tables,
potentially avoiding out-of-memory errors and improving scalability when inserting data into Parquet
tables.
</p>
</li>
<li>
<p>
Anti-joins are now supported, expressed using the <codeph>LEFT ANTI JOIN</codeph> and <codeph>RIGHT
ANTI JOIN</codeph> clauses.
<!-- Maybe RIGHT SEMI JOIN is new too? -->
<!-- Make following statement true in the context of RIGHT ANTI JOIN. -->
These clauses returns results from one table that have no match in the other table. You might use this
type of join in the same sorts of use cases as the <codeph>NOT EXISTS</codeph> and <codeph>NOT
IN</codeph> operators. See <xref href="impala_joins.xml#joins"/> for details.
</p>
</li>
<li audience="hidden">
<!-- This feature will be undocumented in Impala 2.0, probably ready for prime time in 2.1. -->
<p>
Improved file format support. Impala can now write to Avro, compressed text, SequenceFile, and RCFile
tables using the <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statements. See
<xref href="impala_file_formats.xml#file_formats"/> for details.
</p>
</li>
<li>
<p>
The <codeph>SET</codeph> command in <cmdname>impala-shell</cmdname> has been promoted to a real SQL
statement. You can now set query options such as <codeph>PARQUET_FILE_SIZE</codeph>,
<codeph>MEM_LIMIT</codeph>, and <codeph>SYNC_DDL</codeph> within JDBC, ODBC, or any other kind of
application that submits SQL without going through the <cmdname>impala-shell</cmdname> interpreter. See
<xref href="impala_set.xml#set"/> for details.
</p>
</li>
<li>
<p>
The <cmdname>impala-shell</cmdname> interpreter now reads settings from an optional configuration file,
named <filepath>$HOME/.impalarc</filepath> by default. See
<xref href="impala_shell_options.xml#shell_config_file"/> for details.
</p>
</li>
<li audience="hidden">
<!-- This feature will be undocumented in Impala 2.0, probably ready for prime time in 2.1. -->
<p>
The <codeph>COMPUTE STATS</codeph> statement can now gather statistics for newly added partitions
rather than the entire table. This feature is known as <term>incremental statistics</term>. See
<xref href="impala_compute_stats.xml#compute_stats"/> for details.
</p>
</li>
<li>
<p>
The library used for regular expression parsing has changed from Boost to Google RE2. This
implementation change adds support for non-greedy matches using the <codeph>.*?</codeph> notation. This
and other changes in the way regular expressions are interpreted means you might need to re-test
queries that use functions such as <codeph>regexp_extract()</codeph> or
<codeph>regexp_replace()</codeph>, or operators such as <codeph>REGEXP</codeph> or
<codeph>RLIKE</codeph>. See <xref href="impala_incompatible_changes.xml#incompatible_changes"/> for
those details.
</p>
</li>
</ul>
</conbody>
</concept>
<concept rev="1.4.0" id="new_features_140">
<title>New Features in <keyword keyref="impala14_full"/></title>
<conbody>
<p>
The following are the major new features in <keyword keyref="impala14_full"/>:
</p>
<ul>
<li>
<p>
The <codeph>DECIMAL</codeph> data type lets you store fixed-precision values, for working with currency
or other fractional values where it is important to represent values exactly and avoid rounding errors.
This feature includes enhancements to built-in functions, numeric literals, and arithmetic expressions.
<ph audience="PDF">See <xref href="impala_decimal.xml#decimal"/> for details.</ph>
</p>
</li>
<li>
<p>
Where the underlying HDFS support exists, Impala can take advantage of the HDFS caching feature to <q>pin</q> entire tables or
individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU
overhead of memory-to-memory copying. When HDFS files are cached in memory, Impala can read the cached
data without any disk reads, and without making an additional copy of the data in memory. Other Hadoop
components that read the same data files also experience a performance benefit.
</p>
<p audience="PDF">
For background information about HDFS caching, see
<xref keyref="setup_hdfs_caching"/>. For performance information about using this feature with Impala, see
<xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>. For the <codeph>SET CACHED</codeph> and
<codeph>SET UNCACHED</codeph> clauses that let you control cached table data through DDL statements,
see <xref href="impala_create_table.xml#create_table"/> and
<xref href="impala_alter_table.xml#alter_table"/>.
</p>
</li>
<li>
<p>
Impala can now use Sentry-based authorization based either on the original policy file, or on rules
defined by <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements issued through Hive.
See <xref keyref="authorization"/> for details.
</p>
</li>
<li>
<p>
For interoperability with Parquet files created through other Hadoop components, such as Pig or
MapReduce jobs, you can create an Impala table that automatically sets up the column definitions based
on the layout of an existing Parquet data file. <ph audience="PDF">See
<xref href="impala_create_table.xml#create_table"/> for the syntax, and
<xref href="impala_parquet.xml#parquet_ddl"/> for usage information.</ph>
</p>
</li>
<li>
<p>
<codeph>ORDER BY</codeph> queries no longer require a <codeph>LIMIT</codeph> clause. If the size of the
result set to be sorted exceeds the memory available to Impala, Impala uses a temporary work space on
disk to perform the sort operation. <ph audience="PDF">See <xref href="impala_order_by.xml#order_by"/>
for details.</ph>
</p>
</li>