blob: 65f543ff8fa6d4f99d7cd5cb7fed804b31d7b89c [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="ver" id="new_features">
<title><ph audience="standalone">New Features in Apache Impala</ph><ph audience="integrated">What's New in Apache Impala</ph></title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Release Notes"/>
<data name="Category" value="New Features"/>
<data name="Category" value="What's New"/>
<data name="Category" value="Getting Started"/>
<data name="Category" value="Upgrading"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
This release of Impala contains the following changes and enhancements from previous releases.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept rev="3.2.0" id="new_features_33">
<title>New Features in <keyword keyref="impala33"/></title>
<conbody>
<p> The following sections describe the noteworthy improvements made in
<keyword keyref="impala33"/>. </p>
<p> For the full list of issues closed in this release, see the <xref
keyref="changelog_33">changelog for <keyword keyref="impala33"
/></xref>. </p>
<section id="section_ezf_tnq_s3b">
<title>Increased Compatibility with Apache Projects</title>
<p>Impala is integrate with the following components:<ul>
<li dir="ltr">
<p dir="ltr">Apache Ranger: Use Apache Ranger to manage
authorization in Impala. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_authorization.html"
format="html" scope="external"><u>Impala
Authorization</u></xref> for details.</p>
</li>
<li dir="ltr">
<p dir="ltr">Apache Atlas: Use Apache Atlas to manage data
governance in Impala.</p>
</li>
<li dir="ltr">
<p dir="ltr">Hive 3</p>
</li>
</ul></p>
</section>
<section id="section_ys5_k4n_t3b">
<title>Parquet Page Index </title>
<p>To improve performance when using Parquet files, Impala can now write
page indexes in Parquet files and use those indexes to skip pages for
the faster scan.</p>
<p>See <xref href="impala_parquet.xml#parquet_performance"/> for
details.</p>
</section>
<section id="section_zs5_k4n_t3b">
<title>The Remote File Handle Cache Supports S3</title>
<p>Impala can now cache remote HDFS file handles when the tables that
store their data in Amazon S3 cloud storage.</p>
<p>See <xref href="impala_scalability.xml#scalability_file_handle_cache"
/> for the information on remote file handle cache.</p>
</section>
<section id="section_jls_hxj_s3b">
<title>Support for Kudu Integrated with Hive Metastore</title>
<p>In Impala 3.3 and Kudu 1.10, Kudu is integrated with Hive Metastore
(HMS), and from Impala, you can create, update, delete, and query the
tables in the Kudu services integrated with HMS.</p>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_kudu.html"
format="html" scope="external">Using Kudu with Impala</xref> for
information on using Kudu tables in Impala.</p>
</section>
<section id="section_dp4_mxj_s3b">
<title>Zstd Compression for Parquet files</title>
<p>Zstandard (Zstd) is a real-time compression algorithm offering a
tradeoff between speed and ratio of compression. Compression levels
from 1 up to 22 are supported. The lower the level, the faster the
speed at the cost of compression ratio.</p>
</section>
<section id="section_parquet_lz4_notes">
<title>Lz4 Compression for Parquet files</title>
<p>Lz4 is a lossless compression algorithm providing extremely fast
and scalable compression and decompression.</p>
</section>
<section id="section_drv_nxj_s3b">
<title>Data Cache for Remote Reads</title>
<p>To improve performance on multi-cluster HDFS environments as well as
on object store environments, Impala now caches data for non-local
reads (e.g. S3, ABFS, ADLS) on local storage.</p>
<p>The data cache is enabled with the <codeph>--data_cache
startup</codeph> flag.</p>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_data_cache.html"
format="html" scope="external">Impala Remote Data Cache</xref> for
the information and steps to enable remote data cache.</p>
</section>
<section id="section_xp4_b1f_t3b">
<title>Metadata Performance Improvements </title>
<p>The following features to improve metadata performance are enabled by
default in this release:</p>
<ul>
<li>
<p>Incremental stats are now compressed in memory in
<codeph>catalogd</codeph>, reducing memory footprint in
<codeph>catalogd</codeph>.</p>
</li>
<li>
<p><codeph>impalad</codeph>coordinators fetch incremental stats from
<codeph>catalogd</codeph> on-demand, reducing the memory
footprint and the network requirements for broadcasting
metadata.</p>
</li>
<li>
<p>Time-based and memory-based automatic invalidation of metadata to
keep the size of metadata bounded and to reduce the chances of
<codeph>catalogd</codeph>cache running out of memory.</p>
</li>
<li>
<p>Automatic invalidation of metadata</p>
<p>With automatic metadata management enabled, you no longer have to
issue <codeph>INVALIDATE</codeph> / <codeph>REFRESH</codeph> in a
number of conditions.</p>
<p>In Impala 3.3, the following additional event in Hive Metastore
can trigger automatic INVALIDATE / REFRESH of Metadata:</p>
<ul>
<li>
<p>INSERT into tables and partitions from Impala or from Spark
on the same or multiple cluster configuration</p>
</li>
</ul>
</li>
</ul>
<p>See <xref href="impala_metadata.xml#impala_metadata"/> for the
information on the above features.</p>
</section>
<section id="section_ztf_c4q_s3b">
<title>Scalable Pool Configuration in Admission Controller</title>
<p>To offer more dynamic and flexible resource management, Impala
supports the new configuration parameters that scale with the number
of hosts in the resource pool. You can use the parameters to control
the number of running queries, queued queries, and maximum amount of
memory allocated for Impala resource pools. See <xref
href="impala_admission.xml#admission_control"/> for the information
about the new parameters and using them for admission control.</p>
</section>
<section id="section_b55_gxj_s3b">
<title>Query Profile</title>
<p>The following information was added to the Query Profile output for
better monitoring and troubleshooting of query performance.</p>
<ul>
<li>
<p>Network I/O throughput</p>
</li>
<li>
<p>System disk I/O throughput</p>
</li>
</ul>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html"
format="html" scope="external">Impala Query Profile</xref> for
generating and reading query profile.</p>
</section>
<section id="section_lbh_kzj_s3b">
<title>DATE Data Type and Functions</title>
<p>You can use the new the DATE type to describe a particular
year/month/day, in the form YYYY-­MM-­DD.</p>
<p>This initial DATE type support the TEXT, Parquet, and HBASE file
formats.</p>
<p>The support of DATE data type includes the following features:</p>
<ul>
<li><codeph>DATE</codeph> type column as a partitioning key
column</li>
<li><codeph>DATE</codeph> literal</li>
<li>Implicit casting between <codeph>DATE</codeph> and other types:
<codeph>STRING</codeph> and <codeph>TIMESTAMP</codeph></li>
<li>Most of the built-in functions for <codeph>TIMESTAMP</codeph> now
allow the <codeph>DATE</codeph> type arguments, as well.</li>
</ul>
<p>See <xref href="impala_date.xml#date"/> and <xref
href="impala_datetime_functions.xml#datetime_functions"/> for using
the DATE type.</p>
</section>
<section id="section_wpm_zzj_s3b">
<title>Support Hive Insert-Only Transactional Tables</title>
<p>Impala added the support to create, drop, query, and insert into the
insert-only type of transactional tables. </p>
</section>
<section>
<p>See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_transactions.html"
format="html" scope="external">Impala Transactions</xref> for
details.</p>
</section>
<section id="section_ab2_41k_s3b">
<title>HiveServer2 HTTP Connection for Clients</title>
<p>Now client applications can connect to Impala over HTTP via
HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for
authentication. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_client.html"
format="html" scope="external">Impala Clients</xref> for
details.</p>
</section>
<section id="section_xxt_44q_s3b">
<title>Default File Format Changed to Parquet</title>
<p>When you create a table, the default format for that table data is
now Parquet.</p>
<p>For backward compatibility, you can use the DEFAULT_FILE_FORMAT query
option to set the default file format to the previous default, text,
or other formats.</p>
</section>
<section id="section_m1h_mnf_t3b">
<title>Built-in Function to Process JSON Objects</title>
<p>The <codeph>GET_JSON_OBJECT()</codeph> function extracts JSON object
from a string based on the path specified and returns the extracted
JSON object.</p>
<p>See <xref href="impala_misc_functions.xml#misc_functions">Impala
Miscellaneous Functions</xref>. for details.</p>
</section>
<section id="section_acs_wck_s3b">
<title>Ubuntu 18.04</title>
<p>This version of Impala is certified to run on Ubuntu 18.04.</p>
</section>
</conbody>
</concept>
<concept rev="3.2.0" id="new_features_32">
<title>New Features in <keyword keyref="impala32"/></title>
<conbody>
<p> The following sections describe the noteworthy improvements made in
<keyword keyref="impala32"/>. </p>
<p> For the full list of issues closed in this release, see the <xref
keyref="changelog_32">changelog for <keyword keyref="impala32"
/></xref>. </p>
</conbody>
<concept id="rn_32_multi_cluster">
<title>Multi-cluster Support</title>
<conbody>
<ul>
<li dir="ltr">Remote File Handle Cache<p>Impala can now cache remote
HDFS file handles when the
<codeph>cache_remote_file_handles</codeph> impalad flag is set
to <codeph>true</codeph>. This feature does not apply to non-HDFS
tables, such as Kudu or HBase tables, and does not apply to the
tables that store their data on cloud services, such as S3 or
ADLS. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_scalability.html"
format="html" scope="external">Scalabilty Considerations</xref>
for file handle caching in Impala.</p></li>
</ul>
</conbody>
</concept>
<concept id="rn_32_ac">
<title>Enhancements in Resource Management and Admission Control</title>
<conbody>
<ul>
<li>Admission Debug page is available in <xref
href="https://impala.apache.org/docs/build/html/topics/impala_webui.html"
format="html" scope="external">Impala Daemon (impalad) web
UI</xref> at <codeph>\admission</codeph> and provides the
following information about Impala resource pools:<ul>
<li>Pool configuration</li>
<li>Relevant pool stats</li>
<li>Queued queries in order of being queued (local to the
coordinator)</li>
<li>Running queries (local to this coordinator)</li>
<li>Histogram of the distribution of peak memory usage by admitted
queries</li>
</ul></li>
</ul>
<ul>
<li>A new query option, <xref
href="https://impala.apache.org/docs/build/html/topics/impala_num_rows_produced_limit.html"
format="html" scope="external">NUM_ROWS_PRODUCED_LIMIT</xref>, was
added to limit the number of rows returned from queries.<p>Impala
will cancel a query if the query produces more rows than the limit
specified by this query option. The limit applies only when the
results are returned to a client, e.g. for a
<codeph>SELECT</codeph> query, but not an
<codeph>INSERT</codeph> query. This query option is a guardrail
against users accidentally submitting queries that return a large
number of rows.</p></li>
</ul>
</conbody>
</concept>
<concept id="rn_32_metadata">
<title>Metadata Performance Improvements</title>
<conbody>
<ul>
<li><xref
href="https://impala.apache.org/docs/build/html/topics/impala_metadata.html"
format="html" scope="external">Automatic Metadata Sync using Hive
Metastore Notification Events</xref><p>When enabled, the
<codeph>catalogd</codeph> polls Hive Metastore (HMS)
notifications events at a configurable interval and syncs with
HMS. You can use the new web UI pages of the
<codeph>catalogd</codeph> to check the state of the automatic
invalidate event processor. </p><p><b>Note</b>: This is a preview
feature in <keyword keyref="impala32">Impala
3.2</keyword>.</p></li>
</ul>
</conbody>
</concept>
<concept id="rn_32_usability">
<title>Compatibility and Usability Enhancements</title>
<conbody>
<ul>
<li>Impala can now read the <codeph>TIMESTAMP_MILLIS</codeph> and
<codeph>TIMESTAMP_MICROS</codeph> Parquet types. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_parquet.html"
format="html" scope="external">Using Parquet File Format for
Impala Tables</xref> for the Parquet support in Impala.</li>
<li>Impala can now read the complex types in ORC such as ARRAY,
STRUCT, and MAP. See <xref
href="https://impala.apache.org/docs/build/html/topics/impala_orc.html"
format="html" scope="external">Using ORC File Format for Impala
Tables</xref> for the ORC support in Impala.</li>
<li>The <xref
href="https://impala.apache.org/docs/build/html/topics/impala_string_functions.html"
format="html" scope="external">LEVENSHTEIN</xref> string function
is supported.<p>The function returns the Levenshtein distance
between two input strings, the minimum number of single-character
edits required to transform one string to other.</p></li>
<li>The <codeph>IF NOT EXISTS</codeph> clause is supported in the
<xref
href="https://impala.apache.org/docs/build/html/topics/impala_alter_table.html"
format="html" scope="external"><codeph>ALTER TABLE</codeph></xref>
statement.</li>
<li>The new <xref
href="https://impala.apache.org/docs/build/html/topics/impala_default_file_format.html"
format="html" scope="external"
><codeph>DEFAULT_FILE_FORMAT</codeph></xref> query option allows
you to set the default table file format. This removes the need for
the <codeph>STORED AS &lt;format></codeph> clause. Set this option
if you prefer a value that is not <codeph>TEXT</codeph>. The
supported formats are: <ul>
<li><codeph>TEXT</codeph></li>
<li><codeph>RC_FILE</codeph></li>
<li><codeph>SEQUENCE_FILE</codeph></li>
<li><codeph>AVRO</codeph></li>
<li><codeph>PARQUET</codeph></li>
<li><codeph>KUDU</codeph></li>
<li><codeph>ORC</codeph></li>
</ul></li>
<li>The extended or verbose <xref
href="https://impala.apache.org/docs/build/html/topics/impala_explain.html"
format="html" scope="external"><codeph>EXPLAIN</codeph></xref>
output includes the following new information for queries:<ul>
<li>The text of the analyzed query that may have been rewritten to
include various optimizations and implicit casts. </li>
<li>The implicit casts and literals shown with the actual
types.</li>
</ul></li>
<li>CPU resource utilization (user, system, iowait) metrics were added
to the <xref
href="https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html"
format="html" scope="external">Impala profile</xref> output.</li>
</ul>
</conbody>
</concept>
<concept id="rn_32_security">
<title><b id="docs-internal-guid-e1c558d3-7fff-4d4e-0ec1-e40f60c9b64a"
><b>Security Enhancement</b></b></title>
<conbody>
<ul>
<li>The <xref
href="https://impala.apache.org/docs/build/html/topics/impala_refresh_authorization.html"
format="html" scope="external">REFRESH AUTHORIZATION</xref>
statement was implemented for refreshing authorization data.</li>
</ul>
</conbody>
</concept>
</concept>
<!-- All 3.1.x new features go under here -->
<concept rev="3.1.0" id="new_features_31">
<title>New Features in <keyword keyref="impala31"/></title>
<conbody>
<p> For the full list of issues closed in this release, including the
issues marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_31">changelog for <keyword keyref="impala31"
/></xref>. </p>
</conbody>
</concept>
<!-- All 3.0.x new features go under here -->
<concept rev="3.0.0" id="new_features_300">
<title>New Features in <keyword keyref="impala30"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the
issues marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_300">changelog for <keyword keyref="impala30"
/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.12.x new features go under here -->
<concept rev="2.12.0" id="new_features_2120">
<title>New Features in <keyword keyref="impala212_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_212">changelog for <keyword keyref="impala212"/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.11.x new features go under here -->
<concept rev="2.11.0" id="new_features_2110">
<title>New Features in <keyword keyref="impala211_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_211">changelog for <keyword keyref="impala211"/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.10.x new features go under here -->
<concept rev="2.10.0" id="new_features_2100">
<title>New Features in <keyword keyref="impala210_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_210">changelog for <keyword keyref="impala210"/></xref>.
</p>
</conbody>
</concept>
<!-- All 2.9.x new features go under here -->
<concept rev="2.9.0" id="new_features_290">
<title>New Features in <keyword keyref="impala29_full"/></title>
<conbody>
<p>
For the full list of issues closed in this release, including the issues
marked as <q>new features</q> or <q>improvements</q>, see the
<xref keyref="changelog_29">changelog for <keyword keyref="impala29"/></xref>.
</p>
<p>
The following are some of the most significant new features in this release:
</p>
<ul id="feature_list">
<li>
<p rev="IMPALA-4729">
A new function, <codeph>replace()</codeph>, which is faster than
<codeph>regexp_replace()</codeph> for simple string substitutions.
See <xref keyref="string_functions"/> for details.
</p>
</li>
<li>
<p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
Startup flags for the <cmdname>impalad</cmdname> daemon, <codeph>is_executor</codeph>
and <codeph>is_coordinator</codeph>, let you divide the work on a large, busy cluster
between a small number of hosts acting as query coordinators, and a larger number of
hosts acting as query executors. By default, each host can act in both roles,
potentially introducing bottlenecks during heavily concurrent workloads.
See <xref keyref="scalability_coordinator"/> for details.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.8.x new features go under here -->
<concept rev="2.8.0" id="new_features_280">
<title>New Features in <keyword keyref="impala28_full"/></title>
<conbody>
<ul id="feature_list">
<li>
<p>
Performance and scalability improvements:
</p>
<ul>
<li>
<p rev="IMPALA-4572">
The <codeph>COMPUTE STATS</codeph> statement can
take advantage of multithreading.
</p>
</li>
<li>
<p rev="IMPALA-4135">
Improved scalability for highly concurrent loads by reducing the possibility of TCP/IP timeouts.
A configuration setting, <codeph>accepted_cnxn_queue_depth</codeph>, can be adjusted upwards to
avoid this type of timeout on large clusters.
</p>
</li>
<li>
<p>
Several performance improvements were made to the mechanism for generating native code:
</p>
<ul>
<li>
<p rev="IMPALA-3638">
Some queries involving analytic functions can take better advantage of native code generation.
</p>
</li>
<li>
<p rev="IMPALA-4008">
Modules produced during intermediate code generation are organized
to be easier to cache and reuse during the lifetime of a long-running or complicated query.
</p>
</li>
<li>
<p rev="IMPALA-4397 IMPALA-1430">
The <codeph>COMPUTE STATS</codeph> statement is more efficient
(less time for the codegen phase) for tables with a large number
of columns, especially for tables containing <codeph>TIMESTAMP</codeph>
columns.
</p>
</li>
<li>
<p rev="IMPALA-3838 IMPALA-4495">
The logic for determining whether or not to use a runtime filter is more reliable, and the
evaluation process itself is faster because of native code generation.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-3902">
The <codeph>MT_DOP</codeph> query option enables
multithreading for a number of Impala operations.
<codeph>COMPUTE STATS</codeph> statements for Parquet tables
use a default of <codeph>MT_DOP=4</codeph> to improve the
intra-node parallelism and CPU efficiency of this data-intensive
operation.
</p>
</li>
<li>
<p rev="IMPALA-4397">
The <codeph>COMPUTE STATS</codeph> statement is more efficient
(less time for the codegen phase) for tables with a large number
of columns.
</p>
</li>
<li>
<p rev="IMPALA-2521">
A new hint, <codeph>CLUSTERED</codeph>,
allows Impala <codeph>INSERT</codeph> operations on a Parquet table
that use dynamic partitioning to process a high number of
partitions in a single statement. The data is ordered based on the
partition key columns, and each partition is only written
by a single host, reducing the amount of memory needed to buffer
Parquet data while the data blocks are being constructed.
</p>
</li>
<li>
<p rev="IMPALA-3552">
The new configuration setting <codeph>inc_stats_size_limit_bytes</codeph>
lets you reduce the load on the catalog server when running the
<codeph>COMPUTE INCREMENTAL STATS</codeph> statement for very large tables.
</p>
</li>
<li>
<p rev="IMPALA-1788">
Impala folds many constant expressions within query statements,
rather than evaluating them for each row. This optimization
is especially useful when using functions to manipulate and
format <codeph>TIMESTAMP</codeph> values, such as the result
of an expression such as <codeph>to_date(now() - interval 1 day)</codeph>.
</p>
</li>
<li>
<p rev="IMPALA-4529">
Parsing of complicated expressions is faster. This speedup is
especially useful for queries containing large <codeph>CASE</codeph>
expressions.
</p>
</li>
<li>
<p rev="IMPALA-4302">
Evaluation is faster for <codeph>IN</codeph> operators with many constant
arguments. The same performance improvement applies to other functions
with many constant arguments.
</p>
</li>
<li>
<p rev="IMPALA-1286">
Impala optimizes identical comparison operators within multiple <codeph>OR</codeph>
blocks.
</p>
</li>
<li>
<p rev="IMPALA-4193 IMPALA-3342">
The reporting for wall-clock times and total CPU time in profile output is more accurate.
</p>
</li>
<li>
<p rev="IMPALA-3671">
A new query option, <codeph>SCRATCH_LIMIT</codeph>, lets you restrict the amount of
space used when a query exceeds the memory limit and activates the <q>spill to disk</q> mechanism.
This option helps to avoid runaway queries or make queries <q>fail fast</q> if they require more
memory than anticipated. You can prevent runaway queries from using excessive amounts of spill space,
without restarting the cluster to turn the spilling feature off entirely.
See <xref href="impala_scratch_limit.xml#scratch_limit"/> for details.
</p>
</li>
</ul>
</li>
<li>
<p>
Integration with Apache Kudu:
</p>
<ul>
<li>
<p rev="">
The experimental Impala support for the Kudu storage layer has been folded
into the main Impala development branch. Impala can now directly access Kudu tables,
opening up new capabilities such as enhanced DML operations and continuous ingestion.
</p>
</li>
<li>
<p rev="">
The <codeph>DELETE</codeph> statement is a flexible way to remove data from a Kudu table. Previously,
removing data from an Impala table involved removing or rewriting the underlying data files, dropping entire partitions,
or rewriting the entire table. This Impala statement only works for Kudu tables.
</p>
</li>
<li>
<p rev="">
The <codeph>UPDATE</codeph> statement is a flexible way to modify data within a Kudu table. Previously,
updating data in an Impala table involved replacing the underlying data files, dropping entire partitions,
or rewriting the entire table. This Impala statement only works for Kudu tables.
</p>
</li>
<li>
<p rev="IMPALA-3725">
The <codeph>UPSERT</codeph> statement is a flexible way to ingest, modify, or both data within a Kudu table. Previously,
ingesting data that might contain duplicates involved an inefficient multi-stage operation, and there was no
built-in protection against duplicate data. The <codeph>UPSERT</codeph> statement, in combination with
the primary key designation for Kudu tables, lets you add or replace rows in a single operation, and
automatically avoids creating any duplicate data.
</p>
</li>
<li>
<p rev="IMPALA-3719 IMPALA-3726">
The <codeph>CREATE TABLE</codeph> statement gains some new clauses that are specific to Kudu tables:
<codeph>PARTITION BY</codeph>, <codeph>PARTITIONS</codeph>, <codeph>STORED AS KUDU</codeph>, and column
attributes <codeph>PRIMARY KEY</codeph>, <codeph>NULL</codeph> and <codeph>NOT NULL</codeph>,
<codeph>ENCODING</codeph>, <codeph>COMPRESSION</codeph>, <codeph>DEFAULT</codeph>, and <codeph>BLOCK_SIZE</codeph>.
These clauses replace the explicit <codeph>TBLPROPERTIES</codeph> settings that were required in the
early experimental phases of integration between Impala and Kudu.
</p>
</li>
<li>
<p rev="IMPALA-2890">
The <codeph>ALTER TABLE</codeph> statement can change certain attributes of Kudu tables.
You can add, drop, or rename columns.
You can add or drop range partitions.
You can change the <codeph>TBLPROPERTIES</codeph> value to rename or point to a different underlying Kudu table,
independently from the Impala table name in the metastore database.
You cannot change the data type of an existing column in a Kudu table.
</p>
</li>
<li>
<p rev="IMPALA-4403">
The <codeph>SHOW PARTITIONS</codeph> statement displays information about the distribution of data
between partitions in Kudu tables. A new variation, <codeph>SHOW RANGE PARTITIONS</codeph>,
displays information about the Kudu-specific partitions that apply across ranges of key values.
</p>
</li>
<li>
<p rev="IMPALA-4379">
Not all Impala data types are supported in Kudu tables. In particular, currently the Impala
<codeph>TIMESTAMP</codeph> type is not allowed in a Kudu table. Impala does not recognize the
<codeph>UNIXTIME_MICROS</codeph> Kudu type when it is present in a Kudu table. (These two
representations of date/time data use different units and are not directly compatible.)
You cannot create columns of type <codeph>TIMESTAMP</codeph>, <codeph>DECIMAL</codeph>,
<codeph>VARCHAR</codeph>, or <codeph>CHAR</codeph> within a Kudu table. Within a query, you can
cast values in a result set to these types. Certain types, such as <codeph>BOOLEAN</codeph>,
cannot be used as primary key columns.
</p>
</li>
<li>
<p rev="">
Currently, Kudu tables are not interchangeable between Impala and Hive the way other kinds of Impala tables are.
Although the metadata for Kudu tables is stored in the metastore database, currently Hive cannot access Kudu tables.
</p>
</li>
<li>
<p rev="">
The <codeph>INSERT</codeph> statement works for Kudu tables. The organization
of the Kudu data makes it more efficient than with HDFS-backed tables to insert
data in small batches, such as with the <codeph>INSERT ... VALUES</codeph> syntax.
</p>
</li>
<li>
<p rev="IMPALA-4283">
Some audit data is recorded for data governance purposes.
All <codeph>UPDATE</codeph>, <codeph>DELETE</codeph>, and <codeph>UPSERT</codeph> statements are characterized
as <codeph>INSERT</codeph> operations in the audit log. Currently, lineage metadata is not generated for
<codeph>UPDATE</codeph> and <codeph>DELETE</codeph> operations on Kudu tables.
</p>
</li>
<li>
<p rev="IMPALA-4000">
Currently, Kudu tables have limited support for Sentry:
<ul>
<li>
<p>
Access to Kudu tables must be granted to roles as usual.
</p>
</li>
<li>
<p>
Currently, access to a Kudu table through Sentry is <q>all or nothing</q>.
You cannot enforce finer-grained permissions such as at the column level,
or permissions on certain operations such as <codeph>INSERT</codeph>.
</p>
</li>
<li>
<p>
Only users with <codeph>ALL</codeph> privileges on <codeph>SERVER</codeph> can create external Kudu tables.
</p>
</li>
</ul>
Because non-SQL APIs can access Kudu data without going through Sentry
authorization, currently the Sentry support is considered preliminary.
</p>
</li>
<li>
<p rev="IMPALA-4571">
Equality and <codeph>IN</codeph> predicates in Impala queries are pushed to
Kudu and evaluated efficiently by the Kudu storage layer.
</p>
</li>
</ul>
</li>
<li>
<p rev="">
<b>Security:</b>
</p>
<ul>
<li>
<p>
Impala can take advantage of the S3 encrypted credential
store, to avoid exposing the secret key when accessing
data stored on S3.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-1654">
[<xref keyref="IMPALA-1654">IMPALA-1654</xref>]
Several kinds of DDL operations
can now work on a range of partitions. The partitions can be specified
using operators such as <codeph>&lt;</codeph>, <codeph>&gt;=</codeph>, and
<codeph>!=</codeph> rather than just an equality predicate applying to a single
partition.
This new feature extends the syntax of several clauses
of the <codeph>ALTER TABLE</codeph> statement
(<codeph>DROP PARTITION</codeph>, <codeph>SET [UN]CACHED</codeph>,
<codeph>SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES</codeph>),
the <codeph>SHOW FILES</codeph> statement, and the
<codeph>COMPUTE INCREMENTAL STATS</codeph> statement.
It does not apply to statements that are defined to only apply to a single
partition, such as <codeph>LOAD DATA</codeph>, <codeph>ALTER TABLE ... ADD PARTITION</codeph>,
<codeph>SET LOCATION</codeph>, and <codeph>INSERT</codeph> with a static
partitioning clause.
</p>
</li>
<li>
<p rev="IMPALA-3973">
The <codeph>instr()</codeph> function has optional second and third arguments, representing
the character to position to begin searching for the substring, and the Nth occurrence
of the substring to find.
</p>
</li>
<li>
<p rev="IMPALA-3441 IMPALA-4387">
Improved error handling for malformed Avro data. In particular, incorrect
precision or scale for <codeph>DECIMAL</codeph> types is now handled.
</p>
</li>
<li>
<p>
Impala debug web UI:
</p>
<ul>
<li>
<p rev="IMPALA-1169">
In addition to <q>inflight</q> and <q>finished</q> queries, the web UI
now also includes a section for <q>queued</q> queries.
</p>
</li>
<li>
<p rev="IMPALA-4048">
The <uicontrol>/sessions</uicontrol> tab now clarifies how many of the displayed
sections are active, and lets you sort by <uicontrol>Expired</uicontrol> status
to distinguish active sessions from expired ones.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-4020">
Improved stability when DDL operations such as <codeph>CREATE DATABASE</codeph>
or <codeph>DROP DATABASE</codeph> are run in Hive at the same time as an Impala
<codeph>INVALIDATE METADATA</codeph> statement.
</p>
</li>
<li>
<p rev="IMPALA-1616">
The <q>out of memory</q> error report was made more user-friendly, with additional
diagnostic information to help identify the spot where the memory limit was exceeded.
</p>
</li>
<li>
<p rev="IMPALA-3983 IMPALA-3974">
Improved disk space usage for Java-based UDFs. Temporary copies of the associated JAR
files are removed when no longer needed, so that they do not accumulate across restarts
of the <cmdname>catalogd</cmdname> daemon and potentially cause an out-of-space condition.
These temporary files are also created in the directory specified by the <codeph>local_library_dir</codeph>
configuration setting, so that the storage for these temporary files can be independent
from any capacity limits on the <filepath>/tmp</filepath> filesystem.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.7.x new features go under here -->
<concept rev="2.7.0" id="new_features_270">
<title>New Features in <keyword keyref="impala27_full"/></title>
<conbody>
<ul id="feature_list">
<li>
<p>
Performance improvements:
</p>
<ul>
<li>
<p rev="IMPALA-3206">
[<xref keyref="IMPALA-3206">IMPALA-3206</xref>]
Speedup for queries against <codeph>DECIMAL</codeph> columns in Avro tables.
The code that parses <codeph>DECIMAL</codeph> values from Avro now uses
native code generation.
</p>
</li>
<li>
<p rev="IMPALA-3674">
[<xref keyref="IMPALA-3674">IMPALA-3674</xref>]
Improved efficiency in LLVM code generation can reduce codegen time, especially
for short queries.
</p>
</li>
<!-- Not actually a new feature, it's more a tip about when to expect remote reads and how to minimize them. To go somewhere in the performance / best practices / Parquet info.
<li>
<p rev="IMPALA-3885">
[<xref keyref="IMPALA-3885">IMPALA-3885</xref>]
Parquet files with multiple blocks can now be processed
without remote reads.
</p>
</li>
-->
<li>
<p rev="IMPALA-2979">
[<xref keyref="IMPALA-2979">IMPALA-2979</xref>]
Improvements to scheduling on worker nodes,
enabled by the <codeph>REPLICA_PREFERENCE</codeph> query option.
See <xref
href="impala_replica_preference.xml#replica_preference"/> for details.
</p>
</li>
</ul>
</li>
<li audience="hidden">
<p rev="IMPALA-3210"><!-- Patch didn't make it into in <keyword keyref="impala27_full"/> -->
[<xref keyref="IMPALA-3210">IMPALA-3210</xref>]
The analytic functions <codeph>FIRST_VALUE()</codeph> and <codeph>LAST_VALUE()</codeph>
accept a new clause, <codeph>IGNORE NULLS</codeph>.
See <xref href="impala_analytic_functions.xml#first_value"/>
and <xref href="impala_analytic_functions.xml#last_value"/>
for details.
</p>
</li>
<li>
<p rev="IMPALA-1683">
[<xref keyref="IMPALA-1683">IMPALA-1683</xref>]
The <codeph>REFRESH</codeph> statement can be applied to a single partition,
rather than the entire table. See <xref href="impala_refresh.xml#refresh"/>
and <xref href="impala_partitioning.xml#partition_refresh"/> for details.
</p>
</li>
<li>
<p>
Improvements to the Impala web user interface:
</p>
<ul>
<li>
<p rev="IMPALA-2767">
[<xref keyref="IMPALA-2767">IMPALA-2767</xref>]
You can now force a session to expire by clicking a link in the web UI,
on the <uicontrol>/sessions</uicontrol> tab.
</p>
</li>
<li>
<p rev="IMPALA-3715">
[<xref keyref="IMPALA-3715">IMPALA-3715</xref>]
The <uicontrol>/memz</uicontrol> tab includes more information about
Impala memory usage.
</p>
</li>
<li>
<p rev="IMPALA-3716">
[<xref keyref="IMPALA-3716">IMPALA-3716</xref>]
The <uicontrol>Details</uicontrol> page for a query now includes
a <uicontrol>Memory</uicontrol> tab.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-3499">
[<xref keyref="IMPALA-3499">IMPALA-3499</xref>]
Scalability improvements to the catalog server. Impala handles internal communication
more efficiently for tables with large numbers of columns and partitions, where the
size of the metadata exceeds 2 GiB.
</p>
</li>
<li>
<p rev="IMPALA-3677">
[<xref keyref="IMPALA-3677">IMPALA-3677</xref>]
You can send a <codeph>SIGUSR1</codeph> signal to any Impala-related daemon to write a
Breakpad minidump. For advanced troubleshooting, you can now produce a minidump
without triggering a crash. See <xref href="impala_breakpad.xml#breakpad"/> for
details about the Breakpad minidump feature.
</p>
</li>
<li>
<p rev="IMPALA-3687">
[<xref keyref="IMPALA-3687">IMPALA-3687</xref>]
The schema reconciliation rules for Avro tables have changed slightly
for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns. Now, if
the definition of such a column is changed in the Avro schema file,
the column retains its <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph>
type as specified in the SQL definition, but the column name and comment
from the Avro schema file take precedence.
See <xref href="impala_avro.xml#avro_create_table"/> for details about
column definitions in Avro tables.
</p>
</li>
<li>
<p rev="IMPALA-3575">
[<xref keyref="IMPALA-3575">IMPALA-3575</xref>]
Some network
operations now have additional timeout and retry settings. The extra
configuration helps avoid failed queries for transient network
problems, to avoid hangs when a sender or receiver fails in the
middle of a network transmission, and to make cancellation requests
more reliable despite network issues. </p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.6.x new features go under here -->
<concept rev="2.6.0" id="new_features_260">
<title>New Features in <keyword keyref="impala26_full"/></title>
<conbody>
<ul>
<li>
<p>
Improvements to Impala support for the Amazon S3 filesystem:
</p>
<ul>
<li>
<p rev="IMPALA-1878">
Impala can now write to S3 tables through the <codeph>INSERT</codeph>
or <codeph>LOAD DATA</codeph> statements.
See <xref href="impala_s3.xml#s3"/> for general information about
using Impala with S3.
</p>
</li>
<li>
<p rev="IMPALA-3452">
A new query option, <codeph>S3_SKIP_INSERT_STAGING</codeph>, lets you
trade off between fast <codeph>INSERT</codeph> performance and
slower <codeph>INSERT</codeph>s that are more consistent if a
problem occurs during the statement. The new behavior is enabled by default.
See <xref href="impala_s3_skip_insert_staging.xml#s3_skip_insert_staging"/> for details
about this option.
</p>
</li>
</ul>
</li>
<li>
<p rev="">
Performance improvements for the runtime filtering feature:
</p>
<ul>
<li>
<p rev="IMPALA-3333">
The default for the <codeph>RUNTIME_FILTER_MODE</codeph>
query option is changed to <codeph>GLOBAL</codeph> (the highest setting).
See <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> for
details about this option.
</p>
</li>
<li rev="IMPALA-3007">
<p>
The <codeph>RUNTIME_BLOOM_FILTER_SIZE</codeph> setting is now only used
as a fallback if statistics are not available; otherwise, Impala
uses the statistics to estimate the appropriate size to use for each filter.
See <xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size"/> for
details about this option.
</p>
</li>
<li rev="IMPALA-3480">
<p>
New query options <codeph>RUNTIME_FILTER_MIN_SIZE</codeph> and
<codeph>RUNTIME_FILTER_MAX_SIZE</codeph> let you fine-tune
the sizes of the Bloom filter structures used for runtime filtering.
If the filter size derived from Impala internal estimates or from
the <codeph>RUNTIME_FILTER_BLOOM_SIZE</codeph> falls outside the size
range specified by these options, any too-small filter size is adjusted
to the minimum, and any too-large filter size is adjusted to the maximum.
See <xref href="impala_runtime_filter_min_size.xml#runtime_filter_min_size"/>
and <xref href="impala_runtime_filter_max_size.xml#runtime_filter_max_size"/>
for details about these options.
</p>
</li>
<li rev="IMPALA-2956">
<p>
Runtime filter propagation now applies to all the
operands of <codeph>UNION</codeph> and <codeph>UNION ALL</codeph>
operators.
</p>
</li>
<li rev="IMPALA-3077">
<p>
Runtime filters can now be produced during join queries even
when the join processing activates the spill-to-disk mechanism.
</p>
</li>
</ul>
See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for
general information about the runtime filtering feature.
</li>
<!-- Have to look closer at resource management / admission control to see if
there are any ripple effects from this default change. -->
<li>
<p rev="IMPALA-3199">
Admission control and dynamic resource pools are enabled by default.
See <xref href="impala_admission.xml#admission_control"/> for details
about admission control.
</p>
</li>
<!-- Below here are features that are pretty well taken care of already;
some of them didn't need much if any doc in the first place. -->
<li>
<p rev="IMPALA-3369">
Impala can now manually set column statistics,
using the <codeph>ALTER TABLE</codeph> statement with a
<codeph>SET COLUMN STATS</codeph> clause.
See <xref href="impala_perf_stats.xml#perf_column_stats_manual"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-3490 IMPALA-3581 IMPALA-2686">
Impala can now write lightweight <q>minidump</q> files, rather
than large core files, to save diagnostic information when
any of the Impala-related daemons crash. This feature uses the
open source <codeph>breakpad</codeph> framework.
See <xref href="impala_breakpad.xml#breakpad"/> for details.
</p>
</li>
<li>
<p>
New query options improve interoperability with Parquet files:
<ul>
<li>
<p rev="IMPALA-2835">
The <codeph>PARQUET_FALLBACK_SCHEMA_RESOLUTION</codeph> query option
lets Impala locate columns within Parquet files based on
column name rather than ordinal position.
This enhancement improves interoperability with applications
that write Parquet files with a different order or subset of
columns than are used in the Impala table.
See <xref href="impala_parquet_fallback_schema_resolution.xml#parquet_fallback_schema_resolution"/>
for details.
</p>
</li>
<li>
<p rev="IMPALA-2069">
The <codeph>PARQUET_ANNOTATE_STRINGS_UTF8</codeph> query option
makes Impala include the <codeph>UTF-8</codeph> annotation
metadata for <codeph>STRING</codeph>, <codeph>CHAR</codeph>,
and <codeph>VARCHAR</codeph> columns in Parquet files created
by <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph>
statements.
See <xref href="impala_parquet_annotate_strings_utf8.xml#parquet_annotate_strings_utf8"/>
for details.
</p>
</li>
</ul>
See <xref href="impala_parquet.xml#parquet"/> for general information about working
with Parquet files.
</p>
</li>
<li>
<p>
Improvements to security and reduction in overhead for secure clusters:
</p>
<ul>
<li>
<p rev="IMPALA-1928">
Overall performance improvements for secure clusters.
(TPC-H queries on a secure cluster were benchmarked
at roughly 3x as fast as the previous release.)
</p>
</li>
<li>
<p rev="IMPALA-2660">
Impala now recognizes the <codeph>auth_to_local</codeph> setting,
specified through the HDFS configuration setting
<codeph>hadoop.security.auth_to_local</codeph>.
This feature is disabled by default; to enable it,
specify <codeph>--load_auth_to_local_rules=true</codeph>
in the <cmdname>impalad</cmdname> configuration settings.
See <xref href="impala_kerberos.xml#auth_to_local"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-2599">
Timing improvements in the mechanism for the <cmdname>impalad</cmdname>
daemon to acquire Kerberos tickets. This feature spreads out the overhead
on the KDC during Impala startup, especially for large clusters.
</p>
</li>
<li>
<p rev="IMPALA-3554">
For Kerberized clusters, the Catalog service now uses
the Kerberos principal instead of the operating sytem user that runs
the <cmdname>catalogd</cmdname> daemon.
This eliminates the requirement to configure a <codeph>hadoop.user.group.static.mapping.overrides</codeph>
setting to put the OS user into the Sentry administrative group, on clusters where the principal
and the OS user name for this user are different.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-3286">
Overall performance improvements for join queries, by using a prefetching mechanism
while building the in-memory hash table to evaluate join predicates.
See <xref href="impala_prefetch_mode.xml#prefetch_mode"/> for the query option
to control this optimization.
</p>
</li>
<li>
<p rev="IMPALA-3397">
The <cmdname>impala-shell</cmdname> interpreter has a new command,
<codeph>SOURCE</codeph>, that lets you run a set of SQL statements
or other <cmdname>impala-shell</cmdname> commands stored in a file.
You can run additional <codeph>SOURCE</codeph> commands from inside
a file, to set up flexible sequences of statements for use cases
such as schema setup, ETL, or reporting.
See <xref href="impala_shell_commands.xml#shell_commands"/> for details
and <xref href="impala_shell_running_commands.xml#shell_running_commands"/>
for examples.
</p>
</li>
<li>
<p rev="IMPALA-1772">
The <codeph>millisecond()</codeph> built-in function lets you extract
the fractional seconds part of a <codeph>TIMESTAMP</codeph> value.
See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-3092">
If an Avro table is created without column definitions in the
<codeph>CREATE TABLE</codeph> statement, and columns are later
added through <codeph>ALTER TABLE</codeph>, the resulting
table is now queryable. Missing values from the newly added
columns now default to <codeph>NULL</codeph>.
See <xref href="impala_avro.xml#avro"/> for general details about
working with Avro files.
</p>
</li>
<li>
<p>
The mechanism for interpreting <codeph>DECIMAL</codeph> literals is
improved, no longer going through an intermediate conversion step
to <codeph>DOUBLE</codeph>:
<ul>
<li>
<p rev="IMPALA-3163">
Casting a <codeph>DECIMAL</codeph> value to <codeph>TIMESTAMP</codeph>
<codeph>DOUBLE</codeph> produces a more precise
value for the <codeph>TIMESTAMP</codeph> than formerly.
</p>
</li>
<li>
<p rev="IMPALA-3439">
Certain function calls involving <codeph>DECIMAL</codeph> literals
now succeed, when formerly they failed due to lack of a function
signature with a <codeph>DOUBLE</codeph> argument.
</p>
</li>
<li>
<p rev="">
Faster runtime performance for <codeph>DECIMAL</codeph> constant
values, through improved native code generation for all combinations
of precision and scale.
</p>
</li>
</ul>
See <xref href="impala_decimal.xml#decimal"/> for details about the <codeph>DECIMAL</codeph> type.
</p>
</li>
<li>
<p rev="IMPALA-3155">
Improved type accuracy for <codeph>CASE</codeph> return values.
If all <codeph>WHEN</codeph> clauses of the <codeph>CASE</codeph>
expression are of <codeph>CHAR</codeph> type, the final result
is also <codeph>CHAR</codeph> instead of being converted to
<codeph>STRING</codeph>.
See <xref href="impala_conditional_functions.xml#conditional_functions"/>
for details about the <codeph>CASE</codeph> function.
</p>
</li>
<li>
<p rev="IMPALA-3232">
Uncorrelated queries using the <codeph>NOT EXISTS</codeph> operator
are now supported. Formerly, the <codeph>NOT EXISTS</codeph>
operator was only available for correlated subqueries.
</p>
</li>
<li>
<p rev="IMPALA-2736">
Improved performance for reading Parquet files.
</p>
</li>
<li>
<p rev="IMPALA-3375">
Improved performance for <term>top-N</term> queries, that is,
those including both <codeph>ORDER BY</codeph> and
<codeph>LIMIT</codeph> clauses.
</p>
</li>
<!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
<li>
<p rev="IMPALA-3471">
A top-N query can now also activate the spill-to-disk mechanism if
a host runs low on memory while evaluating it. For example, using
large <codeph>LIMIT</codeph> and/or <codeph>OFFSET</codeph> clauses
adds some memory overhead that could cause spilling.
</p>
</li>
-->
<li>
<p rev="IMPALA-1740">
Impala optionally skips an arbitrary number of header lines from text input
files on HDFS based on the <codeph>skip.header.line.count</codeph> value
in the <codeph>TBLPROPERTIES</codeph> field of the table metadata.
See <xref href="impala_txtfile.xml#text_data_files"/> for details.
</p>
</li>
<li>
<p rev="IMPALA-2336">
Trailing comments are now allowed in queries processed by
the <cmdname>impala-shell</cmdname> options <codeph>-q</codeph>
and <codeph>-f</codeph>.
</p>
</li>
<li>
<p rev="IMPALA-2844">
Impala can run <codeph>COUNT</codeph> queries for RCFile tables
that include complex type columns.
See <xref href="impala_complex_types.xml#complex_types"/> for
general information about working with complex types,
and <xref href="impala_array.xml#array"/>,
<xref href="impala_map.xml#map"/>, and <xref href="impala_struct.xml#struct"/>
for syntax details of each type.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.5.x new features go under here -->
<concept rev="2.5.0" id="new_features_250">
<title>New Features in <keyword keyref="impala25_full"/></title>
<conbody>
<ul>
<li><!-- Spec: https://docs.google.com/document/d/1ambtYJ1t05iITCVIrN6N1A-e7PZBSetBPgjy8SLzJrA/edit#heading=h.vcftzwlpn845 -->
<p rev="IMPALA-2552 IMPALA-3054">
Dynamic partition pruning. When a query refers to a partition key column in a <codeph>WHERE</codeph>
clause, and the exact set of column values are not known until the query is executed,
Impala evaluates the predicate and skips the I/O for entire partitions that are not needed.
For example, if a table was partitioned by year, Impala would apply this technique to a query
such as <codeph>SELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table)</codeph>.
<ph audience="standalone">See <xref href="impala_partitioning.xml#dynamic_partition_pruning"/> for details.</ph>
</p>
<p>
The dynamic partition pruning optimization technique lets Impala avoid reading
data files from partitions that are not part of the result set, even when
that determination cannot be made in advance. This technique is especially valuable
when performing join queries involving partitioned tables. For example, if a join
query includes an <codeph>ON</codeph> clause and a <codeph>WHERE</codeph> clause
that refer to the same columns, the query can find the set of column values that
match the <codeph>WHERE</codeph> clause, and only scan the associated partitions
when evaluating the <codeph>ON</codeph> clause.
</p>
<p>
Dynamic partition pruning is controlled by the same settings as the runtime filtering feature.
By default, this feature is enabled at a medium level, because the maximum setting can use
slightly more memory for queries than in previous releases.
To fully enable this feature, set the query option <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>.
</p>
</li>
<li><!-- Spec: https://docs.google.com/document/d/1ambtYJ1t05iITCVIrN6N1A-e7PZBSetBPgjy8SLzJrA/edit#heading=h.vcftzwlpn845 -->
<p rev="IMPALA-2419 IMPALA-3001 IMPALA-3008 IMPALA-3039 IMPALA-3046 IMPALA-3054">
Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries.
Using the same technique as with dynamic partition pruning,
Impala uses the predicates from <codeph>WHERE</codeph> and <codeph>ON</codeph> clauses
to determine the subset of column values from one of the joined tables could possibly be part of the
result set. Impala sends a compact representation of the filter condition to the hosts in the cluster,
instead of the full set of values or the entire table.
<ph audience="PDF">See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for details.</ph>
</p>
<p>
By default, this feature is enabled at a medium level, because the maximum setting can use
slightly more memory for queries than in previous releases.
To fully enable this feature, set the query option <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>.
<ph audience="PDF">See <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> for details.</ph>
</p>
<p>
This feature involves some new query options:
<xref audience="standalone" href="impala_runtime_filter_mode.xml">RUNTIME_FILTER_MODE</xref><codeph audience="integrated">RUNTIME_FILTER_MODE</codeph>,
<xref audience="standalone" href="impala_max_num_runtime_filters.xml">MAX_NUM_RUNTIME_FILTERS</xref><codeph audience="integrated">MAX_NUM_RUNTIME_FILTERS</codeph>,
<xref audience="standalone" href="impala_runtime_bloom_filter_size.xml">RUNTIME_BLOOM_FILTER_SIZE</xref><codeph audience="integrated">RUNTIME_BLOOM_FILTER_SIZE</codeph>,
<xref audience="standalone" href="impala_runtime_filter_wait_time_ms.xml">RUNTIME_FILTER_WAIT_TIME_MS</xref><codeph audience="integrated">RUNTIME_FILTER_WAIT_TIME_MS</codeph>,
and <xref audience="standalone" href="impala_disable_row_runtime_filtering.xml">DISABLE_ROW_RUNTIME_FILTERING</xref><codeph audience="integrated">DISABLE_ROW_RUNTIME_FILTERING</codeph>.
<ph audience="PDF">See
<xref href="impala_runtime_filter_mode.xml#runtime_filter_mode">RUNTIME_FILTER_MODE</xref>,
<xref href="impala_max_num_runtime_filters.xml#max_num_runtime_filters">MAX_NUM_RUNTIME_FILTERS</xref>,
<xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size">RUNTIME_BLOOM_FILTER_SIZE</xref>,
<xref href="impala_runtime_filter_wait_time_ms.xml#runtime_filter_wait_time_ms">RUNTIME_FILTER_WAIT_TIME_MS</xref>, and
<xref href="impala_disable_row_runtime_filtering.xml#disable_row_runtime_filtering">DISABLE_ROW_RUNTIME_FILTERING</xref>
for details.
</ph>
</p>
</li>
<li>
<p rev="IMPALA-2696">
More efficient use of the HDFS caching feature, to avoid
hotspots and bottlenecks that could occur if heavily used
cached data blocks were always processed by the same host.
By default, Impala now randomizes which host processes each cached
HDFS data block, when cached replicas are available on multiple hosts.
(Remember to use the <codeph>WITH REPLICATION</codeph> clause with the
<codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement
when enabling HDFS caching for a table or partition, to cache the same
data blocks across multiple hosts.)
The new query option <codeph>SCHEDULE_RANDOM_REPLICA</codeph>
<!-- and <codeph>REPLICA_PREFERENCE</codeph> -->
lets you fine-tune the interaction with HDFS caching even more.
<ph audience="PDF">See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2641">
The <codeph>TRUNCATE TABLE</codeph> statement now accepts an <codeph>IF EXISTS</codeph>
clause, making <codeph>TRUNCATE TABLE</codeph> easier to use in setup or ETL scripts where the table might or
might not exist.
<ph audience="PDF">See <xref href="impala_truncate_table.xml#truncate_table"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2681 IMPALA-2688 IMPALA-2749">
Improved performance and reliability for the <codeph>DECIMAL</codeph> data type:
<ul>
<li>
<p rev="IMPALA-2681">
Using <codeph>DECIMAL</codeph> values in a <codeph>GROUP BY</codeph> clause now
triggers the native code generation optimization, speeding up queries that
group by values such as prices.
</p>
</li>
<li>
<p rev="IMPALA-2688">
Checking for overflow in <codeph>DECIMAL</codeph>
multiplication is now substantially faster, making <codeph>DECIMAL</codeph>
a more practical data type in some use cases where formerly <codeph>DECIMAL</codeph>
was much slower than <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph>.
</p>
</li>
<li>
<p rev="IMPALA-2749">
Multiplying a mixture of <codeph>DECIMAL</codeph>
and <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> values now returns the
<codeph>DOUBLE</codeph> rather than <codeph>DECIMAL</codeph>. This change avoids
some cases where an intermediate value would underflow or overflow and become
<codeph>NULL</codeph> unexpectedly.
</p>
</li>
</ul>
<ph audience="PDF">See <xref href="impala_decimal.xml"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2382">
For UDFs written in Java, or Hive UDFs reused for Impala,
Impala now allows parameters and return values to be primitive types.
Formerly, these things were required to be one of the <q>Writable</q>
object types.
<ph audience="PDF">See <xref href="impala_udf.xml#udfs_hive"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1588"><!-- This is from 2015, so perhaps it's really in an earlier release. -->
Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the
overhead of repeatedly opening the same file.
</p>
</li>
<!-- Kudu didn't make it into 2.5 / 5.7 release, so no DELETE or UPDATE statement. -->
<li>
<p><!-- Is there a JIRA for that one? Alex? -->
Performance improvements for queries involving nested complex types.
Certain basic query types, such as counting the elements of a complex column,
now use an optimized code path.
</p>
</li>
<li>
<p rev="IMPALA-3044 IMPALA-2538 IMPALA-1168">
Improvements to the memory reservation mechanism for the Impala
admission control feature. You can specify more settings, such
as the timeout period and maximum aggregate memory used, for each
resource pool instead of globally for the Impala instance. The
default limit for concurrent queries (the <uicontrol>max requests</uicontrol>
setting) is now unlimited instead of 200.
</p>
</li>
<li>
<p rev="IMPALA-1755">
Performance improvements related to code generation.
Even in queries where code generation is not performed
for some phases of execution (such as reading data from
Parquet tables), Impala can still use code generation in
other parts of the query, such as evaluating
functions in the <codeph>WHERE</codeph> clause.
</p>
</li>
<li>
<p rev="IMPALA-1305">
Performance improvements for queries using aggregation functions
on high-cardinality columns.
Formerly, Impala could do unnecessary extra work to produce intermediate
results for operations such as <codeph>DISTINCT</codeph> or <codeph>GROUP BY</codeph>
on columns that were unique or had few duplicate values.
Now, Impala decides at run time whether it is more efficient to
do an initial aggregation phase and pass along a smaller set of intermediate data,
or to pass raw intermediate data back to next phase of query processing to be aggregated there.
This feature is known as <term>streaming pre-aggregation</term>.
In case of performance regression, this feature can be turned off
using the <codeph>DISABLE_STREAMING_PREAGGREGATIONS</codeph> query option.
<ph audience="PDF">See <xref href="impala_disable_streaming_preaggregations.xml#disable_streaming_preaggregations"/> for details.</ph>
</p>
</li>
<li>
<p>
Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature
could be turned off using a pair of configuration settings,
<codeph>enable_partitioned_aggregation=false</codeph> and
<codeph>enable_partitioned_hash_join=false</codeph>.
The latest improvements in the spill-to-disk mechanism, and related features that
interact with it, make this feature robust enough that disabling it is now
no longer needed or supported. In particular, some new features in <keyword keyref="impala25_full"/>
and higher do not work when the spill-to-disk feature is disabled.
</p>
</li>
<li>
<p rev="IMPALA-1067">
Improvements to scripting capability for the <cmdname>impala-shell</cmdname> command,
through user-specified substitution variables that can appear in statements processed
by <cmdname>impala-shell</cmdname>:
</p>
<ul>
<li rev="IMPALA-2179">
<p>
The <codeph>--var</codeph> command-line option lets you pass key-value pairs to
<cmdname>impala-shell</cmdname>. The shell can substitute the values
into queries before executing them, where the query text contains the notation
<codeph>${var:<varname>varname</varname>}</codeph>. For example, you might prepare a SQL file
containing a set of DDL statements and queries containing variables for
database and table names, and then pass the applicable names as part of the
<codeph>impala-shell -f <varname>filename</varname></codeph> command.
<ph audience="PDF">See <xref href="impala_shell_running_commands.xml#shell_running_commands"/> for details.</ph>
</p>
</li>
<li rev="IMPALA-2180">
<p>
The <codeph>SET</codeph> and <codeph>UNSET</codeph> commands within the
<cmdname>impala-shell</cmdname> interpreter now work with user-specified
substitution variables, as well as the built-in query options.
The two kinds of variables are divided in the <codeph>SET</codeph> output.
As with variables defined by the <codeph>--var</codeph> command-line option,
you refer to the user-specified substitution variables in queries by using
the notation <codeph>${var:<varname>varname</varname>}</codeph>
in the query text. Because the substitution variables are processed by
<cmdname>impala-shell</cmdname> instead of the <cmdname>impalad</cmdname>
backend, you cannot define your own substitution variables through the
<codeph>SET</codeph> statement in a JDBC or ODBC application.
<ph audience="PDF">See <xref href="impala_set.xml#set"/> for details.</ph>
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-1599">
Performance improvements for query startup. Impala better parallelizes certain work
when coordinating plan distribution between <cmdname>impalad</cmdname> instances, which improves
startup time for queries involving tables with many partitions on large clusters,
or complicated queries with many plan fragments.
</p>
</li>
<li>
<p rev="IMPALA-2560">
Performance and scalability improvements for tables with many partitions.
The memory requirements on the coordinator node are reduced, making it substantially
faster and less resource-intensive
to do joins involving several tables with thousands of partitions each.
</p>
</li>
<li>
<p rev="IMPALA-3095">
Whitelisting for access to internal APIs. For applications that need direct access
to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can
specify a list of Kerberos users who are allowed to call those APIs. By default, the
<codeph>impala</codeph> and <codeph>hdfs</codeph> users are the only ones authorized
for this kind of access.
Any users not explicitly authorized through the <codeph>internal_principals_whitelist</codeph>
configuration setting are blocked from accessing the APIs. This setting applies to all the
Impala-related daemons, although currently it is primarily used for HDFS to control the
behavior of the catalog server.
</p>
</li>
<li>
<p rev="">
Improvements to Impala integration and usability for Hue. (The code changes
are actually on the Hue side.)
</p>
<ul>
<li>
<p rev="">
The list of tables now refreshes dynamically.
</p>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-1787">
Usability improvements for case-insensitive queries.
You can now use the operators <codeph>ILIKE</codeph> and <codeph>IREGEXP</codeph>
to perform case-insensitive wildcard matches or regular expression matches,
rather than explicitly converting column values with <codeph>UPPER</codeph>
or <codeph>LOWER</codeph>.
<ph audience="PDF">See <xref href="impala_operators.xml#ilike"/> and <xref href="impala_operators.xml#iregexp"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1480">
Performance and reliability improvements for DDL and insert operations on partitioned tables with a large
number of partitions. Impala only re-evaluates metadata for partitions that are affected by
a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress,
other Impala statements that attempt to modify metadata for the same table wait until the first one
finishes.
</p>
</li>
<li>
<p rev="IMPALA-2867">
Reliability improvements for the <codeph>LOAD DATA</codeph> statement.
Previously, this statement would fail if the source HDFS directory
contained any subdirectories at all. Now, the statement ignores
any hidden subdirectories, for example <filepath>_impala_insert_staging</filepath>.
</p>
</li>
<li>
<p rev="IMPALA-2147">
A new operator, <codeph>IS [NOT] DISTINCT FROM</codeph>, lets you compare values
and always get a <codeph>true</codeph> or <codeph>false</codeph> result,
even if one or both of the values are <codeph>NULL</codeph>.
The <codeph>IS NOT DISTINCT FROM</codeph> operator, or its equivalent
<codeph>&lt;=&gt;</codeph> notation, improves the efficiency of join queries that
treat key values that are <codeph>NULL</codeph> in both tables as equal.
<ph audience="PDF">See <xref href="impala_operators.xml#is_distinct_from"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1934">
Security enhancements for the <cmdname>impala-shell</cmdname> command.
A new option, <codeph>--ldap_password_cmd</codeph>, lets you specify
a command to retrieve the LDAP password. The resulting password is
then used to authenticate the <cmdname>impala-shell</cmdname> command
with the LDAP server.
<ph audience="PDF">See <xref href="impala_shell_options.xml"/> for details.</ph>
</p>
</li>
<li>
<p>
The <codeph>CREATE TABLE AS SELECT</codeph> statement now accepts a
<codeph>PARTITIONED BY</codeph> clause, which lets you create a
partitioned table and insert data into it with a single statement.
<ph audience="PDF">See <xref href="impala_create_table.xml#create_table"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1748">
User-defined functions (UDFs and UDAFs) written in C++ now persist automatically
when the <cmdname>catalogd</cmdname> daemon is restarted. You no longer
have to run the <codeph>CREATE FUNCTION</codeph> statements again after a restart.
</p>
</li>
<li>
<p rev="IMPALA-2843">
User-defined functions (UDFs) written in Java can now persist
when the <cmdname>catalogd</cmdname> daemon is restarted, and can be shared
transparently between Impala and Hive. You must do a one-time operation to recreate these
UDFs using new <codeph>CREATE FUNCTION</codeph> syntax, without a signature for arguments
or the return value. Afterwards, you no longer have to run the <codeph>CREATE FUNCTION</codeph>
statements again after a restart.
Although Impala does not have visibility into the UDFs that implement the
Hive built-in functions, user-created Hive UDFs are now automatically available
for calling through Impala.
<ph audience="PDF">See <xref href="impala_create_function.xml#create_function"/> for details.</ph>
</p>
</li>
<li>
<!-- Listed as fixed in 2.6.0. Is this item inappropriate or did it actually come from a different JIRA? -->
<p rev="IMPALA-2728">
Reliability enhancements for memory management. Some aggregation and join queries
that formerly might have failed with an out-of-memory error due to memory contention,
now can succeed using the spill-to-disk mechanism.
</p>
</li>
<li>
<!-- Same blurb is under Incompatible Changes. Turn into a conref. -->
<p rev="IMPALA-2070">
The <codeph>SHOW DATABASES</codeph> statement now returns two columns rather than one.
The second column includes the associated comment string, if any, for each database.
Adjust any application code that examines the list of databases and assumes the
result set contains only a single column.
<ph audience="PDF">See <xref href="impala_show.xml#show_databases"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-2499">
A new optimization speeds up aggregation operations that involve only the partition key
columns of partitioned tables. For example, a query such as <codeph>SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1</codeph>
can avoid reading any data files if <codeph>T1</codeph> is a partitioned table and <codeph>K</codeph>
is one of the partition key columns. Because this technique can produce different results in cases
where HDFS files in a partition are manually deleted or are empty, you must enable the optimization
by setting the query option <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph>.
<ph audience="PDF">See <xref href="impala_optimize_partition_key_scans.xml"/> for details.</ph>
</p>
</li>
<li audience="hidden"><!-- All the other undocumented query options are not really new features for this release, so hiding this whole bullet. -->
<p>
Other new query options:
</p>
<ul>
<li audience="hidden"><!-- Actually from a long way back, just never documented. Not sure if appropriate to keep internal-only or expose. -->
<codeph>DISABLE_OUTERMOST_TOPN</codeph>
</li>
<li audience="hidden"><!-- Actually from a long way back, just never documented. Not sure if appropriate to keep internal-only or expose. -->
<codeph>RM_INITIAL_MEM</codeph>
</li>
<li audience="hidden"><!-- Seems to be related to writing sequence files, a capability not externalized at this time. -->
<codeph>SEQ_COMPRESSION_MODE</codeph>
</li>
<li audience="hidden"><!-- Actually, was only used for working around one JIRA. Being deprecated now in Impala 2.3 via IMPALA-2963. -->
<codeph>DISABLE_CACHED_READS</codeph>
</li>
</ul>
</li>
<li>
<p rev="IMPALA-2196">
The <codeph>DESCRIBE</codeph> statement can now display metadata about a database, using the
syntax <codeph>DESCRIBE DATABASE <varname>db_name</varname></codeph>.
<ph audience="PDF">See <xref href="impala_describe.xml#describe"/> for details.</ph>
</p>
</li>
<li>
<p rev="IMPALA-1477">
The <codeph>uuid()</codeph> built-in function generates an
alphanumeric value that you can use as a guaranteed unique identifier.
The uniqueness applies even across tables, for cases where an ascending
numeric sequence is not suitable.
<ph audience="PDF">See <xref href="impala_misc_functions.xml#misc_functions"/> for details.</ph>
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.4.x new features go under here -->
<concept rev="2.4.0" id="new_features_240">
<title>New Features in <keyword keyref="impala24_full"/></title>
<conbody>
<ul>
<li>
<p>
Impala can be used on the DSSD D5 Storage Appliance.
From a user perspective, the Impala features are the same as in <keyword keyref="impala23_full"/>.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.3.x subsections go under here -->
<!-- Actually for 2.3 / 5.5, let's get away from doing a separate subhead for each maintenance release,
because in the normal course of events there will be nothing to add here until 5.6. If something new
needs to get noted, just add a new bullet with wording to indicate which 5.5.x release it applies to. -->
<concept rev="2.3.0" id="new_features_230">
<title>New Features in <keyword keyref="impala23_full"/></title>
<conbody>
<p>
The following are the major new features in Impala 2.3.x. This major release
contains improvements to SQL syntax (particularly new support for complex types), performance,
manageability, security.
</p>
<ul>
<li>
<p>
Complex data types: <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and <codeph>MAP</codeph>. These
types can encode multiple named fields, positional items, or key-value pairs within a single column.
You can combine these types to produce nested types with arbitrarily deep nesting,
such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> values,
a <codeph>MAP</codeph> where each key-value pair is an <codeph>ARRAY</codeph> of other <codeph>MAP</codeph> values,
and so on. Currently, complex data types are only supported for the Parquet file format.
<ph audience="PDF">See <xref href="impala_complex_types.xml#complex_types"/> for usage details and <xref href="impala_array.xml#array"/>, <xref href="impala_struct.xml#struct"/>, and <xref href="impala_map.xml#map"/> for syntax.</ph>
</p>
</li>
<li rev="collevelauth">
<p>
Column-level authorization lets you define access to particular columns within a table,
rather than the entire table. This feature lets you reduce the reliance on creating views to
set up authorization schemes for subsets of information.
See <xref keyref="sg_hive_sql"/> for background details, and
<xref href="impala_grant.xml#grant"/> and <xref href="impala_revoke.xml#revoke"/> for Impala-specific syntax.
</p>
</li>
<li rev="IMPALA-1139">
<p>
The <codeph>TRUNCATE TABLE</codeph> statement removes all the data from a table without removing the table itself.
<ph audience="PDF">See <xref href="impala_truncate_table.xml#truncate_table"/> for details.</ph>
</p>
</li>
<li id="IMPALA-2015">
<p>
Nested loop join queries. Some join queries that formerly required equality comparisons can now use
operators such as <codeph>&lt;</codeph> or <codeph>&gt;=</codeph>. This same join mechanism is used
internally to optimize queries that retrieve values from complex type columns.
<ph audience="PDF">See <xref href="impala_joins.xml#joins"/> for details about Impala join queries.</ph>
</p>
</li>
<li>
<p>
Reduced memory usage and improved performance and robustness for spill-to-disk feature.
<ph audience="PDF">See <xref href="impala_scalability.xml#spill_to_disk"/> for details about this feature.</ph>
</p>
</li>
<li rev="IMPALA-1881">
<p>
Performance improvements for querying Parquet data files containing multiple row groups
and multiple data blocks:
</p>
<ul>
<li>
<p> For files written by Hive, SparkSQL, and other Parquet MR writers
and spanning multiple HDFS blocks, Impala now scans the extra
data blocks locally when possible, rather than using remote
reads. </p>
</li>
<li>
<p>
Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet
files written by Hive, MapReduce, and other components. (Impala itself never writes
multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.)
These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks.
The <codeph>parquet.writer.max-padding</codeph> setting specifies the maximum number of bytes, by default
8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block
so that the next row group starts at the beginning of the next block.
If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space.
Include this setting in the <filepath>hive-site</filepath> configuration file to influence Parquet files written by Hive,
or the <filepath>hdfs-site</filepath> configuration file to influence Parquet files written by all non-Impala components.
</p>
</li>
</ul>
<p audience="PDF">
See <xref href="impala_parquet.xml#parquet"/> for instructions about using Parquet data files
with Impala.
</p>
</li>
<li id="IMPALA-1660">
<p>
Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions.
</p>
<p rev="IMPALA-1771">
Math functions<ph audience="PDF"> (see <xref href="impala_math_functions.xml#math_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>ATAN2</codeph>
</li>
<li>
<codeph>COSH</codeph>
</li>
<li>
<codeph>COT</codeph>
</li>
<li>
<codeph>DCEIL</codeph>
</li>
<li>
<codeph>DEXP</codeph>
</li>
<li>
<codeph>DFLOOR</codeph>
</li>
<li>
<codeph>DLOG10</codeph>
</li>
<li>
<codeph>DPOW</codeph>
</li>
<li>
<codeph>DROUND</codeph>
</li>
<li>
<codeph>DSQRT</codeph>
</li>
<li>
<codeph>DTRUNC</codeph>
</li>
<li>
<codeph>FACTORIAL</codeph>, and corresponding <codeph>!</codeph> operator
</li>
<li>
<codeph>FPOW</codeph>
</li>
<li>
<codeph>RADIANS</codeph>
</li>
<li>
<codeph>RANDOM</codeph>
</li>
<li>
<codeph>SINH</codeph>
</li>
<li>
<codeph>TANH</codeph>
</li>
</ul>
<p>
String functions<ph audience="PDF"> (see <xref href="impala_string_functions.xml#string_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>BTRIM</codeph>
</li>
<li>
<codeph>CHR</codeph>
</li>
<li>
<codeph>REGEXP_LIKE</codeph>
</li>
<li>
<codeph>SPLIT_PART</codeph>
</li>
</ul>
<p>
Date and time functions<ph audience="PDF"> (see <xref href="impala_datetime_functions.xml#datetime_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>INT_MONTHS_BETWEEN</codeph>
</li>
<li>
<codeph>MONTHS_BETWEEN</codeph>
</li>
<li>
<codeph>TIMEOFDAY</codeph>
</li>
<li>
<codeph>TIMESTAMP_CMP</codeph>
</li>
</ul>
<p>
Bit manipulation functions<ph audience="PDF"> (see <xref href="impala_bit_functions.xml#bit_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>BITAND</codeph>
</li>
<li>
<codeph>BITNOT</codeph>
</li>
<li>
<codeph>BITOR</codeph>
</li>
<li>
<codeph>BITXOR</codeph>
</li>
<li>
<codeph>COUNTSET</codeph>
</li>
<li>
<codeph>GETBIT</codeph>
</li>
<li>
<codeph>ROTATELEFT</codeph>
</li>
<li>
<codeph>ROTATERIGHT</codeph>
</li>
<li>
<codeph>SETBIT</codeph>
</li>
<li>
<codeph>SHIFTLEFT</codeph>
</li>
<li>
<codeph>SHIFTRIGHT</codeph>
</li>
</ul>
<p>
Type conversion functions<ph audience="PDF"> (see <xref href="impala_conversion_functions.xml#conversion_functions"/> for details)</ph>:
</p>
<ul>
<li>
<codeph>TYPEOF</codeph>
</li>
</ul>
<p>
The <codeph>effective_user()</codeph> function<ph audience="PDF"> (see <xref href="impala_misc_functions.xml#misc_functions"/> for details)</ph>.
</p>
</li>
<li id="IMPALA-2081">
<p>
New built-in analytic functions: <codeph>PERCENT_RANK</codeph>, <codeph>NTILE</codeph>,
<codeph>CUME_DIST</codeph>.
<ph audience="PDF">See <xref href="impala_analytic_functions.xml#analytic_functions"/> for details.</ph>
</p>
</li>
<li id="IMPALA-595">
<p>
The <codeph>DROP DATABASE</codeph> statement now works for a non-empty database.
When you specify the optional <codeph>CASCADE</codeph> clause, any tables in the
database are dropped before the database itself is removed.
<ph audience="PDF">See <xref href="impala_drop_database.xml#drop_database"/> for details.</ph>
</p>
</li>
<li>
<p>
The <codeph>DROP TABLE</codeph> and <codeph>ALTER TABLE DROP PARTITION</codeph> statements have a new optional keyword, <codeph>PURGE</codeph>.
This keyword causes Impala to immediately remove the relevant HDFS data files rather than sending them to the HDFS trashcan.
This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of
a problem with the HDFS trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone
than the data files.
<ph audience="PDF">See <xref href="impala_drop_table.xml#drop_table"/> and <xref href="impala_alter_table.xml#alter_table"/> for syntax.</ph>
</p>
</li>
<li id="IMPALA-80">
<p>
The <cmdname>impala-shell</cmdname> command has a new feature for live progress reporting. This feature
is enabled through the <codeph>--live_progress</codeph> and <codeph>--live_summary</codeph>
command-line options, or during a session through the <codeph>LIVE_SUMMARY</codeph> and
<codeph>LIVE_PROGRESS</codeph> query options.
<ph audience="PDF">See <xref href="impala_live_progress.xml#live_progress"/> and <xref href="impala_live_summary.xml#live_summary"/> for details.</ph>
</p>
</li>
<li>
<p>
The <cmdname>impala-shell</cmdname> command also now displays a random <q>tip of the day</q> when it starts.
</p>
</li>
<li id="IMPALA-1413">
<p>
The <cmdname>impala-shell</cmdname> option <codeph>-f</codeph> now recognizes a special filename
<codeph>-</codeph> to accept input from stdin.
<ph audience="PDF">See <xref href="impala_shell_options.xml#shell_options"/> for details about the options for running <cmdname>impala-shell</cmdname> in non-interactive mode.</ph>
</p>
</li>
<li id="IMPALA-1963">
<p>
Format strings for the <codeph>unix_timestamp()</codeph> function can now include numeric timezone offsets.
<ph audience="PDF">See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details.</ph>
</p>
</li>
<li>
<p>
Impala can now run a specified command to obtain the password to decrypt a private-key PEM file,
rather than having the private-key file be unencrypted on disk.
<ph audience="PDF">See <xref href="impala_ssl.xml#ssl"/> for details.</ph>
</p>
</li>
<li id="IMPALA-859">
<p>
Impala components now can use SSL for more of their internal communication. SSL is used for
communication between all three Impala-related daemons when the configuration option
<codeph>ssl_server_certificate</codeph> is enabled. SSL is used for communication with client
applications when the configuration option <codeph>ssl_client_ca_certificate</codeph> is enabled.
<ph audience="PDF">See <xref href="impala_ssl.xml#ssl"/> for details.</ph>
</p>
<p>
Currently, you can only use one of server-to-server TLS/SSL encryption or Kerberos authentication.
This limitation is tracked by the issue
<xref keyref="IMPALA-2598">IMPALA-2598</xref>.
</p>
</li>
<li id="IMPALA-1829">
<p>
Improved flexibility for intermediate data types in user-defined aggregate functions (UDAFs).
<ph audience="PDF">See <xref href="impala_udf.xml#udafs"/> for details.</ph>
</p>
</li>
</ul>
<p>
In <keyword keyref="impala232"/>, the bug fix for <xref keyref="IMPALA-2598">IMPALA-2598</xref>
removes the restriction on using both Kerberos and SSL for internal communication between Impala components.
</p>
<!-- End of new feature list for 2.3 / 5.5. -->
</conbody>
</concept>
<!-- All 2.2.x subsections go under here -->
<concept rev="2.2.0" id="new_features_220">
<title>New Features in <keyword keyref="impala28_full"/></title>
<conbody>
<p>
The following are the major new features in <keyword keyref="impala22_full"/>. This release
contains improvements to performance, manageability, security, and SQL syntax.
</p>
<ul>
<li>
<p>
Several improvements to date and time features enable higher interoperability with Hive and other
database systems, provide more flexibility for handling time zones, and future-proof the handling of
<codeph>TIMESTAMP</codeph> values:
</p>
<ul>
<li>
<p>
The <codeph>WITH REPLICATION</codeph> clause for the <codeph>CREATE TABLE</codeph> and
<codeph>ALTER TABLE</codeph> statements lets you control the replication factor for
HDFS caching for a specific table or partition. By default, each cached block is
only present on a single host, which can lead to CPU contention if the same host
processes each cached block. Increasing the replication factor lets Impala choose
different hosts to process different cached blocks, to better distribute the CPU load.
</p>
</li>
<li>
<p>
Startup flags for the <cmdname>impalad</cmdname> daemon enable a higher level of compatibility with
<codeph>TIMESTAMP</codeph> values written by Hive, and more flexibility for working with date and
time data using the local time zone instead of UTC. To enable these features, set the
<cmdname>impalad</cmdname> startup flags
<codeph>-use_local_tz_for_unix_timestamp_conversions=true</codeph> and
<codeph>-convert_legacy_hive_parquet_utc_timestamps=true</codeph>.
</p>
<p>
The <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> setting controls how the
<codeph>unix_timestamp()</codeph>, <codeph>from_unixtime()</codeph>, and <codeph>now()</codeph>
functions handle time zones. By default (when this setting is turned off), Impala considers all
<codeph>TIMESTAMP</codeph> values to be in the UTC time zone when converting to or from Unix time
values. When this setting is enabled, Impala treats <codeph>TIMESTAMP</codeph> values passed to or
returned from these functions to be in the local time zone. When this setting is enabled, take
particular care that all hosts in the cluster have the same timezone settings, to avoid
inconsistent results depending on which host reads or writes <codeph>TIMESTAMP</codeph> data.
</p>
<p>
The <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting causes Impala to convert
<codeph>TIMESTAMP</codeph> values to the local time zone when it reads them from Parquet files
written by Hive. This setting only applies to data using the Parquet file format, where Impala can
use metadata in the files to reliably determine that the files were written by Hive. If in the
future Hive changes the way it writes <codeph>TIMESTAMP</codeph> data in Parquet, Impala will
automatically handle that new <codeph>TIMESTAMP</codeph> encoding.
</p>
<p>
See <xref href="impala_timestamp.xml#timestamp"/> for details about time zone handling and the
configuration options for Impala / Hive compatibility with Parquet format.
</p>
</li>
<li>
<p conref="../shared/impala_common.xml#common/y2k38" />
<p>
See <xref href="impala_datetime_functions.xml#datetime_functions"/> for the current function
signatures.
</p>
</li>
</ul>
</li>
<li>
<p>
The <codeph>SHOW FILES</codeph> statement lets you view the names and sizes of the files that make up
an entire table or a specific partition. See <xref href="impala_show.xml#show_files"/> for details.
</p>
</li>
<li>
<p>
Impala can now run queries against Parquet data containing columns with complex or nested types, as
long as the query only refers to columns with scalar types.
</p>
</li>
<li>
<p>
Performance improvements for queries that include <codeph>IN()</codeph> operators and involve
partitioned tables.
</p>
</li>
<li>
<!-- Same text for this item in impala_fixed_issues.xml. Could turn into a conref. -->
<p>
The new <codeph>-max_log_files</codeph> configuration option specifies how many log files to keep at
each severity level. The default value is 10, meaning that Impala preserves the latest 10 log files for
each severity level (<codeph>INFO</codeph>, <codeph>WARNING</codeph>, and <codeph>ERROR</codeph>) for
each Impala-related daemon (<cmdname>impalad</cmdname>, <cmdname>statestored</cmdname>, and
<cmdname>catalogd</cmdname>). Impala checks to see if any old logs need to be removed based on the
interval specified in the <codeph>logbufsecs</codeph> setting, every 5 seconds by default. See
<xref href="impala_logging.xml#logs_rotate"/> for details.
</p>
</li>
<li>
<p>
Redaction of sensitive data from Impala log files. This feature protects details such as credit card
numbers or tax IDs from administrators who see the text of SQL statements in the course of monitoring
and troubleshooting a Hadoop cluster. See <xref href="impala_logging.xml#redaction"/> for background
information for Impala users, and <xref keyref="sg_redaction"/> for usage details.
</p>
</li>
<li>
<p>
Lineage information is available for data created or queried by Impala. This feature lets you track who
has accessed data through Impala SQL statements, down to the level of specific columns, and how data
has been propagated between tables. See <xref href="impala_lineage.xml#lineage"/> for background
information for Impala users, <xref keyref="datamgmt_impala_lineage_log"/> for usage details and
how to interpret the lineage information.
</p>
</li>
<li>
<p>
Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem,
for convenience in cases where data is already located in S3 and you prefer to query it in-place.
Queries might have lower performance than when the data files reside on HDFS, because Impala uses some
HDFS-specific optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements
such as <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> are not available when the destination
table or partition is in S3. See <xref href="impala_s3.xml#s3"/> for details.
</p>
<note conref="../shared/impala_common.xml#common/s3_caveat" />
</li>
<li>
<!-- Only want the link out of the release notes to appear for HTML
(N.B. audience="PDF" means hide from PDF), and only in the HTML for the
integrated build where the topic is available for link resolution. -->
<p>
Improved support for HDFS encryption. The <codeph>LOAD DATA</codeph> statement now works when the
source directory and destination table are in different encryption zones. See
<xref keyref="cdh_sg_component_kms"/> for details about using HDFS encryption with
Impala.
</p>
</li>
<li>
<p>
Additional arithmetic function <codeph>mod()</codeph>. See
<xref href="impala_math_functions.xml#math_functions"/> for details.
</p>
</li>
<li>
<p>
Flexibility to interpret <codeph>TIMESTAMP</codeph> values using the UTC time zone (the traditional
Impala behavior) or using the local time zone (for compatibility with <codeph>TIMESTAMP</codeph> values
produced by Hive).
</p>
</li>
<li>
<p>
Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced
by these tools (filenames with suffixes <codeph>.copying</codeph> and <codeph>.tmp</codeph>).
</p>
</li>
<li>
<p>
The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now
been relaxed.
</p>
<p conref="../shared/impala_common.xml#common/cpu_prereq" />
</li>
<li>
<p>
Enhanced support for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types in the <codeph>COMPUTE
STATS</codeph> statement.
</p>
</li>
<li rev="">
<p>
The amount of memory required during setup for <q>spill to disk</q> operations is greatly reduced. This
enhancement reduces the chance of a memory-intensive join or aggregation query failing with an
out-of-memory error.
</p>
</li>
<li>
<p>
Several new conditional functions provide enhanced compatibility when porting code that uses industry
extensions. The new functions are: <codeph>isfalse()</codeph>, <codeph>isnotfalse()</codeph>,
<codeph>isnottrue()</codeph>, <codeph>istrue()</codeph>, <codeph>nonnullvalue()</codeph>, and
<codeph>nullvalue()</codeph>. See <xref href="impala_conditional_functions.xml#conditional_functions"/>
for details.
</p>
</li>
<li>
<p>
The Impala debug web UI now can display a visual representation of the query plan. On the
<uicontrol>/queries</uicontrol> tab, select <uicontrol>Details</uicontrol> for a particular query. The
<uicontrol>Details</uicontrol> page includes a <uicontrol>Plan</uicontrol> tab with a plan diagram that
you can zoom in or out (using scroll gestures through mouse wheel or trackpad).
</p>
</li>
</ul>
<!-- End of new feature list for 5.4. -->
</conbody>
</concept>
<!-- All 2.1.x subsections go under here -->
<concept rev="2.1.0" id="new_features_210">
<title>New Features in <keyword keyref="impala21_full"/></title>
<conbody>
<p>
This release contains the following enhancements to query performance and system scalability:
</p>
<ul>
<li>
<p>
Impala can now collect statistics for individual partitions in a partitioned table, rather than
processing the entire table for each <codeph>COMPUTE STATS</codeph> statement. This feature is known as
incremental statistics, and is controlled by the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax.
(You can still use the original <codeph>COMPUTE STATS</codeph> statement for nonpartitioned tables or
partitioned tables that are unchanging or whose contents are entirely replaced all at once.) See
<xref href="impala_compute_stats.xml#compute_stats"/> and
<xref href="impala_perf_stats.xml#perf_stats"/> for details.
</p>
</li>
<li>
<p>
Optimization for small queries lets Impala process queries that process very few rows without the
unnecessary overhead of parallelizing and generating native code. Reducing this overhead lets Impala
clear small queries quickly, keeping YARN resources and admission control slots available for
data-intensive queries. The number of rows considered to be a <q>small</q> query is controlled by the
<codeph>EXEC_SINGLE_NODE_ROWS_THRESHOLD</codeph> query option. See
<xref href="impala_exec_single_node_rows_threshold.xml#exec_single_node_rows_threshold"/> for details.
</p>
</li>
<li>
<p>
An enhancement to the statestore component lets it transmit heartbeat information independently of
broadcasting metadata updates. This optimization improves reliability of health checking on large
clusters with many tables and partitions.
</p>
</li>
<li>
<p>
The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data
as it is read, rather than reading the entire gzipped file and decompressing it in memory.
</p>
</li>
</ul>
</conbody>
</concept>
<!-- All 2.0.x subsections go under here -->
<concept rev="2.0.0" id="new_features_200">
<title>New Features in <keyword keyref="impala20_full"/></title>
<conbody>
<p>
The following are the major new features in <keyword keyref="impala20_full"/>. This major release
contains improvements to performance, scalability, security, and SQL syntax.
</p>
<ul>
<li>
<p>
Queries with joins or aggregation functions involving high volumes of data can now use temporary work
areas on disk, reducing the chance of failure due to out-of-memory errors. When the required memory for
the intermediate result set exceeds the amount available on a particular node, the query automatically
uses a temporary work area on disk. This <q>spill to disk</q> mechanism is similar to the <codeph>ORDER
BY</codeph> improvement from Impala 1.4. For details, see
<xref href="impala_scalability.xml#spill_to_disk"/>.
</p>
</li>
<li>
<p>
Subquery enhancements:
<ul>
<li>
Subqueries are now allowed in the <codeph>WHERE</codeph> clause, for example with the
<codeph>IN</codeph> operator.
</li>
<li>
The <codeph>EXISTS</codeph> and <codeph>NOT EXISTS</codeph> operators are available. They are
always used in conjunction with subqueries.
</li>
<li>
The <codeph>IN</codeph> and <codeph>NOT IN</codeph> queries can now operate on the result set from
a subquery, not just a hardcoded list of values.
</li>
<li>
Uncorrelated subqueries let you compare against one or more values for equality,
<codeph>IN</codeph>, and <codeph>EXISTS</codeph> comparisons. For example, you might use
<codeph>WHERE</codeph> clauses such as <codeph>WHERE <varname>column</varname> = (SELECT
MAX(<varname>some_other_column</varname> FROM <varname>table</varname>)</codeph> or <codeph>WHERE
<varname>column</varname> IN (SELECT <varname>some_other_column</varname> FROM
<varname>table</varname> WHERE <varname>conditions</varname>)</codeph>.
</li>
<li>
Correlated subqueries let you cross-reference values from the outer query block and the subquery.
</li>
<li>
Scalar subqueries let you substitute the result of single-value aggregate functions such as
<codeph>MAX()</codeph>, <codeph>MIN()</codeph>, <codeph>COUNT()</codeph>, or
<codeph>AVG()</codeph>, where you would normally use a numeric value in a <codeph>WHERE</codeph>
clause.
</li>
</ul>
</p>
<p>
For details about subqueries, see <xref href="impala_subqueries.xml#subqueries"/> For information about
new and improved operators, see <xref href="impala_operators.xml#exists"/> and
<xref href="impala_operators.xml#in"/>.
</p>
</li>
<li>
<p>
Analytic functions such as <codeph>RANK()</codeph>, <codeph>LAG()</codeph>, <codeph>LEAD()</codeph>,
and <codeph>FIRST_VALUE()</codeph> let you analyze sequences of rows with flexible ordering and
grouping. Existing aggregate functions such as <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, and
<codeph>COUNT()</codeph> can also be used in an analytic context. See
<xref href="impala_analytic_functions.xml#analytic_functions"/> for details. See
<xref href="impala_aggregate_functions.xml#aggregate_functions"/> for enhancements to existing
aggregate functions.
</p>
</li>
<li>
<p>
New data types provide greater compatibility with source code from traditional database systems:
</p>
<ul>
<li>
<codeph>VARCHAR</codeph> is like the <codeph>STRING</codeph> data type, but with a maximum length.
See <xref href="impala_varchar.xml#varchar"/> for details.
</li>
<li>
<codeph>CHAR</codeph> is like the <codeph>STRING</codeph> data type, but with a precise length. Short
values are padded with spaces on the right. See <xref href="impala_char.xml#char"/> for details.
</li>
<li audience="hidden">
<!-- This feature will be undocumented in Impala 2.0, probably ready for prime time in 2.1. -->
<codeph>DATE</codeph>. See <xref href="impala_date.xml#date"/> for details.
</li>
</ul>
</li>
<li>
<p>
Security enhancements:
<ul>
<li>
Formerly, Impala was restricted to using either Kerberos or LDAP / Active Directory authentication
within a cluster. Now, Impala can freely accept either kind of authentication request, allowing you
to set up some hosts with Kerberos authentication and others with LDAP or Active Directory. See
<xref href="impala_mixed_security.xml#mixed_security"/> for details.
</li>
<li>
<codeph>GRANT</codeph> statement. See <xref href="impala_grant.xml#grant"/> for details.
</li>
<li>
<codeph>REVOKE</codeph> statement. See <xref href="impala_revoke.xml#revoke"/> for details.
</li>
<li>
<codeph>CREATE ROLE</codeph> statement. See <xref href="impala_create_role.xml#create_role"/> for
details.
</li>
<li>
<codeph>DROP ROLE</codeph> statement. See <xref href="impala_drop_role.xml#drop_role"/> for
details.
</li>
<li>
<codeph>SHOW ROLES</codeph> and <codeph>SHOW ROLE GRANT</codeph> statements. See
<xref href="impala_show.xml#show"/> for details.
</li>
<li>
<p>
To complement the HDFS encryption feature, a new Impala configuration option,
<codeph>--disk_spill_encryption</codeph> secures sensitive data from being observed or tampered
with when temporarily stored on disk.
</p>
</li>
</ul>
</p>
<p>
The new security-related SQL statements work along with the Sentry authorization framework. See
<xref keyref="authorization"/> for details.
</p>
</li>
<li>
<p>
Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not
require any special table settings to work in an Impala text table. Impala recognizes the compression
type automatically based on file extensions of <codeph>.gz</codeph>, <codeph>.bz2</codeph>, and
<codeph>.snappy</codeph> respectively. These types of compressed text files are intended for
convenience with existing ETL pipelines. Their non-splittable nature means they are not optimal for
high-performance parallel queries. See <xref href="impala_txtfile.xml#gzip"/> for details.
</p>
</li>
<li>
<p>
Query hints can now use comment notation, <codeph>/* +<varname>hint_name</varname> */</codeph> or
<codeph>-- +<varname>hint_name</varname></codeph>, at the same places in the query where the hints
enclosed by <codeph>[ ]</codeph> are recognized. This enhancement makes it easier to reuse Impala
queries on other database systems. See <xref href="impala_hints.xml#hints"/> for details.
</p>
</li>
<li>
<p>
A new query option, <codeph>QUERY_TIMEOUT_S</codeph>, lets you specify a timeout period in seconds for
individual queries.
</p>
<p>
The working of the <codeph>--idle_query_timeout</codeph> configuration option is extended. If no
<codeph>QUERY_OPTION_S</codeph> query option is in effect, <codeph>--idle_query_timeout</codeph> works
the same as before, setting the timeout interval. When the <codeph>QUERY_OPTION_S</codeph> query option
is specified, its maximum value is capped by the value of the <codeph>--idle_query_timeout</codeph>
option.
</p>
<p>
That is, the system administrator sets the default and maximum timeout through the
<codeph>--idle_query_timeout</codeph> startup option, and then individual users or applications can set
a lower timeout value if desired through the <codeph>QUERY_TIMEOUT_S</codeph> query option. See
<xref href="impala_timeouts.xml#timeouts"/> and
<xref href="impala_query_timeout_s.xml#query_timeout_s"/> for details.
</p>
</li>
<li>
<p>
New functions <codeph>VAR_SAMP()</codeph> and <codeph>VAR_POP()</codeph> are aliases for the existing
<codeph>VARIANCE_SAMP()</codeph> and <codeph>VARIANCE_POP()</codeph> functions.
</p>
</li>
<li>
<p>
A new date and time function, <codeph>DATE_PART()</codeph>, provides similar functionality to
<codeph>EXTRACT()</codeph>. You can also call the <codeph>EXTRACT()</codeph> function using the SQL-99
syntax, <codeph>EXTRACT(<varname>unit</varname> FROM <varname>timestamp</varname>)</codeph>. These
enhancements simplify the porting process for date-related code from other systems. See
<xref href="impala_datetime_functions.xml#datetime_functions"/> for details.
</p>
</li>
<li>
<p>
New approximation features provide a fast way to get results when absolute precision is not required:
</p>
<ul>
<li>
The <codeph>APPX_COUNT_DISTINCT</codeph> query option lets Impala rewrite
<codeph>COUNT(DISTINCT)</codeph> calls to use <codeph>NDV()</codeph> instead, which speeds up the
operation and allows multiple <codeph>COUNT(DISTINCT)</codeph> operations in a single query. See
<xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for details.
</li>
</ul>
The <codeph>APPX_MEDIAN()</codeph> aggregate function produces an estimate for the median value of a
column by using sampling. See <xref href="impala_appx_median.xml#appx_median"/> for details.
</li>
<li>
<p>
Impala now supports a <codeph>DECODE()</codeph> function. This function works as a shorthand for a
<codeph>CASE()</codeph> expression, and improves compatibility with SQL code containing vendor
extensions. See <xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
</p>
</li>
<li>
<p>
The <codeph>STDDEV()</codeph>, <codeph>STDDEV_POP()</codeph>, <codeph>STDDEV_SAMP()</codeph>,
<codeph>VARIANCE()</codeph>, <codeph>VARIANCE_POP()</codeph>, <codeph>VARIANCE_SAMP()</codeph>, and
<codeph>NDV()</codeph> aggregate functions now all return <codeph>DOUBLE</codeph> results rather than
<codeph>STRING</codeph>. Formerly, you were required to <codeph>CAST()</codeph> the result to a numeric
type before using it in arithmetic operations.
</p>
</li>
<li id="parquet_block_size">
<p>
The default settings for Parquet block size, and the associated <codeph>PARQUET_FILE_SIZE</codeph>
query option, are changed. Now, Impala writes Parquet files with a size of 256 MB and an HDFS block
size of 256 MB. Previously, Impala attempted to write Parquet files with a size of 1 GB and an HDFS
block size of 1 GB. In practice, Impala used a conservative estimate of the disk space needed for each
Parquet block, leading to files that were typically 512 MB anyway. Thus, this change will make the file
size more accurate if you specify a value for the <codeph>PARQUET_FILE_SIZE</codeph> query option. It
also reduces the amount of memory reserved during <codeph>INSERT</codeph> into Parquet tables,
potentially avoiding out-of-memory errors and improving scalability when inserting data into Parquet
tables.
</p>
</li>
<li>
<p>
Anti-joins are now supported, expressed using the <codeph>LEFT ANTI JOIN</codeph> and <codeph>RIGHT
ANTI JOIN</codeph> clauses.
<!-- Maybe RIGHT SEMI JOIN is new too? -->
<!-- Make following statement true in the context of RIGHT ANTI JOIN. -->
These clauses returns results from one table that have no match in the other table. You might use this
type of join in the same sorts of use cases as the <codeph>NOT EXISTS</codeph> and <codeph>NOT
IN</codeph> operators. See <xref href="impala_joins.xml#joins"/> for details.
</p>
</li>
<li audience="hidden">
<!-- This feature will be undocumented in Impala 2.0, probably ready for prime time in 2.1. -->
<p>
Improved file format support. Impala can now write to Avro, compressed text, SequenceFile, and RCFile
tables using the <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statements. See
<xref href="impala_file_formats.xml#file_formats"/> for details.
</p>
</li>
<li>
<p>
The <codeph>SET</codeph> command in <cmdname>impala-shell</cmdname> has been promoted to a real SQL
statement. You can now set query options such as <codeph>PARQUET_FILE_SIZE</codeph>,
<codeph>MEM_LIMIT</codeph>, and <codeph>SYNC_DDL</codeph> within JDBC, ODBC, or any other kind of
application that submits SQL without going through the <cmdname>impala-shell</cmdname> interpreter. See
<xref href="impala_set.xml#set"/> for details.
</p>
</li>
<li>
<p>
The <cmdname>impala-shell</cmdname> interpreter now reads settings from an optional configuration file,
named <filepath>$HOME/.impalarc</filepath> by default. See
<xref href="impala_shell_options.xml#shell_config_file"/> for details.
</p>
</li>
<li audience="hidden">
<!-- This feature will be undocumented in Impala 2.0, probably ready for prime time in 2.1. -->
<p>
The <codeph>COMPUTE STATS</codeph> statement can now gather statistics for newly added partitions
rather than the entire table. This feature is known as <term>incremental statistics</term>. See
<xref href="impala_compute_stats.xml#compute_stats"/> for details.
</p>
</li>
<li>
<p>
The library used for regular expression parsing has changed from Boost to Google RE2. This
implementation change adds support for non-greedy matches using the <codeph>.*?</codeph> notation. This
and other changes in the way regular expressions are interpreted means you might need to re-test
queries that use functions such as <codeph>regexp_extract()</codeph> or
<codeph>regexp_replace()</codeph>, or operators such as <codeph>REGEXP</codeph> or
<codeph>RLIKE</codeph>. See <xref href="impala_incompatible_changes.xml#incompatible_changes"/> for
those details.
</p>
</li>
</ul>
</conbody>
</concept>
<concept rev="1.4.0" id="new_features_140">
<title>New Features in <keyword keyref="impala14_full"/></title>
<conbody>
<p>
The following are the major new features in <keyword keyref="impala14_full"/>:
</p>
<ul>
<li>
<p>
The <codeph>DECIMAL</codeph> data type lets you store fixed-precision values, for working with currency
or other fractional values where it is important to represent values exactly and avoid rounding errors.
This feature includes enhancements to built-in functions, numeric literals, and arithmetic expressions.
<ph audience="PDF">See <xref href="impala_decimal.xml#decimal"/> for details.</ph>
</p>
</li>
<li>
<p>
Where the underlying HDFS support exists, Impala can take advantage of the HDFS caching feature to <q>pin</q> entire tables or
individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU
overhead of memory-to-memory copying. When HDFS files are cached in memory, Impala can read the cached
data without any disk reads, and without making an additional copy of the data in memory. Other Hadoop
components that read the same data files also experience a performance benefit.
</p>
<p audience="PDF">
For background information about HDFS caching, see
<xref keyref="setup_hdfs_caching"/>. For performance information about using this feature with Impala, see
<xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>. For the <codeph>SET CACHED</codeph> and
<codeph>SET UNCACHED</codeph> clauses that let you control cached table data through DDL statements,
see <xref href="impala_create_table.xml#create_table"/> and
<xref href="impala_alter_table.xml#alter_table"/>.
</p>
</li>
<li>
<p>
Impala can now use Sentry-based authorization based either on the original policy file, or on rules
defined by <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements issued through Hive.
See <xref keyref="authorization"/> for details.
</p>
</li>
<li>
<p>
For interoperability with Parquet files created through other Hadoop components, such as Pig or
MapReduce jobs, you can create an Impala table that automatically sets up the column definitions based
on the layout of an existing Parquet data file. <ph audience="PDF">See
<xref href="impala_create_table.xml#create_table"/> for the syntax, and
<xref href="impala_parquet.xml#parquet_ddl"/> for usage information.</ph>
</p>
</li>
<li>
<p>
<codeph>ORDER BY</codeph> queries no longer require a <codeph>LIMIT</codeph> clause. If the size of the
result set to be sorted exceeds the memory available to Impala, Impala uses a temporary work space on
disk to perform the sort operation. <ph audience="PDF">See <xref href="impala_order_by.xml#order_by"/>
for details.</ph>
</p>
</li>
<li>
<p>
LDAP connections can be secured through either SSL or TLS. <ph audience="PDF">See
<xref href="impala_ldap.xml#ldap"/> for details.</ph>
</p>
</li>
<li>
<p>
The following new built-in scalar and aggregate functions are available:
</p>
<ul>
<li>
<p>
A new built-in function, <codeph>EXTRACT()</codeph>, returns one date or time field from a
<codeph>TIMESTAMP</codeph> value. <ph audience="PDF">See
<xref href="impala_datetime_functions.xml#datetime_functions"/> for details.</ph>
</p>
</li>
<li>
<p>
A new built-in function, <codeph>TRUNC()</codeph>, truncates date/time values to a particular
granularity, such as year, month, day, hour, and so on. <ph audience="PDF">See
<xref href="impala_datetime_functions.xml#datetime_functions"/> for details.</ph>
</p>
</li>
<li>
<p>
<codeph>ADD_MONTHS()</codeph> built-in function, an alias for the existing
<codeph>MONTHS_ADD()</codeph> function. <ph audience="PDF">See
<xref href="impala_datetime_functions.xml#datetime_functions"/> for details.</ph>
</p>
</li>
<li>
<p>
A new built-in function, <codeph>ROUND()</codeph>, rounds <codeph>DECIMAL</codeph> values to a
specified number of fractional digits. <ph audience="PDF">See
<xref href="impala_math_functions.xml#math_functions"/> for details.</ph>
</p>
</li>
<li>
<p>
Several built-in aggregate functions for computing properties for statistical distributions:
<codeph>STDDEV()</codeph>, <codeph>STDDEV_SAMP()</codeph>, <codeph>STDDEV_POP()</codeph>,
<codeph>VARIANCE()</codeph>, <codeph>VARIANCE_SAMP()</codeph>, and <codeph>VARIANCE_POP()</codeph>.
<ph audience="PDF">See <xref href="impala_stddev.xml#stddev"/> and
<xref href="impala_variance.xml#variance"/> for details.</ph>
</p>
</li>
<li>
<p>
Several new built-in functions, such as <codeph>MAX_INT()</codeph>,
<codeph>MIN_SMALLINT()</codeph>, and so on, let you conveniently check whether data values are in
an expected range. You might be able to switch a column to a smaller type, saving memory during
processing. <ph audience="PDF">See <xref href="impala_math_functions.xml#math_functions"/> for
details.</ph>
</p>
</li>
<li>
<p>
New built-in functions, <codeph>IS_INF()</codeph> and <codeph>IS_NAN()</codeph>, check for the
special values infinity and <q>not a number</q>. These values could be specified as
<codeph>inf</codeph> or <codeph>nan</codeph> in text data files, or be produced by certain
arithmetic expressions. <ph audience="PDF">See
<xref href="impala_math_functions.xml#math_functions"/> for details.</ph>
</p>
</li>
</ul>
</li>
<li>
<p>
The <codeph>SHOW PARTITIONS</codeph> statement displays information about the structure of a
partitioned table. <ph audience="PDF">See <xref href="impala_show.xml#show"/> for details.</ph>
</p>
</li>
<li audience="hidden">
<!-- Not documenting for 1.4. Revisit in a future release. -->
<p>
Data sources. <ph audience="PDF">See <xref href="impala_data_sources.xml#data_sources"/> for
details.</ph>
</p>
</li>
<li>
<p>
New configuration options for the <cmdname>impalad</cmdname> daemon let you specify initial memory
usage for all queries. The initial resource requests handled by Llama and YARN can be expanded later if
needed, avoiding unnecessary over-allocation and reducing the chance of out-of-memory conditions.
<ph audience="PDF">See <xref href="impala_resource_management.xml#resource_management"/> for
details.</ph>
</p>
</li>
<li>
The Impala <codeph>CREATE TABLE</codeph> statement now has a <codeph>STORED AS AVRO</codeph> clause,
allowing you to create Avro tables through Impala. <ph audience="PDF">See
<xref href="impala_avro.xml#avro"/> for details and examples.</ph>
</li>
<li>
<p>
New <cmdname>impalad</cmdname> configuration options let you fine-tune the calculations Impala makes to
estimate resource requirements for each query. These options can help avoid problems due to
overconsumption due to too-low estimates, or underutilization due to too-high estimates.
<ph audience="PDF">See <xref href="impala_resource_management.xml#resource_management"/> for
details.</ph>
</p>
</li>
<li>
<p>
A new <codeph>SUMMARY</codeph> command in the <cmdname>impala-shell</cmdname> interpreter provides a
high-level summary of the work performed at each stage of the explain plan. The summary is also
included in output from the <codeph>PROFILE</codeph> command. <ph audience="PDF">See
<xref href="impala_shell_commands.xml#shell_commands"/> and
<xref href="impala_explain_plan.xml#perf_summary"/> for details.</ph>
</p>
</li>
<li>
<p>
Performance improvements for the <codeph>COMPUTE STATS</codeph> statement:
</p>
<ul>
<!-- This particular change has been pushed out to a later release. -->
<li audience="hidden">
Certain simple aggregation operations (with no <codeph>GROUP BY</codeph> step) are multi-threaded if
spare cores are available.
</li>
<li>
The <codeph>NDV</codeph> function is speeded up through native code generation.
</li>
<li>
Because the <codeph>NULL</codeph> count is not currently used by the Impala query planner, in Impala
1.4.0 and higher, <codeph>COMPUTE STATS</codeph> does not count the <codeph>NULL</codeph> values for
each column. (The <codeph>#Nulls</codeph> field of the stats table is left as -1, signifying that the
value is unknown.)
</li>
</ul>
<p audience="PDF">
See <xref href="impala_compute_stats.xml#compute_stats"/> for general details about the <codeph>COMPUTE
STATS</codeph> statement, and <xref href="impala_perf_stats.xml#perf_stats"/> for how to use the
statistics to improve query performance.
</p>
</li>
<li>
<p>
Performance improvements for partition pruning. This feature reduces the time spent in query planning,
for partitioned tables with thousands of partitions. Previously, Impala typically queried tables with
up to approximately 3000 partitions. With the performance improvement in partition pruning, now Impala
can comfortably handle tables with tens of thousands of partitions. <ph audience="PDF">See
<xref href="impala_partitioning.xml#partition_pruning"/> for information about partition pruning.</ph>
</p>
</li>
<li>
<p>
The documentation provides additional guidance for planning tasks. <ph audience="PDF">See
<xref href="impala_planning.xml#planning"/>.</ph>
</p>
</li>
<li>
<p>
The <cmdname>impala-shell</cmdname> interpreter now supports UTF-8 characters for input and output. You
can control whether <cmdname>impala-shell</cmdname> ignores invalid Unicode code points through the
<codeph>--strict_unicode</codeph> option. (Although this option is removed in Impala 2.0.)
</p>
</li>
</ul>
</conbody>
</concept>
<concept rev="1.3.2" id="new_features_132">
<title>New Features in <keyword keyref="impala132"/></title>
<conbody>
<p>
No new features. This point release is exclusively a bug fix release for the IMPALA-1019 issue related to
HDFS caching.
</p>
</conbody>
</concept>
<concept rev="1.3.1" id="new_features_131">
<title>New Features in Impala 1.3.1</title>
<conbody>
<p>
This point release is primarily a vehicle to deliver bug fixes. Any new features are minor changes
resulting from fixes for performance, reliability, or usability issues.
</p>
<ul>
<li>
<p>
A new <cmdname>impalad</cmdname> startup option, <codeph>--insert_inherit_permissions</codeph>, causes
Impala <codeph>INSERT</codeph> statements to create each new partition with the same HDFS permissions
as its parent directory. By default, <codeph>INSERT</codeph> statements create directories for new
partitions using default HDFS permissions. See <xref href="impala_insert.xml#insert"/> for examples of
<codeph>INSERT</codeph> statements for partitioned tables.
</p>
</li>
<li>
<p>
The <codeph>SHOW FUNCTIONS</codeph> statement now displays the return type of each function, in
addition to the types of its arguments. See <xref href="impala_show.xml#show"/> for examples.
</p>
</li>
<li>
<p>
You can now specify the clause <codeph>FIELDS TERMINATED BY '\0'</codeph> with a <codeph>CREATE
TABLE</codeph> statement to use text data files that use ASCII 0 (<codeph>nul</codeph>) characters as a
delimiter. See <xref href="impala_txtfile.xml#txtfile"/> for details.
</p>
</li>
<li>
<p conref="../shared/impala_common.xml#common/regexp_matching" />
</li>
</ul>
</conbody>
</concept>
<concept rev="1.3.0" id="new_features_130">
<title>New Features in <keyword keyref="impala13_full"/></title>
<conbody>
<ul>
<li>
<p>
The admission control feature lets you control and prioritize the volume and resource consumption of
concurrent queries. This mechanism reduces spikes in resource usage, helping Impala to run alongside
other kinds of workloads on a busy cluster. It also provides more user-friendly conflict resolution
when multiple memory-intensive queries are submitted concurrently, avoiding resource contention that
formerly resulted in out-of-memory errors. See <xref href="impala_admission.xml#admission_control"/>
for details.
</p>
</li>
<li>
<p>
Enhanced <codeph>EXPLAIN</codeph> plans provide more detail in an easier-to-read format. Now there are
four levels of verbosity: the <codeph>EXPLAIN_LEVEL</codeph> option can be set from 0 (most concise) to
3 (most verbose). See <xref href="impala_explain.xml#explain"/> for syntax and
<xref href="impala_explain_plan.xml#explain_plan"/> for usage information.
</p>
</li>
<li>
<p>
The <codeph>TIMESTAMP</codeph> data type accepts more kinds of input string formats through the
<codeph>UNIX_TIMESTAMP</codeph> function, and produces more varieties of string formats through the
<codeph>FROM_UNIXTIME</codeph> function. The documentation now also lists more functions for date
arithmetic, used for adding and subtracting <codeph>INTERVAL</codeph> expressions from
<codeph>TIMESTAMP</codeph> values. See <xref href="impala_datetime_functions.xml#datetime_functions"/>
for details.
</p>
</li>
<li>
<p>
New conditional functions, <codeph>NULLIF()</codeph>, <codeph>NULLIFZERO()</codeph>, and
<codeph>ZEROIFNULL()</codeph>, simplify porting SQL containing vendor extensions to Impala. See
<xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
</p>
</li>
<li>
<p>
New utility function, <codeph>CURRENT_DATABASE()</codeph>. See
<xref href="impala_misc_functions.xml#misc_functions"/> for details.
</p>
</li>
<li>
<p>
Integration with the YARN resource management framework. This
feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates
requests to YARN for Impala resources, so that the Impala query only proceeds when all requested
resources are available. See <xref href="impala_resource_management.xml#resource_management"/> for full
details.
</p>
<p>
On the Impala side, this feature involves some new startup options for the <cmdname>impalad</cmdname>
daemon:
</p>
<ul>
<li>
<codeph>-enable_rm</codeph>
</li>
<li>
<codeph>-llama_host</codeph>
</li>
<li>
<codeph>-llama_port</codeph>
</li>
<li>
<codeph>-llama_callback_port</codeph>
</li>
<li>
<codeph>-cgroup_hierarchy_path</codeph>
</li>
</ul>
<p>
For details of these startup options, see <xref href="impala_config_options.xml#config_options"/>.
</p>
<p>
This feature also involves several new or changed query options that you can set through the
<cmdname>impala-shell</cmdname> interpreter and apply within a specific session:
</p>
<ul>
<li>
<codeph>MEM_LIMIT</codeph>: the function of this existing option changes when Impala resource
management is enabled.
</li>
<li>
<codeph>REQUEST_POOL</codeph>: a new option. (Renamed to <codeph>RESOURCE_POOL</codeph> in Impala
1.3.0.)
</li>
<li>
<codeph>V_CPU_CORES</codeph>: a new option.
</li>
<li>
<codeph>RESERVATION_REQUEST_TIMEOUT</codeph>: a new option.
</li>
</ul>
<p>
For details of these query options, see <xref href="impala_resource_management.xml#rm_query_options"/>.
</p>
</li>
</ul>
</conbody>
</concept>
<concept rev="1.2.4" id="new_features_124">
<title>New Features in Impala 1.2.4</title>
<conbody>
<note>
Impala 1.2.4 is primarily a bug fix release for Impala 1.2.3, plus some performance
enhancements for the catalog server to minimize startup and DDL wait times for Impala deployments with
large numbers of databases, tables, and partitions.
</note>
<ul>
<li>
<p>
On Impala startup, the metadata loading and synchronization mechanism has been improved and optimized,
to give more responsiveness when starting Impala on a system with a large number of databases, tables,
or partitions. The initial metadata loading happens in the background, allowing queries to be run
before the entire process is finished. When a query refers to a table whose metadata is not yet loaded,
the query waits until the metadata for that table is loaded, and the load operation for that table is
prioritized to happen first.
</p>
</li>
<li>
<p>
Formerly, if you created a new table in Hive, you had to issue the <codeph>INVALIDATE METADATA</codeph>
statement (with no table name) which was an expensive operation that reloaded metadata for all tables.
Impala did not recognize the name of the Hive-created table, so you could not do <codeph>INVALIDATE
METADATA <varname>new_table</varname></codeph> to get the metadata for just that one table. Now, when
you issue <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>, Impala checks to see if
that name represents a table created in Hive, and if so recognizes the new table and loads the metadata
for it. Additionally, if the new table is in a database that was newly created in Hive, Impala also
recognizes the new database.
</p>
</li>
<li>
<p>
If you issue <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> and the table has been
dropped through Hive, Impala will recognize that the table no longer exists.
</p>
</li>
<li>
<p>
New startup options let you control the parallelism of the metadata loading during startup for the
<cmdname>catalogd</cmdname> daemon:
</p>
<ul>
<li>
<p>
<codeph>--load_catalog_in_background</codeph> makes Impala load and cache metadata using background
threads after startup. It is <codeph>true</codeph> by default. Previously, a system with a large
number of databases, tables, or partitions could be unresponsive or even time out during startup.
</p>
</li>
<li>
<p>
<codeph>--num_metadata_loading_threads</codeph> determines how much parallelism Impala devotes to
loading metadata in the background. The default is 16. You might increase this value for systems
with huge numbers of databases, tables, or partitions. You might lower this value for busy systems
that are CPU-constrained due to jobs from components other than Impala.
</p>
</li>
</ul>
</li>
</ul>
</conbody>
</concept>
<concept rev="1.2.3" id="new_features_123">
<title>New Features in Impala 1.2.3</title>
<conbody>
<p>
Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional
fix for compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or
MapReduce. If you are upgrading from Impala 1.2.1 or earlier, see
<xref href="impala_new_features.xml#new_features_122"/> for the latest added features.
</p>
</conbody>
</concept>
<concept rev="1.2.2" id="new_features_122">
<title>New Features in Impala 1.2.2</title>
<conbody>
<p>
Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over
1.2.1 are performance related, primarily for join queries.
</p>
<p>
New user-visible features include:
</p>
<ul>
<li>
<p>
Join order optimizations. This highly valuable feature automatically distributes and parallelizes the
work for a join query to minimize disk I/O and network traffic. The automatic optimization reduces the
need to use query hints or to rewrite join queries with the tables in a specific order based on size or
cardinality. The new <codeph>COMPUTE STATS</codeph> statement gathers statistical information about
each table that is crucial for enabling the join optimizations. See
<xref href="impala_perf_joins.xml#perf_joins"/> for details.
</p>
</li>
<li>
<p>
<codeph>COMPUTE STATS</codeph> statement to collect both table statistics and column statistics with a
single statement. Intended to be more comprehensive, efficient, and reliable than the corresponding
Hive <codeph>ANALYZE TABLE</codeph> statement, which collects statistics in multiple phases through
MapReduce jobs. These statistics are important for query planning for join queries, queries on
partitioned tables, and other types of data-intensive operations. For optimal planning of join queries,
you need to collect statistics for each table involved in the join. See
<xref href="impala_compute_stats.xml#compute_stats"/> for details.
</p>
</li>
<li>
<p>
Reordering of tables in a join query can be overridden by the <codeph>STRAIGHT_JOIN</codeph> operator,
allowing you to fine-tune the planning of the join query if necessary, by using the original technique
of ordering the joined tables in descending order of size. See
<xref href="impala_perf_joins.xml#straight_join"/> for details.
</p>
</li>
<li>
<p>
The <codeph>CROSS JOIN</codeph> clause in the
<codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement to allow Cartesian
products in queries, that is, joins without an equality comparison between columns in both tables.
Because such queries must be carefully checked to avoid accidental overconsumption of memory, you must
use the <codeph>CROSS JOIN</codeph> operator to explicitly select this kind of join. See
<xref href="impala_tutorial.xml#tut_cross_join"/> for examples.
</p>
</li>
<li>
<p>
The <codeph>ALTER TABLE</codeph> statement has new clauses that let you fine-tune table statistics. You
can use this technique as a less-expensive way to update specific statistics, in case the statistics
become stale, or to experiment with the effects of different data distributions on query planning.
</p>
</li>
<li>
<p>
LDAP username/password authentication in JDBC/ODBC. See <xref href="impala_ldap.xml#ldap"/> for
details.
</p>
</li>
<li>
<p>
<xref href="impala_string_functions.xml#string_functions/group_concat">GROUP_CONCAT()</xref> aggregate
function to concatenate column values across all rows of a result set.
</p>
</li>
<li>
<p>
The <codeph>INSERT</codeph> statement now accepts hints, <codeph>[SHUFFLE]</codeph> and
<codeph>[NOSHUFFLE]</codeph>, to influence the way work is redistributed during
<codeph>INSERT...SELECT</codeph> operations. The hints are primarily useful for inserting into
partitioned Parquet tables, where using the <codeph>[SHUFFLE]</codeph> hint can avoid problems due to
memory consumption and simultaneous open files in HDFS, by collecting all the new data for each
partition on a specific node.
</p>
</li>
<li>
<p>
Several built-in functions and operators are now overloaded for more numeric data types, to reduce the
requirement to use <codeph>CAST()</codeph> for type coercion in <codeph>INSERT</codeph> statements. For
example, the expression <codeph>2+2</codeph> in an <codeph>INSERT</codeph> statement formerly produced
a <codeph>BIGINT</codeph> result, requiring a <codeph>CAST()</codeph> to be stored in an
<codeph>INT</codeph> variable. Now, addition, subtraction, and multiplication only produce a result
that is one step <q>bigger</q> than their arguments, and numeric and conditional functions can return
<codeph>SMALLINT</codeph>, <codeph>FLOAT</codeph>, and other smaller types rather than always
<codeph>BIGINT</codeph> or <codeph>DOUBLE</codeph>.
</p>
</li>
<li>
<p>
New <codeph>fnv_hash()</codeph> built-in function for constructing hashed values. See
<xref href="impala_math_functions.xml#math_functions"/> for details.
</p>
</li>
<li>
<p>
The clause <codeph>STORED AS PARQUET</codeph> is accepted as an equivalent for <codeph>STORED AS
PARQUETFILE</codeph>. This more concise form is recommended for new code.
</p>
</li>
</ul>
<p>
Because Impala 1.2.2 builds on a number of features introduced in 1.2.1, if you are upgrading from an older
1.1.x release straight to 1.2.2, also review <xref href="impala_new_features.xml#new_features_121"/> to see
features such as the <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW COLUMN STATS</codeph> statements,
and user-defined functions (UDFs).
</p>
</conbody>
</concept>
<concept rev="1.2" id="new_features_121">
<title>New Features in Impala 1.2.1</title>
<conbody>
<note>
The Impala 1.2.1 feature set is a superset of features in the Impala 1.2.0 beta, with the
exception of resource management, which relies on resource management infrastructure in the
underlying Hadoop distribution.
</note>
<p>
Impala 1.2.1 includes new features for security, performance, and flexibility.
</p>
<p>
New user-visible features include:
</p>
<ul>
<li rev="1.2.1">
<p>
<codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and <codeph>SHOW COLUMN STATS
<varname>table_name</varname></codeph> statements, to verify that statistics are available and to see
the values used during query planning.
</p>
</li>
<li rev="1.2.1">
<p>
<codeph>CREATE TABLE AS SELECT</codeph> syntax, to create a new table and transfer data into it in a
single operation.
</p>
</li>
<li rev="1.2.1">
<p>
<codeph>OFFSET</codeph> clause, for use with the <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph>
clauses to produce <q>paged</q> result sets such as items 1-10, then 11-20, and so on.
</p>
</li>
<li rev="1.2.1">
<p>
<codeph>NULLS FIRST</codeph> and <codeph>NULLS LAST</codeph> clauses to ensure consistent placement of
<codeph>NULL</codeph> values in <codeph>ORDER BY</codeph> queries.
</p>
</li>
<li rev="1.2.1">
<p>
New <xref href="impala_functions.xml#builtins">built-in functions</xref>: <codeph>least()</codeph>,
<codeph>greatest()</codeph>, <codeph>initcap()</codeph>.
</p>
</li>
<li rev="1.2.1">
<p>
New aggregate function: <codeph>ndv()</codeph>, a fast alternative to <codeph>COUNT(DISTINCT
<varname>col</varname>)</codeph> returning an approximate result.
</p>
</li>
<li rev="1.2.1">
<p>
The <codeph>LIMIT</codeph> clause can now accept a numeric expression as an argument, rather than only
a literal constant.
</p>
</li>
<li rev="1.2.1">
<p>
The <codeph>SHOW CREATE TABLE</codeph> statement displays the end result of all the <codeph>CREATE
TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements for a particular table. You can use the
output to produce a simplified setup script for a schema.
</p>
</li>
<li rev="1.2.1">
<p>
The <codeph>--idle_query_timeout</codeph> and <codeph>--idle_session_timeout</codeph> options for
<cmdname>impalad</cmdname> control the time intervals after which idle queries are cancelled, and idle
sessions expire. See <xref href="impala_timeouts.xml#timeouts"/> for details.
</p>
</li>
<li>
<p>
User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is
important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required
switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala
can run high-performance functions written in C++, or you can reuse existing Hive functions written in
Java.
</p>
<p>
You create UDFs through the <codeph>CREATE FUNCTION</codeph> statement and drop them through the
<codeph>DROP FUNCTION</codeph> statement. See <xref href="impala_udf.xml#udfs"/> for instructions about
coding, building, and deploying UDFs, and <xref href="impala_create_function.xml#create_function"/> and
<xref href="impala_drop_function.xml#drop_function"/> for related SQL syntax.
</p>
</li>
<li>
<p>
A new service automatically propagates changes to table data and metadata made by one Impala node,
sending the new or updated metadata to all the other Impala nodes. The automatic synchronization
mechanism eliminates the need to use the <codeph>INVALIDATE METADATA</codeph> and
<codeph>REFRESH</codeph> statements after issuing Impala statements such as <codeph>CREATE
TABLE</codeph>, <codeph>ALTER TABLE</codeph>, <codeph>DROP TABLE</codeph>, <codeph>INSERT</codeph>, and
<codeph>LOAD DATA</codeph>.
</p>
<p>
For even more precise synchronization, you can enable the
<codeph><xref href="impala_sync_ddl.xml#sync_ddl">SYNC_DDL</xref></codeph> query option before issuing
a DDL, <codeph>INSERT</codeph>, or <codeph>LOAD DATA</codeph> statement. This option causes the
statement to wait, returning only after the catalog service has broadcast the applicable changes to all
Impala nodes in the cluster.
</p>
<note>
<p>
Because the catalog service only monitors operations performed through Impala, <codeph>INVALIDATE
METADATA</codeph> and <codeph>REFRESH</codeph> are still needed on the Impala side after creating new
tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because
the catalog service broadcasts the result of the <codeph>REFRESH</codeph> and <codeph>INVALIDATE
METADATA</codeph> statements to all Impala nodes, when you do need to use those statements, you can
do so a single time rather than on every Impala node.
</p>
</note>
<p>
This service is implemented by the <cmdname>catalogd</cmdname> daemon. See
<xref href="impala_components.xml#intro_catalogd"/> for details.
</p>
</li>
<li>
<p>
The <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements have new clauses
<codeph>TBLPROPERTIES</codeph> and <codeph>WITH SERDEPROPERTIES</codeph>. The
<codeph>TBLPROPERTIES</codeph> clause lets you associate arbitrary items of metadata with a particular
table as key-value pairs. The <codeph>WITH SERDEPROPERTIES</codeph> clause lets you specify the
serializer/deserializer (SerDes) classes that read and write data for a table; although Impala does not
make use of these properties, sometimes particular values are needed for Hive compatibility. See
<xref href="impala_create_table.xml#create_table"/> and
<xref href="impala_alter_table.xml#alter_table"/> for details.
</p>
</li>
<li>
<p>
Delegation support lets you authorize certain OS users associated with applications (for example,
<codeph>hue</codeph>), to submit requests using the credentials of other users.
See <xref href="impala_delegation.xml#delegation"/> for details.
</p>
</li>
<li>
<p>
Enhancements to <codeph>EXPLAIN</codeph> output. In particular, when you enable the new
<codeph>EXPLAIN_LEVEL</codeph> query option, the <codeph>EXPLAIN</codeph> and <codeph>PROFILE</codeph>
statements produce more verbose output showing estimated resource requirements and whether table and
column statistics are available for the applicable tables and columns. See
<xref href="impala_explain.xml#explain"/> for details.
</p>
</li>
<li rev="1.2.1">
<p>
<codeph>SHOW CREATE TABLE</codeph> summarizes the effects of the original <codeph>CREATE TABLE</codeph>
statement and any subsequent <codeph>ALTER TABLE</codeph> statements, giving you a <codeph>CREATE
TABLE</codeph> statement that will re-create the current structure and layout for a table.
</p>
</li>
<li rev="1.2.1">
<p>
The <codeph>LIMIT</codeph> clause for queries now accepts an arithmetic expression, in addition to
numeric literals.
</p>
</li>
</ul>
</conbody>
</concept>
<concept rev="1.2" id="new_features_120">
<title>New Features in Impala 1.2.0 (Beta)</title>
<conbody>
<p>
The Impala 1.2.0 beta includes new features for security, performance, and flexibility.
</p>
<p>
New user-visible features include:
</p>
<ul>
<li>
<p>
User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is
important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required
switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala
can run high-performance functions written in C++, or you can reuse existing Hive functions written in
Java.
</p>
<p>
You create UDFs through the <codeph>CREATE FUNCTION</codeph> statement and drop them through the
<codeph>DROP FUNCTION</codeph> statement. See <xref href="impala_udf.xml#udfs"/> for instructions about
coding, building, and deploying UDFs, and <xref href="impala_create_function.xml#create_function"/> and
<xref href="impala_drop_function.xml#drop_function"/> for related SQL syntax.
</p>
</li>
<li>
<p>
A new service automatically propagates changes to table data and metadata made by one Impala node,
sending the new or updated metadata to all the other Impala nodes. The automatic synchronization
mechanism eliminates the need to use the <codeph>INVALIDATE METADATA</codeph> and
<codeph>REFRESH</codeph> statements after issuing Impala statements such as <codeph>CREATE
TABLE</codeph>, <codeph>ALTER TABLE</codeph>, <codeph>DROP TABLE</codeph>, <codeph>INSERT</codeph>, and
<codeph>LOAD DATA</codeph>.
</p>
<note>
<p>
Because this service only monitors operations performed through Impala, <codeph>INVALIDATE
METADATA</codeph> and <codeph>REFRESH</codeph> are still needed on the Impala side after creating new
tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because
the catalog service broadcasts the result of the <codeph>REFRESH</codeph> and <codeph>INVALIDATE
METADATA</codeph> statements to all Impala nodes, when you do need to use those statements, you can
do so a single time rather than on every Impala node.
</p>
</note>
<p>
This service is implemented by the <cmdname>catalogd</cmdname> daemon. See
<xref href="impala_components.xml#intro_catalogd"/> for details.
</p>
</li>
<li>
<p>
Integration with the YARN resource management framework. This
feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates
requests to YARN for Impala resources, so that the Impala query only proceeds when all requested
resources are available. See <xref href="impala_resource_management.xml#resource_management"/> for full
details.
</p>
<p>
On the Impala side, this feature involves some new startup options for the <cmdname>impalad</cmdname>
daemon:
</p>
<ul>
<li>
<codeph>-enable_rm</codeph>
</li>
<li>
<codeph>-llama_host</codeph>
</li>
<li>
<codeph>-llama_port</codeph>
</li>
<li>
<codeph>-llama_callback_port</codeph>
</li>
<li>
<codeph>-cgroup_hierarchy_path</codeph>
</li>
</ul>
<p>
For details of these startup options, see <xref href="impala_config_options.xml#config_options"/>.
</p>
<p>
This feature also involves several new or changed query options that you can set through the
<cmdname>impala-shell</cmdname> interpreter and apply within a specific session:
</p>
<ul>
<li>
<codeph>MEM_LIMIT</codeph>: the function of this existing option changes when Impala resource
management is enabled.
</li>
<li>
<codeph>YARN_POOL</codeph>: a new option. (Renamed to <codeph>RESOURCE_POOL</codeph> in Impala
1.3.0.)
</li>
<li>
<codeph>V_CPU_CORES</codeph>: a new option.
</li>
<li>
<codeph>RESERVATION_REQUEST_TIMEOUT</codeph>: a new option.
</li>
</ul>
<p>
For details of these query options, see <xref href="impala_resource_management.xml#rm_query_options"/>.
</p>
</li>
<li>
<p>
<codeph>CREATE TABLE ... AS SELECT</codeph> syntax, to create a table and copy data into it in a single
operation. See <xref href="impala_create_table.xml#create_table"/> for details.
</p>
</li>
<li>
<p>
The <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements have a new
<codeph>TBLPROPERTIES</codeph> clause that lets you associate arbitrary items of metadata with a
particular table as key-value pairs. See <xref href="impala_create_table.xml#create_table"/> and
<xref href="impala_alter_table.xml#alter_table"/> for details.
</p>
</li>
<li>
<p>
Delegation support lets you authorize certain OS users associated with applications (for example,
<codeph>hue</codeph>), to submit requests using the credentials of other users.
See <xref href="impala_delegation.xml#delegation"/> for details.
</p>
</li>
<li>
<p>
Enhancements to <codeph>EXPLAIN</codeph> output. In particular, when you enable the new
<codeph>EXPLAIN_LEVEL</codeph> query option, the <codeph>EXPLAIN</codeph> and <codeph>PROFILE</codeph>
statements produce more verbose output showing estimated resource requirements and whether table and
column statistics are available for the applicable tables and columns. See
<xref href="impala_explain.xml#explain"/> for details.
</p>
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_111">
<title>New Features in Impala 1.1.1</title>
<conbody>
<p>
Impala 1.1.1 includes new features for security and stability.
</p>
<p>
New user-visible features include:
</p>
<ul>
<li>
Additional security feature: auditing. New startup options for <cmdname>impalad</cmdname> let you capture
information about Impala queries that succeed or are blocked due to insufficient privileges. For details,
see <xref href="impala_security.xml#security"/>.
</li>
<li>
Parquet data files generated by Impala 1.1.1 are now compatible with the Parquet support in Hive. See
<xref href="impala_incompatible_changes.xml#incompatible_changes"/> for the procedure to update older
Impala-created Parquet files to be compatible with the Hive Parquet support.
</li>
<li>
Additional improvements to stability and resource utilization for Impala queries.
</li>
<li>
Additional enhancements for compatibility with existing file formats.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_11">
<title>New Features in Impala 1.1</title>
<conbody>
<p>
Impala 1.1 includes new features for security, performance, and usability.
</p>
<p>
New user-visible features include:
</p>
<ul>
<li>
Extensive new security features, built on top of the Sentry open source project. Impala now supports
fine-grained authorization based on roles. A policy file determines which privileges on which schema
objects (servers, databases, tables, and HDFS paths) are available to users based on their membership in
groups. By assigning privileges for views, you can control access to table data at the column level. For
details, see <xref href="impala_security.xml#security"/>.
</li>
<li>
Impala can now create, alter, drop, and query views. Views provide a flexible way to set up simple
aliases for complex queries; hide query details from applications and users; and simplify maintenance as
you rename or reorganize databases, tables, and columns. See the overview section
<xref href="impala_views.xml#views"/> and the statements
<xref href="impala_create_view.xml#create_view"/>, <xref href="impala_alter_view.xml#alter_view"/>, and
<xref href="impala_drop_view.xml#drop_view"/>.
</li>
<li>
Performance is improved through a number of automatic optimizations. Resource consumption is also reduced
for Impala queries. These improvements apply broadly across all kinds of workloads and file formats. The
major areas of performance enhancement include:
<ul>
<li>
Improved disk and thread scheduling, which applies to all queries.
</li>
<li>
Improved hash join and aggregation performance, which applies to queries with large build tables or a
large number of groups.
</li>
<li>
Dictionary encoding with Parquet, which applies to Parquet tables with short string columns.
</li>
<li>
Improved performance on systems with SSDs, which applies to all queries and file formats.
</li>
</ul>
</li>
<li>
Some new built-in functions are implemented:
<xref href="impala_string_functions.xml#string_functions/translate">translate()</xref> to substitute
characters within strings,
<!-- IMPALA-418 -->
<xref href="impala_misc_functions.xml#misc_functions/user">user()</xref> to check the login ID of the
connected user.
<!-- IMPALA-??? -->
</li>
<li>
The new <codeph>WITH</codeph> clause for <codeph>SELECT</codeph> statements lets you simplify complicated
queries in a way similar to creating a view. The effects of the <codeph>WITH</codeph> clause only last
for the duration of one query, unlike views, which are persistent schema objects that can be used by
multiple sessions or applications. See <xref href="impala_with.xml#with"/>.
</li>
<li>
An enhancement to <codeph>DESCRIBE</codeph> statement, <codeph>DESCRIBE FORMATTED
<varname>table_name</varname></codeph>, displays more detailed information about the table. This
information includes the file format, location, delimiter, ownership, external or internal, creation and
access times, and partitions. The information is returned as a result set that can be interpreted and
used by a management or monitoring application. See <xref href="impala_describe.xml#describe"/>.
</li>
<li>
You can now insert a subset of columns for a table, with other columns being left as all
<codeph>NULL</codeph> values. Or you can specify the columns in any order in the destination table,
rather than having to match the order of the corresponding columns in the source. <codeph>VALUES</codeph>
clause. This feature is known as <q>column permutation</q>. See <xref href="impala_insert.xml#insert"/>.
</li>
<li>
The new <codeph>LOAD DATA</codeph> statement lets you load data into a table directly from an HDFS data
file. This technique lets you minimize the number of steps in your ETL process, and provides more
flexibility. For example, you can bring data into an Impala table in one step. Formerly, you might have
created an external table where the data files are not entirely under your control, or copied the data
files to Impala data directories manually, or loaded the original data into one table and then used the
<codeph>INSERT</codeph> statement to copy it to a new table with a different file format, partitioning
scheme, and so on. See <xref href="impala_load_data.xml#load_data"/>.
</li>
<li>
Improvements to Impala-HBase integration:
<ul>
<li>
New query options for HBase performance:
<codeph><xref href="impala_hbase_cache_blocks.xml#hbase_cache_blocks">HBASE_CACHE_BLOCKS</xref></codeph>
and <codeph><xref href="impala_hbase_caching.xml#hbase_caching">HBASE_CACHING</xref></codeph>.
</li>
<li>
Support for binary data types in HBase tables. See <xref href="impala_hbase.xml#hbase_types"/> for
details.
</li>
</ul>
</li>
<li>
You can issue <codeph>REFRESH</codeph> as a SQL statement through any of the programming interfaces that
Impala supports. <codeph>REFRESH</codeph> formerly had to be issued as a command through the
<cmdname>impala-shell</cmdname> interpreter, and was not available through a JDBC or ODBC API call. As
part of this change, the functionality of the <codeph>REFRESH</codeph> statement is divided between two
statements. In Impala 1.1, <codeph>REFRESH</codeph> requires a table name argument and immediately
reloads the metadata; the new <codeph>INVALIDATE METADATA</codeph> statement works the same as the Impala
1.0 <codeph>REFRESH</codeph> did: the table name argument is optional, and the metadata for one or all
tables is marked as stale, but not actually reloaded until the table is queried. When you create a new
table in the Hive shell or through a different Impala node, you must enter <codeph>INVALIDATE
METADATA</codeph> with no table parameter before you can see the new table in
<cmdname>impala-shell</cmdname>. See <xref href="impala_refresh.xml#refresh"/> and
<xref href="impala_invalidate_metadata.xml#invalidate_metadata"/>.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_101">
<title>New Features in Impala 1.0.1</title>
<conbody>
<p>
New user-visible features include:
</p>
<ul>
<li>
The <codeph>VALUES</codeph> clause lets you <codeph>INSERT</codeph> one or more rows using literals,
function return values, or other expressions. For performance and scalability, you should still use
<codeph>INSERT ... SELECT</codeph> for bringing large quantities of data into an Impala table. The
<codeph>VALUES</codeph> clause is a convenient way to set up small tables, particularly for initial
testing of SQL features that do not require large amounts of data. See
<xref href="impala_insert.xml#values"/> for details.
</li>
<li>
The <codeph>-B</codeph> and <codeph>-o</codeph> options of the <codeph>impala-shell</codeph> command can
turn query results into delimited text files and store them in an output file. The plain text results are
useful for using with other Hadoop components or Unix tools. In benchmark tests, it is also faster to
produce plain rather than pretty-printed results, and write to a file rather than to the screen, giving a
more accurate picture of the actual query time.
</li>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_101"/> for details.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_10">
<title>New Features in Impala 1.0</title>
<conbody>
<p>
This version has multiple performance improvements and adds the following functionality:
</p>
<ul>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_10"/>.
</li>
<li>
<codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph> statement.
</li>
<li>
<xref href="impala_hints.xml#hints">Hints</xref> to allow specifying a particular join strategy.
</li>
<li>
<codeph><xref href="impala_refresh.xml#refresh">REFRESH</xref></codeph> for a single table.
</li>
<li>
Dynamic resource management, allowing high concurrency for Impala queries.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_07">
<title>New Features in Version 0.7 of the Impala Beta Release</title>
<conbody>
<p>
This version has multiple performance improvements and adds the following functionality:
</p>
<ul>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_07"/>.
</li>
<li>
Support for the Parquet file format. For more information on file formats, see
<xref href="impala_file_formats.xml#file_formats"/>.
</li>
<li>
Added support for Avro.
</li>
<li>
Support for the memory limits. For more information, see the example on modifying memory limits in
<xref href="impala_config_options.xml#config_options"/>.
</li>
<li>
Bigger and faster joins through the addition of partitioned joins to the already supported broadcast
joins.
</li>
<li>
Fully distributed aggregations.
</li>
<li>
Fully distributed top-n computation.
</li>
<li>
Support for creating and altering tables.
</li>
<li>
Support for GROUP BY with floats and doubles.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_06">
<title>New Features in Version 0.6 of the Impala Beta Release</title>
<conbody>
<ul>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_06"/>.
</li>
<li>
Added support for Impala on SUSE and Debian/Ubuntu. Impala is now supported on:
<ul>
<li>
RHEL5.7/6.2 and Centos5.7/6.2
</li>
<li>
SUSE 11 with Service Pack 1 or higher
</li>
<li>
Ubuntu 10.04/12.04 and Debian 6.03
</li>
</ul>
</li>
<li>
Support for the RCFile file format. For more information on file formats, see
<xref href="impala_file_formats.xml#file_formats">Understanding File Formats</xref>.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_05">
<title>New Features in Version 0.5 of the Impala Beta Release</title>
<conbody>
<ul>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_05"/>.
</li>
<li>
Added support for a JDBC driver that allows you to access Impala from a Java client. To use this feature,
follow the instructions in <xref href="impala_jdbc.xml#impala_jdbc"/> to install the JDBC
driver JARs on the client machine and modify the <codeph>CLASSPATH</codeph> on the client to include the
JARs.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_04">
<title>New Features in Version 0.4 of the Impala Beta Release</title>
<conbody>
<ul>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_04"/>.
</li>
<li>
Added support for Impala on RHEL5.7/Centos5.7. Impala is now supported on RHEL5.7/6.2 and Centos5.7/6.2.
</li>
<li>
The Impala debug webserver now has the ability to serve static files from
<codeph>${IMPALA_HOME}/www</codeph>. This can be disabled by setting
<codeph>--enable_webserver_doc_root=false</codeph> on the command line. As a result, Impala now uses the
Twitter Bootstrap library to style its debug webpages, and the <codeph>/queries</codeph> page now tracks
the last 25 queries run by each Impala daemon.
</li>
<li>
Additional metrics available on the Impala Debug Webpage.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_03">
<title>New Features in Version 0.3 of the Impala Beta Release</title>
<conbody>
<ul>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_03"/>.
</li>
<li>
The <codeph>state-store-service binary</codeph> has been renamed <codeph>statestored</codeph>.
</li>
<li>
The location of the Impala configuration files has changed from the <codeph>/usr/lib/impala/conf</codeph>
directory to the <codeph>/etc/impala/conf</codeph> directory.
</li>
</ul>
</conbody>
</concept>
<concept id="new_features_02">
<title>New Features in Version 0.2 of the Impala Beta Release</title>
<conbody>
<ul>
<li>
Several bug fixes. See <xref href="impala_fixed_issues.xml#fixed_issues_02"/>.
</li>
<li>
<b>Added Default Query Options</b> Default query options override all default QueryOption values when
starting <codeph>impalad</codeph>. The format is:
<codeblock>-default_query_options='key=value;key=value'</codeblock>
</li>
</ul>
</conbody>
</concept>
</concept>