docs/topics/impala_porting.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="porting">

   <title>Porting SQL from Other Database Systems to Impala</title>
   <titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Databases"/>
       <data name="Category" value="Hive"/>
       <data name="Category" value="Oracle"/>
       <data name="Category" value="MySQL"/>
       <data name="Category" value="PostgreSQL"/>
       <data name="Category" value="Troubleshooting"/>
       <data name="Category" value="Porting"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Developers"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       <indexterm audience="hidden">porting</indexterm>
       Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications
       to Impala, due to variations in data types, built-in functions, vendor language extensions, and
       Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for
       best performance.
     </p>

     <p outputclass="toc inpage"/>
   </conbody>

   <concept id="porting_ddl_dml">

     <title>Porting DDL and DML Statements</title>

     <conbody>

       <p>
         When adapting SQL code from a traditional database system to Impala, expect to find a number of differences
         in the DDL statements that you use to set up the schema. Clauses related to physical layout of files,
         tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to
         account for the Impala partitioning scheme and Hadoop file formats.
       </p>

       <p>
         Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor
         extensions and features not yet supported in Impala, you might be able to run identical or almost-identical
         query text on both systems.
       </p>

       <p>
         Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse
         and ongoing tuning efforts on the code for SQL queries.
       </p>
     </conbody>
   </concept>

   <concept id="porting_data_types">

     <title>Porting Data Types from Other Database Systems</title>

     <conbody>

       <ul>
         <li>
           <p>
             Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to
             <codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example,
             change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is
             very flexible about the length of string values; it does not impose any length constraints
             or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns.
             (In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>,
             with length constraints for both types and blank-padding for <codeph>CHAR</codeph>.
             However, for performance reasons, it is still preferable to use <codeph>STRING</codeph>
             columns where practical.)
           </p>
         </li>

         <li>
           <p>
             For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or
             <codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently
             some string manipulation operations only work correctly with ASCII data. See
             <xref href="impala_string.xml#string"/> for details.
           </p>
         </li>

         <li>
           <p>
             Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to
             <codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make
             sure your application logic or ETL process accounts for the fact that Impala expects all
             <codeph>TIMESTAMP</codeph> values to be in
             <xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated
             Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about
             the <codeph>TIMESTAMP</codeph> data type, and
             <xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different
             date and time formats.
           </p>
           <p>
             You might also need to adapt date- and time-related literal values and format strings to use the
             supported Impala date and time formats. If you have date and time literals with different separators or
             different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala
             expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the
             Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the
             allowed formats for date and time literals, and
             <xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as
             <codeph>regexp_replace()</codeph>.
           </p>
           <p>
             Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>.
           </p>
           <p>
             Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname>
             days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() +
             INTERVAL 30 DAYS</codeph>.
           </p>
         </li>

         <li>
           <p>
             Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in
             <xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column
             data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric
             values that you can add or subtract using the functions in
             <xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table
             <codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could
             construct dates N days in the future like so:
           </p>
 <codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock>
         </li>

         <li>
           <p>
             For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient
             range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on
             for the various numeric data types.
           </p>
         </li>

         <li>
           <p>
             Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not
             required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on
             the range of values. For applications that require precise decimal values, such as financial data, you
             might need to make more extensive changes to table structure and application logic, such as using
             separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs
             to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges,
             casting, and so on for the various numeric data types.
           </p>
         </li>

         <li>
           <p>
             <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in
             Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an
             alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into
             <codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for
             details about ranges, casting, and so on for the various numeric data types.
           </p>
         </li>

         <li>
           <p>
             Most integer types from other systems have equivalents in Impala, perhaps under different names such as
             <codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example
             <codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range.
             Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details
             about ranges, casting, and so on for the various numeric data types.
           </p>
         </li>

         <li>
           <p>
             Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See
             <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the
             various numeric data types.
           </p>
         </li>

         <li>
           <p>
             For any types holding bitwise values, use an integer type with enough range to hold all the relevant
             bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about
             ranges, casting, and so on for the various numeric data types.
           </p>
           <p>
             For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate
             8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>.
           </p>
 <codeblock>[localhost:21000] &gt; select cast(127*2 as tinyint);
 +--------------------------+
 | cast(127 * 2 as tinyint) |
 +--------------------------+
 | -2                       |
 +--------------------------+
 [localhost:21000] &gt; select cast(128 as tinyint);
 +----------------------+
 | cast(128 as tinyint) |
 +----------------------+
 | -128                 |
 +----------------------+
 [localhost:21000] &gt; select cast(127*2 as smallint);
 +---------------------------+
 | cast(127 * 2 as smallint) |
 +---------------------------+
 | 254                       |
 +---------------------------+</codeblock>
           <p>
             Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals.
           </p>
         </li>

         <li>
           <p>
             For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or
             <codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects
             such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and
             <codeph>VARBINARY</codeph> do not currently have an equivalent in Impala.
           </p>
         </li>

         <li>
           <p>
             For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type.
           </p>
         </li>

         <li>
           <p>
             Because Impala currently does not support composite or nested types, any spatial data types in other
             database systems do not have direct equivalents in Impala. You could represent spatial values in string
             format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where
             practical, separate spatial types into separate tables so that Impala can still work with the
             non-spatial data.
           </p>
         </li>

         <li>
           <p>
             Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different
             sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph>
             and external tables mean that Impala is flexible about the format of data files, and Impala does not
             necessarily validate or cleanse data before querying it. When copying data through Impala
             <codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or
             <codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see
             <xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
           </p>
         </li>

         <li>
           <p>
             Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
             statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>,
             <codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or
             <codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources,
             such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen
             earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries
             to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph>
             statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to
             substitute some other value for <codeph>NULL</codeph> fields; see
             <xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
           </p>
           <p>
             Do as much verification as practical before loading data into Impala. After data is loaded into Impala,
             you can do further verification using SQL queries to check if values have expected ranges, if values
             are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to
             re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in
             Impala to copy the faulty data to a new table and transform or filter out the bad values.
           </p>
         </li>

         <li>
           <p>
             Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER
             INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any
             <codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from
             <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk
             read operations for data warehouse-style queries, and therefore does not support indexes for its
             tables.
           </p>
         </li>

         <li>
           <p>
             Calls to built-in functions with out-of-range or otherwise incorrect arguments, return
             <codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the
             <codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using
             representative data to doublecheck that calls to built-in functions are returning expected values
             rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not
             raise an error in Impala:
           </p>
 <codeblock>select cast('foo' as int);
 +--------------------+
 | cast('foo' as int) |
 +--------------------+
 | NULL               |
 +--------------------+</codeblock>
         </li>

         <li>
           <p>
             For any other type not supported in Impala, you could represent their values in string format and write
             UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details.
           </p>
         </li>

         <li>
           <p>
             To detect the presence of unsupported or unconvertable data types in data files, do initial testing
             with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to
             fail immediately if they encounter disallowed type conversions. See
             <xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example:
           </p>
 <codeblock>set abort_on_error=true;
 select count(*) from (select * from t1);
 -- The above query will fail if the data files for T1 contain any
 -- values that can't be converted to the expected Impala data types.
 -- For example, if T1.C1 is defined as INT but the column contains
 -- floating-point values like 1.1, the query will return an error.</codeblock>
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="porting_statements">

     <title>SQL Statements to Remove or Adapt</title>

     <conbody>

       <p>
         Some SQL statements or clauses that you might be familiar with are not currently supported in Impala:
       </p>

       <ul>
         <li>
           <p>
             Impala has no <codeph>DELETE</codeph> statement. Impala is intended for data warehouse-style operations
             where you do bulk moves and transforms of large quantities of data. Instead of using
             <codeph>DELETE</codeph>, use <codeph>INSERT OVERWRITE</codeph> to entirely replace the contents of a
             table or partition, or use <codeph>INSERT ... SELECT</codeph> to copy a subset of data (everything but
             the rows you intended to delete) from one table to another. See <xref href="impala_dml.xml#dml"/> for
             an overview of Impala DML statements.
           </p>
         </li>

         <li>
           <p>
             Impala has no <codeph>UPDATE</codeph> statement. Impala is intended for data warehouse-style operations
             where you do bulk moves and transforms of large quantities of data. Instead of using
             <codeph>UPDATE</codeph>, do all necessary transformations early in the ETL process, such as in the job
             that generates the original data, or when copying from one table to another to convert to a particular
             file format or partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an overview of Impala DML
             statements.
           </p>
         </li>

         <li>
           <p>
             Impala has no transactional statements, such as <codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>.
             Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode in some database systems, where
             changes take effect as soon as they are made.
           </p>
         </li>

         <li>
           <p>
             If your database, table, column, or other names conflict with Impala reserved words, use different
             names or quote the names with backticks. See <xref href="impala_reserved_words.xml#reserved_words"/>
             for the current list of Impala reserved words.
           </p>
           <p>
             Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or
             column alias. For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>, Impala does not
             recognize the <codeph>NATURAL</codeph> keyword and interprets it as an alias for the table
             <codeph>t1</codeph>. If you experience any unexpected behavior with queries, check the list of reserved
             words to make sure all keywords in join and <codeph>WHERE</codeph> clauses are recognized.
           </p>
         </li>

         <li>
           <p>
             Impala supports subqueries only in the <codeph>FROM</codeph> clause of a query, not within the
             <codeph>WHERE</codeph> clauses. Therefore, you cannot use clauses such as <codeph>WHERE
             <varname>column</varname> IN (<varname>subquery</varname>)</codeph>. Also, Impala does not allow
             <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> clauses (although <codeph>EXISTS</codeph> is a
             reserved keyword).
           </p>
         </li>

         <li>
           <p>
             Impala supports <codeph>UNION</codeph> and <codeph>UNION ALL</codeph> set operators, but not
             <codeph>INTERSECT</codeph>. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/>
           </p>
         </li>

         <li>
           <p>
             Within queries, Impala requires query aliases for any subqueries:
           </p>
 <codeblock>-- Without the alias 'contents_of_t1' at the end, query gives syntax error.
 select count(*) from (select * from t1) contents_of_t1;</codeblock>
         </li>

         <li>
           <p>
             When an alias is declared for an expression in a query, that alias cannot be referenced again within
             the same query block:
           </p>
 <codeblock>-- Can't reference AVERAGE twice in the SELECT list where it's defined.
 select avg(x) as average, average+1 from t1 group by x;
 ERROR: AnalysisException: couldn't resolve column reference: 'average'

 -- Although it can be referenced again later in the same query.
 select avg(x) as average from t1 group by x having average &gt; 3;</codeblock>
           <p>
             For Impala, either repeat the expression again, or abstract the expression into a <codeph>WITH</codeph>
             clause, creating named columns that can be referenced multiple times anywhere in the base query:
           </p>
 <codeblock>-- The following 2 query forms are equivalent.
 select avg(x) as average, avg(x)+1 from t1 group by x;
 with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;</codeblock>
 <!-- An alternative bunch of queries to use in the example above.
 [localhost:21000] > select x*x as x_squared from t1;

 [localhost:21000] > select x*x as x_squared from t1 where x_squared < 100;
 ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
 [localhost:21000] > select x*x as x_squared, x_squared * pi() as pi_x_squared from t1;
 ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
 [localhost:21000] > select x*x as x_squared from t1 group by x_squared;

 [localhost:21000] > select x*x as x_squared from t1 group by x_squared having x_squared < 100;
 -->
         </li>

         <li>
           <p>
             Impala does not support certain rarely used join types that are less appropriate for high-volume tables
             used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to
             ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support
             natural joins or anti-joins, and requires the <codeph>CROSS JOIN</codeph> operator for Cartesian
             products. See <xref href="impala_joins.xml#joins"/> for details on the syntax for Impala join clauses.
           </p>
         </li>

         <li>
           <p>
             Impala has a limited choice of partitioning types. Partitions are defined based on each distinct
             combination of values for one or more partition key columns. Impala does not redistribute or check data
             to create evenly distributed partitions; you must choose partition key columns based on your knowledge
             of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning
             to use the Impala partition syntax for <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
             statements. Impala partitioning is similar to range partitioning where every range has exactly one
             value, or key partitioning where the hash function produces a separate bucket for every combination of
             key values. See <xref href="impala_partitioning.xml#partitioning"/> for usage details, and
             <xref href="impala_create_table.xml#create_table"/> and
             <xref href="impala_alter_table.xml#alter_table"/> for syntax.
           </p>
           <note>
             Because the number of separate partitions is potentially higher than in other database systems, keep a
             close eye on the number of partitions and the volume of data in each one; scale back the number of
             partition key columns if you end up with too many partitions with a small volume of data in each one.
             Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node.
             HDFS blocks are typically multiple megabytes, <ph rev="parquet_block_size">especially</ph> for Parquet
             files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much
             parallelism in the query because such a small amount of data is typically processed by a single node.
           </note>
         </li>

         <li>
           <p>
             For <q>top-N</q> queries, Impala uses the <codeph>LIMIT</codeph> clause rather than comparing against a
             pseudocolumn named <codeph>ROWNUM</codeph> or <codeph>ROW_NUM</codeph>. See
             <xref href="impala_limit.xml#limit"/> for details.
           </p>
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="porting_antipatterns">

     <title>SQL Constructs to Doublecheck</title>

     <conbody>

       <p>
         Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than
         optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC
         applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert
         and change these things where appropriate:
       </p>

       <ul>
         <li>
           <p>
             A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files
             in plain text format, which is convenient for data interchange but not a good choice for high-volume
             data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and
             how to use specific file formats for compact data and high-performance queries. Especially see
             <xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for
             large-scale data warehouse queries.
           </p>
         </li>

         <li>
           <p>
             A <codeph>CREATE TABLE</codeph> statement with no <codeph>PARTITIONED BY</codeph> clause stores all the
             data files in the same physical location, which can lead to scalability problems when the data volume
             becomes large.
           </p>
           <p>
             On the other hand, adapting tables that were already partitioned in a different database system could
             produce an Impala table with a high number of partitions and not enough data in each one, leading to
             underutilization of Impala's parallel query features.
           </p>
           <p>
             See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and
             tuning the performance of queries on partitioned tables.
           </p>
         </li>

         <li>
           <p>
             The <codeph>INSERT ... VALUES</codeph> syntax is suitable for setting up toy tables with a few rows for
             functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a
             scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising
             your data load process to produce raw data files outside of Impala, then setting up Impala external
             tables or using the <codeph>LOAD DATA</codeph> statement to use those data files instantly in Impala
             tables, with no conversion or indexing stage. See <xref href="impala_tables.xml#external_tables"/> and
             <xref href="impala_load_data.xml#load_data"/> for details about the Impala techniques for working with
             data files produced outside of Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for examples
             of ETL workflow for Impala.
           </p>
         </li>

         <li>
           <p>
             If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data
             files, or a single giant data file that cannot take advantage of distributed parallel queries or
             partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a
             new table and reorganize into a more efficient layout in the same operation. See
             <xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement.
           </p>
           <p>
             You can do <codeph>INSERT ... SELECT</codeph> into a table with a more efficient file format (see
             <xref href="impala_file_formats.xml#file_formats"/>) or from an unpartitioned table into a partitioned
             one (see <xref href="impala_partitioning.xml#partitioning"/>).
           </p>
         </li>

         <li>
           <p>
             The number of expressions allowed in an Impala query might be smaller than for some other database
             systems, causing failures for very complicated queries (typically produced by automated SQL
             generators). Where practical, keep the number of expressions in the <codeph>WHERE</codeph> clauses to
             approximately 2000 or fewer. As a workaround, set the query option
             <codeph>DISABLE_CODEGEN=true</codeph> if queries fail for this reason. See
             <xref href="impala_disable_codegen.xml#disable_codegen"/> for details.
           </p>
         </li>

         <li>
           <p>
             If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator
             instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/>
           </p>
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="porting_next">

     <title>Next Porting Steps after Verifying Syntax and Semantics</title>

     <conbody>

       <p>
         Throughout this section, some of the decisions you make during the porting process also have a substantial
         impact on performance. After your SQL code is ported and working correctly, doublecheck the
         performance-related aspects of your schema design, physical layout, and queries to make sure that the
         ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and
         integration with Hadoop components.
       </p>

       <ul>
         <li>
           Have you run the <codeph>COMPUTE STATS</codeph> statement on each table involved in join queries? Have
           you also run <codeph>COMPUTE STATS</codeph> for each table used as the source table in an <codeph>INSERT
           ... SELECT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statement?
         </li>

         <li>
           Are you using the most efficient file format for your data volumes, table structure, and query
           characteristics?
         </li>

         <li>
           Are you using partitioning effectively? That is, have you partitioned on columns that are often used for
           filtering in <codeph>WHERE</codeph> clauses? Have you partitioned at the right granularity so that there
           is enough data in each partition to parallelize the work for each query?
         </li>

         <li>
           Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a
           huge number of small files (bad)?
         </li>
       </ul>

       <p>
         See <xref href="impala_performance.xml#performance"/> for details about the whole performance tuning
         process.
       </p>
     </conbody>
   </concept>
 </concept>