| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="porting"> |
| |
| <title>Porting SQL from Other Database Systems to Impala</title> |
| <titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Databases"/> |
| <data name="Category" value="Hive"/> |
| <data name="Category" value="Oracle"/> |
| <data name="Category" value="MySQL"/> |
| <data name="Category" value="PostgreSQL"/> |
| <data name="Category" value="Troubleshooting"/> |
| <data name="Category" value="Porting"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> Although Impala uses standard SQL for queries, you might need to modify |
| SQL source when bringing applications to Impala, due to variations in data |
| types, built-in functions, vendor language extensions, and Hadoop-specific |
| syntax. Even when SQL is working correctly, you might make further minor |
| modifications for best performance. </p> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="porting_ddl_dml"> |
| |
| <title>Porting DDL and DML Statements</title> |
| |
| <conbody> |
| |
| <p> |
| When adapting SQL code from a traditional database system to Impala, expect to find a number of differences |
| in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, |
| tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to |
| account for the Impala partitioning scheme and Hadoop file formats. |
| </p> |
| |
| <p> |
| Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor |
| extensions and features not yet supported in Impala, you might be able to run identical or almost-identical |
| query text on both systems. |
| </p> |
| |
| <p> |
| Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse |
| and ongoing tuning efforts on the code for SQL queries. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_data_types"> |
| |
| <title>Porting Data Types from Other Database Systems</title> |
| |
| <conbody> |
| |
| <ul> |
| <li> |
| <p> |
| Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to |
| <codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example, |
| change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is |
| very flexible about the length of string values; it does not impose any length constraints |
| or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns. |
| (In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>, |
| with length constraints for both types and blank-padding for <codeph>CHAR</codeph>. |
| However, for performance reasons, it is still preferable to use <codeph>STRING</codeph> |
| columns where practical.) |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or |
| <codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently |
| some string manipulation operations only work correctly with ASCII data. See |
| <xref href="impala_string.xml#string"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to |
| <codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make |
| sure your application logic or ETL process accounts for the fact that Impala expects all |
| <codeph>TIMESTAMP</codeph> values to be in |
| <xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated |
| Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about |
| the <codeph>TIMESTAMP</codeph> data type, and |
| <xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different |
| date and time formats. |
| </p> |
| <p> |
| You might also need to adapt date- and time-related literal values and format strings to use the |
| supported Impala date and time formats. If you have date and time literals with different separators or |
| different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala |
| expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the |
| Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the |
| allowed formats for date and time literals, and |
| <xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as |
| <codeph>regexp_replace()</codeph>. |
| </p> |
| <p> |
| Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>. |
| </p> |
| <p> |
| Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname> |
| days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() + |
| INTERVAL 30 DAYS</codeph>. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in |
| <xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column |
| data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric |
| values that you can add or subtract using the functions in |
| <xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table |
| <codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could |
| construct dates N days in the future like so: |
| </p> |
| <codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock> |
| </li> |
| |
| <li> |
| <p> |
| For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient |
| range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on |
| for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not |
| required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on |
| the range of values. For applications that require precise decimal values, such as financial data, you |
| might need to make more extensive changes to table structure and application logic, such as using |
| separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs |
| to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, |
| casting, and so on for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in |
| Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an |
| alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into |
| <codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for |
| details about ranges, casting, and so on for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Most integer types from other systems have equivalents in Impala, perhaps under different names such as |
| <codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example |
| <codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range. |
| Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details |
| about ranges, casting, and so on for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See |
| <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the |
| various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For any types holding bitwise values, use an integer type with enough range to hold all the relevant |
| bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about |
| ranges, casting, and so on for the various numeric data types. |
| </p> |
| <p> |
| For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate |
| 8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>. |
| </p> |
| <codeblock>[localhost:21000] > select cast(127*2 as tinyint); |
| +--------------------------+ |
| | cast(127 * 2 as tinyint) | |
| +--------------------------+ |
| | -2 | |
| +--------------------------+ |
| [localhost:21000] > select cast(128 as tinyint); |
| +----------------------+ |
| | cast(128 as tinyint) | |
| +----------------------+ |
| | -128 | |
| +----------------------+ |
| [localhost:21000] > select cast(127*2 as smallint); |
| +---------------------------+ |
| | cast(127 * 2 as smallint) | |
| +---------------------------+ |
| | 254 | |
| +---------------------------+</codeblock> |
| <p> |
| Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or |
| <codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects |
| such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and |
| <codeph>VARBINARY</codeph> do not currently have an equivalent in Impala. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Because Impala currently does not support composite or nested types, any spatial data types in other |
| database systems do not have direct equivalents in Impala. You could represent spatial values in string |
| format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where |
| practical, separate spatial types into separate tables so that Impala can still work with the |
| non-spatial data. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different |
| sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph> |
| and external tables mean that Impala is flexible about the format of data files, and Impala does not |
| necessarily validate or cleanse data before querying it. When copying data through Impala |
| <codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or |
| <codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see |
| <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> |
| statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>, |
| <codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or |
| <codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources, |
| such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen |
| earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries |
| to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph> |
| statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to |
| substitute some other value for <codeph>NULL</codeph> fields; see |
| <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. |
| </p> |
| <p> |
| Do as much verification as practical before loading data into Impala. After data is loaded into Impala, |
| you can do further verification using SQL queries to check if values have expected ranges, if values |
| are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to |
| re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in |
| Impala to copy the faulty data to a new table and transform or filter out the bad values. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER |
| INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any |
| <codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from |
| <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk |
| read operations for data warehouse-style queries, and therefore does not support indexes for its |
| tables. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Calls to built-in functions with out-of-range or otherwise incorrect arguments, return |
| <codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the |
| <codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using |
| representative data to doublecheck that calls to built-in functions are returning expected values |
| rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not |
| raise an error in Impala: |
| </p> |
| <codeblock>select cast('foo' as int); |
| +--------------------+ |
| | cast('foo' as int) | |
| +--------------------+ |
| | NULL | |
| +--------------------+</codeblock> |
| </li> |
| |
| <li> |
| <p> |
| For any other type not supported in Impala, you could represent their values in string format and write |
| UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| To detect the presence of unsupported or unconvertable data types in data files, do initial testing |
| with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to |
| fail immediately if they encounter disallowed type conversions. See |
| <xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example: |
| </p> |
| <codeblock>set abort_on_error=true; |
| select count(*) from (select * from t1); |
| -- The above query will fail if the data files for T1 contain any |
| -- values that can't be converted to the expected Impala data types. |
| -- For example, if T1.C1 is defined as INT but the column contains |
| -- floating-point values like 1.1, the query will return an error.</codeblock> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_statements"> |
| |
| <title>SQL Statements to Remove or Adapt</title> |
| |
| <conbody> |
| |
| <p> The following SQL statements or clauses are not currently supported or |
| supported with limitations in Impala: </p> |
| |
| <ul> |
| <li> |
| <p> Impala supports the <codeph>DELETE</codeph> statement only for |
| Kudu tables. </p> |
| <p>Impala is intended for data warehouse-style operations where you do |
| bulk moves and transforms of large quantities of data. When not |
| using Kudu tables, instead of <codeph>DELETE</codeph>, use |
| <codeph>INSERT OVERWRITE</codeph> to entirely replace the contents |
| of a table or partition, or use <codeph>INSERT ... SELECT</codeph> |
| to copy a subset of data (everything but the rows you intended to |
| delete) from one table to another. See <xref |
| href="impala_dml.xml#dml"/> for an overview of Impala DML |
| statements. </p> |
| </li> |
| <li> |
| <p> Impala supports the <codeph>UPDATE</codeph> statement only for |
| Kudu tables.</p> |
| <p>When not using Kudu tables, instead of <codeph>UPDATE</codeph>, do |
| all necessary transformations early in the ETL process, such as in |
| the job that generates the original data, or when copying from one |
| table to another to convert to a particular file format or |
| partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an |
| overview of Impala DML statements. </p> |
| </li> |
| <li> |
| <p> Impala has no transactional statements, such as |
| <codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>. </p> |
| <p>Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode |
| in some database systems, where changes take effect as soon as they |
| are made. </p> |
| </li> |
| <li> |
| <p> If your database, table, column, or other names conflict with |
| Impala reserved words, use different names or quote the names with |
| backticks. </p> |
| <p>See <xref href="impala_reserved_words.xml#reserved_words"/> for the |
| current list of Impala reserved words. </p> |
| <p> Conversely, if you use a keyword that Impala does not recognize, |
| it might be interpreted as a table or column alias. </p> |
| <p>For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>, |
| Impala does not recognize the <codeph>NATURAL</codeph> keyword and |
| interprets it as an alias for the table <codeph>t1</codeph>. If you |
| experience any unexpected behavior with queries, check the list of |
| reserved words to make sure all keywords in join and |
| <codeph>WHERE</codeph> clauses are supported keywords in Impala. |
| </p> |
| </li> |
| <li> |
| <p>Impala has some restrictions on subquery support. See <keyword |
| keyref="subqueries"/> for the current details.</p> |
| </li> |
| <li> |
| <p> Impala supports <codeph>UNION</codeph> and <codeph>UNION |
| ALL</codeph> set operators, but not <codeph>INTERSECT</codeph>. </p> |
| <p><ph conref="../shared/impala_common.xml#common/union_all_vs_union" |
| /> |
| </p> |
| </li> |
| <li><p>Impala requires query aliases for the subqueries used as inline |
| views in the <codeph>FROM</codeph> clause. </p><p>For example, |
| without the alias <codeph>contents_of_t1</codeph> at the end, the |
| following query gives a syntax |
| error:<codeblock>SELECT COUNT(*) FROM (SELECT * FROM t1) contents_of_t1;</codeblock></p>Aliases |
| are not required for the subqueries used in other parts of queries. |
| For |
| example:<codeblock>SELECT * FROM functional.alltypes WHERE id = (SELECT MIN(id) FROM functional.alltypes); |
| </codeblock></li> |
| <li> |
| <p> When an alias is declared for an expression in a query, that alias |
| cannot be referenced again within the same <codeph>SELECT</codeph> |
| list.</p> |
| <p>For example, the <codeph>average</codeph> alias cannot be |
| referenced twice in the <codeph>SELECT</codeph> list as below. You |
| will receive an error:</p> |
| <codeblock>SELECT AVG(x) AS average, average+1 FROM t1 GROUP BY x;</codeblock> |
| <p>An alias can be referenced again in the same query if not in the |
| <codeph>SELECT</codeph> list. For example, the |
| <codeph>average</codeph> alias can be referenced twice as shown |
| below:<codeblock>SELECT AVG(x) AS average FROM t1 GROUP BY x HAVING average > 3;</codeblock></p> |
| </li> |
| <li> |
| <p> Impala does not support <codeph>NATURAL JOIN</codeph>, and it does |
| not support the <codeph>USING</codeph> clause in joins. See <xref |
| href="impala_joins.xml#joins"/> for details on the syntax for |
| Impala join clauses. </p> |
| </li> |
| <li> |
| <p> Impala supports a limited choice of partitioning types. </p> |
| <p>Partitions are defined based on each distinct combination of values |
| for one or more partition key columns. Impala does not redistribute |
| or check data to create evenly distributed partitions. You must |
| choose partition key columns based on your knowledge of the data |
| volume and distribution. Adapt any tables that use range, list, |
| hash, or key partitioning to use the Impala partition syntax for |
| <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> |
| statements. </p> |
| <p>Impala partitioning is similar to range partitioning where every |
| range has exactly one value, or key partitioning where the hash |
| function produces a separate bucket for every combination of key |
| values. See <xref href="impala_partitioning.xml#partitioning"/> for |
| usage details, and <xref href="impala_create_table.xml#create_table" |
| /> and <xref href="impala_alter_table.xml#alter_table"/> for syntax. </p> |
| <note> Because the number of separate partitions is potentially higher |
| than in other database systems, keep a close eye on the number of |
| partitions and the volume of data in each one; scale back the number |
| of partition key columns if you end up with too many partitions with |
| a small volume of data in each one. <p>To distribute work for a |
| query across a cluster, you need at least one HDFS block per node. |
| HDFS blocks are typically multiple megabytes, <ph |
| rev="parquet_block_size">especially</ph> for Parquet files. |
| Therefore, if each partition holds only a few megabytes of data, |
| you are unlikely to see much parallelism in the query because such |
| a small amount of data is typically processed by a single node. |
| </p></note> |
| </li> |
| <li> |
| <p> For the <q>top-N</q> queries, Impala uses the |
| <codeph>LIMIT</codeph> clause rather than comparing against a |
| pseudo column named <codeph>ROWNUM</codeph> or |
| <codeph>ROW_NUM</codeph>. </p> |
| <p>See <xref href="impala_limit.xml#limit"/> for details. </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_antipatterns"> |
| |
| <title>SQL Constructs to Double-check</title> |
| |
| <conbody> |
| |
| <p> Some SQL constructs that are supported have behavior or defaults more |
| oriented towards convenience than optimal performance. Also, sometimes |
| machine-generated SQL, perhaps issued through JDBC or ODBC applications, |
| might have inefficiencies or exceed internal Impala limits. As you port |
| SQL code, examine and possibly update the following where appropriate: </p> |
| |
| <ul> |
| <li> |
| <p> |
| A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files |
| in plain text format, which is convenient for data interchange but not a good choice for high-volume |
| data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and |
| how to use specific file formats for compact data and high-performance queries. Especially see |
| <xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for |
| large-scale data warehouse queries. |
| </p> |
| </li> |
| |
| <li> |
| <p> Adapting tables that were already partitioned in a different |
| database system could produce an Impala table with a high number of |
| partitions and not enough data in each one, leading to |
| underutilization of Impala's parallel query features. </p> |
| <p> |
| See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and |
| tuning the performance of queries on partitioned tables. |
| </p> |
| </li> |
| |
| <li> |
| <p> The <codeph>INSERT ... VALUES</codeph> syntax is suitable for |
| setting up toy tables with a few rows for functional testing when |
| used with HDFS. Each such statement creates a separate tiny file in |
| HDFS, and it is not a scalable technique for loading megabytes or |
| gigabytes (let alone petabytes) of data. </p> |
| <p>Consider revising your data load process to produce raw data files |
| outside of Impala, then setting up Impala external tables or using |
| the <codeph>LOAD DATA</codeph> statement to use those data files |
| instantly in Impala tables, with no conversion or indexing stage. |
| See <xref href="impala_tables.xml#external_tables"/> and <xref |
| href="impala_load_data.xml#load_data"/> for details about the |
| Impala techniques for working with data files produced outside of |
| Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for |
| examples of ETL workflow for Impala. </p> |
| <p><codeph>INSERT</codeph> works fine for Kudu tables even though not |
| particularly fast.</p> |
| </li> |
| |
| <li> |
| <p> |
| If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data |
| files, or a single giant data file that cannot take advantage of distributed parallel queries or |
| partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a |
| new table and reorganize into a more efficient layout in the same operation. See |
| <xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement. |
| </p> |
| <p> You can do <codeph>INSERT ... SELECT</codeph> into a table with a |
| more efficient file format (see <xref |
| href="impala_file_formats.xml#file_formats"/>) or from an |
| unpartitioned table into a partitioned one. See <xref |
| href="impala_partitioning.xml#partitioning"/>. </p> |
| </li> |
| |
| <li> |
| <p> Complex queries may have high codegen time. As a workaround, set |
| the query option <codeph>DISABLE_CODEGEN=true</codeph> if queries |
| fail for this reason. See <xref |
| href="impala_disable_codegen.xml#disable_codegen"/> for details. </p> |
| </li> |
| |
| <li> |
| <p> |
| If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator |
| instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/> |
| </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_next"> |
| |
| <title>Next Porting Steps after Verifying Syntax and Semantics</title> |
| |
| <conbody> |
| |
| <p> Some of the decisions you make during the porting process can have an |
| impact on performance. After your SQL code is ported and working |
| correctly, examine the performance-related aspects of your schema |
| design, physical layout, and queries to make sure that the ported |
| application is taking full advantage of Impala's parallelism, |
| performance-related SQL features, and integration with Hadoop |
| components. The following are a few of the areas you should examine:</p> |
| |
| <ul> |
| <li> For the optimal performance, we recommend that you run |
| <codeph>COMPUTE STATS</codeph> on all tables.</li> |
| |
| <li> Use the most efficient file format for your data volumes, table |
| structure, and query characteristics.</li> |
| |
| <li> Partition on columns that are often used for filtering in |
| <codeph>WHERE</codeph> clauses.</li> |
| |
| <li> Your ETL process should produce a relatively small number of |
| multi-megabyte data files rather than a huge number of small |
| files.</li> |
| </ul> |
| |
| <p> See <xref href="impala_performance.xml#performance"/> for details |
| about the performance tuning process. </p> |
| </conbody> |
| </concept> |
| </concept> |