| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="porting"> |
| |
| <title>Porting SQL from Other Database Systems to Impala</title> |
| <titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Databases"/> |
| <data name="Category" value="Hive"/> |
| <data name="Category" value="Oracle"/> |
| <data name="Category" value="MySQL"/> |
| <data name="Category" value="PostgreSQL"/> |
| <data name="Category" value="Troubleshooting"/> |
| <data name="Category" value="Porting"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">porting</indexterm> |
| Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications |
| to Impala, due to variations in data types, built-in functions, vendor language extensions, and |
| Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for |
| best performance. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="porting_ddl_dml"> |
| |
| <title>Porting DDL and DML Statements</title> |
| |
| <conbody> |
| |
| <p> |
| When adapting SQL code from a traditional database system to Impala, expect to find a number of differences |
| in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, |
| tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to |
| account for the Impala partitioning scheme and Hadoop file formats. |
| </p> |
| |
| <p> |
| Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor |
| extensions and features not yet supported in Impala, you might be able to run identical or almost-identical |
| query text on both systems. |
| </p> |
| |
| <p> |
| Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse |
| and ongoing tuning efforts on the code for SQL queries. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_data_types"> |
| |
| <title>Porting Data Types from Other Database Systems</title> |
| |
| <conbody> |
| |
| <ul> |
| <li> |
| <p> |
| Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to |
| <codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example, |
| change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is |
| very flexible about the length of string values; it does not impose any length constraints |
| or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns. |
| (In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>, |
| with length constraints for both types and blank-padding for <codeph>CHAR</codeph>. |
| However, for performance reasons, it is still preferable to use <codeph>STRING</codeph> |
| columns where practical.) |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or |
| <codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently |
| some string manipulation operations only work correctly with ASCII data. See |
| <xref href="impala_string.xml#string"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to |
| <codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make |
| sure your application logic or ETL process accounts for the fact that Impala expects all |
| <codeph>TIMESTAMP</codeph> values to be in |
| <xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated |
| Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about |
| the <codeph>TIMESTAMP</codeph> data type, and |
| <xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different |
| date and time formats. |
| </p> |
| <p> |
| You might also need to adapt date- and time-related literal values and format strings to use the |
| supported Impala date and time formats. If you have date and time literals with different separators or |
| different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala |
| expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the |
| Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the |
| allowed formats for date and time literals, and |
| <xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as |
| <codeph>regexp_replace()</codeph>. |
| </p> |
| <p> |
| Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>. |
| </p> |
| <p> |
| Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname> |
| days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() + |
| INTERVAL 30 DAYS</codeph>. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in |
| <xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column |
| data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric |
| values that you can add or subtract using the functions in |
| <xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table |
| <codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could |
| construct dates N days in the future like so: |
| </p> |
| <codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock> |
| </li> |
| |
| <li> |
| <p> |
| For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient |
| range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on |
| for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not |
| required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on |
| the range of values. For applications that require precise decimal values, such as financial data, you |
| might need to make more extensive changes to table structure and application logic, such as using |
| separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs |
| to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, |
| casting, and so on for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in |
| Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an |
| alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into |
| <codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for |
| details about ranges, casting, and so on for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Most integer types from other systems have equivalents in Impala, perhaps under different names such as |
| <codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example |
| <codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range. |
| Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details |
| about ranges, casting, and so on for the various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See |
| <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the |
| various numeric data types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For any types holding bitwise values, use an integer type with enough range to hold all the relevant |
| bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about |
| ranges, casting, and so on for the various numeric data types. |
| </p> |
| <p> |
| For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate |
| 8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>. |
| </p> |
| <codeblock>[localhost:21000] > select cast(127*2 as tinyint); |
| +--------------------------+ |
| | cast(127 * 2 as tinyint) | |
| +--------------------------+ |
| | -2 | |
| +--------------------------+ |
| [localhost:21000] > select cast(128 as tinyint); |
| +----------------------+ |
| | cast(128 as tinyint) | |
| +----------------------+ |
| | -128 | |
| +----------------------+ |
| [localhost:21000] > select cast(127*2 as smallint); |
| +---------------------------+ |
| | cast(127 * 2 as smallint) | |
| +---------------------------+ |
| | 254 | |
| +---------------------------+</codeblock> |
| <p> |
| Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or |
| <codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects |
| such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and |
| <codeph>VARBINARY</codeph> do not currently have an equivalent in Impala. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Because Impala currently does not support composite or nested types, any spatial data types in other |
| database systems do not have direct equivalents in Impala. You could represent spatial values in string |
| format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where |
| practical, separate spatial types into separate tables so that Impala can still work with the |
| non-spatial data. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different |
| sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph> |
| and external tables mean that Impala is flexible about the format of data files, and Impala does not |
| necessarily validate or cleanse data before querying it. When copying data through Impala |
| <codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or |
| <codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see |
| <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> |
| statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>, |
| <codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or |
| <codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources, |
| such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen |
| earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries |
| to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph> |
| statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to |
| substitute some other value for <codeph>NULL</codeph> fields; see |
| <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. |
| </p> |
| <p> |
| Do as much verification as practical before loading data into Impala. After data is loaded into Impala, |
| you can do further verification using SQL queries to check if values have expected ranges, if values |
| are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to |
| re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in |
| Impala to copy the faulty data to a new table and transform or filter out the bad values. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER |
| INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any |
| <codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from |
| <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk |
| read operations for data warehouse-style queries, and therefore does not support indexes for its |
| tables. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Calls to built-in functions with out-of-range or otherwise incorrect arguments, return |
| <codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the |
| <codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using |
| representative data to doublecheck that calls to built-in functions are returning expected values |
| rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not |
| raise an error in Impala: |
| </p> |
| <codeblock>select cast('foo' as int); |
| +--------------------+ |
| | cast('foo' as int) | |
| +--------------------+ |
| | NULL | |
| +--------------------+</codeblock> |
| </li> |
| |
| <li> |
| <p> |
| For any other type not supported in Impala, you could represent their values in string format and write |
| UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| To detect the presence of unsupported or unconvertable data types in data files, do initial testing |
| with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to |
| fail immediately if they encounter disallowed type conversions. See |
| <xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example: |
| </p> |
| <codeblock>set abort_on_error=true; |
| select count(*) from (select * from t1); |
| -- The above query will fail if the data files for T1 contain any |
| -- values that can't be converted to the expected Impala data types. |
| -- For example, if T1.C1 is defined as INT but the column contains |
| -- floating-point values like 1.1, the query will return an error.</codeblock> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_statements"> |
| |
| <title>SQL Statements to Remove or Adapt</title> |
| |
| <conbody> |
| |
| <p> |
| Some SQL statements or clauses that you might be familiar with are not currently supported in Impala: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Impala has no <codeph>DELETE</codeph> statement. Impala is intended for data warehouse-style operations |
| where you do bulk moves and transforms of large quantities of data. Instead of using |
| <codeph>DELETE</codeph>, use <codeph>INSERT OVERWRITE</codeph> to entirely replace the contents of a |
| table or partition, or use <codeph>INSERT ... SELECT</codeph> to copy a subset of data (everything but |
| the rows you intended to delete) from one table to another. See <xref href="impala_dml.xml#dml"/> for |
| an overview of Impala DML statements. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Impala has no <codeph>UPDATE</codeph> statement. Impala is intended for data warehouse-style operations |
| where you do bulk moves and transforms of large quantities of data. Instead of using |
| <codeph>UPDATE</codeph>, do all necessary transformations early in the ETL process, such as in the job |
| that generates the original data, or when copying from one table to another to convert to a particular |
| file format or partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an overview of Impala DML |
| statements. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Impala has no transactional statements, such as <codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>. |
| Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode in some database systems, where |
| changes take effect as soon as they are made. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If your database, table, column, or other names conflict with Impala reserved words, use different |
| names or quote the names with backticks. See <xref href="impala_reserved_words.xml#reserved_words"/> |
| for the current list of Impala reserved words. |
| </p> |
| <p> |
| Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or |
| column alias. For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>, Impala does not |
| recognize the <codeph>NATURAL</codeph> keyword and interprets it as an alias for the table |
| <codeph>t1</codeph>. If you experience any unexpected behavior with queries, check the list of reserved |
| words to make sure all keywords in join and <codeph>WHERE</codeph> clauses are recognized. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Impala supports subqueries only in the <codeph>FROM</codeph> clause of a query, not within the |
| <codeph>WHERE</codeph> clauses. Therefore, you cannot use clauses such as <codeph>WHERE |
| <varname>column</varname> IN (<varname>subquery</varname>)</codeph>. Also, Impala does not allow |
| <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> clauses (although <codeph>EXISTS</codeph> is a |
| reserved keyword). |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Impala supports <codeph>UNION</codeph> and <codeph>UNION ALL</codeph> set operators, but not |
| <codeph>INTERSECT</codeph>. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/> |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Within queries, Impala requires query aliases for any subqueries: |
| </p> |
| <codeblock>-- Without the alias 'contents_of_t1' at the end, query gives syntax error. |
| select count(*) from (select * from t1) contents_of_t1;</codeblock> |
| </li> |
| |
| <li> |
| <p> |
| When an alias is declared for an expression in a query, that alias cannot be referenced again within |
| the same query block: |
| </p> |
| <codeblock>-- Can't reference AVERAGE twice in the SELECT list where it's defined. |
| select avg(x) as average, average+1 from t1 group by x; |
| ERROR: AnalysisException: couldn't resolve column reference: 'average' |
| |
| -- Although it can be referenced again later in the same query. |
| select avg(x) as average from t1 group by x having average > 3;</codeblock> |
| <p> |
| For Impala, either repeat the expression again, or abstract the expression into a <codeph>WITH</codeph> |
| clause, creating named columns that can be referenced multiple times anywhere in the base query: |
| </p> |
| <codeblock>-- The following 2 query forms are equivalent. |
| select avg(x) as average, avg(x)+1 from t1 group by x; |
| with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;</codeblock> |
| <!-- An alternative bunch of queries to use in the example above. |
| [localhost:21000] > select x*x as x_squared from t1; |
| |
| [localhost:21000] > select x*x as x_squared from t1 where x_squared < 100; |
| ERROR: AnalysisException: couldn't resolve column reference: 'x_squared' |
| [localhost:21000] > select x*x as x_squared, x_squared * pi() as pi_x_squared from t1; |
| ERROR: AnalysisException: couldn't resolve column reference: 'x_squared' |
| [localhost:21000] > select x*x as x_squared from t1 group by x_squared; |
| |
| [localhost:21000] > select x*x as x_squared from t1 group by x_squared having x_squared < 100; |
| --> |
| </li> |
| |
| <li> |
| <p> |
| Impala does not support certain rarely used join types that are less appropriate for high-volume tables |
| used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to |
| ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support |
| natural joins or anti-joins, and requires the <codeph>CROSS JOIN</codeph> operator for Cartesian |
| products. See <xref href="impala_joins.xml#joins"/> for details on the syntax for Impala join clauses. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Impala has a limited choice of partitioning types. Partitions are defined based on each distinct |
| combination of values for one or more partition key columns. Impala does not redistribute or check data |
| to create evenly distributed partitions; you must choose partition key columns based on your knowledge |
| of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning |
| to use the Impala partition syntax for <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> |
| statements. Impala partitioning is similar to range partitioning where every range has exactly one |
| value, or key partitioning where the hash function produces a separate bucket for every combination of |
| key values. See <xref href="impala_partitioning.xml#partitioning"/> for usage details, and |
| <xref href="impala_create_table.xml#create_table"/> and |
| <xref href="impala_alter_table.xml#alter_table"/> for syntax. |
| </p> |
| <note> |
| Because the number of separate partitions is potentially higher than in other database systems, keep a |
| close eye on the number of partitions and the volume of data in each one; scale back the number of |
| partition key columns if you end up with too many partitions with a small volume of data in each one. |
| Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node. |
| HDFS blocks are typically multiple megabytes, <ph rev="parquet_block_size">especially</ph> for Parquet |
| files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much |
| parallelism in the query because such a small amount of data is typically processed by a single node. |
| </note> |
| </li> |
| |
| <li> |
| <p> |
| For <q>top-N</q> queries, Impala uses the <codeph>LIMIT</codeph> clause rather than comparing against a |
| pseudocolumn named <codeph>ROWNUM</codeph> or <codeph>ROW_NUM</codeph>. See |
| <xref href="impala_limit.xml#limit"/> for details. |
| </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_antipatterns"> |
| |
| <title>SQL Constructs to Doublecheck</title> |
| |
| <conbody> |
| |
| <p> |
| Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than |
| optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC |
| applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert |
| and change these things where appropriate: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files |
| in plain text format, which is convenient for data interchange but not a good choice for high-volume |
| data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and |
| how to use specific file formats for compact data and high-performance queries. Especially see |
| <xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for |
| large-scale data warehouse queries. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| A <codeph>CREATE TABLE</codeph> statement with no <codeph>PARTITIONED BY</codeph> clause stores all the |
| data files in the same physical location, which can lead to scalability problems when the data volume |
| becomes large. |
| </p> |
| <p> |
| On the other hand, adapting tables that were already partitioned in a different database system could |
| produce an Impala table with a high number of partitions and not enough data in each one, leading to |
| underutilization of Impala's parallel query features. |
| </p> |
| <p> |
| See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and |
| tuning the performance of queries on partitioned tables. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The <codeph>INSERT ... VALUES</codeph> syntax is suitable for setting up toy tables with a few rows for |
| functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a |
| scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising |
| your data load process to produce raw data files outside of Impala, then setting up Impala external |
| tables or using the <codeph>LOAD DATA</codeph> statement to use those data files instantly in Impala |
| tables, with no conversion or indexing stage. See <xref href="impala_tables.xml#external_tables"/> and |
| <xref href="impala_load_data.xml#load_data"/> for details about the Impala techniques for working with |
| data files produced outside of Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for examples |
| of ETL workflow for Impala. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data |
| files, or a single giant data file that cannot take advantage of distributed parallel queries or |
| partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a |
| new table and reorganize into a more efficient layout in the same operation. See |
| <xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement. |
| </p> |
| <p> |
| You can do <codeph>INSERT ... SELECT</codeph> into a table with a more efficient file format (see |
| <xref href="impala_file_formats.xml#file_formats"/>) or from an unpartitioned table into a partitioned |
| one (see <xref href="impala_partitioning.xml#partitioning"/>). |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The number of expressions allowed in an Impala query might be smaller than for some other database |
| systems, causing failures for very complicated queries (typically produced by automated SQL |
| generators). Where practical, keep the number of expressions in the <codeph>WHERE</codeph> clauses to |
| approximately 2000 or fewer. As a workaround, set the query option |
| <codeph>DISABLE_CODEGEN=true</codeph> if queries fail for this reason. See |
| <xref href="impala_disable_codegen.xml#disable_codegen"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator |
| instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/> |
| </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="porting_next"> |
| |
| <title>Next Porting Steps after Verifying Syntax and Semantics</title> |
| |
| <conbody> |
| |
| <p> |
| Throughout this section, some of the decisions you make during the porting process also have a substantial |
| impact on performance. After your SQL code is ported and working correctly, doublecheck the |
| performance-related aspects of your schema design, physical layout, and queries to make sure that the |
| ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and |
| integration with Hadoop components. |
| </p> |
| |
| <ul> |
| <li> |
| Have you run the <codeph>COMPUTE STATS</codeph> statement on each table involved in join queries? Have |
| you also run <codeph>COMPUTE STATS</codeph> for each table used as the source table in an <codeph>INSERT |
| ... SELECT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statement? |
| </li> |
| |
| <li> |
| Are you using the most efficient file format for your data volumes, table structure, and query |
| characteristics? |
| </li> |
| |
| <li> |
| Are you using partitioning effectively? That is, have you partitioned on columns that are often used for |
| filtering in <codeph>WHERE</codeph> clauses? Have you partitioned at the right granularity so that there |
| is enough data in each partition to parallelize the work for each query? |
| </li> |
| |
| <li> |
| Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a |
| huge number of small files (bad)? |
| </li> |
| </ul> |
| |
| <p> |
| See <xref href="impala_performance.xml#performance"/> for details about the whole performance tuning |
| process. |
| </p> |
| </conbody> |
| </concept> |
| </concept> |