| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept rev="2.3.0" id="complex_types"> |
| |
| <title id="nested_types">Complex Types (<keyword keyref="impala23"/> or higher only)</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Impala Data Types"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p rev="2.3.0"> |
| <indexterm audience="hidden">complex types</indexterm> |
| |
| <indexterm audience="hidden">nested types</indexterm> |
| <term>Complex types</term> (also referred to as <term>nested types</term>) let you represent multiple data values within a single |
| row/column position. They differ from the familiar column types such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>, known as |
| <term>scalar types</term> or <term>primitive types</term>, which represent a single data value within a given row/column position. |
| Impala supports the complex types <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> in <keyword keyref="impala23_full"/> |
| and higher. The Hive <codeph>UNION</codeph> type is not currently supported. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| <p> |
| Once you understand the basics of complex types, refer to the individual type topics when you need to refresh your memory about syntax |
| and examples: |
| </p> |
| |
| <ul> |
| <li> |
| <xref href="impala_array.xml#array"/> |
| </li> |
| |
| <li> |
| <xref href="impala_struct.xml#struct"/> |
| </li> |
| |
| <li> |
| <xref href="impala_map.xml#map"/> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| <concept id="complex_types_benefits"> |
| |
| <title>Benefits of Impala Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| The reasons for using Impala complex types include the following: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| You already have data produced by Hive or other non-Impala component that uses the complex type column names. You might need to |
| convert the underlying data to Parquet to use it with Impala. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Your data model originates with a non-SQL programming language or a NoSQL data management system. For example, if you are |
| representing Python data expressed as nested lists, dictionaries, and tuples, those data structures correspond closely to Impala |
| <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> types. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Your analytic queries involving multiple tables could benefit from greater locality during join processing. By packing more |
| related data items within each HDFS data block, complex types let join queries avoid the network overhead of the traditional |
| Hadoop shuffle or broadcast join techniques. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used |
| with all SQL clauses, such as <codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, all kinds of joins, subqueries, and inline |
| views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other |
| programming languages to deconstruct the underlying data structures. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_overview"> |
| |
| <title>Overview of Impala Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| <!-- |
| Each <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or <codeph>STRUCT</codeph> column can include multiple instances of scalar types |
| such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>. |
| --> |
| The <codeph>ARRAY</codeph> and <codeph>MAP</codeph> types are closely related: they represent collections with arbitrary numbers of |
| elements, where each element is the same type. In contrast, <codeph>STRUCT</codeph> groups together a fixed number of items into a |
| single element. The parts of a <codeph>STRUCT</codeph> element (the <term>fields</term>) can be of different types, and each field |
| has a name. |
| </p> |
| |
| <p> |
| The elements of an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>, can also be other |
| complex types. You can construct elaborate data structures with up to 100 levels of nesting. For example, you can make an |
| <codeph>ARRAY</codeph> whose elements are <codeph>STRUCT</codeph>s. Within each <codeph>STRUCT</codeph>, you can have some fields |
| that are <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or another kind of <codeph>STRUCT</codeph>. The Impala documentation uses the |
| terms complex and nested types interchangeably; for simplicity, it primarily uses the term complex types, to encompass all the |
| properties of these types. |
| </p> |
| |
| <p> |
| When visualizing your data model in familiar SQL terms, you can think of each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> as a |
| miniature table, and each <codeph>STRUCT</codeph> as a row within such a table. By default, the table represented by an |
| <codeph>ARRAY</codeph> has two columns, <codeph>POS</codeph> to represent ordering of elements, and <codeph>ITEM</codeph> |
| representing the value of each element. Likewise, by default, the table represented by a <codeph>MAP</codeph> encodes key-value |
| pairs, and therefore has two columns, <codeph>KEY</codeph> and <codeph>VALUE</codeph>. |
| <!-- |
| When you use a <codeph>STRUCT</codeph> as an |
| <codeph>ARRAY</codeph> element or the <codeph>VALUE</codeph> part of a <codeph>MAP</codeph>, the field names of the |
| <codeph>STRUCT</codeph> become additional columns in the result set. |
| --> |
| </p> |
| |
| <p> |
| The <codeph>ITEM</codeph> and <codeph>VALUE</codeph> names are only required for the very simplest kinds of <codeph>ARRAY</codeph> |
| and <codeph>MAP</codeph> columns, ones that hold only scalar values. When the elements within the <codeph>ARRAY</codeph> or |
| <codeph>MAP</codeph> are of type <codeph>STRUCT</codeph> rather than a scalar type, then the result set contains columns with names |
| corresponding to the <codeph>STRUCT</codeph> fields rather than <codeph>ITEM</codeph> or <codeph>VALUE</codeph>. |
| </p> |
| |
| <!-- |
| <p> |
| <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are both <term>collection</term> types, which can have a variable number of |
| elements; <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are typically used as the top-level type of a table column. |
| <codeph>STRUCT</codeph> represents a single element and has a fixed number of fields; <codeph>STRUCT</codeph> is typically used as |
| the final, lowest level of a nested type definition. |
| </p> |
| --> |
| |
| <p> |
| You write most queries that process complex type columns using familiar join syntax, even though the data for both sides of the join |
| resides in a single table. The join notation brings together the scalar values from a row with the values from the complex type |
| columns for that same row. The final result set contains all scalar values, allowing you to do all the familiar filtering, |
| aggregation, ordering, and so on for the complex data entirely in SQL or using business intelligence tools that issue SQL queries. |
| <!-- |
| Instead of pulling together values from different tables, the join selects the specified values from both |
| the scalar columns, and from inside the complex type columns, producing a flattened result set consisting of all scalar values. When |
| doing a join query involving a complex type column, Impala derives the join key automatically, without the need to create additional |
| ID columns in the table. |
| --> |
| </p> |
| |
| <p> |
| Behind the scenes, Impala ensures that the processing for each row is done efficiently on a single host, without the network traffic |
| involved in broadcast or shuffle joins. The most common type of join query for tables with complex type columns is <codeph>INNER |
| JOIN</codeph>, which returns results only in those cases where the complex type contains some elements. Therefore, most query |
| examples in this section use either the <codeph>INNER JOIN</codeph> clause or the equivalent comma notation. |
| </p> |
| |
| <note> |
| <p> |
| Although Impala can query complex types that are present in Parquet files, Impala currently cannot create new Parquet files |
| containing complex types. Therefore, the discussion and examples presume that you are working with existing Parquet data produced |
| through Hive, Spark, or some other source. See <xref href="#complex_types_ex_hive_etl"/> for examples of constructing Parquet data |
| files with complex type columns. |
| </p> |
| |
| <p> |
| For learning purposes, you can create empty tables with complex type columns and practice query syntax, even if you do not have |
| sample data with the required structure. |
| </p> |
| </note> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_design"> |
| |
| <title>Design Considerations for Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| When planning to use Impala complex types, and designing the Impala schema, first learn how this kind of schema differs from |
| traditional table layouts from the relational database and data warehousing fields. Because you might have already encountered |
| complex types in a Hadoop context while using Hive for ETL, also learn how to write high-performance analytic queries for complex |
| type data using Impala SQL syntax. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="complex_types_vs_rdbms"> |
| |
| <title>How Complex Types Differ from Traditional Data Warehouse Schemas</title> |
| |
| <conbody> |
| |
| <p> |
| Complex types let you associate arbitrary data structures with a particular row. If you are familiar with schema design for |
| relational database management systems or data warehouses, a schema with complex types has the following differences: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Logically, related values can now be grouped tightly together in the same table. |
| </p> |
| |
| <p> |
| In traditional data warehousing, related values were typically arranged in one of two ways: |
| </p> |
| <ul> |
| <li> |
| <p> |
| Split across multiple normalized tables. Foreign key columns specified which rows from each table were associated with |
| each other. This arrangement avoided duplicate data and therefore the data was compact, but join queries could be |
| expensive because the related data had to be retrieved from separate locations. (In the case of distributed Hadoop |
| queries, the joined tables might even be transmitted between different hosts in a cluster.) |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Flattened into a single denormalized table. Although this layout eliminated some potential performance issues by removing |
| the need for join queries, the table typically became larger because values were repeated. The extra data volume could |
| cause performance issues in other parts of the workflow, such as longer ETL cycles or more expensive full-table scans |
| during queries. |
| </p> |
| </li> |
| </ul> |
| <p> |
| Complex types represent a middle ground that addresses these performance and volume concerns. By physically locating related |
| data within the same data files, complex types increase locality and reduce the expense of join queries. By associating an |
| arbitrary amount of data with a single row, complex types avoid the need to repeat lengthy values such as strings. Because |
| Impala knows which complex type values are associated with each row, you can save storage by avoiding artificial foreign key |
| values that are only used for joins. The flexibility of the <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and |
| <codeph>MAP</codeph> types lets you model familiar constructs such as fact and dimension tables from a data warehouse, and |
| wide tables representing sparse matrixes. |
| </p> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_physical"> |
| |
| <title>Physical Storage for Complex Types in Parquet</title> |
| |
| <conbody> |
| |
| <p> |
| Physically, the scalar and complex columns in each row are located adjacent to each other in the same Parquet data file, ensuring |
| that they are processed on the same host rather than being broadcast across the network when cross-referenced within a query. This |
| co-location simplifies the process of copying, converting, and backing all the columns up at once. Because of the column-oriented |
| layout of Parquet files, you can still query only the scalar columns of a table without imposing the I/O penalty of reading the |
| (possibly large) values of the composite columns. |
| </p> |
| |
| <p> |
| Within each Parquet data file, the constituent parts of complex type columns are stored in column-oriented format: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Each field of a <codeph>STRUCT</codeph> type is stored like a column, with all the scalar values adjacent to each other and |
| encoded, compressed, and so on using the Parquet space-saving techniques. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For an <codeph>ARRAY</codeph> containing scalar values, all those values (represented by the <codeph>ITEM</codeph> |
| pseudocolumn) are stored adjacent to each other. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For a <codeph>MAP</codeph>, the values of the <codeph>KEY</codeph> pseudocolumn are stored adjacent to each other. If the |
| <codeph>VALUE</codeph> pseudocolumn is a scalar type, its values are also stored adjacent to each other. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If an <codeph>ARRAY</codeph> element, <codeph>STRUCT</codeph> field, or <codeph>MAP</codeph> <codeph>VALUE</codeph> part is |
| another complex type, the column-oriented storage applies to the next level down (or the next level after that, and so on for |
| deeply nested types) where the final elements, fields, or values are of scalar types. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| The numbers represented by the <codeph>POS</codeph> pseudocolumn of an <codeph>ARRAY</codeph> are not physically stored in the |
| data files. They are synthesized at query time based on the order of the <codeph>ARRAY</codeph> elements associated with each row. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_file_formats"> |
| |
| <title>File Format Support for Impala Complex Types</title> |
| |
| <conbody> |
| |
| <p> Currently, Impala queries support complex type data in the Parquet |
| and ORC file formats. See <xref href="impala_parquet.xml#parquet"/> |
| for details about the performance benefits and physical layout of |
| Parquet file format. </p> |
| |
| <p> |
| Because Impala does not parse the data structures containing nested types for unsupported formats such as text, Avro, |
| SequenceFile, or RCFile, you cannot use data files in these formats with Impala, even if the query does not refer to the nested |
| type columns. Also, if a table using an unsupported format originally contained nested type columns, and then those columns were |
| dropped from the table using <codeph>ALTER TABLE ... DROP COLUMN</codeph>, any existing data files in the table still contain the |
| nested type data and Impala queries on that table will generate errors. |
| </p> |
| |
| <p rev="2.6.0 IMPALA-2844"> |
| The one exception to the preceding rule is <codeph>COUNT(*)</codeph> queries on RCFile tables that include complex types. |
| Such queries are allowed in <keyword keyref="impala26_full"/> and higher. |
| </p> |
| |
| <p> |
| You can perform DDL operations for tables involving complex types in |
| most file formats other than Parquet or ORC. You cannot create tables |
| in Impala with complex types using text files. |
| </p> |
| |
| <p> |
| You can have a partitioned table with complex type columns that uses |
| a format other than Parquet or ORC, and use |
| <codeph>ALTER TABLE</codeph> to change the file format to Parquet/ORC |
| for individual partitions. When you put Parquet/ORC files into those |
| partitions, Impala can execute queries against that data as long as |
| the query does not involve any of the non-Parquet and non-ORC |
| partitions. |
| </p> |
| |
| <p> |
| If you use the <cmdname>parquet-tools</cmdname> command to examine the structure of a Parquet data file that includes complex |
| types, you see that both <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are represented as a <codeph>Bag</codeph> in Parquet |
| terminology, with all fields marked <codeph>Optional</codeph> because Impala allows any column to be nullable. |
| </p> |
| |
| <p> |
| Impala supports either 2-level and 3-level encoding within each Parquet data file. When constructing Parquet data files outside |
| Impala, use either encoding style but do not mix 2-level and 3-level encoding within the same data file. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_vs_normalization"> |
| |
| <title>Choosing Between Complex Types and Normalized Tables</title> |
| |
| <conbody> |
| |
| <p> |
| Choosing between multiple normalized fact and dimension tables, or a single table containing complex types, is an important design |
| decision. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| If you are coming from a traditional database or data warehousing background, you might be familiar with how to split up data |
| between tables. Your business intelligence tools might already be optimized for dealing with this kind of multi-table scenario |
| through join queries. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If you are pulling data from Impala into an application written in a programming language that has data structures analogous |
| to the complex types, such as Python or Java, complex types in Impala could simplify data interchange and improve |
| understandability and reliability of your program logic. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| You might already be faced with existing infrastructure or receive high volumes of data that assume one layout or the other. |
| For example, complex types are popular with web-oriented applications, for example to keep information about an online user |
| all in one place for convenient lookup and analysis, or to deal with sparse or constantly evolving data fields. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If some parts of the data change over time while related data remains constant, using multiple normalized tables lets you |
| replace certain parts of the data without reloading the entire data set. Conversely, if you receive related data all bundled |
| together, such as in JSON files, using complex types can save the overhead of splitting the related items across multiple |
| tables. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| From a performance perspective: |
| </p> |
| <ul> |
| <li> |
| <p> |
| In Parquet or ORC tables, Impala can skip columns that are not referenced in a query, avoiding the I/O penalty of reading the |
| embedded data. When complex types are nested within a column, the data is physically divided at a very granular level; for |
| example, a query referring to data nested multiple levels deep in a complex type column does not have to read all the data |
| from that column, only the data for the relevant parts of the column type hierarchy. |
| <!-- Avro not supported in 5.5 / 2.3: Avro tables might experience some performance overhead due to |
| the need to skip past the complex type columns in each row when reading the data. --> |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Complex types avoid the possibility of expensive join queries when data from fact and dimension tables is processed in |
| parallel across multiple hosts. All the information for a row containing complex types is typically to be in the same data |
| block, and therefore does not need to be transmitted across the network when joining fields that are all part of the same |
| row. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The tradeoff with complex types is that fewer rows fit in each data block. Whether it is better to have more data blocks |
| with fewer rows, or fewer data blocks with many rows, depends on the distribution of your data and the characteristics of |
| your query workload. If the complex columns are rarely referenced, using them might lower efficiency. If you are seeing |
| low parallelism due to a small volume of data (relatively few data blocks) in each table partition, increasing the row |
| size by including complex columns might produce more data blocks and thus spread the work more evenly across the cluster. |
| See <xref href="impala_scalability.xml#scalability"/> for more on this advanced topic. |
| </p> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_hive"> |
| |
| <title>Differences Between Impala and Hive Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| Impala can query Parquet and ORC tables containing <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> columns |
| produced by Hive. There are some differences to be aware of between the Impala SQL and HiveQL syntax for complex types, primarily |
| for queries. |
| </p> |
| <p> |
| Impala supports a subset of the syntax that Hive supports for |
| specifying <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and |
| <codeph>MAP</codeph> types in the <codeph>CREATE TABLE</codeph> |
| statements. |
| </p> |
| |
| <p> |
| Because Impala <codeph>STRUCT</codeph> columns include user-specified field names, you use the <codeph>NAMED_STRUCT()</codeph> |
| constructor in Hive rather than the <codeph>STRUCT()</codeph> constructor when you populate an Impala <codeph>STRUCT</codeph> |
| column using a Hive <codeph>INSERT</codeph> statement. |
| </p> |
| |
| <p> |
| The Hive <codeph>UNION</codeph> type is not currently supported in Impala. |
| </p> |
| |
| <p> |
| While Impala usually aims for a high degree of compatibility with HiveQL query syntax, Impala syntax differs from Hive for queries |
| involving complex types. The differences are intended to provide extra flexibility for queries involving these kinds of tables. |
| </p> |
| |
| <ul> |
| <li> |
| Impala uses dot notation for referring to element names or elements within complex types, and join notation for |
| cross-referencing scalar columns with the elements of complex types within the same row, rather than the <codeph>LATERAL |
| VIEW</codeph> clause and <codeph>EXPLODE()</codeph> function of HiveQL. |
| </li> |
| |
| <li> |
| Using join notation lets you use all the kinds of join queries with complex type columns. For example, you can use a |
| <codeph>LEFT OUTER JOIN</codeph>, <codeph>LEFT ANTI JOIN</codeph>, or <codeph>LEFT SEMI JOIN</codeph> query to evaluate |
| different scenarios where the complex columns do or do not contain any elements. |
| </li> |
| |
| <li> |
| You can include references to collection types inside subqueries and inline views. For example, you can construct a |
| <codeph>FROM</codeph> clause where one of the <q>tables</q> is a subquery against a complex type column, or use a subquery |
| against a complex type column as the argument to an <codeph>IN</codeph> or <codeph>EXISTS</codeph> clause. |
| </li> |
| |
| <li> |
| The Impala pseudocolumn <codeph>POS</codeph> lets you retrieve the position of elements in an array along with the elements |
| themselves, equivalent to the <codeph>POSEXPLODE()</codeph> function of HiveQL. You do not use index notation to retrieve a |
| single array element in a query; the join query loops through the array elements and you use <codeph>WHERE</codeph> clauses to |
| specify which elements to return. |
| </li> |
| |
| <li> |
| <p> |
| Join clauses involving complex type columns do not require an <codeph>ON</codeph> or <codeph>USING</codeph> clause. Impala |
| implicitly applies the join key so that the correct array entries or map elements are associated with the correct row from the |
| table. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Impala does not currently support the <codeph>UNION</codeph> complex type. |
| </p> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_limits"> |
| |
| <title>Limitations and Restrictions for Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| Complex type columns can only be used in tables or partitions with the Parquet or ORC file format. |
| </p> |
| |
| <p> |
| Complex type columns cannot be used as partition key columns in a partitioned table. |
| </p> |
| |
| <p> |
| When you use complex types with the <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, <codeph>HAVING</codeph>, or |
| <codeph>WHERE</codeph> clauses, you cannot refer to the column name by itself. Instead, you refer to the names of the scalar |
| values within the complex type, such as the <codeph>ITEM</codeph>, <codeph>POS</codeph>, <codeph>KEY</codeph>, or |
| <codeph>VALUE</codeph> pseudocolumns, or the field names from a <codeph>STRUCT</codeph>. |
| </p> |
| |
| <p> |
| The maximum depth of nesting for complex types is 100 levels. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_max_length"/> |
| |
| <p> |
| For ideal performance and scalability, use small or medium-sized collections, where all the complex columns contain at most a few |
| hundred megabytes per row. Remember, all the columns of a row are stored in the same HDFS data block, whose size in Parquet files |
| typically ranges from 256 MB to 1 GB. |
| </p> |
| |
| <p> |
| Including complex type columns in a table introduces some overhead that might make queries that do not reference those columns |
| somewhat slower than Impala queries against tables without any complex type columns. Expect at most a 2x slowdown compared to |
| tables that do not have any complex type columns. |
| </p> |
| |
| <p> |
| Currently, the <codeph>COMPUTE STATS</codeph> statement does not collect any statistics for columns containing complex types. |
| Impala uses heuristics to construct execution plans involving complex type columns. |
| </p> |
| |
| <p> |
| Currently, Impala built-in functions and user-defined functions cannot accept complex types as parameters or produce them as |
| function return values. (When the complex type values are materialized in an Impala result set, the result set contains the scalar |
| components of the values, such as the <codeph>POS</codeph> or <codeph>ITEM</codeph> for an <codeph>ARRAY</codeph>, the |
| <codeph>KEY</codeph> or <codeph>VALUE</codeph> for a <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>; these |
| scalar data items <i>can</i> be used with built-in functions and UDFs as usual.) |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_read_only"/> |
| |
| <p> |
| Currently, Impala can query complex type columns only from Parquet/ORC tables or Parquet/ORC partitions within partitioned tables. |
| Although you can use complex types in tables with Avro, text, and other file formats as part of your ETL pipeline, for example as |
| intermediate tables populated through Hive, doing analytics through Impala requires that the data eventually ends up in a Parquet/ORC |
| table. The requirement for Parquet/ORC data files means that you can use complex types with Impala tables hosted on other kinds of |
| file storage systems such as Isilon and Amazon S3, but you cannot use Impala to query complex types from HBase tables. See |
| <xref href="impala_complex_types.xml#complex_types_file_formats"/> for more details. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="complex_types_using"> |
| |
| <title>Using Complex Types from SQL</title> |
| |
| <conbody> |
| |
| <p> |
| When using complex types through SQL in Impala, you learn the notation for <codeph>< ></codeph> delimiters for the complex |
| type columns in <codeph>CREATE TABLE</codeph> statements, and how to construct join queries to <q>unpack</q> the scalar values |
| nested inside the complex data structures. You might need to condense a traditional RDBMS or data warehouse schema into a smaller |
| number of Parquet tables, and use Hive, Spark, Pig, or other mechanism outside Impala to populate the tables with data. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="nested_types_ddl"> |
| |
| <title>Complex Type Syntax for DDL Statements</title> |
| |
| <conbody> |
| |
| <p> |
| The definition of <varname>data_type</varname>, as seen in the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> |
| statements, now includes complex types in addition to primitive types: |
| </p> |
| |
| <codeblock> primitive_type |
| | array_type |
| | map_type |
| | struct_type |
| </codeblock> |
| |
| <p> |
| Unions are not currently supported. |
| </p> |
| <p> |
| <codeph>Array</codeph>, <codeph>struct</codeph>, and |
| <codeph>map</codeph> column type declarations are specified in the |
| <codeph>CREATE TABLE</codeph> statement. You can also add or change |
| the type of complex columns through the <codeph>ALTER TABLE</codeph> |
| statement. </p> |
| <p> Currently, Impala queries allow complex types only in tables that |
| use the Parquet or ORC format. If an Impala query encounters complex |
| types in a table or partition using other file formats, the query |
| returns a runtime error. </p> |
| <p> You can use <codeph>ALTER TABLE ... SET FILEFORMAT PARQUET</codeph> |
| to change the file format of an existing table containing complex |
| types to Parquet, after which Impala can query it. Make sure to load |
| Parquet files into the table after changing the file format, because |
| the <codeph>ALTER TABLE ... SET FILEFORMAT</codeph> statement does not |
| convert existing data to the new file format. </p> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_partitioning"/> |
| |
| <p> |
| Because use cases for Impala complex types require that you already have Parquet/ORC data files produced outside of Impala, you can |
| use the Impala <codeph>CREATE TABLE LIKE PARQUET</codeph> syntax to produce a table with columns that match the structure of an |
| existing Parquet file, including complex type columns for nested data structures. Remember to include the <codeph>STORED AS |
| PARQUET</codeph> clause in this case, because even with <codeph>CREATE TABLE LIKE PARQUET</codeph>, the default file format of the |
| resulting table is still text. |
| </p> |
| |
| <p> |
| Because the complex columns are omitted from the result set of an Impala <codeph>SELECT *</codeph> or <codeph>SELECT |
| <varname>col_name</varname></codeph> query, and because Impala currently does not support writing Parquet files with complex type |
| columns, you cannot use the <codeph>CREATE TABLE AS SELECT</codeph> syntax to create a table with nested type columns. |
| </p> |
| |
| <note> |
| <p> |
| Once you have a table set up with complex type columns, use the <codeph>DESCRIBE</codeph> and <codeph>SHOW CREATE TABLE</codeph> |
| statements to see the correct notation with <codeph><</codeph> and <codeph>></codeph> delimiters and comma and colon |
| separators within the complex type definitions. If you do not have existing data with the same layout as the table, you can |
| query the empty table to practice with the notation for the <codeph>SELECT</codeph> statement. In the <codeph>SELECT</codeph> |
| list, you use dot notation and pseudocolumns such as <codeph>ITEM</codeph>, <codeph>KEY</codeph>, and <codeph>VALUE</codeph> for |
| referring to items within the complex type columns. In the <codeph>FROM</codeph> clause, you use join notation to construct |
| table aliases for any referenced <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns. |
| </p> |
| </note> |
| |
| <!-- To do: show some simple CREATE TABLE statements for each of the complex types, without so much backstory for the schema. --> |
| |
| <p> |
| For example, when defining a table that holds contact information, you might represent phone numbers differently depending on the |
| expected layout and relationships of the data, and how well you can predict those properties in advance. |
| </p> |
| |
| <p> |
| Here are different ways that you might represent phone numbers in a traditional relational schema, with equivalent representations |
| using complex types. |
| </p> |
| |
| <fig id="complex_types_phones_flat_fixed"> |
| |
| <title>Traditional Relational Representation of Phone Numbers: Single Table</title> |
| |
| <p> |
| The traditional, simplest way to represent phone numbers in a relational table is to store all contact info in a single table, |
| with all columns having scalar types, and each potential phone number represented as a separate column. In this example, each |
| person can only have these 3 types of phone numbers. If the person does not have a particular kind of phone number, the |
| corresponding column is <codeph>NULL</codeph> for that row. |
| </p> |
| |
| <codeblock> |
| CREATE TABLE contacts_fixed_phones |
| ( |
| id BIGINT |
| , name STRING |
| , address STRING |
| , home_phone STRING |
| , work_phone STRING |
| , mobile_phone STRING |
| ) STORED AS PARQUET; |
| </codeblock> |
| |
| </fig> |
| |
| <fig id="complex_types_phones_array"> |
| |
| <title>An Array of Phone Numbers</title> |
| |
| <p> |
| Using a complex type column to represent the phone numbers adds some extra flexibility. Now there could be an unlimited number |
| of phone numbers. Because the array elements have an order but not symbolic names, you could decide in advance that |
| phone_number[0] is the home number, [1] is the work number, [2] is the mobile number, and so on. (In subsequent examples, you |
| will see how to create a more flexible naming scheme using other complex type variations, such as a <codeph>MAP</codeph> or an |
| <codeph>ARRAY</codeph> where each element is a <codeph>STRUCT</codeph>.) |
| </p> |
| |
| <codeblock><![CDATA[ |
| CREATE TABLE contacts_array_of_phones |
| ( |
| id BIGINT |
| , name STRING |
| , address STRING |
| , phone_number ARRAY < STRING > |
| ) STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| </fig> |
| |
| <fig id="complex_types_phones_map"> |
| |
| <title>A Map of Phone Numbers</title> |
| |
| <p> |
| Another way to represent an arbitrary set of phone numbers is with a <codeph>MAP</codeph> column. With a <codeph>MAP</codeph>, |
| each element is associated with a key value that you specify, which could be a numeric, string, or other scalar type. This |
| example uses a <codeph>STRING</codeph> key to give each phone number a name, such as <codeph>'home'</codeph> or |
| <codeph>'mobile'</codeph>. A query could filter the data based on the key values, or display the key values in reports. |
| </p> |
| |
| <codeblock><![CDATA[ |
| CREATE TABLE contacts_unlimited_phones |
| ( |
| id BIGINT, name STRING, address STRING, phone_number MAP < STRING,STRING > |
| ) STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| </fig> |
| |
| <fig id="complex_types_phones_flat_normalized"> |
| |
| <title>Traditional Relational Representation of Phone Numbers: Normalized Tables</title> |
| |
| <p> |
| If you are an experienced database designer, you already know how to work around the limitations of the single-table schema from |
| <xref href="#nested_types_ddl/complex_types_phones_flat_fixed"/>. By normalizing the schema, with the phone numbers in their own |
| table, you can associate an arbitrary set of phone numbers with each person, and associate additional details with each phone |
| number, such as whether it is a home, work, or mobile phone. |
| </p> |
| |
| <p> |
| The flexibility of this approach comes with some drawbacks. Reconstructing all the data for a particular person requires a join |
| query, which might require performance tuning on Hadoop because the data from each table might be transmitted from a different |
| host. Data management tasks such as backups and refreshing the data require dealing with multiple tables instead of a single |
| table. |
| </p> |
| |
| <p> |
| This example illustrates a traditional database schema to store contact info normalized across 2 tables. The fact table |
| establishes the identity and basic information about person. A dimension table stores information only about phone numbers, |
| using an ID value to associate each phone number with a person ID from the fact table. Each person can have 0, 1, or many |
| phones; the categories are not restricted to a few predefined ones; and the phone table can contain as many columns as desired, |
| to represent all sorts of details about each phone number. |
| </p> |
| |
| <codeblock> |
| CREATE TABLE fact_contacts (id BIGINT, name STRING, address STRING) STORED AS PARQUET; |
| CREATE TABLE dim_phones |
| ( |
| contact_id BIGINT |
| , category STRING |
| , international_code STRING |
| , area_code STRING |
| , exchange STRING |
| , extension STRING |
| , mobile BOOLEAN |
| , carrier STRING |
| , current BOOLEAN |
| , service_start_date TIMESTAMP |
| , service_end_date TIMESTAMP |
| ) |
| STORED AS PARQUET; |
| </codeblock> |
| |
| </fig> |
| |
| <fig id="complex_types_phones_array_struct"> |
| |
| <title>Phone Numbers Represented as an Array of Structs</title> |
| |
| <p> |
| To represent a schema equivalent to the one from <xref href="#nested_types_ddl/complex_types_phones_flat_normalized"/> using |
| complex types, this example uses an <codeph>ARRAY</codeph> where each array element is a <codeph>STRUCT</codeph>. As with the |
| earlier complex type examples, each person can have an arbitrary set of associated phone numbers. Making each array element into |
| a <codeph>STRUCT</codeph> lets us associate multiple data items with each phone number, and give a separate name and type to |
| each data item. The <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph> elements reproduce the columns of the dimension |
| table from the previous example. |
| </p> |
| |
| <p> |
| You can do all the same kinds of queries with the complex type schema as with the normalized schema from the previous example. |
| The advantages of the complex type design are in the areas of convenience and performance. Now your backup and ETL processes |
| only deal with a single table. When a query uses a join to cross-reference the information about a person with their associated |
| phone numbers, all the relevant data for each row resides in the same HDFS data block, meaning each row can be processed on a |
| single host without requiring network transmission. |
| </p> |
| |
| <codeblock><![CDATA[ |
| CREATE TABLE contacts_detailed_phones |
| ( |
| id BIGINT, name STRING, address STRING |
| , phone ARRAY < STRUCT < |
| category: STRING |
| , international_code: STRING |
| , area_code: STRING |
| , exchange: STRING |
| , extension: STRING |
| , mobile: BOOLEAN |
| , carrier: STRING |
| , current: BOOLEAN |
| , service_start_date: TIMESTAMP |
| , service_end_date: TIMESTAMP |
| >> |
| ) STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| </fig> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_sql"> |
| |
| <title>SQL Statements that Support Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| The Impala SQL statements that support complex types are currently |
| <codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>, |
| <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph>, |
| <codeph><xref href="impala_describe.xml#describe">DESCRIBE</xref></codeph>, |
| <codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph>, and |
| <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph>. That is, currently Impala can create or alter tables |
| containing complex type columns, examine the structure of a table containing complex type columns, import existing data files |
| containing complex type columns into a table, and query Parquet/ORC tables containing complex types. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_read_only"/> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="complex_types_ddl"> |
| |
| <title>DDL Statements and Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| Column specifications for complex or nested types use <codeph><</codeph> and <codeph>></codeph> delimiters: |
| </p> |
| |
| <codeblock><![CDATA[-- What goes inside the < > for an ARRAY is a single type, either a scalar or another |
| -- complex type (ARRAY, STRUCT, or MAP). |
| CREATE TABLE array_t |
| ( |
| id BIGINT, |
| a1 ARRAY <STRING>, |
| a2 ARRAY <BIGINT>, |
| a3 ARRAY <TIMESTAMP>, |
| a4 ARRAY <STRUCT <f1: STRING, f2: INT, f3: BOOLEAN>> |
| ) |
| STORED AS PARQUET; |
| |
| -- What goes inside the < > for a MAP is two comma-separated types specifying the types of the key-value pair: |
| -- a scalar type representing the key, and a scalar or complex type representing the value. |
| CREATE TABLE map_t |
| ( |
| id BIGINT, |
| m1 MAP <STRING, STRING>, |
| m2 MAP <STRING, BIGINT>, |
| m3 MAP <BIGINT, STRING>, |
| m4 MAP <BIGINT, BIGINT>, |
| m5 MAP <STRING, ARRAY <STRING>> |
| ) |
| STORED AS PARQUET; |
| |
| -- What goes inside the < > for a STRUCT is a comma-separated list of fields, each field defined as |
| -- name:type. The type can be a scalar or a complex type. The field names for each STRUCT do not clash |
| -- with the names of table columns or fields in other STRUCTs. A STRUCT is most often used inside |
| -- an ARRAY or a MAP rather than as a top-level column. |
| CREATE TABLE struct_t |
| ( |
| id BIGINT, |
| s1 STRUCT <f1: STRING, f2: BIGINT>, |
| s2 ARRAY <STRUCT <f1: INT, f2: TIMESTAMP>>, |
| s3 MAP <BIGINT, STRUCT <name: STRING, birthday: TIMESTAMP>> |
| ) |
| STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_queries"> |
| |
| <title>Queries and Complex Types</title> |
| |
| <conbody> |
| |
| <!-- Hive does the JSON output business: http://www.datascience-labs.com/hive/hiveql-data-manipulation/ --> |
| |
| <!-- SELECT * works but skips any nested type coloumns. --> |
| |
| <p> |
| The result set of an Impala query always contains all scalar types; the elements and fields within any complex type queries must |
| be <q>unpacked</q> using join queries. A query cannot directly retrieve the entire value for a complex type column. Impala |
| returns an error in this case. Queries using <codeph>SELECT *</codeph> are allowed for tables with complex types, but the |
| columns with complex types are skipped. |
| </p> |
| |
| <p> |
| The following example shows how referring directly to a complex type column returns an error, while <codeph>SELECT *</codeph> on |
| the same table succeeds, but only retrieves the scalar columns. |
| </p> |
| |
| <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/> |
| |
| <!-- Original error message: |
| ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING,o_totalprice:DECIMAL(12,2),o_orderdate:STRING,o_orderpriority:STRING,o_clerk:STRING,o_shippriority:INT,o_comment:STRING,o_lineitems:ARRAY<STRUCT<l_partkey:BIGINT,l_suppkey:BIGINT,l_linenumber:INT,l_quantity:DECIMAL(12,2),l_extendedprice:DECIMAL(12,2),l_discount:DECIMAL(12,2),l_tax:DECIMAL(12,2),l_returnflag:STRING,l_linestatus:STRING,l_shipdate:STRING,l_commitdate:STRING,l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'. |
| --> |
| |
| <codeblock><![CDATA[SELECT c_orders FROM customer LIMIT 1; |
| ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING, ... l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'. |
| Only scalar types are allowed in the select list. |
| |
| -- Original column has several scalar and one complex column. |
| DESCRIBE customer; |
| +--------------+------------------------------------+ |
| | name | type | |
| +--------------+------------------------------------+ |
| | c_custkey | bigint | |
| | c_name | string | |
| ... |
| | c_orders | array<struct< | |
| | | o_orderkey:bigint, | |
| | | o_orderstatus:string, | |
| | | o_totalprice:decimal(12,2), | |
| ... |
| | | >> | |
| +--------------+------------------------------------+ |
| |
| -- When we SELECT * from that table, only the scalar columns come back in the result set. |
| CREATE TABLE select_star_customer STORED AS PARQUET AS SELECT * FROM customer; |
| +------------------------+ |
| | summary | |
| +------------------------+ |
| | Inserted 150000 row(s) | |
| +------------------------+ |
| |
| -- The c_orders column, being of complex type, was not included in the SELECT * result set. |
| DESC select_star_customer; |
| +--------------+---------------+ |
| | name | type | |
| +--------------+---------------+ |
| | c_custkey | bigint | |
| | c_name | string | |
| | c_address | string | |
| | c_nationkey | smallint | |
| | c_phone | string | |
| | c_acctbal | decimal(12,2) | |
| | c_mktsegment | string | |
| | c_comment | string | |
| +--------------+---------------+ |
| ]]> |
| </codeblock> |
| |
| <!-- To do: These "references to..." bits could be promoted to their own 'expressions' subheads. --> |
| |
| <p> |
| References to fields within <codeph>STRUCT</codeph> columns use dot notation. If the field name is unambiguous, you can omit |
| qualifiers such as table name, column name, or even the <codeph>ITEM</codeph> or <codeph>VALUE</codeph> pseudocolumn names for |
| <codeph>STRUCT</codeph> elements inside an <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>. |
| </p> |
| |
| <!-- To do: rewrite example to use CUSTOMER table. --> |
| |
| <!-- Don't think TPC-H schema has a bare STRUCT to use in such a simple query though. --> |
| |
| <!-- Perhaps reuse the STRUCT_DEMO example here. --> |
| |
| <codeblock>SELECT id, address.city FROM customers WHERE address.zip = 94305; |
| </codeblock> |
| |
| <p> |
| References to elements within <codeph>ARRAY</codeph> columns use the <codeph>ITEM</codeph> pseudocolumn: |
| </p> |
| |
| <!-- To do: shorten qualified names. --> |
| |
| <codeblock>select r_name, r_nations.item.n_name from region, region.r_nations limit 7; |
| +--------+----------------+ |
| | r_name | item.n_name | |
| +--------+----------------+ |
| | EUROPE | UNITED KINGDOM | |
| | EUROPE | RUSSIA | |
| | EUROPE | ROMANIA | |
| | EUROPE | GERMANY | |
| | EUROPE | FRANCE | |
| | ASIA | VIETNAM | |
| | ASIA | CHINA | |
| +--------+----------------+ |
| </codeblock> |
| |
| <p> |
| References to fields within <codeph>MAP</codeph> columns use the <codeph>KEY</codeph> and <codeph>VALUE</codeph> pseudocolumns. |
| In this example, once the query establishes the alias <codeph>MAP_FIELD</codeph> for a <codeph>MAP</codeph> column with a |
| <codeph>STRING</codeph> key and an <codeph>INT</codeph> value, the query can refer to <codeph>MAP_FIELD.KEY</codeph> and |
| <codeph>MAP_FIELD.VALUE</codeph>, which have zero, one, or many instances for each row from the containing table. |
| </p> |
| |
| <codeblock><![CDATA[DESCRIBE table_0; |
| +---------+-----------------------+ |
| | name | type | |
| +---------+-----------------------+ |
| | field_0 | string | |
| | field_1 | map<string,int> | |
| ... |
| |
| SELECT field_0, map_field.key, map_field.value |
| FROM table_0, table_0.field_1 AS map_field |
| WHERE length(field_0) = 1 |
| LIMIT 10; |
| +---------+-----------+-------+ |
| | field_0 | key | value | |
| +---------+-----------+-------+ |
| | b | gshsgkvd | NULL | |
| | b | twrtcxj6 | 18 | |
| | b | 2vp5 | 39 | |
| | b | fh0s | 13 | |
| | v | 2 | 41 | |
| | v | 8b58mz | 20 | |
| | v | hw | 16 | |
| | v | 65l388pyt | 29 | |
| | v | 03k68g91z | 30 | |
| | v | r2hlg5b | NULL | |
| +---------+-----------+-------+ |
| ]]> |
| </codeblock> |
| |
| <!-- To do: refer to or reuse examples from the other subtopics that discuss pseudocolumns etc. --> |
| |
| <p> |
| When complex types are nested inside each other, you use a combination of joins, pseudocolumn names, and dot notation to refer |
| to specific fields at the appropriate level. This is the most frequent form of query syntax for complex columns, because the |
| typical use case involves two levels of complex types, such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements. |
| </p> |
| |
| <!-- To do: rewrite example to use CUSTOMER table. --> |
| |
| <!-- This is my own manufactured example so I have the table, and the query works, but I don't have sample data to show. --> |
| |
| <codeblock>SELECT id, phone_numbers.area_code FROM contact_info_many_structs INNER JOIN contact_info_many_structs.phone_numbers phone_numbers LIMIT 3; |
| </codeblock> |
| |
| <p> |
| You can express relationships between <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns at different levels as joins. You |
| include comparison operators between fields at the top level and within the nested type columns so that Impala can do the |
| appropriate join operation. |
| </p> |
| |
| <!-- Don't think TPC-H schema has any instances where outer field matches up with inner one though. --> |
| |
| <!-- But don't think this usage is important enough to call out at this early point. Hide the example for now. --> |
| |
| <!-- |
| <codeblock>SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc; |
| SELECT c.id, o.txn_id FROM customers c, c.orders o; |
| </codeblock> |
| --> |
| |
| <!-- To do: move these examples down, to the examples subtopic at the end. --> |
| |
| <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/> |
| |
| <p> |
| For example, the following queries work equivalently. They each return customer and order data for customers that have at least |
| one order. |
| </p> |
| |
| <codeblock>SELECT c.c_name, o.o_orderkey FROM customer c, c.c_orders o LIMIT 5; |
| +--------------------+------------+ |
| | c_name | o_orderkey | |
| +--------------------+------------+ |
| | Customer#000072578 | 558821 | |
| | Customer#000072578 | 2079810 | |
| | Customer#000072578 | 5768068 | |
| | Customer#000072578 | 1805604 | |
| | Customer#000072578 | 3436389 | |
| +--------------------+------------+ |
| |
| SELECT c.c_name, o.o_orderkey FROM customer c INNER JOIN c.c_orders o LIMIT 5; |
| +--------------------+------------+ |
| | c_name | o_orderkey | |
| +--------------------+------------+ |
| | Customer#000072578 | 558821 | |
| | Customer#000072578 | 2079810 | |
| | Customer#000072578 | 5768068 | |
| | Customer#000072578 | 1805604 | |
| | Customer#000072578 | 3436389 | |
| +--------------------+------------+ |
| </codeblock> |
| |
| <p> |
| The following query using an outer join returns customers that have orders, plus customers with no orders (no entries in the |
| <codeph>C_ORDERS</codeph> array): |
| </p> |
| |
| <codeblock><![CDATA[SELECT c.c_custkey, o.o_orderkey |
| FROM customer c LEFT OUTER JOIN c.c_orders o |
| LIMIT 5; |
| +-----------+------------+ |
| | c_custkey | o_orderkey | |
| +-----------+------------+ |
| | 60210 | NULL | |
| | 147873 | NULL | |
| | 72578 | 558821 | |
| | 72578 | 2079810 | |
| | 72578 | 5768068 | |
| +-----------+------------+ |
| ]]> |
| </codeblock> |
| |
| <p> |
| The following query returns <i>only</i> customers that have no orders. (With <codeph>LEFT ANTI JOIN</codeph> or <codeph>LEFT |
| SEMI JOIN</codeph>, the query can only refer to columns from the left-hand table, because by definition there is no matching |
| information in the right-hand table.) |
| </p> |
| |
| <codeblock><![CDATA[SELECT c.c_custkey, c.c_name |
| FROM customer c LEFT ANTI JOIN c.c_orders o |
| LIMIT 5; |
| +-----------+--------------------+ |
| | c_custkey | c_name | |
| +-----------+--------------------+ |
| | 60210 | Customer#000060210 | |
| | 147873 | Customer#000147873 | |
| | 141576 | Customer#000141576 | |
| | 85365 | Customer#000085365 | |
| | 70998 | Customer#000070998 | |
| +-----------+--------------------+ |
| ]]> |
| </codeblock> |
| |
| <!-- To do: promote the correlated subquery aspect into its own subtopic. --> |
| |
| <p> |
| You can also perform correlated subqueries to examine the properties of complex type columns for each row in the result set. |
| </p> |
| |
| <p> |
| Count the number of orders per customer. Note the correlated reference to the table alias <codeph>C</codeph>. The |
| <codeph>COUNT(*)</codeph> operation applies to all the elements of the <codeph>C_ORDERS</codeph> array for the corresponding |
| row, avoiding the need for a <codeph>GROUP BY</codeph> clause. |
| </p> |
| |
| <codeblock>select c_name, howmany FROM customer c, (SELECT COUNT(*) howmany FROM c.c_orders) v limit 5; |
| +--------------------+---------+ |
| | c_name | howmany | |
| +--------------------+---------+ |
| | Customer#000030065 | 15 | |
| | Customer#000065455 | 18 | |
| | Customer#000113644 | 21 | |
| | Customer#000111078 | 0 | |
| | Customer#000024621 | 0 | |
| +--------------------+---------+ |
| </codeblock> |
| |
| <p> |
| Count the number of orders per customer, ignoring any customers that have not placed any orders: |
| </p> |
| |
| <codeblock>SELECT c_name, howmany_orders |
| FROM |
| customer c, |
| (SELECT COUNT(*) howmany_orders FROM c.c_orders) subq1 |
| WHERE howmany_orders > 0 |
| LIMIT 5; |
| +--------------------+----------------+ |
| | c_name | howmany_orders | |
| +--------------------+----------------+ |
| | Customer#000072578 | 7 | |
| | Customer#000046378 | 26 | |
| | Customer#000069815 | 11 | |
| | Customer#000079058 | 12 | |
| | Customer#000092239 | 26 | |
| +--------------------+----------------+ |
| </codeblock> |
| |
| <p> |
| Count the number of line items in each order. The reference to <codeph>C.C_ORDERS</codeph> in the <codeph>FROM</codeph> clause |
| is needed because the <codeph>O_ORDERKEY</codeph> field is a member of the elements in the <codeph>C_ORDERS</codeph> array. The |
| subquery labelled <codeph>SUBQ1</codeph> is correlated: it is re-evaluated for the <codeph>C_ORDERS.O_LINEITEMS</codeph> array |
| from each row of the <codeph>CUSTOMERS</codeph> table. |
| </p> |
| |
| <codeblock>SELECT c_name, o_orderkey, howmany_line_items |
| FROM |
| customer c, |
| c.c_orders t2, |
| (SELECT COUNT(*) howmany_line_items FROM c.c_orders.o_lineitems) subq1 |
| WHERE howmany_line_items > 0 |
| LIMIT 5; |
| +--------------------+------------+--------------------+ |
| | c_name | o_orderkey | howmany_line_items | |
| +--------------------+------------+--------------------+ |
| | Customer#000020890 | 1884930 | 95 | |
| | Customer#000020890 | 4570754 | 95 | |
| | Customer#000020890 | 3771072 | 95 | |
| | Customer#000020890 | 2555489 | 95 | |
| | Customer#000020890 | 919171 | 95 | |
| +--------------------+------------+--------------------+ |
| </codeblock> |
| |
| <p> |
| Get the number of orders, the average order price, and the maximum items in any order per customer. For this example, the |
| subqueries labelled <codeph>SUBQ1</codeph> and <codeph>SUBQ2</codeph> are correlated: they are re-evaluated for each row from |
| the original <codeph>CUSTOMER</codeph> table, and only apply to the complex columns associated with that row. |
| </p> |
| |
| <codeblock>SELECT c_name, howmany, average_price, most_items |
| FROM |
| customer c, |
| (SELECT COUNT(*) howmany, AVG(o_totalprice) average_price FROM c.c_orders) subq1, |
| (SELECT MAX(l_quantity) most_items FROM c.c_orders.o_lineitems ) subq2 |
| LIMIT 5; |
| +--------------------+---------+---------------+------------+ |
| | c_name | howmany | average_price | most_items | |
| +--------------------+---------+---------------+------------+ |
| | Customer#000030065 | 15 | 128908.34 | 50.00 | |
| | Customer#000088191 | 0 | NULL | NULL | |
| | Customer#000101555 | 10 | 164250.31 | 50.00 | |
| | Customer#000022092 | 0 | NULL | NULL | |
| | Customer#000036277 | 27 | 166040.06 | 50.00 | |
| +--------------------+---------+---------------+------------+ |
| </codeblock> |
| |
| <p> |
| For example, these queries show how to access information about the <codeph>ARRAY</codeph> elements within the |
| <codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q> schema, starting with the initial <codeph>ARRAY</codeph> elements |
| and progressing to examine the <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph>, and then the elements nested within |
| another <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>: |
| </p> |
| |
| <codeblock><![CDATA[-- How many orders does each customer have? |
| -- The type of the ARRAY column doesn't matter, this is just counting the elements. |
| SELECT c_custkey, count(*) |
| FROM customer, customer.c_orders |
| GROUP BY c_custkey |
| LIMIT 5; |
| +-----------+----------+ |
| | c_custkey | count(*) | |
| +-----------+----------+ |
| | 61081 | 21 | |
| | 115987 | 15 | |
| | 69685 | 19 | |
| | 109124 | 15 | |
| | 50491 | 12 | |
| +-----------+----------+ |
| |
| -- How many line items are part of each customer order? |
| -- Now we examine a field from a STRUCT nested inside the ARRAY. |
| SELECT c_custkey, c_orders.o_orderkey, count(*) |
| FROM customer, customer.c_orders c_orders, c_orders.o_lineitems |
| GROUP BY c_custkey, c_orders.o_orderkey |
| LIMIT 5; |
| +-----------+------------+----------+ |
| | c_custkey | o_orderkey | count(*) | |
| +-----------+------------+----------+ |
| | 63367 | 4985959 | 7 | |
| | 53989 | 1972230 | 2 | |
| | 143513 | 5750498 | 5 | |
| | 17849 | 4857989 | 1 | |
| | 89881 | 1046437 | 1 | |
| +-----------+------------+----------+ |
| |
| -- What are the line items in each customer order? |
| -- One of the STRUCT fields inside the ARRAY is another |
| -- ARRAY containing STRUCT elements. The join finds |
| -- all the related items from both levels of ARRAY. |
| SELECT c_custkey, o_orderkey, l_partkey |
| FROM customer, customer.c_orders, c_orders.o_lineitems |
| LIMIT 5; |
| +-----------+------------+-----------+ |
| | c_custkey | o_orderkey | l_partkey | |
| +-----------+------------+-----------+ |
| | 113644 | 2738497 | 175846 | |
| | 113644 | 2738497 | 27309 | |
| | 113644 | 2738497 | 175873 | |
| | 113644 | 2738497 | 88559 | |
| | 113644 | 2738497 | 8032 | |
| +-----------+------------+-----------+ |
| ]]> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="pseudocolumns"> |
| |
| <title>Pseudocolumns for ARRAY and MAP Types</title> |
| |
| <conbody> |
| |
| <p> |
| Each element in an <codeph>ARRAY</codeph> type has a position, indexed starting from zero, and a value. Each element in a |
| <codeph>MAP</codeph> type represents a key-value pair. Impala provides pseudocolumns that let you retrieve this metadata as part |
| of a query, or filter query results by including such things in a <codeph>WHERE</codeph> clause. You refer to the pseudocolumns as |
| part of qualified column names in queries: |
| </p> |
| |
| <ul> |
| <li> |
| <codeph>ITEM</codeph>: The value of an array element. If the <codeph>ARRAY</codeph> contains <codeph>STRUCT</codeph> elements, |
| you can refer to either <codeph><varname>array_name</varname>.ITEM.<varname>field_name</varname></codeph> or use the shorthand |
| <codeph><varname>array_name</varname>.<varname>field_name</varname></codeph>. |
| </li> |
| |
| <li> |
| <codeph>POS</codeph>: The position of an element within an array. |
| </li> |
| |
| <li> |
| <codeph>KEY</codeph>: The value forming the first part of a key-value pair in a map. It is not necessarily unique. |
| </li> |
| |
| <li> |
| <codeph>VALUE</codeph>: The data item forming the second part of a key-value pair in a map. If the <codeph>VALUE</codeph> part |
| of the <codeph>MAP</codeph> element is a <codeph>STRUCT</codeph>, you can refer to either |
| <codeph><varname>map_name</varname>.VALUE.<varname>field_name</varname></codeph> or use the shorthand |
| <codeph><varname>map_name</varname>.<varname>field_name</varname></codeph>. |
| </li> |
| </ul> |
| |
| <!-- To do: Consider whether to move the detailed subtopics underneath ARRAY and MAP instead of embedded here. --> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="item"> |
| |
| <title id="pos">ITEM and POS Pseudocolumns</title> |
| |
| <conbody> |
| |
| <p> |
| When an <codeph>ARRAY</codeph> column contains <codeph>STRUCT</codeph> elements, you can refer to a field within the |
| <codeph>STRUCT</codeph> using a qualified name of the form |
| <codeph><varname>array_column</varname>.<varname>field_name</varname></codeph>. If the <codeph>ARRAY</codeph> contains scalar |
| values, Impala recognizes the special name <codeph><varname>array_column</varname>.ITEM</codeph> to represent the value of each |
| scalar array element. For example, if a column contained an <codeph>ARRAY</codeph> where each element was a |
| <codeph>STRING</codeph>, you would use <codeph><varname>array_name</varname>.ITEM</codeph> to refer to each scalar value in the |
| <codeph>SELECT</codeph> list, or the <codeph>WHERE</codeph> or other clauses. |
| </p> |
| |
| <p> |
| This example shows a table with two <codeph>ARRAY</codeph> columns whose elements are of the scalar type |
| <codeph>STRING</codeph>. When referring to the values of the array elements in the <codeph>SELECT</codeph> list, |
| <codeph>WHERE</codeph> clause, or <codeph>ORDER BY</codeph> clause, you use the <codeph>ITEM</codeph> pseudocolumn because |
| within the array, the individual elements have no defined names. |
| </p> |
| |
| <codeblock><![CDATA[create TABLE persons_of_interest |
| ( |
| person_id BIGINT, |
| aliases ARRAY <STRING>, |
| associates ARRAY <STRING>, |
| real_name STRING |
| ) |
| STORED AS PARQUET; |
| |
| -- Get all the aliases of each person. |
| SELECT real_name, aliases.ITEM |
| FROM persons_of_interest, persons_of_interest.aliases |
| ORDER BY real_name, aliases.item; |
| |
| -- Search for particular associates of each person. |
| SELECT real_name, associates.ITEM |
| FROM persons_of_interest, persons_of_interest.associates |
| WHERE associates.item LIKE '% MacGuffin'; |
| ]]> |
| </codeblock> |
| |
| <p> |
| Because an array is inherently an ordered data structure, Impala recognizes the special name |
| <codeph><varname>array_column</varname>.POS</codeph> to represent the numeric position of each element within the array. The |
| <codeph>POS</codeph> pseudocolumn lets you filter or reorder the result set based on the sequence of array elements. |
| </p> |
| |
| <p> |
| The following example uses a table from a flattened version of the TPC-H schema. The <codeph>REGION</codeph> table only has a |
| few rows, such as one row for Europe and one for Asia. The row for each region represents all the countries in that region as an |
| <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements: |
| </p> |
| |
| <codeblock><![CDATA[[localhost:21000] > desc region; |
| +-------------+--------------------------------------------------------------------+ |
| | name | type | |
| +-------------+--------------------------------------------------------------------+ |
| | r_regionkey | smallint | |
| | r_name | string | |
| | r_comment | string | |
| | r_nations | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> | |
| +-------------+--------------------------------------------------------------------+ |
| ]]> |
| </codeblock> |
| |
| <p> |
| To find the countries within a specific region, you use a join query. To find out the order of elements in the array, you also |
| refer to the <codeph>POS</codeph> pseudocolumn in the select list: |
| </p> |
| |
| <codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, <b>r2.POS</b> |
| > FROM region r1 INNER JOIN r1.r_nations r2 |
| > WHERE r1.r_name = 'ASIA'; |
| +--------+-----------+-----+ |
| | r_name | n_name | pos | |
| +--------+-----------+-----+ |
| | ASIA | VIETNAM | 0 | |
| | ASIA | CHINA | 1 | |
| | ASIA | JAPAN | 2 | |
| | ASIA | INDONESIA | 3 | |
| | ASIA | INDIA | 4 | |
| +--------+-----------+-----+ |
| </codeblock> |
| |
| <p> |
| Once you know the positions of the elements, you can use that information in subsequent queries, for example to change the |
| ordering of results from the complex type column or to filter certain elements from the array: |
| </p> |
| |
| <codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS |
| > FROM region r1 INNER JOIN r1.r_nations r2 |
| > WHERE r1.r_name = 'ASIA' |
| > <b>ORDER BY r2.POS DESC</b>; |
| +--------+-----------+-----+ |
| | r_name | n_name | pos | |
| +--------+-----------+-----+ |
| | ASIA | INDIA | 4 | |
| | ASIA | INDONESIA | 3 | |
| | ASIA | JAPAN | 2 | |
| | ASIA | CHINA | 1 | |
| | ASIA | VIETNAM | 0 | |
| +--------+-----------+-----+ |
| [localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS |
| > FROM region r1 INNER JOIN r1.r_nations r2 |
| > WHERE r1.r_name = 'ASIA' AND <b>r2.POS BETWEEN 1 and 3</b>; |
| +--------+-----------+-----+ |
| | r_name | n_name | pos | |
| +--------+-----------+-----+ |
| | ASIA | CHINA | 1 | |
| | ASIA | JAPAN | 2 | |
| | ASIA | INDONESIA | 3 | |
| +--------+-----------+-----+ |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="key"> |
| |
| <title id="value">KEY and VALUE Pseudocolumns</title> |
| |
| <conbody> |
| |
| <p> |
| The <codeph>MAP</codeph> data type is suitable for representing sparse or wide data structures, where each row might only have |
| entries for a small subset of named fields. Because the element names (the map keys) vary depending on the row, a query must be |
| able to refer to both the key and the value parts of each key-value pair. The <codeph>KEY</codeph> and <codeph>VALUE</codeph> |
| pseudocolumns let you refer to the parts of the key-value pair independently within the query, as |
| <codeph><varname>map_column</varname>.KEY</codeph> and <codeph><varname>map_column</varname>.VALUE</codeph>. |
| </p> |
| |
| <p> |
| The <codeph>KEY</codeph> must always be a scalar type, such as <codeph>STRING</codeph>, <codeph>BIGINT</codeph>, or |
| <codeph>TIMESTAMP</codeph>. It can be <codeph>NULL</codeph>. Values of the <codeph>KEY</codeph> field are not necessarily unique |
| within the same <codeph>MAP</codeph>. You apply any required <codeph>DISTINCT</codeph>, <codeph>GROUP BY</codeph>, and other |
| clauses in the query, and loop through the result set to process all the values matching any specified keys. |
| </p> |
| |
| <p> |
| The <codeph>VALUE</codeph> can be either a scalar type or another complex type. If the <codeph>VALUE</codeph> is a |
| <codeph>STRUCT</codeph>, you can construct a qualified name |
| <codeph><varname>map_column</varname>.VALUE.<varname>struct_field</varname></codeph> to refer to the individual fields inside |
| the value part. If the <codeph>VALUE</codeph> is an <codeph>ARRAY</codeph> or another <codeph>MAP</codeph>, you must include |
| another join condition that establishes a table alias for <codeph><varname>map_column</varname>.VALUE</codeph>, and then |
| construct another qualified name using that alias, for example <codeph><varname>table_alias</varname>.ITEM</codeph> or |
| <codeph><varname>table_alias</varname>.KEY</codeph> and <codeph><varname>table_alias</varname>.VALUE</codeph> |
| </p> |
| |
| <p> |
| The following example shows different ways to access a <codeph>MAP</codeph> column using the <codeph>KEY</codeph> and |
| <codeph>VALUE</codeph> pseudocolumns. The <codeph>DETAILS</codeph> column has a <codeph>STRING</codeph> first part with short, |
| standardized values such as <codeph>'Recurring'</codeph>, <codeph>'Lucid'</codeph>, or <codeph>'Anxiety'</codeph>. This is the |
| <q>key</q> that is used to look up particular kinds of elements from the <codeph>MAP</codeph>. The second part, also a |
| <codeph>STRING</codeph>, is a longer free-form explanation. Impala gives you the standard pseudocolumn names |
| <codeph>KEY</codeph> and <codeph>VALUE</codeph> for the two parts, and you apply your own conventions and interpretations to the |
| underlying values. |
| </p> |
| |
| <note> |
| If you find that the single-item nature of the <codeph>VALUE</codeph> makes it difficult to model your data accurately, the |
| solution is typically to add some nesting to the complex type. For example, to have several sets of key-value pairs, make the |
| column an <codeph>ARRAY</codeph> whose elements are <codeph>MAP</codeph>. To make a set of key-value pairs that holds more |
| elaborate information, make a <codeph>MAP</codeph> column whose <codeph>VALUE</codeph> part contains an <codeph>ARRAY</codeph> |
| or a <codeph>STRUCT</codeph>. |
| </note> |
| |
| <codeblock><![CDATA[CREATE TABLE dream_journal |
| ( |
| dream_id BIGINT, |
| details MAP <STRING,STRING> |
| ) |
| STORED AS PARQUET; |
| ]]> |
| |
| -- What are all the types of dreams that are recorded? |
| SELECT DISTINCT details.KEY FROM dream_journal, dream_journal.details; |
| |
| -- How many lucid dreams were recorded? |
| -- Because there is no GROUP BY, we count the 'Lucid' keys across all rows. |
| SELECT <b>COUNT(details.KEY)</b> |
| FROM dream_journal, dream_journal.details |
| WHERE <b>details.KEY = 'Lucid'</b>; |
| |
| -- Print a report of a subset of dreams, filtering based on both the lookup key |
| -- and the detailed value. |
| SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE AS "Dream Summary"</b> |
| FROM dream_journal, dream_journal.details |
| WHERE |
| <b>details.KEY IN ('Happy', 'Pleasant', 'Joyous')</b> |
| AND <b>details.VALUE LIKE '%childhood%'</b>; |
| </codeblock> |
| |
| <p> |
| The following example shows a more elaborate version of the previous table, where the <codeph>VALUE</codeph> part of the |
| <codeph>MAP</codeph> entry is a <codeph>STRUCT</codeph> rather than a scalar type. Now instead of referring to the |
| <codeph>VALUE</codeph> pseudocolumn directly, you use dot notation to refer to the <codeph>STRUCT</codeph> fields inside it. |
| </p> |
| |
| <codeblock><![CDATA[CREATE TABLE better_dream_journal |
| ( |
| dream_id BIGINT, |
| details MAP <STRING,STRUCT <summary: STRING, when_happened: TIMESTAMP, duration: DECIMAL(5,2), woke_up: BOOLEAN> > |
| ) |
| STORED AS PARQUET; |
| ]]> |
| |
| -- Do more elaborate reporting and filtering by examining multiple attributes within the same dream. |
| SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE.summary AS "Dream Summary"</b>, <b>details.VALUE.duration AS "Duration"</b> |
| FROM better_dream_journal, better_dream_journal.details |
| WHERE |
| <b>details.KEY IN ('Anxiety', 'Nightmare')</b> |
| AND <b>details.VALUE.duration > 60</b> |
| AND <b>details.VALUE.woke_up = TRUE</b>; |
| |
| -- Remember that if the ITEM or VALUE contains a STRUCT, you can reference |
| -- the STRUCT fields directly without the .ITEM or .VALUE qualifier. |
| SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.summary AS "Dream Summary"</b>, <b>details.duration AS "Duration"</b> |
| FROM better_dream_journal, better_dream_journal.details |
| WHERE |
| <b>details.KEY IN ('Anxiety', 'Nightmare')</b> |
| AND <b>details.duration > 60</b> |
| AND <b>details.woke_up = TRUE</b>; |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="complex_types_etl"> |
| |
| <!-- This topic overlaps in many ways with the preceding one. See which theme resonates with users, and combine them under the better title. --> |
| |
| <title>Loading Data Containing Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| Because the Impala <codeph>INSERT</codeph> statement does not currently support creating new data with complex type columns, or |
| copying existing complex type values from one table to another, you primarily use Impala to query Parquet/ORC tables with complex |
| types where the data was inserted through Hive, or create tables with complex types where you already have existing Parquet/ORC |
| data files. |
| </p> |
| |
| <p> |
| If you have created a Hive table with the Parquet/ORC file format and containing complex types, use the same table for Impala queries |
| with no changes. If you have such a Hive table in some other format, use a Hive <codeph>CREATE TABLE AS SELECT ... STORED AS |
| PARQUET</codeph> or <codeph>INSERT ... SELECT</codeph> statement to produce an equivalent Parquet table that Impala can query. |
| </p> |
| |
| <p> If you have existing Parquet data files containing complex types, |
| located outside of any Impala or Hive table, such as data files |
| created by Spark jobs, you can use an Impala <codeph>CREATE TABLE ... |
| STORED AS PARQUET</codeph> statement, followed by an Impala |
| <codeph>LOAD DATA</codeph> statement to move the data files into the |
| table. As an alternative, you can use an Impala <codeph>CREATE |
| EXTERNAL TABLE</codeph> statement to create a table pointing to the |
| HDFS directory that already contains the Parquet or ORC data |
| files.</p> |
| |
| <p>The simplest way to get started with complex type data is to take a |
| denormalized table containing duplicated values, and use an |
| <codeph>INSERT ... SELECT</codeph> statement to copy the data into a |
| Parquet table and condense the repeated values into complex types. |
| With the Hive <codeph>INSERT</codeph> statement, you use the |
| <codeph>COLLECT_LIST()</codeph>, <codeph>NAMED_STRUCT()</codeph>, |
| and <codeph>MAP()</codeph> constructor functions within a |
| <codeph>GROUP BY</codeph> query to produce the complex type values. |
| <codeph>COLLECT_LIST()</codeph> turns a sequence of values into an |
| <codeph>ARRAY</codeph>. <codeph>NAMED_STRUCT()</codeph> uses the |
| first, third, and so on arguments as the field names for a |
| <codeph>STRUCT</codeph>, to match the field names from the |
| <codeph>CREATE TABLE</codeph> statement. </p> |
| |
| <note> |
| Because Hive currently cannot construct individual rows using complex types through the <codeph>INSERT ... VALUES</codeph> syntax, |
| you prepare the data in flat form in a separate table, then copy it to the table with complex columns using <codeph>INSERT ... |
| SELECT</codeph> and the complex type constructors. See <xref href="impala_complex_types.xml#complex_types_ex_hive_etl"/> for |
| examples. |
| </note> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_nesting"> |
| |
| <title>Using Complex Types as Nested Types</title> |
| |
| <conbody> |
| |
| <p> |
| The <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types can be the top-level types for <q>nested |
| type</q> columns. That is, each of these types can contain other complex or scalar types, with multiple levels of nesting to a |
| maximum depth of 100. For example, you can have an array of structures, a map containing other maps, a structure containing an |
| array of other structures, and so on. At the lowest level, there are always scalar types making up the fields of a |
| <codeph>STRUCT</codeph>, elements of an <codeph>ARRAY</codeph>, and keys and values of a <codeph>MAP</codeph>. |
| </p> |
| |
| <p> |
| Schemas involving complex types typically use some level of nesting for the complex type columns. |
| </p> |
| |
| <p> |
| For example, to model a relationship like a dimension table and a fact table, you typically use an <codeph>ARRAY</codeph> where |
| each array element is a <codeph>STRUCT</codeph>. The <codeph>STRUCT</codeph> fields represent what would traditionally be columns |
| in a separate joined table. It makes little sense to use a <codeph>STRUCT</codeph> as the top-level type for a column, because you |
| could just make the fields of the <codeph>STRUCT</codeph> into regular table columns. |
| </p> |
| |
| <!-- To do: this example might move somewhere else, under STRUCT itself or in a tips-and-tricks section. --> |
| |
| <p> |
| Perhaps the only use case for a top-level <codeph>STRUCT</codeph> would be to to allow <codeph>STRUCT</codeph> fields with the |
| same name as columns to coexist in the same table. The following example shows how a table could have a column named |
| <codeph>ID</codeph>, and two separate <codeph>STRUCT</codeph> fields also named <codeph>ID</codeph>. Because the |
| <codeph>STRUCT</codeph> fields are always referenced using qualified names, the identical <codeph>ID</codeph> names do not cause a |
| conflict. |
| </p> |
| |
| <codeblock><![CDATA[CREATE TABLE struct_namespaces |
| ( |
| id BIGINT |
| , s1 STRUCT < id: BIGINT, field1: STRING > |
| , s2 STRUCT < id: BIGINT, when_happened: TIMESTAMP > |
| ) |
| STORED AS PARQUET; |
| |
| select id, s1.id, s2.id from struct_namespaces; |
| ]]> |
| </codeblock> |
| |
| <p> |
| It is common to make the value portion of each key-value pair in a <codeph>MAP</codeph> a <codeph>STRUCT</codeph>, |
| <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, or other complex type variation. That way, each key in the <codeph>MAP</codeph> |
| can be associated with a flexible and extensible data structure. The key values are not predefined ahead of time (other than by |
| specifying their data type). Therefore, the <codeph>MAP</codeph> can accomodate a rapidly evolving schema, or sparse data |
| structures where each row contains only a few data values drawn from a large set of possible choices. |
| </p> |
| |
| <p> |
| Although you can use an <codeph>ARRAY</codeph> of scalar values as the top-level column in a table, such a simple array is |
| typically of limited use for analytic queries. The only property of the array elements, aside from the element value, is the |
| ordering sequence available through the <codeph>POS</codeph> pseudocolumn. To record any additional item about each array element, |
| such as a <codeph>TIMESTAMP</codeph> or a symbolic name, you use an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> rather than |
| of scalar values. |
| </p> |
| |
| <p> |
| If you are considering having multiple <codeph>ARRAY</codeph> or <codeph>MAP</codeph> columns, with related items under the same |
| position in each <codeph>ARRAY</codeph> or the same key in each <codeph>MAP</codeph>, prefer to use a <codeph>STRUCT</codeph> to |
| group all the related items into a single <codeph>ARRAY</codeph> or <codeph>MAP</codeph>. Doing so avoids the additional storage |
| overhead and potential duplication of key values from having an extra complex type column. Also, because each |
| <codeph>ARRAY</codeph> or <codeph>MAP</codeph> that you reference in the query <codeph>SELECT</codeph> list requires an additional |
| join clause, minimizing the number of complex type columns also makes the query easier to read and maintain, relying more on dot |
| notation to refer to the relevant fields rather than a sequence of join clauses. |
| </p> |
| |
| <p> |
| For example, here is a table with several complex type columns all at the top level and containing only scalar types. To retrieve |
| every data item for the row requires a separate join for each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> column. The fields of |
| the <codeph>STRUCT</codeph> can be referenced using dot notation, but there is no real advantage to using the |
| <codeph>STRUCT</codeph> at the top level rather than just making separate columns <codeph>FIELD1</codeph> and |
| <codeph>FIELD2</codeph>. |
| </p> |
| |
| <codeblock><![CDATA[CREATE TABLE complex_types_top_level |
| ( |
| id BIGINT, |
| a1 ARRAY<INT>, |
| a2 ARRAY<STRING>, |
| s STRUCT<field1: INT, field2: STRING>, |
| -- Numeric lookup key for a string value. |
| m1 MAP<INT,STRING>, |
| -- String lookup key for a numeric value. |
| m2 MAP<STRING,INT> |
| ) |
| STORED AS PARQUET; |
| |
| describe complex_types_top_level; |
| +------+-----------------+ |
| | name | type | |
| +------+-----------------+ |
| | id | bigint | |
| | a1 | array<int> | |
| | a2 | array<string> | |
| | s | struct< | |
| | | field1:int, | |
| | | field2:string | |
| | | > | |
| | m1 | map<int,string> | |
| | m2 | map<string,int> | |
| +------+-----------------+ |
| |
| select |
| id, |
| a1.item, |
| a2.item, |
| s.field1, |
| s.field2, |
| m1.key, |
| m1.value, |
| m2.key, |
| m2.value |
| from |
| complex_types_top_level, |
| complex_types_top_level.a1, |
| complex_types_top_level.a2, |
| complex_types_top_level.m1, |
| complex_types_top_level.m2; |
| ]]> |
| </codeblock> |
| |
| <p> |
| For example, here is a table with columns containing an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, a <codeph>MAP</codeph> |
| where each key value is a <codeph>STRUCT</codeph>, and a <codeph>MAP</codeph> where each key value is an <codeph>ARRAY</codeph> of |
| <codeph>STRUCT</codeph>. |
| </p> |
| |
| <codeblock><![CDATA[CREATE TABLE nesting_demo |
| ( |
| user_id BIGINT, |
| family_members ARRAY < STRUCT < name: STRING, email: STRING, date_joined: TIMESTAMP >>, |
| foo map < STRING, STRUCT < f1: INT, f2: INT, f3: TIMESTAMP, f4: BOOLEAN >>, |
| gameplay MAP < STRING , ARRAY < STRUCT < |
| name: STRING, highest: BIGINT, lives_used: INT, total_spent: DECIMAL(16,2) |
| >>> |
| ) |
| STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| <p> |
| The <codeph>DESCRIBE</codeph> statement rearranges the <codeph><</codeph> and <codeph>></codeph> separators and the field |
| names within each <codeph>STRUCT</codeph> for easy readability: |
| </p> |
| |
| <codeblock><![CDATA[DESCRIBE nesting_demo; |
| +----------------+-----------------------------+ |
| | name | type | |
| +----------------+-----------------------------+ |
| | user_id | bigint | |
| | family_members | array<struct< | |
| | | name:string, | |
| | | email:string, | |
| | | date_joined:timestamp | |
| | | >> | |
| | foo | map<string,struct< | |
| | | f1:int, | |
| | | f2:int, | |
| | | f3:timestamp, | |
| | | f4:boolean | |
| | | >> | |
| | gameplay | map<string,array<struct< | |
| | | name:string, | |
| | | highest:bigint, | |
| | | lives_used:int, | |
| | | total_spent:decimal(16,2) | |
| | | >>> | |
| +----------------+-----------------------------+ |
| ]]> |
| </codeblock> |
| |
| <p> |
| To query the complex type columns, you use join notation to refer to the lowest-level scalar values. If the value is an |
| <codeph>ARRAY</codeph> element, the fully qualified name includes the <codeph>ITEM</codeph> pseudocolumn. If the value is inside a |
| <codeph>MAP</codeph>, the fully qualified name includes the <codeph>KEY</codeph> or <codeph>VALUE</codeph> pseudocolumn. Each |
| reference to a different <codeph>ARRAY</codeph> or <codeph>MAP</codeph> (even if nested inside another complex type) requires an |
| additional join clause. |
| </p> |
| |
| <!-- To do: shorten qualified names. --> |
| |
| <codeblock><![CDATA[SELECT |
| -- The lone scalar field doesn't require any dot notation or join clauses. |
| user_id |
| -- Retrieve the fields of a STRUCT inside an ARRAY. |
| -- The FAMILY_MEMBERS name refers to the FAMILY_MEMBERS table alias defined later in the FROM clause. |
| , family_members.item.name |
| , family_members.item.email |
| , family_members.item.date_joined |
| -- Retrieve the KEY and VALUE fields of a MAP, with the value being a STRUCT consisting of more fields. |
| -- The FOO name refers to the FOO table alias defined later in the FROM clause. |
| , foo.key |
| , foo.value.f1 |
| , foo.value.f2 |
| , foo.value.f3 |
| , foo.value.f4 |
| -- Retrieve the KEY fields of a MAP, and expand the VALUE part into ARRAY items consisting of STRUCT fields. |
| -- The GAMEPLAY name refers to the GAMEPLAY table alias defined later in the FROM clause (referring to the MAP item). |
| -- The GAME_N name refers to the GAME_N table alias defined later in the FROM clause (referring to the ARRAY |
| -- inside the MAP item's VALUE part.) |
| , gameplay.key |
| , game_n.name |
| , game_n.highest |
| , game_n.lives_used |
| , game_n.total_spent |
| FROM |
| nesting_demo |
| , nesting_demo.family_members AS family_members |
| , nesting_demo.foo AS foo |
| , nesting_demo.gameplay AS gameplay |
| , nesting_demo.gameplay.value AS game_n; |
| ]]> |
| </codeblock> |
| |
| <p> |
| Once you understand the notation to refer to a particular data item in the <codeph>SELECT</codeph> list, you can use the same |
| qualified name to refer to that data item in other parts of the query, such as the <codeph>WHERE</codeph> clause, <codeph>ORDER |
| BY</codeph> or <codeph>GROUP BY</codeph> clauses, or calls to built-in functions. For example, you might frequently retrieve the |
| <codeph>VALUE</codeph> part of each <codeph>MAP</codeph> item in the <codeph>SELECT</codeph> list, while choosing the specific |
| <codeph>MAP</codeph> items by running comparisons against the <codeph>KEY</codeph> part in the <codeph>WHERE</codeph> clause. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_views"> |
| |
| <title>Accessing Complex Type Data in Flattened Form Using Views</title> |
| |
| <conbody> |
| |
| <p> |
| The layout of complex and nested types is largely a physical consideration. The complex type columns reside in the same data files |
| rather than in separate normalized tables, for your convenience in managing related data sets and performance in querying related |
| data sets. You can use views to treat tables with complex types as if they were flattened. By putting the join logic and |
| references to the complex type columns in the view definition, you can query the same tables using existing queries intended for |
| tables containing only scalar columns. This technique also lets you use tables with complex types with BI tools that are not aware |
| of the data types and query notation for accessing complex type columns. |
| </p> |
| |
| <!-- First conref contains a link back to complex_types topic so not appropriate here. |
| Second conref might not make sense without the first. Therefore hold off for now, |
| see later if there's a good way to work in these reusable tips. |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_views"/> |
| <p conref="../shared/impala_common.xml#common/complex_types_views_caveat"/> |
| --> |
| |
| <p> |
| For example, the variation of the TPC-H schema containing complex types has a table <codeph>REGION</codeph>. This table has 5 |
| rows, corresponding to 5 regions such as <codeph>NORTH AMERICA</codeph> and <codeph>AFRICA</codeph>. Each row has an |
| <codeph>ARRAY</codeph> column, where each array item is a <codeph>STRUCT</codeph> containing details about a country in that |
| region. |
| </p> |
| |
| <codeblock><![CDATA[DESCRIBE region; |
| +-------------+-------------------------+ |
| | name | type | |
| +-------------+-------------------------+ |
| | r_regionkey | smallint | |
| | r_name | string | |
| | r_comment | string | |
| | r_nations | array<struct< | |
| | | n_nationkey:smallint, | |
| | | n_name:string, | |
| | | n_comment:string | |
| | | >> | |
| +-------------+-------------------------+ |
| ]]> |
| </codeblock> |
| |
| <p> |
| The same data could be represented in traditional denormalized form, as a single table where the information about each region is |
| repeated over and over, alongside the information about each country. The nested complex types let us avoid the repetition, while |
| still keeping the data in a single table rather than normalizing across multiple tables. |
| </p> |
| |
| <p> |
| To use this table with a JDBC or ODBC application that expected scalar columns, we could create a view that represented the result |
| set as a set of scalar columns (three columns from the original table, plus three more from the <codeph>STRUCT</codeph> fields of |
| the array elements). In the following examples, any column with an <codeph>R_*</codeph> prefix is taken unchanged from the |
| original table, while any column with an <codeph>N_*</codeph> prefix is extracted from the <codeph>STRUCT</codeph> inside the |
| <codeph>ARRAY</codeph>. |
| </p> |
| |
| <!-- To do: shorten qualified names. --> |
| |
| <codeblock>CREATE VIEW region_view AS |
| SELECT |
| r_regionkey, |
| r_name, |
| r_comment, |
| array_field.item.n_nationkey AS n_nationkey, |
| array_field.item.n_name AS n_name, |
| array_field.n_comment AS n_comment |
| FROM |
| region, region.r_nations AS array_field; |
| </codeblock> |
| |
| <p> |
| Then we point the application queries at the view rather than the original table. From the perspective of the view, there are 25 |
| rows in the result set, one for each nation in each region, and queries can refer freely to fields related to the region or the |
| nation. |
| </p> |
| |
| <codeblock>-- Retrieve info such as the nation name from the original R_NATIONS array elements. |
| select n_name from region_view where r_name in ('EUROPE', 'ASIA'); |
| +----------------+ |
| | n_name | |
| +----------------+ |
| | UNITED KINGDOM | |
| | RUSSIA | |
| | ROMANIA | |
| | GERMANY | |
| | FRANCE | |
| | VIETNAM | |
| | CHINA | |
| | JAPAN | |
| | INDONESIA | |
| | INDIA | |
| +----------------+ |
| |
| -- UNITED STATES in AMERICA and UNITED KINGDOM in EUROPE. |
| SELECT DISTINCT r_name FROM region_view WHERE n_name LIKE 'UNITED%'; |
| +---------+ |
| | r_name | |
| +---------+ |
| | AMERICA | |
| | EUROPE | |
| +---------+ |
| |
| -- For conciseness, we only list some view columns in the SELECT list. |
| -- SELECT * would bring back all the data, unlike SELECT * |
| -- queries on the original table with complex type columns. |
| SELECT r_regionkey, r_name, n_nationkey, n_name FROM region_view LIMIT 7; |
| +-------------+--------+-------------+----------------+ |
| | r_regionkey | r_name | n_nationkey | n_name | |
| +-------------+--------+-------------+----------------+ |
| | 3 | EUROPE | 23 | UNITED KINGDOM | |
| | 3 | EUROPE | 22 | RUSSIA | |
| | 3 | EUROPE | 19 | ROMANIA | |
| | 3 | EUROPE | 7 | GERMANY | |
| | 3 | EUROPE | 6 | FRANCE | |
| | 2 | ASIA | 21 | VIETNAM | |
| | 2 | ASIA | 18 | CHINA | |
| +-------------+--------+-------------+----------------+ |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="complex_types_examples"> |
| |
| <title>Tutorials and Examples for Complex Types</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Tutorials"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| The following examples illustrate the query syntax for some common use cases involving complex type columns. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="complex_sample_schema"> |
| |
| <title>Sample Schema and Data for Experimenting with Impala Complex Types</title> |
| |
| <conbody> |
| |
| <!-- To do: find where the schema and data creation scripts are posted so I can refer readers to a download location. --> |
| |
| <p> |
| The tables used for earlier examples of complex type syntax are trivial ones with no actual data. The more substantial examples of |
| the complex type feature use these tables, adapted from the schema used for TPC-H testing: |
| </p> |
| |
| <codeblock><![CDATA[SHOW TABLES; |
| +----------+ |
| | name | |
| +----------+ |
| | customer | |
| | part | |
| | region | |
| | supplier | |
| +----------+ |
| |
| DESCRIBE customer; |
| +--------------+------------------------------------+ |
| | name | type | |
| +--------------+------------------------------------+ |
| | c_custkey | bigint | |
| | c_name | string | |
| | c_address | string | |
| | c_nationkey | smallint | |
| | c_phone | string | |
| | c_acctbal | decimal(12,2) | |
| | c_mktsegment | string | |
| | c_comment | string | |
| | c_orders | array<struct< | |
| | | o_orderkey:bigint, | |
| | | o_orderstatus:string, | |
| | | o_totalprice:decimal(12,2), | |
| | | o_orderdate:string, | |
| | | o_orderpriority:string, | |
| | | o_clerk:string, | |
| | | o_shippriority:int, | |
| | | o_comment:string, | |
| | | o_lineitems:array<struct< | |
| | | l_partkey:bigint, | |
| | | l_suppkey:bigint, | |
| | | l_linenumber:int, | |
| | | l_quantity:decimal(12,2), | |
| | | l_extendedprice:decimal(12,2), | |
| | | l_discount:decimal(12,2), | |
| | | l_tax:decimal(12,2), | |
| | | l_returnflag:string, | |
| | | l_linestatus:string, | |
| | | l_shipdate:string, | |
| | | l_commitdate:string, | |
| | | l_receiptdate:string, | |
| | | l_shipinstruct:string, | |
| | | l_shipmode:string, | |
| | | l_comment:string | |
| | | >> | |
| | | >> | |
| +--------------+------------------------------------+ |
| |
| DESCRIBE part; |
| +---------------+---------------+ |
| | name | type | |
| +---------------+---------------+ |
| | p_partkey | bigint | |
| | p_name | string | |
| | p_mfgr | string | |
| | p_brand | string | |
| | p_type | string | |
| | p_size | int | |
| | p_container | string | |
| | p_retailprice | decimal(12,2) | |
| | p_comment | string | |
| +---------------+---------------+ |
| |
| DESCRIBE region; |
| +-------------+--------------------------------------------------------------------+ |
| | name | type | |
| +-------------+--------------------------------------------------------------------+ |
| | r_regionkey | smallint | |
| | r_name | string | |
| | r_comment | string | |
| | r_nations | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> | |
| +-------------+--------------------------------------------------------------------+ |
| |
| DESCRIBE supplier; |
| +-------------+----------------------------------------------+ |
| | name | type | |
| +-------------+----------------------------------------------+ |
| | s_suppkey | bigint | |
| | s_name | string | |
| | s_address | string | |
| | s_nationkey | smallint | |
| | s_phone | string | |
| | s_acctbal | decimal(12,2) | |
| | s_comment | string | |
| | s_partsupps | array<struct<ps_partkey:bigint, | |
| | | ps_availqty:int,ps_supplycost:decimal(12,2), | |
| | | ps_comment:string>> | |
| +-------------+----------------------------------------------+ |
| ]]> |
| </codeblock> |
| |
| <p> |
| The volume of data used in the following examples is: |
| </p> |
| |
| <codeblock><![CDATA[SELECT count(*) FROM customer; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 150000 | |
| +----------+ |
| |
| SELECT count(*) FROM part; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 200000 | |
| +----------+ |
| |
| SELECT count(*) FROM region; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 5 | |
| +----------+ |
| |
| SELECT count(*) FROM supplier; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 10000 | |
| +----------+ |
| ]]> |
| </codeblock> |
| |
| </conbody> |
| |
| <concept audience="hidden" id="complex_python"> |
| |
| <!-- Hiding this subtopic for the moment because there isn't enough material related to Ibis yet, which would be useful for background or to construct examples. --> |
| |
| <title>Representing Python Data Types with Impala Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| If you use Python data types for data science computations or for convenience of manipulating data structures, you might think |
| of complex or nested data in terms of lists, tuples, and dictionaries and their textual representations. These types map well to |
| Impala complex types, as follows: |
| </p> |
| |
| <!-- To do: add some examples in both languages to show the equivalences. --> |
| |
| <ul> |
| <li> |
| <p> |
| An Impala <codeph>ARRAY</codeph> is like a Python list. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| An Impala <codeph>STRUCT</codeph> is like a Python tuple. The elements of an Impala <codeph>STRUCT</codeph> are accessed by |
| name, rather than by numeric index as in Python. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| An Impala <codeph>MAP</codeph> is like a Python dictionary. To emulate the Python <codeph>defaultdict</codeph> type, which |
| returns a default value in the case of non-existent elements, use an <codeph>OUTER JOIN</codeph> clause in the Impala query |
| that retrieves the <codeph>MAP</codeph> elements, and translate any <codeph>NULL</codeph> values corresponding to missing |
| elements using Impala conditional functions. |
| </p> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="complex_types_ex_aggregation" audience="hidden"> |
| |
| <title>Aggregating the Elements in an ARRAY</title> |
| |
| <conbody> |
| |
| <p></p> |
| |
| <codeblock> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_ex_map_keys" audience="hidden"> |
| |
| <title>Finding the Distinct Keys in a MAP</title> |
| |
| <conbody> |
| |
| <p></p> |
| |
| <codeblock> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_ex_map_struct" audience="hidden"> |
| |
| <title>Using a STRUCT as the Value Part of a MAP</title> |
| |
| <conbody> |
| |
| <p></p> |
| |
| <codeblock> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_types_ex_hive_etl"> |
| |
| <title>Constructing Parquet/ORC Files with Complex Columns Using Hive</title> |
| |
| <conbody> |
| |
| <p> |
| The following examples demonstrate the Hive syntax to transform flat data (tables with all scalar columns) into Parquet/ORC tables |
| where Impala can query the complex type columns. Each example shows the full sequence of steps, including switching back and forth |
| between Impala and Hive. Although the source table can use any file format, the destination table must use the Parquet/ORC file |
| format. We take Parquet in the following examples. You can replace Parquet with ORC to do the same things in ORC file format. |
| </p> |
| |
| <p> |
| <b>Create table with <codeph>ARRAY</codeph> in Impala, load data in Hive, query in Impala:</b> |
| </p> |
| |
| <p> |
| This example shows the cycle of creating the tables and querying the complex data in Impala, and using Hive (either the |
| <codeph>hive</codeph> shell or <codeph>beeline</codeph>) for the data loading step. The data starts in flattened, denormalized |
| form in a text table. Hive writes the corresponding Parquet data, including an <codeph>ARRAY</codeph> column. Then Impala can run |
| analytic queries on the Parquet table, using join notation to unpack the <codeph>ARRAY</codeph> column. |
| </p> |
| |
| <codeblock>/* Initial DDL and loading of flat, denormalized data happens in impala-shell */<![CDATA[CREATE TABLE flat_array (country STRING, city STRING);INSERT INTO flat_array VALUES |
| ('Canada', 'Toronto') , ('Canada', 'Vancouver') , ('Canada', "St. John\'s") |
| , ('Canada', 'Saint John') , ('Canada', 'Montreal') , ('Canada', 'Halifax') |
| , ('Canada', 'Winnipeg') , ('Canada', 'Calgary') , ('Canada', 'Saskatoon') |
| , ('Canada', 'Ottawa') , ('Canada', 'Yellowknife') , ('France', 'Paris') |
| , ('France', 'Nice') , ('France', 'Marseilles') , ('France', 'Cannes') |
| , ('Greece', 'Athens') , ('Greece', 'Piraeus') , ('Greece', 'Hania') |
| , ('Greece', 'Heraklion') , ('Greece', 'Rethymnon') , ('Greece', 'Fira'); |
| |
| CREATE TABLE complex_array (country STRING, city ARRAY <STRING>) STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| <codeblock><![CDATA[/* Conversion to Parquet and complex and/or nested columns happens in Hive */ |
| |
| INSERT INTO complex_array SELECT country, collect_list(city) FROM flat_array GROUP BY country; |
| Query ID = dev_20151108160808_84477ff2-82bd-4ba4-9a77-554fa7b8c0cb |
| Total jobs = 1 |
| Launching Job 1 out of 1 |
| ... |
| ]]> |
| </codeblock> |
| |
| <codeblock><![CDATA[/* Back to impala-shell again for analytic queries */ |
| |
| REFRESH complex_array; |
| SELECT country, city.item FROM complex_array, complex_array.city |
| +---------+-------------+ |
| | country | item | |
| +---------+-------------+ |
| | Canada | Toronto | |
| | Canada | Vancouver | |
| | Canada | St. John's | |
| | Canada | Saint John | |
| | Canada | Montreal | |
| | Canada | Halifax | |
| | Canada | Winnipeg | |
| | Canada | Calgary | |
| | Canada | Saskatoon | |
| | Canada | Ottawa | |
| | Canada | Yellowknife | |
| | France | Paris | |
| | France | Nice | |
| | France | Marseilles | |
| | France | Cannes | |
| | Greece | Athens | |
| | Greece | Piraeus | |
| | Greece | Hania | |
| | Greece | Heraklion | |
| | Greece | Rethymnon | |
| | Greece | Fira | |
| +---------+-------------+ |
| ]]> |
| </codeblock> |
| |
| <p> |
| <b>Create table with <codeph>STRUCT</codeph> and <codeph>ARRAY</codeph> in Impala, load data in Hive, query in Impala:</b> |
| </p> |
| |
| <p> |
| This example shows the cycle of creating the tables and querying the complex data in Impala, and using Hive (either the |
| <codeph>hive</codeph> shell or <codeph>beeline</codeph>) for the data loading step. The data starts in flattened, denormalized |
| form in a text table. Hive writes the corresponding Parquet data, including a <codeph>STRUCT</codeph> column with an |
| <codeph>ARRAY</codeph> field. Then Impala can run analytic queries on the Parquet table, using join notation to unpack the |
| <codeph>ARRAY</codeph> field from the <codeph>STRUCT</codeph> column. |
| </p> |
| |
| <!-- To do: enhance the complex table in this example, with COUNTRY being an ARRAY of STRUCT, with the STRUCT containing another ARRAY field. --> |
| |
| <codeblock><![CDATA[/* Initial DDL and loading of flat, denormalized data happens in impala-shell */ |
| |
| CREATE TABLE flat_struct_array (continent STRING, country STRING, city STRING); |
| INSERT INTO flat_struct_array VALUES |
| ('North America', 'Canada', 'Toronto') , ('North America', 'Canada', 'Vancouver') |
| , ('North America', 'Canada', "St. John\'s") , ('North America', 'Canada', 'Saint John') |
| , ('North America', 'Canada', 'Montreal') , ('North America', 'Canada', 'Halifax') |
| , ('North America', 'Canada', 'Winnipeg') , ('North America', 'Canada', 'Calgary') |
| , ('North America', 'Canada', 'Saskatoon') , ('North America', 'Canada', 'Ottawa') |
| , ('North America', 'Canada', 'Yellowknife') , ('Europe', 'France', 'Paris') |
| , ('Europe', 'France', 'Nice') , ('Europe', 'France', 'Marseilles') |
| , ('Europe', 'France', 'Cannes') , ('Europe', 'Greece', 'Athens') |
| , ('Europe', 'Greece', 'Piraeus') , ('Europe', 'Greece', 'Hania') |
| , ('Europe', 'Greece', 'Heraklion') , ('Europe', 'Greece', 'Rethymnon') |
| , ('Europe', 'Greece', 'Fira'); |
| |
| CREATE TABLE complex_struct_array (continent STRING, country STRUCT <name: STRING, city: ARRAY <STRING> >) STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| <codeblock><![CDATA[/* Conversion to Parquet and complex and/or nested columns happens in Hive */ |
| |
| INSERT INTO complex_struct_array SELECT continent, named_struct('name', country, 'city', collect_list(city)) FROM flat_array_array GROUP BY continent, country; |
| Query ID = dev_20151108163535_11a4fa53-0003-4638-97e6-ef13cdb8e09e |
| Total jobs = 1 |
| Launching Job 1 out of 1 |
| ... |
| ]]> |
| </codeblock> |
| |
| <codeblock><![CDATA[/* Back to impala-shell again for analytic queries */ |
| |
| REFRESH complex_struct_array; |
| SELECT t1.continent, t1.country.name, t2.item |
| FROM complex_struct_array t1, t1.country.city t2 |
| +---------------+--------------+-------------+ |
| | continent | country.name | item | |
| +---------------+--------------+-------------+ |
| | Europe | France | Paris | |
| | Europe | France | Nice | |
| | Europe | France | Marseilles | |
| | Europe | France | Cannes | |
| | Europe | Greece | Athens | |
| | Europe | Greece | Piraeus | |
| | Europe | Greece | Hania | |
| | Europe | Greece | Heraklion | |
| | Europe | Greece | Rethymnon | |
| | Europe | Greece | Fira | |
| | North America | Canada | Toronto | |
| | North America | Canada | Vancouver | |
| | North America | Canada | St. John's | |
| | North America | Canada | Saint John | |
| | North America | Canada | Montreal | |
| | North America | Canada | Halifax | |
| | North America | Canada | Winnipeg | |
| | North America | Canada | Calgary | |
| | North America | Canada | Saskatoon | |
| | North America | Canada | Ottawa | |
| | North America | Canada | Yellowknife | |
| +---------------+--------------+-------------+ |
| ]]> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept audience="hidden" id="complex_types_no_joins"> |
| |
| <title>Using Complex Types without Join Queries</title> |
| |
| <conbody> |
| |
| <p> |
| Although most discussion and examples of complex types revolve around join queries, you can make use of complex columns without |
| joins. In these cases, you refer to the complex object (possibly nested under multiple levels of other complex types) as the only |
| <q>table</q> in the <codeph>FROM</codeph> clause, and the columns of this table are the scalar values named <codeph>POS</codeph> |
| (for <codeph>ARRAY</codeph>), <codeph>ITEM</codeph> (for <codeph>ARRAY</codeph> of scalar values), <codeph>KEY</codeph> (for |
| <codeph>MAP</codeph>), <codeph>VALUE</codeph> (for <codeph>MAP</codeph> with scalar values), and the names of |
| <codeph>STRUCT</codeph> fields (where the <codeph>ARRAY</codeph> element or <codeph>MAP</codeph> value is a |
| <codeph>STRUCT</codeph>). |
| </p> |
| |
| <p> |
| This technique lets you use aggregation operation such as <codeph>COUNT()</codeph>, <codeph>SUM()</codeph>, |
| <codeph>MAX()</codeph>, or <codeph>DISTINCT</codeph> across all the data items from a complex type, across all rows of the table. |
| For users new to complex types, this can be a good practice exercise to understand the notation for referring to data items within |
| deeply nested types, and how the pseudocolumns such as <codeph>POS</codeph> and <codeph>KEY</codeph> are treated as columns |
| alongside the fields of a <codeph>STRUCT</codeph>. |
| </p> |
| |
| <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/> |
| |
| <p> |
| For example, these queries perform aggregate functions using the <codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q> |
| schema: |
| </p> |
| |
| <codeblock><![CDATA[ |
| ]]> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_denormalizing"> |
| |
| <title>Flattening Normalized Tables into a Single Table with Complex Types</title> |
| |
| <conbody> |
| |
| <p> |
| One common use for complex types is to embed the contents of one table into another. The traditional technique of denormalizing |
| results in a huge number of rows with some column values repeated over and over. With complex types, you can keep the same number |
| of rows as in the original normalized table, and put all the associated data from the other table in a single new column. |
| </p> |
| |
| <p> |
| In this flattening scenario, you might frequently use a column that is an <codeph>ARRAY</codeph> consisting of |
| <codeph>STRUCT</codeph> elements, where each field within the <codeph>STRUCT</codeph> corresponds to a column name from the table |
| that you are combining. |
| </p> |
| |
| <p> |
| The following example shows a traditional normalized layout using two tables, and then an equivalent layout using complex types in |
| a single table. |
| </p> |
| |
| <codeblock><![CDATA[/* Traditional relational design */ |
| |
| -- This table just stores numbers, allowing us to look up details about the employee |
| -- and details about their vacation time using a three-table join query. |
| CREATE table employee_vacations |
| ( |
| employee_id BIGINT, |
| vacation_id BIGINT |
| ) |
| STORED AS PARQUET; |
| |
| -- Each kind of information to track gets its own "fact table". |
| CREATE table vacation_details |
| ( |
| vacation_id BIGINT, |
| vacation_start TIMESTAMP, |
| duration INT |
| ) |
| STORED AS PARQUET; |
| |
| -- Any time we print a human-readable report, we join with this table to |
| -- display info about employee #1234. |
| CREATE TABLE employee_contact |
| ( |
| employee_id BIGINT, |
| name STRING, |
| address STRING, |
| phone STRING, |
| email STRING, |
| address_type STRING /* 'home', 'work', 'remote', etc. */ |
| ) |
| STORED AS PARQUET; |
| |
| /* Equivalent flattened schema using complex types */ |
| |
| -- For analytic queries using complex types, we can bundle the dimension table |
| -- and multiple fact tables into a single table. |
| CREATE TABLE employee_vacations_nested_types |
| ( |
| -- We might still use the employee_id for other join queries. |
| -- The table needs at least one scalar column to serve as an identifier |
| -- for the complex type columns. |
| employee_id BIGINT, |
| |
| -- Columns of the VACATION_DETAILS table are folded into a STRUCT. |
| -- We drop the VACATION_ID column because Impala doesn't need |
| -- synthetic IDs to join a complex type column. |
| -- Each row from the VACATION_DETAILS table becomes an array element. |
| vacation ARRAY < STRUCT < |
| vacation_start: TIMESTAMP, |
| duration: INT |
| >>, |
| |
| -- The ADDRESS_TYPE column, with a small number of predefined values that are distinct |
| -- for each employee, makes the EMPLOYEE_CONTACT table a good candidate to turn into a MAP, |
| -- with each row represented as a STRUCT. The string value from ADDRESS_TYPE becomes the |
| -- "key" (the anonymous first field) of the MAP. |
| contact MAP < STRING, STRUCT < |
| address: STRING, |
| phone: STRING, |
| email: STRING |
| >> |
| ) |
| STORED AS PARQUET; |
| ]]> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_inference"> |
| |
| <title>Interchanging Complex Type Tables and Data Files with Hive and Other Components</title> |
| |
| <conbody> |
| |
| <p> |
| You can produce Parquet data files through several Hadoop components and APIs. |
| </p> |
| |
| <p> |
| If you have a Hive-created Parquet table that includes <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or <codeph>MAP</codeph> |
| columns, Impala can query that same table in <keyword keyref="impala23_full"/> and higher, subject to the usual restriction that all other |
| columns are of data types supported by Impala, and also that the file type of the table must be Parquet. |
| </p> |
| |
| <p> |
| If you have a Parquet data file produced outside of Impala, Impala can automatically deduce the appropriate table structure using |
| the syntax <codeph>CREATE TABLE ... LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'</codeph>. In <keyword keyref="impala23_full"/> |
| and higher, this feature works for Parquet files that include <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or |
| <codeph>MAP</codeph> types. |
| </p> |
| |
| <codeblock><![CDATA[/* In impala-shell, find the HDFS data directory of the original table. |
| DESCRIBE FORMATTED tpch_nested_parquet.customer; |
| ... |
| | Location: | hdfs://localhost:20500/test-warehouse/tpch_nested_parquet.db/customer | NULL | |
| ... |
| |
| # In the Unix shell, find the path of any Parquet data file in that HDFS directory. |
| $ hdfs dfs -ls hdfs://localhost:20500/test-warehouse/tpch_nested_parquet.db/customer |
| Found 4 items |
| -rwxr-xr-x 3 dev supergroup 171298918 2015-09-22 23:30 hdfs://localhost:20500/blah/tpch_nested_parquet.db/customer/000000_0 |
| ... |
| |
| /* Back in impala-shell, use the HDFS path in a CREATE TABLE LIKE PARQUET statement. */ |
| CREATE TABLE customer_ctlp |
| LIKE PARQUET 'hdfs://localhost:20500/blah/tpch_nested_parquet.db/customer/000000_0' |
| STORED AS PARQUET; |
| |
| /* Confirm that old and new tables have the same column layout, including complex types. */ |
| DESCRIBE tpch_nested_parquet.customer |
| +--------------+------------------------------------+---------+ |
| | name | type | comment | |
| +--------------+------------------------------------+---------+ |
| | c_custkey | bigint | | |
| | c_name | string | | |
| | c_address | string | | |
| | c_nationkey | smallint | | |
| | c_phone | string | | |
| | c_acctbal | decimal(12,2) | | |
| | c_mktsegment | string | | |
| | c_comment | string | | |
| | c_orders | array<struct< | | |
| | | o_orderkey:bigint, | | |
| | | o_orderstatus:string, | | |
| | | o_totalprice:decimal(12,2), | | |
| | | o_orderdate:string, | | |
| | | o_orderpriority:string, | | |
| | | o_clerk:string, | | |
| | | o_shippriority:int, | | |
| | | o_comment:string, | | |
| | | o_lineitems:array<struct< | | |
| | | l_partkey:bigint, | | |
| | | l_suppkey:bigint, | | |
| | | l_linenumber:int, | | |
| | | l_quantity:decimal(12,2), | | |
| | | l_extendedprice:decimal(12,2), | | |
| | | l_discount:decimal(12,2), | | |
| | | l_tax:decimal(12,2), | | |
| | | l_returnflag:string, | | |
| | | l_linestatus:string, | | |
| | | l_shipdate:string, | | |
| | | l_commitdate:string, | | |
| | | l_receiptdate:string, | | |
| | | l_shipinstruct:string, | | |
| | | l_shipmode:string, | | |
| | | l_comment:string | | |
| | | >> | | |
| | | >> | | |
| +--------------+------------------------------------+---------+ |
| |
| describe customer_ctlp; |
| +--------------+------------------------------------+-----------------------------+ |
| | name | type | comment | |
| +--------------+------------------------------------+-----------------------------+ |
| | c_custkey | bigint | Inferred from Parquet file. | |
| | c_name | string | Inferred from Parquet file. | |
| | c_address | string | Inferred from Parquet file. | |
| | c_nationkey | int | Inferred from Parquet file. | |
| | c_phone | string | Inferred from Parquet file. | |
| | c_acctbal | decimal(12,2) | Inferred from Parquet file. | |
| | c_mktsegment | string | Inferred from Parquet file. | |
| | c_comment | string | Inferred from Parquet file. | |
| | c_orders | array<struct< | Inferred from Parquet file. | |
| | | o_orderkey:bigint, | | |
| | | o_orderstatus:string, | | |
| | | o_totalprice:decimal(12,2), | | |
| | | o_orderdate:string, | | |
| | | o_orderpriority:string, | | |
| | | o_clerk:string, | | |
| | | o_shippriority:int, | | |
| | | o_comment:string, | | |
| | | o_lineitems:array<struct< | | |
| | | l_partkey:bigint, | | |
| | | l_suppkey:bigint, | | |
| | | l_linenumber:int, | | |
| | | l_quantity:decimal(12,2), | | |
| | | l_extendedprice:decimal(12,2), | | |
| | | l_discount:decimal(12,2), | | |
| | | l_tax:decimal(12,2), | | |
| | | l_returnflag:string, | | |
| | | l_linestatus:string, | | |
| | | l_shipdate:string, | | |
| | | l_commitdate:string, | | |
| | | l_receiptdate:string, | | |
| | | l_shipinstruct:string, | | |
| | | l_shipmode:string, | | |
| | | l_comment:string | | |
| | | >> | | |
| | | >> | | |
| +--------------+------------------------------------+-----------------------------+ |
| ]]> |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| </concept> |