docs/topics/impala_complex_types.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept rev="2.3.0" id="complex_types">

   <title id="nested_types">Complex Types (<keyword keyref="impala23"/> or higher only)</title>

   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Impala Data Types"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p rev="2.3.0">
       <indexterm audience="hidden">complex types</indexterm>

       <indexterm audience="hidden">nested types</indexterm>
       <term>Complex types</term> (also referred to as <term>nested types</term>) let you represent multiple data values within a single
       row/column position. They differ from the familiar column types such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>, known as
       <term>scalar types</term> or <term>primitive types</term>, which represent a single data value within a given row/column position.
       Impala supports the complex types <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> in <keyword keyref="impala23_full"/>
       and higher. The Hive <codeph>UNION</codeph> type is not currently supported.
     </p>

     <p outputclass="toc inpage"/>

     <p>
       Once you understand the basics of complex types, refer to the individual type topics when you need to refresh your memory about syntax
       and examples:
     </p>

     <ul>
       <li>
         <xref href="impala_array.xml#array"/>
       </li>

       <li>
         <xref href="impala_struct.xml#struct"/>
       </li>

       <li>
         <xref href="impala_map.xml#map"/>
       </li>
     </ul>

   </conbody>

   <concept id="complex_types_benefits">

     <title>Benefits of Impala Complex Types</title>

     <conbody>

       <p>
         The reasons for using Impala complex types include the following:
       </p>

       <ul>
         <li>
           <p>
             You already have data produced by Hive or other non-Impala component that uses the complex type column names. You might need to
             convert the underlying data to Parquet to use it with Impala.
           </p>
         </li>

         <li>
           <p>
             Your data model originates with a non-SQL programming language or a NoSQL data management system. For example, if you are
             representing Python data expressed as nested lists, dictionaries, and tuples, those data structures correspond closely to Impala
             <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> types.
           </p>
         </li>

         <li>
           <p>
             Your analytic queries involving multiple tables could benefit from greater locality during join processing. By packing more
             related data items within each HDFS data block, complex types let join queries avoid the network overhead of the traditional
             Hadoop shuffle or broadcast join techniques.
           </p>
         </li>
       </ul>

       <p>
         The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used
         with all SQL clauses, such as <codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, all kinds of joins, subqueries, and inline
         views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other
         programming languages to deconstruct the underlying data structures.
       </p>

     </conbody>

   </concept>

   <concept id="complex_types_overview">

     <title>Overview of Impala Complex Types</title>

     <conbody>

       <p>
 <!--
       Each <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or <codeph>STRUCT</codeph> column can include multiple instances of scalar types
       such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>.
 -->
         The <codeph>ARRAY</codeph> and <codeph>MAP</codeph> types are closely related: they represent collections with arbitrary numbers of
         elements, where each element is the same type. In contrast, <codeph>STRUCT</codeph> groups together a fixed number of items into a
         single element. The parts of a <codeph>STRUCT</codeph> element (the <term>fields</term>) can be of different types, and each field
         has a name.
       </p>

       <p>
         The elements of an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>, can also be other
         complex types. You can construct elaborate data structures with up to 100 levels of nesting. For example, you can make an
         <codeph>ARRAY</codeph> whose elements are <codeph>STRUCT</codeph>s. Within each <codeph>STRUCT</codeph>, you can have some fields
         that are <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or another kind of <codeph>STRUCT</codeph>. The Impala documentation uses the
         terms complex and nested types interchangeably; for simplicity, it primarily uses the term complex types, to encompass all the
         properties of these types.
       </p>

       <p>
         When visualizing your data model in familiar SQL terms, you can think of each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> as a
         miniature table, and each <codeph>STRUCT</codeph> as a row within such a table. By default, the table represented by an
         <codeph>ARRAY</codeph> has two columns, <codeph>POS</codeph> to represent ordering of elements, and <codeph>ITEM</codeph>
         representing the value of each element. Likewise, by default, the table represented by a <codeph>MAP</codeph> encodes key-value
         pairs, and therefore has two columns, <codeph>KEY</codeph> and <codeph>VALUE</codeph>.
 <!--
         When you use a <codeph>STRUCT</codeph> as an
         <codeph>ARRAY</codeph> element or the <codeph>VALUE</codeph> part of a <codeph>MAP</codeph>, the field names of the
         <codeph>STRUCT</codeph> become additional columns in the result set.
 -->
       </p>

       <p>
         The <codeph>ITEM</codeph> and <codeph>VALUE</codeph> names are only required for the very simplest kinds of <codeph>ARRAY</codeph>
         and <codeph>MAP</codeph> columns, ones that hold only scalar values. When the elements within the <codeph>ARRAY</codeph> or
         <codeph>MAP</codeph> are of type <codeph>STRUCT</codeph> rather than a scalar type, then the result set contains columns with names
         corresponding to the <codeph>STRUCT</codeph> fields rather than <codeph>ITEM</codeph> or <codeph>VALUE</codeph>.
       </p>

 <!--
       <p>
         <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are both <term>collection</term> types, which can have a variable number of
         elements; <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are typically used as the top-level type of a table column.
         <codeph>STRUCT</codeph> represents a single element and has a fixed number of fields; <codeph>STRUCT</codeph> is typically used as
         the final, lowest level of a nested type definition.
       </p>
 -->

       <p>
         You write most queries that process complex type columns using familiar join syntax, even though the data for both sides of the join
         resides in a single table. The join notation brings together the scalar values from a row with the values from the complex type
         columns for that same row. The final result set contains all scalar values, allowing you to do all the familiar filtering,
         aggregation, ordering, and so on for the complex data entirely in SQL or using business intelligence tools that issue SQL queries.
 <!--
         Instead of pulling together values from different tables, the join selects the specified values from both
         the scalar columns, and from inside the complex type columns, producing a flattened result set consisting of all scalar values. When
         doing a join query involving a complex type column, Impala derives the join key automatically, without the need to create additional
         ID columns in the table.
 -->
       </p>

       <p>
         Behind the scenes, Impala ensures that the processing for each row is done efficiently on a single host, without the network traffic
         involved in broadcast or shuffle joins. The most common type of join query for tables with complex type columns is <codeph>INNER
         JOIN</codeph>, which returns results only in those cases where the complex type contains some elements. Therefore, most query
         examples in this section use either the <codeph>INNER JOIN</codeph> clause or the equivalent comma notation.
       </p>

       <note>
         <p>
           Although Impala can query complex types that are present in Parquet files, Impala currently cannot create new Parquet files
           containing complex types. Therefore, the discussion and examples presume that you are working with existing Parquet data produced
           through Hive, Spark, or some other source. See <xref href="#complex_types_ex_hive_etl"/> for examples of constructing Parquet data
           files with complex type columns.
         </p>

         <p>
           For learning purposes, you can create empty tables with complex type columns and practice query syntax, even if you do not have
           sample data with the required structure.
         </p>
       </note>

     </conbody>

   </concept>

   <concept id="complex_types_design">

     <title>Design Considerations for Complex Types</title>

     <conbody>

       <p>
         When planning to use Impala complex types, and designing the Impala schema, first learn how this kind of schema differs from
         traditional table layouts from the relational database and data warehousing fields. Because you might have already encountered
         complex types in a Hadoop context while using Hive for ETL, also learn how to write high-performance analytic queries for complex
         type data using Impala SQL syntax.
       </p>

       <p outputclass="toc inpage"/>

     </conbody>

     <concept id="complex_types_vs_rdbms">

       <title>How Complex Types Differ from Traditional Data Warehouse Schemas</title>

       <conbody>

         <p>
           Complex types let you associate arbitrary data structures with a particular row. If you are familiar with schema design for
           relational database management systems or data warehouses, a schema with complex types has the following differences:
         </p>

         <ul>
           <li>
             <p>
               Logically, related values can now be grouped tightly together in the same table.
             </p>

             <p>
               In traditional data warehousing, related values were typically arranged in one of two ways:
             </p>
             <ul>
               <li>
                 <p>
                   Split across multiple normalized tables. Foreign key columns specified which rows from each table were associated with
                   each other. This arrangement avoided duplicate data and therefore the data was compact, but join queries could be
                   expensive because the related data had to be retrieved from separate locations. (In the case of distributed Hadoop
                   queries, the joined tables might even be transmitted between different hosts in a cluster.)
                 </p>
               </li>

               <li>
                 <p>
                   Flattened into a single denormalized table. Although this layout eliminated some potential performance issues by removing
                   the need for join queries, the table typically became larger because values were repeated. The extra data volume could
                   cause performance issues in other parts of the workflow, such as longer ETL cycles or more expensive full-table scans
                   during queries.
                 </p>
               </li>
             </ul>
             <p>
               Complex types represent a middle ground that addresses these performance and volume concerns. By physically locating related
               data within the same data files, complex types increase locality and reduce the expense of join queries. By associating an
               arbitrary amount of data with a single row, complex types avoid the need to repeat lengthy values such as strings. Because
               Impala knows which complex type values are associated with each row, you can save storage by avoiding artificial foreign key
               values that are only used for joins. The flexibility of the <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and
               <codeph>MAP</codeph> types lets you model familiar constructs such as fact and dimension tables from a data warehouse, and
               wide tables representing sparse matrixes.
             </p>
           </li>
         </ul>

       </conbody>

     </concept>

     <concept id="complex_types_physical">

       <title>Physical Storage for Complex Types in Parquet</title>

       <conbody>

         <p>
           Physically, the scalar and complex columns in each row are located adjacent to each other in the same Parquet data file, ensuring
           that they are processed on the same host rather than being broadcast across the network when cross-referenced within a query. This
           co-location simplifies the process of copying, converting, and backing all the columns up at once. Because of the column-oriented
           layout of Parquet files, you can still query only the scalar columns of a table without imposing the I/O penalty of reading the
           (possibly large) values of the composite columns.
         </p>

         <p>
           Within each Parquet data file, the constituent parts of complex type columns are stored in column-oriented format:
         </p>

         <ul>
           <li>
             <p>
               Each field of a <codeph>STRUCT</codeph> type is stored like a column, with all the scalar values adjacent to each other and
               encoded, compressed, and so on using the Parquet space-saving techniques.
             </p>
           </li>

           <li>
             <p>
               For an <codeph>ARRAY</codeph> containing scalar values, all those values (represented by the <codeph>ITEM</codeph>
               pseudocolumn) are stored adjacent to each other.
             </p>
           </li>

           <li>
             <p>
               For a <codeph>MAP</codeph>, the values of the <codeph>KEY</codeph> pseudocolumn are stored adjacent to each other. If the
               <codeph>VALUE</codeph> pseudocolumn is a scalar type, its values are also stored adjacent to each other.
             </p>
           </li>

           <li>
             <p>
               If an <codeph>ARRAY</codeph> element, <codeph>STRUCT</codeph> field, or <codeph>MAP</codeph> <codeph>VALUE</codeph> part is
               another complex type, the column-oriented storage applies to the next level down (or the next level after that, and so on for
               deeply nested types) where the final elements, fields, or values are of scalar types.
             </p>
           </li>
         </ul>

         <p>
           The numbers represented by the <codeph>POS</codeph> pseudocolumn of an <codeph>ARRAY</codeph> are not physically stored in the
           data files. They are synthesized at query time based on the order of the <codeph>ARRAY</codeph> elements associated with each row.
         </p>

       </conbody>

     </concept>

     <concept id="complex_types_file_formats">

       <title>File Format Support for Impala Complex Types</title>

       <conbody>

         <p> Currently, Impala queries support complex type data in the Parquet
           and ORC file formats. See <xref href="impala_parquet.xml#parquet"/>
           for details about the performance benefits and physical layout of
           Parquet file format. </p>

         <p>
           Because Impala does not parse the data structures containing nested types for unsupported formats such as text, Avro,
           SequenceFile, or RCFile, you cannot use data files in these formats with Impala, even if the query does not refer to the nested
           type columns. Also, if a table using an unsupported format originally contained nested type columns, and then those columns were
           dropped from the table using <codeph>ALTER TABLE ... DROP COLUMN</codeph>, any existing data files in the table still contain the
           nested type data and Impala queries on that table will generate errors.
         </p>

         <p rev="2.6.0 IMPALA-2844">
             The one exception to the preceding rule is <codeph>COUNT(*)</codeph> queries on RCFile tables that include complex types.
             Such queries are allowed in <keyword keyref="impala26_full"/> and higher.
         </p>

         <p>
           You can perform DDL operations for tables involving complex types in
           most file formats other than Parquet or ORC. You cannot create tables
           in Impala with complex types using text files.
         </p>

         <p>
           You can have a partitioned table with complex type columns that uses
           a format other than Parquet or ORC, and use
           <codeph>ALTER TABLE</codeph> to change the file format to Parquet/ORC
           for individual partitions. When you put Parquet/ORC files into those
           partitions, Impala can execute queries against that data as long as
           the query does not involve any of the non-Parquet and non-ORC
           partitions.
         </p>

         <p>
           If you use the <cmdname>parquet-tools</cmdname> command to examine the structure of a Parquet data file that includes complex
           types, you see that both <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are represented as a <codeph>Bag</codeph> in Parquet
           terminology, with all fields marked <codeph>Optional</codeph> because Impala allows any column to be nullable.
         </p>

         <p>
           Impala supports either 2-level and 3-level encoding within each Parquet data file. When constructing Parquet data files outside
           Impala, use either encoding style but do not mix 2-level and 3-level encoding within the same data file.
         </p>

       </conbody>

     </concept>

     <concept id="complex_types_vs_normalization">

       <title>Choosing Between Complex Types and Normalized Tables</title>

       <conbody>

         <p>
           Choosing between multiple normalized fact and dimension tables, or a single table containing complex types, is an important design
           decision.
         </p>

         <ul>
           <li>
             <p>
               If you are coming from a traditional database or data warehousing background, you might be familiar with how to split up data
               between tables. Your business intelligence tools might already be optimized for dealing with this kind of multi-table scenario
               through join queries.
             </p>
           </li>

           <li>
             <p>
               If you are pulling data from Impala into an application written in a programming language that has data structures analogous
               to the complex types, such as Python or Java, complex types in Impala could simplify data interchange and improve
               understandability and reliability of your program logic.
             </p>
           </li>

           <li>
             <p>
               You might already be faced with existing infrastructure or receive high volumes of data that assume one layout or the other.
               For example, complex types are popular with web-oriented applications, for example to keep information about an online user
               all in one place for convenient lookup and analysis, or to deal with sparse or constantly evolving data fields.
             </p>
           </li>

           <li>
             <p>
               If some parts of the data change over time while related data remains constant, using multiple normalized tables lets you
               replace certain parts of the data without reloading the entire data set. Conversely, if you receive related data all bundled
               together, such as in JSON files, using complex types can save the overhead of splitting the related items across multiple
               tables.
             </p>
           </li>

           <li>
             <p>
               From a performance perspective:
             </p>
             <ul>
               <li>
                 <p>
                   In Parquet or ORC tables, Impala can skip columns that are not referenced in a query, avoiding the I/O penalty of reading the
                   embedded data. When complex types are nested within a column, the data is physically divided at a very granular level; for
                   example, a query referring to data nested multiple levels deep in a complex type column does not have to read all the data
                   from that column, only the data for the relevant parts of the column type hierarchy.
 <!-- Avro not supported in 5.5 / 2.3: Avro tables might experience some performance overhead due to
                 the need to skip past the complex type columns in each row when reading the data. -->
                 </p>
               </li>

               <li>
                 <p>
                   Complex types avoid the possibility of expensive join queries when data from fact and dimension tables is processed in
                   parallel across multiple hosts. All the information for a row containing complex types is typically to be in the same data
                   block, and therefore does not need to be transmitted across the network when joining fields that are all part of the same
                   row.
                 </p>
               </li>

               <li>
                 <p>
                   The tradeoff with complex types is that fewer rows fit in each data block. Whether it is better to have more data blocks
                   with fewer rows, or fewer data blocks with many rows, depends on the distribution of your data and the characteristics of
                   your query workload. If the complex columns are rarely referenced, using them might lower efficiency. If you are seeing
                   low parallelism due to a small volume of data (relatively few data blocks) in each table partition, increasing the row
                   size by including complex columns might produce more data blocks and thus spread the work more evenly across the cluster.
                   See <xref href="impala_scalability.xml#scalability"/> for more on this advanced topic.
                 </p>
               </li>
             </ul>
           </li>
         </ul>

       </conbody>

     </concept>

     <concept id="complex_types_hive">

       <title>Differences Between Impala and Hive Complex Types</title>

       <conbody>

         <p>
           Impala can query Parquet and ORC tables containing <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> columns
           produced by Hive. There are some differences to be aware of between the Impala SQL and HiveQL syntax for complex types, primarily
           for queries.
         </p>
         <p>
           Impala supports a subset of the syntax that Hive supports for
           specifying <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and
             <codeph>MAP</codeph> types in the <codeph>CREATE TABLE</codeph>
           statements.
         </p>

         <p>
           Because Impala <codeph>STRUCT</codeph> columns include user-specified field names, you use the <codeph>NAMED_STRUCT()</codeph>
           constructor in Hive rather than the <codeph>STRUCT()</codeph> constructor when you populate an Impala <codeph>STRUCT</codeph>
           column using a Hive <codeph>INSERT</codeph> statement.
         </p>

         <p>
           The Hive <codeph>UNION</codeph> type is not currently supported in Impala.
         </p>

         <p>
           While Impala usually aims for a high degree of compatibility with HiveQL query syntax, Impala syntax differs from Hive for queries
           involving complex types. The differences are intended to provide extra flexibility for queries involving these kinds of tables.
         </p>

         <ul>
           <li>
             Impala uses dot notation for referring to element names or elements within complex types, and join notation for
             cross-referencing scalar columns with the elements of complex types within the same row, rather than the <codeph>LATERAL
             VIEW</codeph> clause and <codeph>EXPLODE()</codeph> function of HiveQL.
           </li>

           <li>
             Using join notation lets you use all the kinds of join queries with complex type columns. For example, you can use a
             <codeph>LEFT OUTER JOIN</codeph>, <codeph>LEFT ANTI JOIN</codeph>, or <codeph>LEFT SEMI JOIN</codeph> query to evaluate
             different scenarios where the complex columns do or do not contain any elements.
           </li>

           <li>
             You can include references to collection types inside subqueries and inline views. For example, you can construct a
             <codeph>FROM</codeph> clause where one of the <q>tables</q> is a subquery against a complex type column, or use a subquery
             against a complex type column as the argument to an <codeph>IN</codeph> or <codeph>EXISTS</codeph> clause.
           </li>

           <li>
             The Impala pseudocolumn <codeph>POS</codeph> lets you retrieve the position of elements in an array along with the elements
             themselves, equivalent to the <codeph>POSEXPLODE()</codeph> function of HiveQL. You do not use index notation to retrieve a
             single array element in a query; the join query loops through the array elements and you use <codeph>WHERE</codeph> clauses to
             specify which elements to return.
           </li>

           <li>
             <p>
               Join clauses involving complex type columns do not require an <codeph>ON</codeph> or <codeph>USING</codeph> clause. Impala
               implicitly applies the join key so that the correct array entries or map elements are associated with the correct row from the
               table.
             </p>
           </li>

           <li>
             <p>
               Impala does not currently support the <codeph>UNION</codeph> complex type.
             </p>
           </li>
         </ul>

       </conbody>

     </concept>

     <concept id="complex_types_limits">

       <title>Limitations and Restrictions for Complex Types</title>

       <conbody>

         <p>
           Complex type columns can only be used in tables or partitions with the Parquet or ORC file format.
         </p>

         <p>
           Complex type columns cannot be used as partition key columns in a partitioned table.
         </p>

         <p>
           When you use complex types with the <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, <codeph>HAVING</codeph>, or
           <codeph>WHERE</codeph> clauses, you cannot refer to the column name by itself. Instead, you refer to the names of the scalar
           values within the complex type, such as the <codeph>ITEM</codeph>, <codeph>POS</codeph>, <codeph>KEY</codeph>, or
           <codeph>VALUE</codeph> pseudocolumns, or the field names from a <codeph>STRUCT</codeph>.
         </p>

         <p>
           The maximum depth of nesting for complex types is 100 levels.
         </p>

         <p conref="../shared/impala_common.xml#common/complex_types_max_length"/>

         <p>
           For ideal performance and scalability, use small or medium-sized collections, where all the complex columns contain at most a few
           hundred megabytes per row. Remember, all the columns of a row are stored in the same HDFS data block, whose size in Parquet files
           typically ranges from 256 MB to 1 GB.
         </p>

         <p>
           Including complex type columns in a table introduces some overhead that might make queries that do not reference those columns
           somewhat slower than Impala queries against tables without any complex type columns. Expect at most a 2x slowdown compared to
           tables that do not have any complex type columns.
         </p>

         <p>
           Currently, the <codeph>COMPUTE STATS</codeph> statement does not collect any statistics for columns containing complex types.
           Impala uses heuristics to construct execution plans involving complex type columns.
         </p>

         <p>
           Currently, Impala built-in functions and user-defined functions cannot accept complex types as parameters or produce them as
           function return values. (When the complex type values are materialized in an Impala result set, the result set contains the scalar
           components of the values, such as the <codeph>POS</codeph> or <codeph>ITEM</codeph> for an <codeph>ARRAY</codeph>, the
           <codeph>KEY</codeph> or <codeph>VALUE</codeph> for a <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>; these
           scalar data items <i>can</i> be used with built-in functions and UDFs as usual.)
         </p>

         <p conref="../shared/impala_common.xml#common/complex_types_read_only"/>

         <p>
           Currently, Impala can query complex type columns only from Parquet/ORC tables or Parquet/ORC partitions within partitioned tables.
           Although you can use complex types in tables with Avro, text, and other file formats as part of your ETL pipeline, for example as
           intermediate tables populated through Hive, doing analytics through Impala requires that the data eventually ends up in a Parquet/ORC
           table. The requirement for Parquet/ORC data files means that you can use complex types with Impala tables hosted on other kinds of
           file storage systems such as Isilon and Amazon S3, but you cannot use Impala to query complex types from HBase tables. See
           <xref href="impala_complex_types.xml#complex_types_file_formats"/> for more details.
         </p>

       </conbody>

     </concept>

   </concept>

   <concept id="complex_types_using">

     <title>Using Complex Types from SQL</title>

     <conbody>

       <p>
         When using complex types through SQL in Impala, you learn the notation for <codeph>&lt; &gt;</codeph> delimiters for the complex
         type columns in <codeph>CREATE TABLE</codeph> statements, and how to construct join queries to <q>unpack</q> the scalar values
         nested inside the complex data structures. You might need to condense a traditional RDBMS or data warehouse schema into a smaller
         number of Parquet tables, and use Hive, Spark, Pig, or other mechanism outside Impala to populate the tables with data.
       </p>

       <p outputclass="toc inpage"/>

     </conbody>

     <concept id="nested_types_ddl">

       <title>Complex Type Syntax for DDL Statements</title>

       <conbody>

         <p>
           The definition of <varname>data_type</varname>, as seen in the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
           statements, now includes complex types in addition to primitive types:
         </p>

 <codeblock>  primitive_type
 | array_type
 | map_type
 | struct_type
 </codeblock>

         <p>
           Unions are not currently supported.
         </p>
         <p>
           <codeph>Array</codeph>, <codeph>struct</codeph>, and
             <codeph>map</codeph> column type declarations are specified in the
             <codeph>CREATE TABLE</codeph> statement. You can also add or change
           the type of complex columns through the <codeph>ALTER TABLE</codeph>
           statement. </p>
         <p> Currently, Impala queries allow complex types only in tables that
           use the Parquet or ORC format. If an Impala query encounters complex
           types in a table or partition using other file formats, the query
           returns a runtime error. </p>
         <p> You can use <codeph>ALTER TABLE ... SET FILEFORMAT PARQUET</codeph>
           to change the file format of an existing table containing complex
           types to Parquet, after which Impala can query it. Make sure to load
           Parquet files into the table after changing the file format, because
           the <codeph>ALTER TABLE ... SET FILEFORMAT</codeph> statement does not
           convert existing data to the new file format. </p>

         <p conref="../shared/impala_common.xml#common/complex_types_partitioning"/>

         <p>
           Because use cases for Impala complex types require that you already have Parquet/ORC data files produced outside of Impala, you can
           use the Impala <codeph>CREATE TABLE LIKE PARQUET</codeph> syntax to produce a table with columns that match the structure of an
           existing Parquet file, including complex type columns for nested data structures. Remember to include the <codeph>STORED AS
           PARQUET</codeph> clause in this case, because even with <codeph>CREATE TABLE LIKE PARQUET</codeph>, the default file format of the
           resulting table is still text.
         </p>

         <p>
           Because the complex columns are omitted from the result set of an Impala <codeph>SELECT *</codeph> or <codeph>SELECT
           <varname>col_name</varname></codeph> query, and because Impala currently does not support writing Parquet files with complex type
           columns, you cannot use the <codeph>CREATE TABLE AS SELECT</codeph> syntax to create a table with nested type columns.
         </p>

         <note>
           <p>
             Once you have a table set up with complex type columns, use the <codeph>DESCRIBE</codeph> and <codeph>SHOW CREATE TABLE</codeph>
             statements to see the correct notation with <codeph>&lt;</codeph> and <codeph>&gt;</codeph> delimiters and comma and colon
             separators within the complex type definitions. If you do not have existing data with the same layout as the table, you can
             query the empty table to practice with the notation for the <codeph>SELECT</codeph> statement. In the <codeph>SELECT</codeph>
             list, you use dot notation and pseudocolumns such as <codeph>ITEM</codeph>, <codeph>KEY</codeph>, and <codeph>VALUE</codeph> for
             referring to items within the complex type columns. In the <codeph>FROM</codeph> clause, you use join notation to construct
             table aliases for any referenced <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns.
           </p>
         </note>

 <!-- To do: show some simple CREATE TABLE statements for each of the complex types, without so much backstory for the schema. -->

         <p>
           For example, when defining a table that holds contact information, you might represent phone numbers differently depending on the
           expected layout and relationships of the data, and how well you can predict those properties in advance.
         </p>

         <p>
           Here are different ways that you might represent phone numbers in a traditional relational schema, with equivalent representations
           using complex types.
         </p>

         <fig id="complex_types_phones_flat_fixed">

           <title>Traditional Relational Representation of Phone Numbers: Single Table</title>

           <p>
             The traditional, simplest way to represent phone numbers in a relational table is to store all contact info in a single table,
             with all columns having scalar types, and each potential phone number represented as a separate column. In this example, each
             person can only have these 3 types of phone numbers. If the person does not have a particular kind of phone number, the
             corresponding column is <codeph>NULL</codeph> for that row.
           </p>

 <codeblock>
 CREATE TABLE contacts_fixed_phones
 (
     id BIGINT
   , name STRING
   , address STRING
   , home_phone STRING
   , work_phone STRING
   , mobile_phone STRING
 ) STORED AS PARQUET;
 </codeblock>

         </fig>

         <fig id="complex_types_phones_array">

           <title>An Array of Phone Numbers</title>

           <p>
             Using a complex type column to represent the phone numbers adds some extra flexibility. Now there could be an unlimited number
             of phone numbers. Because the array elements have an order but not symbolic names, you could decide in advance that
             phone_number[0] is the home number, [1] is the work number, [2] is the mobile number, and so on. (In subsequent examples, you
             will see how to create a more flexible naming scheme using other complex type variations, such as a <codeph>MAP</codeph> or an
             <codeph>ARRAY</codeph> where each element is a <codeph>STRUCT</codeph>.)
           </p>

 <codeblock><![CDATA[
 CREATE TABLE contacts_array_of_phones
 (
     id BIGINT
   , name STRING
   , address STRING
   , phone_number ARRAY < STRING >
 ) STORED AS PARQUET;
 ]]>
 </codeblock>

         </fig>

         <fig id="complex_types_phones_map">

           <title>A Map of Phone Numbers</title>

           <p>
             Another way to represent an arbitrary set of phone numbers is with a <codeph>MAP</codeph> column. With a <codeph>MAP</codeph>,
             each element is associated with a key value that you specify, which could be a numeric, string, or other scalar type. This
             example uses a <codeph>STRING</codeph> key to give each phone number a name, such as <codeph>'home'</codeph> or
             <codeph>'mobile'</codeph>. A query could filter the data based on the key values, or display the key values in reports.
           </p>

 <codeblock><![CDATA[
 CREATE TABLE contacts_unlimited_phones
 (
   id BIGINT, name STRING, address STRING, phone_number MAP < STRING,STRING >
 ) STORED AS PARQUET;
 ]]>
 </codeblock>

         </fig>

         <fig id="complex_types_phones_flat_normalized">

           <title>Traditional Relational Representation of Phone Numbers: Normalized Tables</title>

           <p>
             If you are an experienced database designer, you already know how to work around the limitations of the single-table schema from
             <xref href="#nested_types_ddl/complex_types_phones_flat_fixed"/>. By normalizing the schema, with the phone numbers in their own
             table, you can associate an arbitrary set of phone numbers with each person, and associate additional details with each phone
             number, such as whether it is a home, work, or mobile phone.
           </p>

           <p>
             The flexibility of this approach comes with some drawbacks. Reconstructing all the data for a particular person requires a join
             query, which might require performance tuning on Hadoop because the data from each table might be transmitted from a different
             host. Data management tasks such as backups and refreshing the data require dealing with multiple tables instead of a single
             table.
           </p>

           <p>
             This example illustrates a traditional database schema to store contact info normalized across 2 tables. The fact table
             establishes the identity and basic information about person. A dimension table stores information only about phone numbers,
             using an ID value to associate each phone number with a person ID from the fact table. Each person can have 0, 1, or many
             phones; the categories are not restricted to a few predefined ones; and the phone table can contain as many columns as desired,
             to represent all sorts of details about each phone number.
           </p>

 <codeblock>
 CREATE TABLE fact_contacts (id BIGINT, name STRING, address STRING) STORED AS PARQUET;
 CREATE TABLE dim_phones
 (
     contact_id BIGINT
   , category STRING
   , international_code STRING
   , area_code STRING
   , exchange STRING
   , extension STRING
   , mobile BOOLEAN
   , carrier STRING
   , current BOOLEAN
   , service_start_date TIMESTAMP
   , service_end_date TIMESTAMP
 )
 STORED AS PARQUET;
 </codeblock>

         </fig>

         <fig id="complex_types_phones_array_struct">

           <title>Phone Numbers Represented as an Array of Structs</title>

           <p>
             To represent a schema equivalent to the one from <xref href="#nested_types_ddl/complex_types_phones_flat_normalized"/> using
             complex types, this example uses an <codeph>ARRAY</codeph> where each array element is a <codeph>STRUCT</codeph>. As with the
             earlier complex type examples, each person can have an arbitrary set of associated phone numbers. Making each array element into
             a <codeph>STRUCT</codeph> lets us associate multiple data items with each phone number, and give a separate name and type to
             each data item. The <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph> elements reproduce the columns of the dimension
             table from the previous example.
           </p>

           <p>
             You can do all the same kinds of queries with the complex type schema as with the normalized schema from the previous example.
             The advantages of the complex type design are in the areas of convenience and performance. Now your backup and ETL processes
             only deal with a single table. When a query uses a join to cross-reference the information about a person with their associated
             phone numbers, all the relevant data for each row resides in the same HDFS data block, meaning each row can be processed on a
             single host without requiring network transmission.
           </p>

 <codeblock><![CDATA[
 CREATE TABLE contacts_detailed_phones
 (
   id BIGINT, name STRING, address STRING
     , phone ARRAY < STRUCT <
         category: STRING
       , international_code: STRING
       , area_code: STRING
       , exchange: STRING
       , extension: STRING
       , mobile: BOOLEAN
       , carrier: STRING
       , current: BOOLEAN
       , service_start_date: TIMESTAMP
       , service_end_date: TIMESTAMP
     >>
 ) STORED AS PARQUET;
 ]]>
 </codeblock>

         </fig>

       </conbody>

     </concept>

     <concept id="complex_types_sql">

       <title>SQL Statements that Support Complex Types</title>

       <conbody>

         <p>
           The Impala SQL statements that support complex types are currently
           <codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>,
           <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph>,
           <codeph><xref href="impala_describe.xml#describe">DESCRIBE</xref></codeph>,
           <codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph>, and
           <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph>. That is, currently Impala can create or alter tables
           containing complex type columns, examine the structure of a table containing complex type columns, import existing data files
           containing complex type columns into a table, and query Parquet/ORC tables containing complex types.
         </p>

         <p conref="../shared/impala_common.xml#common/complex_types_read_only"/>

         <p outputclass="toc inpage"/>

       </conbody>

       <concept id="complex_types_ddl">

         <title>DDL Statements and Complex Types</title>

         <conbody>

           <p>
             Column specifications for complex or nested types use <codeph>&lt;</codeph> and <codeph>&gt;</codeph> delimiters:
           </p>

 <codeblock><![CDATA[-- What goes inside the < > for an ARRAY is a single type, either a scalar or another
 -- complex type (ARRAY, STRUCT, or MAP).
 CREATE TABLE array_t
 (
   id BIGINT,
   a1 ARRAY <STRING>,
   a2 ARRAY <BIGINT>,
   a3 ARRAY <TIMESTAMP>,
   a4 ARRAY <STRUCT <f1: STRING, f2: INT, f3: BOOLEAN>>
 )
 STORED AS PARQUET;

 -- What goes inside the < > for a MAP is two comma-separated types specifying the types of the key-value pair:
 -- a scalar type representing the key, and a scalar or complex type representing the value.
 CREATE TABLE map_t
 (
   id BIGINT,
   m1 MAP <STRING, STRING>,
   m2 MAP <STRING, BIGINT>,
   m3 MAP <BIGINT, STRING>,
   m4 MAP <BIGINT, BIGINT>,
   m5 MAP <STRING, ARRAY <STRING>>
 )
 STORED AS PARQUET;

 -- What goes inside the < > for a STRUCT is a comma-separated list of fields, each field defined as
 -- name:type. The type can be a scalar or a complex type. The field names for each STRUCT do not clash
 -- with the names of table columns or fields in other STRUCTs. A STRUCT is most often used inside
 -- an ARRAY or a MAP rather than as a top-level column.
 CREATE TABLE struct_t
 (
   id BIGINT,
   s1 STRUCT <f1: STRING, f2: BIGINT>,
   s2 ARRAY <STRUCT <f1: INT, f2: TIMESTAMP>>,
   s3 MAP <BIGINT, STRUCT <name: STRING, birthday: TIMESTAMP>>
 )
 STORED AS PARQUET;
 ]]>
 </codeblock>

         </conbody>

       </concept>

       <concept id="complex_types_queries">

         <title>Queries and Complex Types</title>

         <conbody>

 <!-- Hive does the JSON output business: http://www.datascience-labs.com/hive/hiveql-data-manipulation/ -->

 <!-- SELECT * works but skips any nested type coloumns. -->

           <p>
             The result set of an Impala query always contains all scalar types; the elements and fields within any complex type queries must
             be <q>unpacked</q> using join queries. A query cannot directly retrieve the entire value for a complex type column. Impala
             returns an error in this case. Queries using <codeph>SELECT *</codeph> are allowed for tables with complex types, but the
             columns with complex types are skipped.
           </p>

           <p>
             The following example shows how referring directly to a complex type column returns an error, while <codeph>SELECT *</codeph> on
             the same table succeeds, but only retrieves the scalar columns.
           </p>

           <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>

 <!-- Original error message:
 ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING,o_totalprice:DECIMAL(12,2),o_orderdate:STRING,o_orderpriority:STRING,o_clerk:STRING,o_shippriority:INT,o_comment:STRING,o_lineitems:ARRAY<STRUCT<l_partkey:BIGINT,l_suppkey:BIGINT,l_linenumber:INT,l_quantity:DECIMAL(12,2),l_extendedprice:DECIMAL(12,2),l_discount:DECIMAL(12,2),l_tax:DECIMAL(12,2),l_returnflag:STRING,l_linestatus:STRING,l_shipdate:STRING,l_commitdate:STRING,l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
 -->

 <codeblock><![CDATA[SELECT c_orders FROM customer LIMIT 1;
 ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING, ... l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
 Only scalar types are allowed in the select list.

 -- Original column has several scalar and one complex column.
 DESCRIBE customer;
 +--------------+------------------------------------+
 | name         | type                               |
 +--------------+------------------------------------+
 | c_custkey    | bigint                             |
 | c_name       | string                             |
 ...
 | c_orders     | array<struct<                      |
 |              |   o_orderkey:bigint,               |
 |              |   o_orderstatus:string,            |
 |              |   o_totalprice:decimal(12,2),      |
 ...
 |              | >>                                 |
 +--------------+------------------------------------+

 -- When we SELECT * from that table, only the scalar columns come back in the result set.
 CREATE TABLE select_star_customer STORED AS PARQUET AS SELECT * FROM customer;
 +------------------------+
 | summary                |
 +------------------------+
 | Inserted 150000 row(s) |
 +------------------------+

 -- The c_orders column, being of complex type, was not included in the SELECT * result set.
 DESC select_star_customer;
 +--------------+---------------+
 | name         | type          |
 +--------------+---------------+
 | c_custkey    | bigint        |
 | c_name       | string        |
 | c_address    | string        |
 | c_nationkey  | smallint      |
 | c_phone      | string        |
 | c_acctbal    | decimal(12,2) |
 | c_mktsegment | string        |
 | c_comment    | string        |
 +--------------+---------------+
 ]]>
 </codeblock>

 <!-- To do: These "references to..." bits could be promoted to their own 'expressions' subheads. -->

           <p>
             References to fields within <codeph>STRUCT</codeph> columns use dot notation. If the field name is unambiguous, you can omit
             qualifiers such as table name, column name, or even the <codeph>ITEM</codeph> or <codeph>VALUE</codeph> pseudocolumn names for
             <codeph>STRUCT</codeph> elements inside an <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>.
           </p>

 <!-- To do: rewrite example to use CUSTOMER table. -->

 <!-- Don't think TPC-H schema has a bare STRUCT to use in such a simple query though. -->

 <!-- Perhaps reuse the STRUCT_DEMO example here. -->

 <codeblock>SELECT id, address.city FROM customers WHERE address.zip = 94305;
 </codeblock>

           <p>
             References to elements within <codeph>ARRAY</codeph> columns use the <codeph>ITEM</codeph> pseudocolumn:
           </p>

 <!-- To do: shorten qualified names. -->

 <codeblock>select r_name, r_nations.item.n_name from region, region.r_nations limit 7;
 +--------+----------------+
 | r_name | item.n_name    |
 +--------+----------------+
 | EUROPE | UNITED KINGDOM |
 | EUROPE | RUSSIA         |
 | EUROPE | ROMANIA        |
 | EUROPE | GERMANY        |
 | EUROPE | FRANCE         |
 | ASIA   | VIETNAM        |
 | ASIA   | CHINA          |
 +--------+----------------+
 </codeblock>

           <p>
             References to fields within <codeph>MAP</codeph> columns use the <codeph>KEY</codeph> and <codeph>VALUE</codeph> pseudocolumns.
             In this example, once the query establishes the alias <codeph>MAP_FIELD</codeph> for a <codeph>MAP</codeph> column with a
             <codeph>STRING</codeph> key and an <codeph>INT</codeph> value, the query can refer to <codeph>MAP_FIELD.KEY</codeph> and
             <codeph>MAP_FIELD.VALUE</codeph>, which have zero, one, or many instances for each row from the containing table.
           </p>

 <codeblock><![CDATA[DESCRIBE table_0;
 +---------+-----------------------+
 | name    | type                  |
 +---------+-----------------------+
 | field_0 | string                |
 | field_1 | map<string,int>       |
 ...

 SELECT field_0, map_field.key, map_field.value
   FROM table_0, table_0.field_1 AS map_field
 WHERE length(field_0) = 1
 LIMIT 10;
 +---------+-----------+-------+
 | field_0 | key       | value |
 +---------+-----------+-------+
 | b       | gshsgkvd  | NULL  |
 | b       | twrtcxj6  | 18    |
 | b       | 2vp5      | 39    |
 | b       | fh0s      | 13    |
 | v       | 2         | 41    |
 | v       | 8b58mz    | 20    |
 | v       | hw        | 16    |
 | v       | 65l388pyt | 29    |
 | v       | 03k68g91z | 30    |
 | v       | r2hlg5b   | NULL  |
 +---------+-----------+-------+
 ]]>
 </codeblock>

 <!-- To do: refer to or reuse examples from the other subtopics that discuss pseudocolumns etc. -->

           <p>
             When complex types are nested inside each other, you use a combination of joins, pseudocolumn names, and dot notation to refer
             to specific fields at the appropriate level. This is the most frequent form of query syntax for complex columns, because the
             typical use case involves two levels of complex types, such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements.
           </p>

 <!-- To do: rewrite example to use CUSTOMER table. -->

 <!-- This is my own manufactured example so I have the table, and the query works, but I don't have sample data to show. -->

 <codeblock>SELECT id, phone_numbers.area_code FROM contact_info_many_structs INNER JOIN contact_info_many_structs.phone_numbers phone_numbers LIMIT 3;
 </codeblock>

           <p>
             You can express relationships between <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns at different levels as joins. You
             include comparison operators between fields at the top level and within the nested type columns so that Impala can do the
             appropriate join operation.
           </p>

 <!-- Don't think TPC-H schema has any instances where outer field matches up with inner one though. -->

 <!-- But don't think this usage is important enough to call out at this early point. Hide the example for now. -->

 <!--
 <codeblock>SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc;
 SELECT c.id, o.txn_id FROM customers c, c.orders o;
 </codeblock>
 -->

 <!-- To do: move these examples down, to the examples subtopic at the end. -->

           <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>

           <p>
             For example, the following queries work equivalently. They each return customer and order data for customers that have at least
             one order.
           </p>

 <codeblock>SELECT c.c_name, o.o_orderkey FROM customer c, c.c_orders o LIMIT 5;
 +--------------------+------------+
 | c_name             | o_orderkey |
 +--------------------+------------+
 | Customer#000072578 | 558821     |
 | Customer#000072578 | 2079810    |
 | Customer#000072578 | 5768068    |
 | Customer#000072578 | 1805604    |
 | Customer#000072578 | 3436389    |
 +--------------------+------------+

 SELECT c.c_name, o.o_orderkey FROM customer c INNER JOIN c.c_orders o LIMIT 5;
 +--------------------+------------+
 | c_name             | o_orderkey |
 +--------------------+------------+
 | Customer#000072578 | 558821     |
 | Customer#000072578 | 2079810    |
 | Customer#000072578 | 5768068    |
 | Customer#000072578 | 1805604    |
 | Customer#000072578 | 3436389    |
 +--------------------+------------+
 </codeblock>

           <p>
             The following query using an outer join returns customers that have orders, plus customers with no orders (no entries in the
             <codeph>C_ORDERS</codeph> array):
           </p>

 <codeblock><![CDATA[SELECT c.c_custkey, o.o_orderkey
   FROM customer c LEFT OUTER JOIN c.c_orders o
 LIMIT 5;
 +-----------+------------+
 | c_custkey | o_orderkey |
 +-----------+------------+
 | 60210     | NULL       |
 | 147873    | NULL       |
 | 72578     | 558821     |
 | 72578     | 2079810    |
 | 72578     | 5768068    |
 +-----------+------------+
 ]]>
 </codeblock>

           <p>
             The following query returns <i>only</i> customers that have no orders. (With <codeph>LEFT ANTI JOIN</codeph> or <codeph>LEFT
             SEMI JOIN</codeph>, the query can only refer to columns from the left-hand table, because by definition there is no matching
             information in the right-hand table.)
           </p>

 <codeblock><![CDATA[SELECT c.c_custkey, c.c_name
   FROM customer c LEFT ANTI JOIN c.c_orders o
 LIMIT 5;
 +-----------+--------------------+
 | c_custkey | c_name             |
 +-----------+--------------------+
 | 60210     | Customer#000060210 |
 | 147873    | Customer#000147873 |
 | 141576    | Customer#000141576 |
 | 85365     | Customer#000085365 |
 | 70998     | Customer#000070998 |
 +-----------+--------------------+
 ]]>
 </codeblock>

 <!-- To do: promote the correlated subquery aspect into its own subtopic. -->

           <p>
             You can also perform correlated subqueries to examine the properties of complex type columns for each row in the result set.
           </p>

           <p>
             Count the number of orders per customer. Note the correlated reference to the table alias <codeph>C</codeph>. The
             <codeph>COUNT(*)</codeph> operation applies to all the elements of the <codeph>C_ORDERS</codeph> array for the corresponding
             row, avoiding the need for a <codeph>GROUP BY</codeph> clause.
           </p>

 <codeblock>select c_name, howmany FROM customer c, (SELECT COUNT(*) howmany FROM c.c_orders) v limit 5;
 +--------------------+---------+
 | c_name             | howmany |
 +--------------------+---------+
 | Customer#000030065 | 15      |
 | Customer#000065455 | 18      |
 | Customer#000113644 | 21      |
 | Customer#000111078 | 0       |
 | Customer#000024621 | 0       |
 +--------------------+---------+
 </codeblock>

           <p>
             Count the number of orders per customer, ignoring any customers that have not placed any orders:
           </p>

 <codeblock>SELECT c_name, howmany_orders
 FROM
   customer c,
   (SELECT COUNT(*) howmany_orders FROM c.c_orders) subq1
 WHERE howmany_orders > 0
 LIMIT 5;
 +--------------------+----------------+
 | c_name             | howmany_orders |
 +--------------------+----------------+
 | Customer#000072578 | 7              |
 | Customer#000046378 | 26             |
 | Customer#000069815 | 11             |
 | Customer#000079058 | 12             |
 | Customer#000092239 | 26             |
 +--------------------+----------------+
 </codeblock>

           <p>
             Count the number of line items in each order. The reference to <codeph>C.C_ORDERS</codeph> in the <codeph>FROM</codeph> clause
             is needed because the <codeph>O_ORDERKEY</codeph> field is a member of the elements in the <codeph>C_ORDERS</codeph> array. The
             subquery labelled <codeph>SUBQ1</codeph> is correlated: it is re-evaluated for the <codeph>C_ORDERS.O_LINEITEMS</codeph> array
             from each row of the <codeph>CUSTOMERS</codeph> table.
           </p>

 <codeblock>SELECT c_name, o_orderkey, howmany_line_items
 FROM
   customer c,
   c.c_orders t2,
   (SELECT COUNT(*) howmany_line_items FROM c.c_orders.o_lineitems) subq1
 WHERE howmany_line_items > 0
 LIMIT 5;
 +--------------------+------------+--------------------+
 | c_name             | o_orderkey | howmany_line_items |
 +--------------------+------------+--------------------+
 | Customer#000020890 | 1884930    | 95                 |
 | Customer#000020890 | 4570754    | 95                 |
 | Customer#000020890 | 3771072    | 95                 |
 | Customer#000020890 | 2555489    | 95                 |
 | Customer#000020890 | 919171     | 95                 |
 +--------------------+------------+--------------------+
 </codeblock>

           <p>
             Get the number of orders, the average order price, and the maximum items in any order per customer. For this example, the
             subqueries labelled <codeph>SUBQ1</codeph> and <codeph>SUBQ2</codeph> are correlated: they are re-evaluated for each row from
             the original <codeph>CUSTOMER</codeph> table, and only apply to the complex columns associated with that row.
           </p>

 <codeblock>SELECT c_name, howmany, average_price, most_items
 FROM
   customer c,
   (SELECT COUNT(*) howmany, AVG(o_totalprice) average_price FROM c.c_orders) subq1,
   (SELECT MAX(l_quantity) most_items FROM c.c_orders.o_lineitems ) subq2
 LIMIT 5;
 +--------------------+---------+---------------+------------+
 | c_name             | howmany | average_price | most_items |
 +--------------------+---------+---------------+------------+
 | Customer#000030065 | 15      | 128908.34     | 50.00      |
 | Customer#000088191 | 0       | NULL          | NULL       |
 | Customer#000101555 | 10      | 164250.31     | 50.00      |
 | Customer#000022092 | 0       | NULL          | NULL       |
 | Customer#000036277 | 27      | 166040.06     | 50.00      |
 +--------------------+---------+---------------+------------+
 </codeblock>

           <p>
             For example, these queries show how to access information about the <codeph>ARRAY</codeph> elements within the
             <codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q> schema, starting with the initial <codeph>ARRAY</codeph> elements
             and progressing to examine the <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph>, and then the elements nested within
             another <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>:
           </p>

 <codeblock><![CDATA[-- How many orders does each customer have?
 -- The type of the ARRAY column doesn't matter, this is just counting the elements.
 SELECT c_custkey, count(*)
   FROM customer, customer.c_orders
 GROUP BY c_custkey
 LIMIT 5;
 +-----------+----------+
 | c_custkey | count(*) |
 +-----------+----------+
 | 61081     | 21       |
 | 115987    | 15       |
 | 69685     | 19       |
 | 109124    | 15       |
 | 50491     | 12       |
 +-----------+----------+

 -- How many line items are part of each customer order?
 -- Now we examine a field from a STRUCT nested inside the ARRAY.
 SELECT c_custkey, c_orders.o_orderkey, count(*)
   FROM customer, customer.c_orders c_orders, c_orders.o_lineitems
 GROUP BY c_custkey, c_orders.o_orderkey
 LIMIT 5;
 +-----------+------------+----------+
 | c_custkey | o_orderkey | count(*) |
 +-----------+------------+----------+
 | 63367     | 4985959    | 7        |
 | 53989     | 1972230    | 2        |
 | 143513    | 5750498    | 5        |
 | 17849     | 4857989    | 1        |
 | 89881     | 1046437    | 1        |
 +-----------+------------+----------+

 -- What are the line items in each customer order?
 -- One of the STRUCT fields inside the ARRAY is another
 -- ARRAY containing STRUCT elements. The join finds
 -- all the related items from both levels of ARRAY.
 SELECT c_custkey, o_orderkey, l_partkey
   FROM customer, customer.c_orders, c_orders.o_lineitems
 LIMIT 5;
 +-----------+------------+-----------+
 | c_custkey | o_orderkey | l_partkey |
 +-----------+------------+-----------+
 | 113644    | 2738497    | 175846    |
 | 113644    | 2738497    | 27309     |
 | 113644    | 2738497    | 175873    |
 | 113644    | 2738497    | 88559     |
 | 113644    | 2738497    | 8032      |
 +-----------+------------+-----------+
 ]]>
 </codeblock>

         </conbody>

       </concept>

     </concept>

     <concept id="pseudocolumns">

       <title>Pseudocolumns for ARRAY and MAP Types</title>

       <conbody>

         <p>
           Each element in an <codeph>ARRAY</codeph> type has a position, indexed starting from zero, and a value. Each element in a
           <codeph>MAP</codeph> type represents a key-value pair. Impala provides pseudocolumns that let you retrieve this metadata as part
           of a query, or filter query results by including such things in a <codeph>WHERE</codeph> clause. You refer to the pseudocolumns as
           part of qualified column names in queries:
         </p>

         <ul>
           <li>
             <codeph>ITEM</codeph>: The value of an array element. If the <codeph>ARRAY</codeph> contains <codeph>STRUCT</codeph> elements,
             you can refer to either <codeph><varname>array_name</varname>.ITEM.<varname>field_name</varname></codeph> or use the shorthand
             <codeph><varname>array_name</varname>.<varname>field_name</varname></codeph>.
           </li>

           <li>
             <codeph>POS</codeph>: The position of an element within an array.
           </li>

           <li>
             <codeph>KEY</codeph>: The value forming the first part of a key-value pair in a map. It is not necessarily unique.
           </li>

           <li>
             <codeph>VALUE</codeph>: The data item forming the second part of a key-value pair in a map. If the <codeph>VALUE</codeph> part
             of the <codeph>MAP</codeph> element is a <codeph>STRUCT</codeph>, you can refer to either
             <codeph><varname>map_name</varname>.VALUE.<varname>field_name</varname></codeph> or use the shorthand
             <codeph><varname>map_name</varname>.<varname>field_name</varname></codeph>.
           </li>
         </ul>

 <!-- To do: Consider whether to move the detailed subtopics underneath ARRAY and MAP instead of embedded here. -->

         <p outputclass="toc inpage"/>

       </conbody>

       <concept id="item">

         <title id="pos">ITEM and POS Pseudocolumns</title>

         <conbody>

           <p>
             When an <codeph>ARRAY</codeph> column contains <codeph>STRUCT</codeph> elements, you can refer to a field within the
             <codeph>STRUCT</codeph> using a qualified name of the form
             <codeph><varname>array_column</varname>.<varname>field_name</varname></codeph>. If the <codeph>ARRAY</codeph> contains scalar
             values, Impala recognizes the special name <codeph><varname>array_column</varname>.ITEM</codeph> to represent the value of each
             scalar array element. For example, if a column contained an <codeph>ARRAY</codeph> where each element was a
             <codeph>STRING</codeph>, you would use <codeph><varname>array_name</varname>.ITEM</codeph> to refer to each scalar value in the
             <codeph>SELECT</codeph> list, or the <codeph>WHERE</codeph> or other clauses.
           </p>

           <p>
             This example shows a table with two <codeph>ARRAY</codeph> columns whose elements are of the scalar type
             <codeph>STRING</codeph>. When referring to the values of the array elements in the <codeph>SELECT</codeph> list,
             <codeph>WHERE</codeph> clause, or <codeph>ORDER BY</codeph> clause, you use the <codeph>ITEM</codeph> pseudocolumn because
             within the array, the individual elements have no defined names.
           </p>

 <codeblock><![CDATA[create TABLE persons_of_interest
 (
 person_id BIGINT,
 aliases ARRAY <STRING>,
 associates ARRAY <STRING>,
 real_name STRING
 )
 STORED AS PARQUET;

 -- Get all the aliases of each person.
 SELECT real_name, aliases.ITEM
   FROM persons_of_interest, persons_of_interest.aliases
 ORDER BY real_name, aliases.item;

 -- Search for particular associates of each person.
 SELECT real_name, associates.ITEM
   FROM persons_of_interest, persons_of_interest.associates
 WHERE associates.item LIKE '% MacGuffin';
 ]]>
 </codeblock>

           <p>
             Because an array is inherently an ordered data structure, Impala recognizes the special name
             <codeph><varname>array_column</varname>.POS</codeph> to represent the numeric position of each element within the array. The
             <codeph>POS</codeph> pseudocolumn lets you filter or reorder the result set based on the sequence of array elements.
           </p>

           <p>
             The following example uses a table from a flattened version of the TPC-H schema. The <codeph>REGION</codeph> table only has a
             few rows, such as one row for Europe and one for Asia. The row for each region represents all the countries in that region as an
             <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements:
           </p>

 <codeblock><![CDATA[[localhost:21000] > desc region;
 +-------------+--------------------------------------------------------------------+
 | name        | type                                                               |
 +-------------+--------------------------------------------------------------------+
 | r_regionkey | smallint                                                           |
 | r_name      | string                                                             |
 | r_comment   | string                                                             |
 | r_nations   | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> |
 +-------------+--------------------------------------------------------------------+
 ]]>
 </codeblock>

           <p>
             To find the countries within a specific region, you use a join query. To find out the order of elements in the array, you also
             refer to the <codeph>POS</codeph> pseudocolumn in the select list:
           </p>

 <codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, <b>r2.POS</b>
                   > FROM region r1 INNER JOIN r1.r_nations r2
                   > WHERE r1.r_name = 'ASIA';
 +--------+-----------+-----+
 | r_name | n_name    | pos |
 +--------+-----------+-----+
 | ASIA   | VIETNAM   | 0   |
 | ASIA   | CHINA     | 1   |
 | ASIA   | JAPAN     | 2   |
 | ASIA   | INDONESIA | 3   |
 | ASIA   | INDIA     | 4   |
 +--------+-----------+-----+
 </codeblock>

           <p>
             Once you know the positions of the elements, you can use that information in subsequent queries, for example to change the
             ordering of results from the complex type column or to filter certain elements from the array:
           </p>

 <codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
                   > FROM region r1 INNER JOIN r1.r_nations r2
                   > WHERE r1.r_name = 'ASIA'
                   > <b>ORDER BY r2.POS DESC</b>;
 +--------+-----------+-----+
 | r_name | n_name    | pos |
 +--------+-----------+-----+
 | ASIA   | INDIA     | 4   |
 | ASIA   | INDONESIA | 3   |
 | ASIA   | JAPAN     | 2   |
 | ASIA   | CHINA     | 1   |
 | ASIA   | VIETNAM   | 0   |
 +--------+-----------+-----+
 [localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
                   > FROM region r1 INNER JOIN r1.r_nations r2
                   > WHERE r1.r_name = 'ASIA' AND <b>r2.POS BETWEEN 1 and 3</b>;
 +--------+-----------+-----+
 | r_name | n_name    | pos |
 +--------+-----------+-----+
 | ASIA   | CHINA     | 1   |
 | ASIA   | JAPAN     | 2   |
 | ASIA   | INDONESIA | 3   |
 +--------+-----------+-----+
 </codeblock>

         </conbody>

       </concept>

       <concept id="key">

         <title id="value">KEY and VALUE Pseudocolumns</title>

         <conbody>

           <p>
             The <codeph>MAP</codeph> data type is suitable for representing sparse or wide data structures, where each row might only have
             entries for a small subset of named fields. Because the element names (the map keys) vary depending on the row, a query must be
             able to refer to both the key and the value parts of each key-value pair. The <codeph>KEY</codeph> and <codeph>VALUE</codeph>
             pseudocolumns let you refer to the parts of the key-value pair independently within the query, as
             <codeph><varname>map_column</varname>.KEY</codeph> and <codeph><varname>map_column</varname>.VALUE</codeph>.
           </p>

           <p>
             The <codeph>KEY</codeph> must always be a scalar type, such as <codeph>STRING</codeph>, <codeph>BIGINT</codeph>, or
             <codeph>TIMESTAMP</codeph>. It can be <codeph>NULL</codeph>. Values of the <codeph>KEY</codeph> field are not necessarily unique
             within the same <codeph>MAP</codeph>. You apply any required <codeph>DISTINCT</codeph>, <codeph>GROUP BY</codeph>, and other
             clauses in the query, and loop through the result set to process all the values matching any specified keys.
           </p>

           <p>
             The <codeph>VALUE</codeph> can be either a scalar type or another complex type. If the <codeph>VALUE</codeph> is a
             <codeph>STRUCT</codeph>, you can construct a qualified name
             <codeph><varname>map_column</varname>.VALUE.<varname>struct_field</varname></codeph> to refer to the individual fields inside
             the value part. If the <codeph>VALUE</codeph> is an <codeph>ARRAY</codeph> or another <codeph>MAP</codeph>, you must include
             another join condition that establishes a table alias for <codeph><varname>map_column</varname>.VALUE</codeph>, and then
             construct another qualified name using that alias, for example <codeph><varname>table_alias</varname>.ITEM</codeph> or
             <codeph><varname>table_alias</varname>.KEY</codeph> and <codeph><varname>table_alias</varname>.VALUE</codeph>
           </p>

           <p>
             The following example shows different ways to access a <codeph>MAP</codeph> column using the <codeph>KEY</codeph> and
             <codeph>VALUE</codeph> pseudocolumns. The <codeph>DETAILS</codeph> column has a <codeph>STRING</codeph> first part with short,
             standardized values such as <codeph>'Recurring'</codeph>, <codeph>'Lucid'</codeph>, or <codeph>'Anxiety'</codeph>. This is the
             <q>key</q> that is used to look up particular kinds of elements from the <codeph>MAP</codeph>. The second part, also a
             <codeph>STRING</codeph>, is a longer free-form explanation. Impala gives you the standard pseudocolumn names
             <codeph>KEY</codeph> and <codeph>VALUE</codeph> for the two parts, and you apply your own conventions and interpretations to the
             underlying values.
           </p>

           <note>
             If you find that the single-item nature of the <codeph>VALUE</codeph> makes it difficult to model your data accurately, the
             solution is typically to add some nesting to the complex type. For example, to have several sets of key-value pairs, make the
             column an <codeph>ARRAY</codeph> whose elements are <codeph>MAP</codeph>. To make a set of key-value pairs that holds more
             elaborate information, make a <codeph>MAP</codeph> column whose <codeph>VALUE</codeph> part contains an <codeph>ARRAY</codeph>
             or a <codeph>STRUCT</codeph>.
           </note>

 <codeblock><![CDATA[CREATE TABLE dream_journal
 (
   dream_id BIGINT,
   details MAP <STRING,STRING>
 )
 STORED AS PARQUET;
 ]]>

 -- What are all the types of dreams that are recorded?
 SELECT DISTINCT details.KEY FROM dream_journal, dream_journal.details;

 -- How many lucid dreams were recorded?
 -- Because there is no GROUP BY, we count the 'Lucid' keys across all rows.
 SELECT <b>COUNT(details.KEY)</b>
   FROM dream_journal, dream_journal.details
 WHERE <b>details.KEY = 'Lucid'</b>;

 -- Print a report of a subset of dreams, filtering based on both the lookup key
 -- and the detailed value.
 SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE AS "Dream Summary"</b>
   FROM dream_journal, dream_journal.details
 WHERE
   <b>details.KEY IN ('Happy', 'Pleasant', 'Joyous')</b>
   AND <b>details.VALUE LIKE '%childhood%'</b>;
 </codeblock>

           <p>
             The following example shows a more elaborate version of the previous table, where the <codeph>VALUE</codeph> part of the
             <codeph>MAP</codeph> entry is a <codeph>STRUCT</codeph> rather than a scalar type. Now instead of referring to the
             <codeph>VALUE</codeph> pseudocolumn directly, you use dot notation to refer to the <codeph>STRUCT</codeph> fields inside it.
           </p>

 <codeblock><![CDATA[CREATE TABLE better_dream_journal
 (
   dream_id BIGINT,
   details MAP <STRING,STRUCT <summary: STRING, when_happened: TIMESTAMP, duration: DECIMAL(5,2), woke_up: BOOLEAN> >
 )
 STORED AS PARQUET;
 ]]>

 -- Do more elaborate reporting and filtering by examining multiple attributes within the same dream.
 SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE.summary AS "Dream Summary"</b>, <b>details.VALUE.duration AS "Duration"</b>
   FROM better_dream_journal, better_dream_journal.details
 WHERE
   <b>details.KEY IN ('Anxiety', 'Nightmare')</b>
   AND <b>details.VALUE.duration > 60</b>
   AND <b>details.VALUE.woke_up = TRUE</b>;

 -- Remember that if the ITEM or VALUE contains a STRUCT, you can reference
 -- the STRUCT fields directly without the .ITEM or .VALUE qualifier.
 SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.summary AS "Dream Summary"</b>, <b>details.duration AS "Duration"</b>
   FROM better_dream_journal, better_dream_journal.details
 WHERE
   <b>details.KEY IN ('Anxiety', 'Nightmare')</b>
   AND <b>details.duration > 60</b>
   AND <b>details.woke_up = TRUE</b>;
 </codeblock>

         </conbody>

       </concept>

     </concept>

     <concept id="complex_types_etl">

 <!-- This topic overlaps in many ways with the preceding one. See which theme resonates with users, and combine them under the better title. -->

       <title>Loading Data Containing Complex Types</title>

       <conbody>

         <p>
           Because the Impala <codeph>INSERT</codeph> statement does not currently support creating new data with complex type columns, or
           copying existing complex type values from one table to another, you primarily use Impala to query Parquet/ORC tables with complex
           types where the data was inserted through Hive, or create tables with complex types where you already have existing Parquet/ORC
           data files.
         </p>

         <p>
           If you have created a Hive table with the Parquet/ORC file format and containing complex types, use the same table for Impala queries
           with no changes. If you have such a Hive table in some other format, use a Hive <codeph>CREATE TABLE AS SELECT ... STORED AS
           PARQUET</codeph> or <codeph>INSERT ... SELECT</codeph> statement to produce an equivalent Parquet table that Impala can query.
         </p>

         <p> If you have existing Parquet data files containing complex types,
           located outside of any Impala or Hive table, such as data files
           created by Spark jobs, you can use an Impala <codeph>CREATE TABLE ...
             STORED AS PARQUET</codeph> statement, followed by an Impala
             <codeph>LOAD DATA</codeph> statement to move the data files into the
           table. As an alternative, you can use an Impala <codeph>CREATE
             EXTERNAL TABLE</codeph> statement to create a table pointing to the
           HDFS directory that already contains the Parquet or ORC data
           files.</p>

         <p>The simplest way to get started with complex type data is to take a
           denormalized table containing duplicated values, and use an
             <codeph>INSERT ... SELECT</codeph> statement to copy the data into a
           Parquet table and condense the repeated values into complex types.
           With the Hive <codeph>INSERT</codeph> statement, you use the
             <codeph>COLLECT_LIST()</codeph>, <codeph>NAMED_STRUCT()</codeph>,
           and <codeph>MAP()</codeph> constructor functions within a
             <codeph>GROUP BY</codeph> query to produce the complex type values.
             <codeph>COLLECT_LIST()</codeph> turns a sequence of values into an
             <codeph>ARRAY</codeph>. <codeph>NAMED_STRUCT()</codeph> uses the
           first, third, and so on arguments as the field names for a
             <codeph>STRUCT</codeph>, to match the field names from the
             <codeph>CREATE TABLE</codeph> statement. </p>

         <note>
           Because Hive currently cannot construct individual rows using complex types through the <codeph>INSERT ... VALUES</codeph> syntax,
           you prepare the data in flat form in a separate table, then copy it to the table with complex columns using <codeph>INSERT ...
           SELECT</codeph> and the complex type constructors. See <xref href="impala_complex_types.xml#complex_types_ex_hive_etl"/> for
           examples.
         </note>

       </conbody>

     </concept>

     <concept id="complex_types_nesting">

       <title>Using Complex Types as Nested Types</title>

       <conbody>

         <p>
           The <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types can be the top-level types for <q>nested
           type</q> columns. That is, each of these types can contain other complex or scalar types, with multiple levels of nesting to a
           maximum depth of 100. For example, you can have an array of structures, a map containing other maps, a structure containing an
           array of other structures, and so on. At the lowest level, there are always scalar types making up the fields of a
           <codeph>STRUCT</codeph>, elements of an <codeph>ARRAY</codeph>, and keys and values of a <codeph>MAP</codeph>.
         </p>

         <p>
           Schemas involving complex types typically use some level of nesting for the complex type columns.
         </p>

         <p>
           For example, to model a relationship like a dimension table and a fact table, you typically use an <codeph>ARRAY</codeph> where
           each array element is a <codeph>STRUCT</codeph>. The <codeph>STRUCT</codeph> fields represent what would traditionally be columns
           in a separate joined table. It makes little sense to use a <codeph>STRUCT</codeph> as the top-level type for a column, because you
           could just make the fields of the <codeph>STRUCT</codeph> into regular table columns.
         </p>

 <!-- To do: this example might move somewhere else, under STRUCT itself or in a tips-and-tricks section. -->

         <p>
           Perhaps the only use case for a top-level <codeph>STRUCT</codeph> would be to to allow <codeph>STRUCT</codeph> fields with the
           same name as columns to coexist in the same table. The following example shows how a table could have a column named
           <codeph>ID</codeph>, and two separate <codeph>STRUCT</codeph> fields also named <codeph>ID</codeph>. Because the
           <codeph>STRUCT</codeph> fields are always referenced using qualified names, the identical <codeph>ID</codeph> names do not cause a
           conflict.
         </p>

 <codeblock><![CDATA[CREATE TABLE struct_namespaces
 (
   id BIGINT
   , s1 STRUCT < id: BIGINT, field1: STRING >
   , s2 STRUCT < id: BIGINT, when_happened: TIMESTAMP >
 )
 STORED AS PARQUET;

 select id, s1.id, s2.id from struct_namespaces;
 ]]>
 </codeblock>

         <p>
           It is common to make the value portion of each key-value pair in a <codeph>MAP</codeph> a <codeph>STRUCT</codeph>,
           <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, or other complex type variation. That way, each key in the <codeph>MAP</codeph>
           can be associated with a flexible and extensible data structure. The key values are not predefined ahead of time (other than by
           specifying their data type). Therefore, the <codeph>MAP</codeph> can accomodate a rapidly evolving schema, or sparse data
           structures where each row contains only a few data values drawn from a large set of possible choices.
         </p>

         <p>
           Although you can use an <codeph>ARRAY</codeph> of scalar values as the top-level column in a table, such a simple array is
           typically of limited use for analytic queries. The only property of the array elements, aside from the element value, is the
           ordering sequence available through the <codeph>POS</codeph> pseudocolumn. To record any additional item about each array element,
           such as a <codeph>TIMESTAMP</codeph> or a symbolic name, you use an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> rather than
           of scalar values.
         </p>

         <p>
           If you are considering having multiple <codeph>ARRAY</codeph> or <codeph>MAP</codeph> columns, with related items under the same
           position in each <codeph>ARRAY</codeph> or the same key in each <codeph>MAP</codeph>, prefer to use a <codeph>STRUCT</codeph> to
           group all the related items into a single <codeph>ARRAY</codeph> or <codeph>MAP</codeph>. Doing so avoids the additional storage
           overhead and potential duplication of key values from having an extra complex type column. Also, because each
           <codeph>ARRAY</codeph> or <codeph>MAP</codeph> that you reference in the query <codeph>SELECT</codeph> list requires an additional
           join clause, minimizing the number of complex type columns also makes the query easier to read and maintain, relying more on dot
           notation to refer to the relevant fields rather than a sequence of join clauses.
         </p>

         <p>
           For example, here is a table with several complex type columns all at the top level and containing only scalar types. To retrieve
           every data item for the row requires a separate join for each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> column. The fields of
           the <codeph>STRUCT</codeph> can be referenced using dot notation, but there is no real advantage to using the
           <codeph>STRUCT</codeph> at the top level rather than just making separate columns <codeph>FIELD1</codeph> and
           <codeph>FIELD2</codeph>.
         </p>

 <codeblock><![CDATA[CREATE TABLE complex_types_top_level
 (
   id BIGINT,
   a1 ARRAY<INT>,
   a2 ARRAY<STRING>,
   s STRUCT<field1: INT, field2: STRING>,
 -- Numeric lookup key for a string value.
   m1 MAP<INT,STRING>,
 -- String lookup key for a numeric value.
   m2 MAP<STRING,INT>
 )
 STORED AS PARQUET;

 describe complex_types_top_level;
 +------+-----------------+
 | name | type            |
 +------+-----------------+
 | id   | bigint          |
 | a1   | array<int>      |
 | a2   | array<string>   |
 | s    | struct<         |
 |      |   field1:int,   |
 |      |   field2:string |
 |      | >               |
 | m1   | map<int,string> |
 | m2   | map<string,int> |
 +------+-----------------+

 select
   id,
   a1.item,
   a2.item,
   s.field1,
   s.field2,
   m1.key,
   m1.value,
   m2.key,
   m2.value
 from
   complex_types_top_level,
   complex_types_top_level.a1,
   complex_types_top_level.a2,
   complex_types_top_level.m1,
   complex_types_top_level.m2;
 ]]>
 </codeblock>

         <p>
           For example, here is a table with columns containing an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, a <codeph>MAP</codeph>
           where each key value is a <codeph>STRUCT</codeph>, and a <codeph>MAP</codeph> where each key value is an <codeph>ARRAY</codeph> of
           <codeph>STRUCT</codeph>.
         </p>

 <codeblock><![CDATA[CREATE TABLE nesting_demo
 (
   user_id BIGINT,
   family_members ARRAY < STRUCT < name: STRING, email: STRING, date_joined: TIMESTAMP >>,
   foo map < STRING, STRUCT < f1: INT, f2: INT, f3: TIMESTAMP, f4: BOOLEAN >>,
   gameplay MAP < STRING , ARRAY < STRUCT <
     name: STRING, highest: BIGINT, lives_used: INT, total_spent: DECIMAL(16,2)
   >>>
 )
 STORED AS PARQUET;
 ]]>
 </codeblock>

         <p>
           The <codeph>DESCRIBE</codeph> statement rearranges the <codeph>&lt;</codeph> and <codeph>&gt;</codeph> separators and the field
           names within each <codeph>STRUCT</codeph> for easy readability:
         </p>

 <codeblock><![CDATA[DESCRIBE nesting_demo;
 +----------------+-----------------------------+
 | name           | type                        |
 +----------------+-----------------------------+
 | user_id        | bigint                      |
 | family_members | array<struct<               |
 |                |   name:string,              |
 |                |   email:string,             |
 |                |   date_joined:timestamp     |
 |                | >>                          |
 | foo            | map<string,struct<          |
 |                |   f1:int,                   |
 |                |   f2:int,                   |
 |                |   f3:timestamp,             |
 |                |   f4:boolean                |
 |                | >>                          |
 | gameplay       | map<string,array<struct<    |
 |                |   name:string,              |
 |                |   highest:bigint,           |
 |                |   lives_used:int,           |
 |                |   total_spent:decimal(16,2) |
 |                | >>>                         |
 +----------------+-----------------------------+
 ]]>
 </codeblock>

         <p>
           To query the complex type columns, you use join notation to refer to the lowest-level scalar values. If the value is an
           <codeph>ARRAY</codeph> element, the fully qualified name includes the <codeph>ITEM</codeph> pseudocolumn. If the value is inside a
           <codeph>MAP</codeph>, the fully qualified name includes the <codeph>KEY</codeph> or <codeph>VALUE</codeph> pseudocolumn. Each
           reference to a different <codeph>ARRAY</codeph> or <codeph>MAP</codeph> (even if nested inside another complex type) requires an
           additional join clause.
         </p>

 <!-- To do: shorten qualified names. -->

 <codeblock><![CDATA[SELECT
 -- The lone scalar field doesn't require any dot notation or join clauses.
     user_id
 -- Retrieve the fields of a STRUCT inside an ARRAY.
 -- The FAMILY_MEMBERS name refers to the FAMILY_MEMBERS table alias defined later in the FROM clause.
   , family_members.item.name
   , family_members.item.email
   , family_members.item.date_joined
 -- Retrieve the KEY and VALUE fields of a MAP, with the value being a STRUCT consisting of more fields.
 -- The FOO name refers to the FOO table alias defined later in the FROM clause.
   , foo.key
   , foo.value.f1
   , foo.value.f2
   , foo.value.f3
   , foo.value.f4
 -- Retrieve the KEY fields of a MAP, and expand the VALUE part into ARRAY items consisting of STRUCT fields.
 -- The GAMEPLAY name refers to the GAMEPLAY table alias defined later in the FROM clause (referring to the MAP item).
 -- The GAME_N name refers to the GAME_N table alias defined later in the FROM clause (referring to the ARRAY
 -- inside the MAP item's VALUE part.)
   , gameplay.key
   , game_n.name
   , game_n.highest
   , game_n.lives_used
   , game_n.total_spent
 FROM
     nesting_demo
   , nesting_demo.family_members AS family_members
   , nesting_demo.foo AS foo
   , nesting_demo.gameplay AS gameplay
   , nesting_demo.gameplay.value AS game_n;
 ]]>
 </codeblock>

         <p>
           Once you understand the notation to refer to a particular data item in the <codeph>SELECT</codeph> list, you can use the same
           qualified name to refer to that data item in other parts of the query, such as the <codeph>WHERE</codeph> clause, <codeph>ORDER
           BY</codeph> or <codeph>GROUP BY</codeph> clauses, or calls to built-in functions. For example, you might frequently retrieve the
           <codeph>VALUE</codeph> part of each <codeph>MAP</codeph> item in the <codeph>SELECT</codeph> list, while choosing the specific
           <codeph>MAP</codeph> items by running comparisons against the <codeph>KEY</codeph> part in the <codeph>WHERE</codeph> clause.
         </p>

       </conbody>

     </concept>

     <concept id="complex_types_views">

       <title>Accessing Complex Type Data in Flattened Form Using Views</title>

       <conbody>

         <p>
           The layout of complex and nested types is largely a physical consideration. The complex type columns reside in the same data files
           rather than in separate normalized tables, for your convenience in managing related data sets and performance in querying related
           data sets. You can use views to treat tables with complex types as if they were flattened. By putting the join logic and
           references to the complex type columns in the view definition, you can query the same tables using existing queries intended for
           tables containing only scalar columns. This technique also lets you use tables with complex types with BI tools that are not aware
           of the data types and query notation for accessing complex type columns.
         </p>

 <!-- First conref contains a link back to complex_types topic so not appropriate here.
      Second conref might not make sense without the first. Therefore hold off for now,
      see later if there's a good way to work in these reusable tips.

       <p conref="../shared/impala_common.xml#common/complex_types_views"/>
       <p conref="../shared/impala_common.xml#common/complex_types_views_caveat"/>
 -->

         <p>
           For example, the variation of the TPC-H schema containing complex types has a table <codeph>REGION</codeph>. This table has 5
           rows, corresponding to 5 regions such as <codeph>NORTH AMERICA</codeph> and <codeph>AFRICA</codeph>. Each row has an
           <codeph>ARRAY</codeph> column, where each array item is a <codeph>STRUCT</codeph> containing details about a country in that
           region.
         </p>

 <codeblock><![CDATA[DESCRIBE region;
 +-------------+-------------------------+
 | name        | type                    |
 +-------------+-------------------------+
 | r_regionkey | smallint                |
 | r_name      | string                  |
 | r_comment   | string                  |
 | r_nations   | array<struct<           |
 |             |   n_nationkey:smallint, |
 |             |   n_name:string,        |
 |             |   n_comment:string      |
 |             | >>                      |
 +-------------+-------------------------+
 ]]>
 </codeblock>

         <p>
           The same data could be represented in traditional denormalized form, as a single table where the information about each region is
           repeated over and over, alongside the information about each country. The nested complex types let us avoid the repetition, while
           still keeping the data in a single table rather than normalizing across multiple tables.
         </p>

         <p>
           To use this table with a JDBC or ODBC application that expected scalar columns, we could create a view that represented the result
           set as a set of scalar columns (three columns from the original table, plus three more from the <codeph>STRUCT</codeph> fields of
           the array elements). In the following examples, any column with an <codeph>R_*</codeph> prefix is taken unchanged from the
           original table, while any column with an <codeph>N_*</codeph> prefix is extracted from the <codeph>STRUCT</codeph> inside the
           <codeph>ARRAY</codeph>.
         </p>

 <!-- To do: shorten qualified names. -->

 <codeblock>CREATE VIEW region_view AS
   SELECT
     r_regionkey,
     r_name,
     r_comment,
     array_field.item.n_nationkey AS n_nationkey,
     array_field.item.n_name AS n_name,
     array_field.n_comment AS n_comment
 FROM
   region, region.r_nations AS array_field;
 </codeblock>

         <p>
           Then we point the application queries at the view rather than the original table. From the perspective of the view, there are 25
           rows in the result set, one for each nation in each region, and queries can refer freely to fields related to the region or the
           nation.
         </p>

 <codeblock>-- Retrieve info such as the nation name from the original R_NATIONS array elements.
 select n_name from region_view where r_name in ('EUROPE', 'ASIA');
 +----------------+
 | n_name         |
 +----------------+
 | UNITED KINGDOM |
 | RUSSIA         |
 | ROMANIA        |
 | GERMANY        |
 | FRANCE         |
 | VIETNAM        |
 | CHINA          |
 | JAPAN          |
 | INDONESIA      |
 | INDIA          |
 +----------------+

 -- UNITED STATES in AMERICA and UNITED KINGDOM in EUROPE.
 SELECT DISTINCT r_name FROM region_view WHERE n_name LIKE 'UNITED%';
 +---------+
 | r_name  |
 +---------+
 | AMERICA |
 | EUROPE  |
 +---------+

 -- For conciseness, we only list some view columns in the SELECT list.
 -- SELECT * would bring back all the data, unlike SELECT *
 -- queries on the original table with complex type columns.
 SELECT r_regionkey, r_name, n_nationkey, n_name FROM region_view LIMIT 7;
 +-------------+--------+-------------+----------------+
 | r_regionkey | r_name | n_nationkey | n_name         |
 +-------------+--------+-------------+----------------+
 | 3           | EUROPE | 23          | UNITED KINGDOM |
 | 3           | EUROPE | 22          | RUSSIA         |
 | 3           | EUROPE | 19          | ROMANIA        |
 | 3           | EUROPE | 7           | GERMANY        |
 | 3           | EUROPE | 6           | FRANCE         |
 | 2           | ASIA   | 21          | VIETNAM        |
 | 2           | ASIA   | 18          | CHINA          |
 +-------------+--------+-------------+----------------+
 </codeblock>

       </conbody>

     </concept>

   </concept>

   <concept id="complex_types_examples">

     <title>Tutorials and Examples for Complex Types</title>

     <prolog>
       <metadata>
         <data name="Category" value="Tutorials"/>
       </metadata>
     </prolog>

     <conbody>

       <p>
         The following examples illustrate the query syntax for some common use cases involving complex type columns.
       </p>

       <p outputclass="toc inpage"/>

     </conbody>

     <concept id="complex_sample_schema">

       <title>Sample Schema and Data for Experimenting with Impala Complex Types</title>

       <conbody>

 <!-- To do: find where the schema and data creation scripts are posted so I can refer readers to a download location. -->

         <p>
           The tables used for earlier examples of complex type syntax are trivial ones with no actual data. The more substantial examples of
           the complex type feature use these tables, adapted from the schema used for TPC-H testing:
         </p>

 <codeblock><![CDATA[SHOW TABLES;
 +----------+
 | name     |
 +----------+
 | customer |
 | part     |
 | region   |
 | supplier |
 +----------+

 DESCRIBE customer;
 +--------------+------------------------------------+
 | name         | type                               |
 +--------------+------------------------------------+
 | c_custkey    | bigint                             |
 | c_name       | string                             |
 | c_address    | string                             |
 | c_nationkey  | smallint                           |
 | c_phone      | string                             |
 | c_acctbal    | decimal(12,2)                      |
 | c_mktsegment | string                             |
 | c_comment    | string                             |
 | c_orders     | array<struct<                      |
 |              |   o_orderkey:bigint,               |
 |              |   o_orderstatus:string,            |
 |              |   o_totalprice:decimal(12,2),      |
 |              |   o_orderdate:string,              |
 |              |   o_orderpriority:string,          |
 |              |   o_clerk:string,                  |
 |              |   o_shippriority:int,              |
 |              |   o_comment:string,                |
 |              |   o_lineitems:array<struct<        |
 |              |     l_partkey:bigint,              |
 |              |     l_suppkey:bigint,              |
 |              |     l_linenumber:int,              |
 |              |     l_quantity:decimal(12,2),      |
 |              |     l_extendedprice:decimal(12,2), |
 |              |     l_discount:decimal(12,2),      |
 |              |     l_tax:decimal(12,2),           |
 |              |     l_returnflag:string,           |
 |              |     l_linestatus:string,           |
 |              |     l_shipdate:string,             |
 |              |     l_commitdate:string,           |
 |              |     l_receiptdate:string,          |
 |              |     l_shipinstruct:string,         |
 |              |     l_shipmode:string,             |
 |              |     l_comment:string               |
 |              |   >>                               |
 |              | >>                                 |
 +--------------+------------------------------------+

 DESCRIBE part;
 +---------------+---------------+
 | name          | type          |
 +---------------+---------------+
 | p_partkey     | bigint        |
 | p_name        | string        |
 | p_mfgr        | string        |
 | p_brand       | string        |
 | p_type        | string        |
 | p_size        | int           |
 | p_container   | string        |
 | p_retailprice | decimal(12,2) |
 | p_comment     | string        |
 +---------------+---------------+

 DESCRIBE region;
 +-------------+--------------------------------------------------------------------+
 | name        | type                                                               |
 +-------------+--------------------------------------------------------------------+
 | r_regionkey | smallint                                                           |
 | r_name      | string                                                             |
 | r_comment   | string                                                             |
 | r_nations   | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> |
 +-------------+--------------------------------------------------------------------+

 DESCRIBE supplier;
 +-------------+----------------------------------------------+
 | name        | type                                         |
 +-------------+----------------------------------------------+
 | s_suppkey   | bigint                                       |
 | s_name      | string                                       |
 | s_address   | string                                       |
 | s_nationkey | smallint                                     |
 | s_phone     | string                                       |
 | s_acctbal   | decimal(12,2)                                |
 | s_comment   | string                                       |
 | s_partsupps | array<struct<ps_partkey:bigint,              |
 |             | ps_availqty:int,ps_supplycost:decimal(12,2), |
 |             | ps_comment:string>>                          |
 +-------------+----------------------------------------------+
 ]]>
 </codeblock>

         <p>
           The volume of data used in the following examples is:
         </p>

 <codeblock><![CDATA[SELECT count(*) FROM customer;
 +----------+
 | count(*) |
 +----------+
 | 150000   |
 +----------+

 SELECT count(*) FROM part;
 +----------+
 | count(*) |
 +----------+
 | 200000   |
 +----------+

 SELECT count(*) FROM region;
 +----------+
 | count(*) |
 +----------+
 | 5        |
 +----------+

 SELECT count(*) FROM supplier;
 +----------+
 | count(*) |
 +----------+
 | 10000    |
 +----------+
 ]]>
 </codeblock>

       </conbody>

       <concept audience="hidden" id="complex_python">

 <!-- Hiding this subtopic for the moment because there isn't enough material related to Ibis yet, which would be useful for background or to construct examples. -->

         <title>Representing Python Data Types with Impala Complex Types</title>

         <conbody>

           <p>
             If you use Python data types for data science computations or for convenience of manipulating data structures, you might think
             of complex or nested data in terms of lists, tuples, and dictionaries and their textual representations. These types map well to
             Impala complex types, as follows:
           </p>

 <!-- To do: add some examples in both languages to show the equivalences. -->

           <ul>
             <li>
               <p>
                 An Impala <codeph>ARRAY</codeph> is like a Python list.
               </p>
             </li>

             <li>
               <p>
                 An Impala <codeph>STRUCT</codeph> is like a Python tuple. The elements of an Impala <codeph>STRUCT</codeph> are accessed by
                 name, rather than by numeric index as in Python.
               </p>
             </li>

             <li>
               <p>
                 An Impala <codeph>MAP</codeph> is like a Python dictionary. To emulate the Python <codeph>defaultdict</codeph> type, which
                 returns a default value in the case of non-existent elements, use an <codeph>OUTER JOIN</codeph> clause in the Impala query
                 that retrieves the <codeph>MAP</codeph> elements, and translate any <codeph>NULL</codeph> values corresponding to missing
                 elements using Impala conditional functions.
               </p>
             </li>
           </ul>

         </conbody>

       </concept>

     </concept>

     <concept id="complex_types_ex_aggregation" audience="hidden">

       <title>Aggregating the Elements in an ARRAY</title>

       <conbody>

         <p></p>

 <codeblock>
 </codeblock>

       </conbody>

     </concept>

     <concept id="complex_types_ex_map_keys" audience="hidden">

       <title>Finding the Distinct Keys in a MAP</title>

       <conbody>

         <p></p>

 <codeblock>
 </codeblock>

       </conbody>

     </concept>

     <concept id="complex_types_ex_map_struct" audience="hidden">

       <title>Using a STRUCT as the Value Part of a MAP</title>

       <conbody>

         <p></p>

 <codeblock>
 </codeblock>

       </conbody>

     </concept>

     <concept id="complex_types_ex_hive_etl">

       <title>Constructing Parquet/ORC Files with Complex Columns Using Hive</title>

       <conbody>

         <p>
           The following examples demonstrate the Hive syntax to transform flat data (tables with all scalar columns) into Parquet/ORC tables
           where Impala can query the complex type columns. Each example shows the full sequence of steps, including switching back and forth
           between Impala and Hive. Although the source table can use any file format, the destination table must use the Parquet/ORC file
           format. We take Parquet in the following examples. You can replace Parquet with ORC to do the same things in ORC file format.
         </p>

         <p>
           <b>Create table with <codeph>ARRAY</codeph> in Impala, load data in Hive, query in Impala:</b>
         </p>

         <p>
           This example shows the cycle of creating the tables and querying the complex data in Impala, and using Hive (either the
           <codeph>hive</codeph> shell or <codeph>beeline</codeph>) for the data loading step. The data starts in flattened, denormalized
           form in a text table. Hive writes the corresponding Parquet data, including an <codeph>ARRAY</codeph> column. Then Impala can run
           analytic queries on the Parquet table, using join notation to unpack the <codeph>ARRAY</codeph> column.
         </p>

 <codeblock>/* Initial DDL and loading of flat, denormalized data happens in impala-shell */<![CDATA[CREATE TABLE flat_array (country STRING, city STRING);INSERT INTO flat_array VALUES
     ('Canada', 'Toronto') , ('Canada', 'Vancouver') , ('Canada', "St. John\'s")
   , ('Canada', 'Saint John') , ('Canada', 'Montreal') , ('Canada', 'Halifax')
   , ('Canada', 'Winnipeg') , ('Canada', 'Calgary') , ('Canada', 'Saskatoon')
   , ('Canada', 'Ottawa') , ('Canada', 'Yellowknife') , ('France', 'Paris')
   , ('France', 'Nice') , ('France', 'Marseilles') , ('France', 'Cannes')
   , ('Greece', 'Athens') , ('Greece', 'Piraeus') , ('Greece', 'Hania')
   , ('Greece', 'Heraklion') , ('Greece', 'Rethymnon') , ('Greece', 'Fira');

 CREATE TABLE complex_array (country STRING, city ARRAY <STRING>) STORED AS PARQUET;
 ]]>
 </codeblock>

 <codeblock><![CDATA[/* Conversion to Parquet and complex and/or nested columns happens in Hive */

 INSERT INTO complex_array SELECT country, collect_list(city) FROM flat_array GROUP BY country;
 Query ID = dev_20151108160808_84477ff2-82bd-4ba4-9a77-554fa7b8c0cb
 Total jobs = 1
 Launching Job 1 out of 1
 ...
 ]]>
 </codeblock>

 <codeblock><![CDATA[/* Back to impala-shell again for analytic queries */

 REFRESH complex_array;
 SELECT country, city.item FROM complex_array, complex_array.city
 +---------+-------------+
 | country | item        |
 +---------+-------------+
 | Canada  | Toronto     |
 | Canada  | Vancouver   |
 | Canada  | St. John's  |
 | Canada  | Saint John  |
 | Canada  | Montreal    |
 | Canada  | Halifax     |
 | Canada  | Winnipeg    |
 | Canada  | Calgary     |
 | Canada  | Saskatoon   |
 | Canada  | Ottawa      |
 | Canada  | Yellowknife |
 | France  | Paris       |
 | France  | Nice        |
 | France  | Marseilles  |
 | France  | Cannes      |
 | Greece  | Athens      |
 | Greece  | Piraeus     |
 | Greece  | Hania       |
 | Greece  | Heraklion   |
 | Greece  | Rethymnon   |
 | Greece  | Fira        |
 +---------+-------------+
 ]]>
 </codeblock>

         <p>
           <b>Create table with <codeph>STRUCT</codeph> and <codeph>ARRAY</codeph> in Impala, load data in Hive, query in Impala:</b>
         </p>

         <p>
           This example shows the cycle of creating the tables and querying the complex data in Impala, and using Hive (either the
           <codeph>hive</codeph> shell or <codeph>beeline</codeph>) for the data loading step. The data starts in flattened, denormalized
           form in a text table. Hive writes the corresponding Parquet data, including a <codeph>STRUCT</codeph> column with an
           <codeph>ARRAY</codeph> field. Then Impala can run analytic queries on the Parquet table, using join notation to unpack the
           <codeph>ARRAY</codeph> field from the <codeph>STRUCT</codeph> column.
         </p>

 <!-- To do: enhance the complex table in this example, with COUNTRY being an ARRAY of STRUCT, with the STRUCT containing another ARRAY field. -->

 <codeblock><![CDATA[/* Initial DDL and loading of flat, denormalized data happens in impala-shell */

 CREATE TABLE flat_struct_array (continent STRING, country STRING, city STRING);
 INSERT INTO flat_struct_array VALUES
     ('North America', 'Canada', 'Toronto') , ('North America', 'Canada', 'Vancouver')
   , ('North America', 'Canada', "St. John\'s") , ('North America', 'Canada', 'Saint John')
   , ('North America', 'Canada', 'Montreal') , ('North America', 'Canada', 'Halifax')
   , ('North America', 'Canada', 'Winnipeg') , ('North America', 'Canada', 'Calgary')
   , ('North America', 'Canada', 'Saskatoon') , ('North America', 'Canada', 'Ottawa')
   , ('North America', 'Canada', 'Yellowknife') , ('Europe', 'France', 'Paris')
   , ('Europe', 'France', 'Nice') , ('Europe', 'France', 'Marseilles')
   , ('Europe', 'France', 'Cannes') , ('Europe', 'Greece', 'Athens')
   , ('Europe', 'Greece', 'Piraeus') , ('Europe', 'Greece', 'Hania')
   , ('Europe', 'Greece', 'Heraklion') , ('Europe', 'Greece', 'Rethymnon')
   , ('Europe', 'Greece', 'Fira');

 CREATE TABLE complex_struct_array (continent STRING, country STRUCT <name: STRING, city: ARRAY <STRING> >) STORED AS PARQUET;
 ]]>
 </codeblock>

 <codeblock><![CDATA[/* Conversion to Parquet and complex and/or nested columns happens in Hive */

 INSERT INTO complex_struct_array SELECT continent, named_struct('name', country, 'city', collect_list(city)) FROM flat_array_array GROUP BY continent, country;
 Query ID = dev_20151108163535_11a4fa53-0003-4638-97e6-ef13cdb8e09e
 Total jobs = 1
 Launching Job 1 out of 1
 ...
 ]]>
 </codeblock>

 <codeblock><![CDATA[/* Back to impala-shell again for analytic queries */

 REFRESH complex_struct_array;
 SELECT t1.continent, t1.country.name, t2.item
   FROM complex_struct_array t1, t1.country.city t2
 +---------------+--------------+-------------+
 | continent     | country.name | item        |
 +---------------+--------------+-------------+
 | Europe        | France       | Paris       |
 | Europe        | France       | Nice        |
 | Europe        | France       | Marseilles  |
 | Europe        | France       | Cannes      |
 | Europe        | Greece       | Athens      |
 | Europe        | Greece       | Piraeus     |
 | Europe        | Greece       | Hania       |
 | Europe        | Greece       | Heraklion   |
 | Europe        | Greece       | Rethymnon   |
 | Europe        | Greece       | Fira        |
 | North America | Canada       | Toronto     |
 | North America | Canada       | Vancouver   |
 | North America | Canada       | St. John's  |
 | North America | Canada       | Saint John  |
 | North America | Canada       | Montreal    |
 | North America | Canada       | Halifax     |
 | North America | Canada       | Winnipeg    |
 | North America | Canada       | Calgary     |
 | North America | Canada       | Saskatoon   |
 | North America | Canada       | Ottawa      |
 | North America | Canada       | Yellowknife |
 +---------------+--------------+-------------+
 ]]>
 </codeblock>

       </conbody>

     </concept>

     <concept audience="hidden" id="complex_types_no_joins">

       <title>Using Complex Types without Join Queries</title>

       <conbody>

         <p>
           Although most discussion and examples of complex types revolve around join queries, you can make use of complex columns without
           joins. In these cases, you refer to the complex object (possibly nested under multiple levels of other complex types) as the only
           <q>table</q> in the <codeph>FROM</codeph> clause, and the columns of this table are the scalar values named <codeph>POS</codeph>
           (for <codeph>ARRAY</codeph>), <codeph>ITEM</codeph> (for <codeph>ARRAY</codeph> of scalar values), <codeph>KEY</codeph> (for
           <codeph>MAP</codeph>), <codeph>VALUE</codeph> (for <codeph>MAP</codeph> with scalar values), and the names of
           <codeph>STRUCT</codeph> fields (where the <codeph>ARRAY</codeph> element or <codeph>MAP</codeph> value is a
           <codeph>STRUCT</codeph>).
         </p>

         <p>
           This technique lets you use aggregation operation such as <codeph>COUNT()</codeph>, <codeph>SUM()</codeph>,
           <codeph>MAX()</codeph>, or <codeph>DISTINCT</codeph> across all the data items from a complex type, across all rows of the table.
           For users new to complex types, this can be a good practice exercise to understand the notation for referring to data items within
           deeply nested types, and how the pseudocolumns such as <codeph>POS</codeph> and <codeph>KEY</codeph> are treated as columns
           alongside the fields of a <codeph>STRUCT</codeph>.
         </p>

         <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>

         <p>
           For example, these queries perform aggregate functions using the <codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q>
           schema:
         </p>

 <codeblock><![CDATA[
 ]]>
 </codeblock>

       </conbody>

     </concept>

     <concept id="complex_denormalizing">

       <title>Flattening Normalized Tables into a Single Table with Complex Types</title>

       <conbody>

         <p>
           One common use for complex types is to embed the contents of one table into another. The traditional technique of denormalizing
           results in a huge number of rows with some column values repeated over and over. With complex types, you can keep the same number
           of rows as in the original normalized table, and put all the associated data from the other table in a single new column.
         </p>

         <p>
           In this flattening scenario, you might frequently use a column that is an <codeph>ARRAY</codeph> consisting of
           <codeph>STRUCT</codeph> elements, where each field within the <codeph>STRUCT</codeph> corresponds to a column name from the table
           that you are combining.
         </p>

         <p>
           The following example shows a traditional normalized layout using two tables, and then an equivalent layout using complex types in
           a single table.
         </p>

 <codeblock><![CDATA[/* Traditional relational design */

 -- This table just stores numbers, allowing us to look up details about the employee
 -- and details about their vacation time using a three-table join query.
 CREATE table employee_vacations
 (
   employee_id BIGINT,
   vacation_id BIGINT
 )
 STORED AS PARQUET;

 -- Each kind of information to track gets its own "fact table".
 CREATE table vacation_details
 (
   vacation_id BIGINT,
   vacation_start TIMESTAMP,
   duration INT
 )
 STORED AS PARQUET;

 -- Any time we print a human-readable report, we join with this table to
 -- display info about employee #1234.
 CREATE TABLE employee_contact
 (
   employee_id BIGINT,
   name STRING,
   address STRING,
   phone STRING,
   email STRING,
   address_type STRING /* 'home', 'work', 'remote', etc. */
 )
 STORED AS PARQUET;

 /* Equivalent flattened schema using complex types */

 -- For analytic queries using complex types, we can bundle the dimension table
 -- and multiple fact tables into a single table.
 CREATE TABLE employee_vacations_nested_types
 (
 -- We might still use the employee_id for other join queries.
 -- The table needs at least one scalar column to serve as an identifier
 -- for the complex type columns.
   employee_id BIGINT,

 -- Columns of the VACATION_DETAILS table are folded into a STRUCT.
 -- We drop the VACATION_ID column because Impala doesn't need
 -- synthetic IDs to join a complex type column.
 -- Each row from the VACATION_DETAILS table becomes an array element.
   vacation ARRAY < STRUCT <
     vacation_start: TIMESTAMP,
     duration: INT
   >>,

 -- The ADDRESS_TYPE column, with a small number of predefined values that are distinct
 -- for each employee, makes the EMPLOYEE_CONTACT table a good candidate to turn into a MAP,
 -- with each row represented as a STRUCT. The string value from ADDRESS_TYPE becomes the
 -- "key" (the anonymous first field) of the MAP.
   contact MAP < STRING, STRUCT <
     address: STRING,
     phone: STRING,
     email: STRING
   >>
 )
 STORED AS PARQUET;
 ]]>
 </codeblock>

       </conbody>

     </concept>

     <concept id="complex_inference">

       <title>Interchanging Complex Type Tables and Data Files with Hive and Other Components</title>

       <conbody>

         <p>
           You can produce Parquet data files through several Hadoop components and APIs.
         </p>

         <p>
           If you have a Hive-created Parquet table that includes <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or <codeph>MAP</codeph>
           columns, Impala can query that same table in <keyword keyref="impala23_full"/> and higher, subject to the usual restriction that all other
           columns are of data types supported by Impala, and also that the file type of the table must be Parquet.
         </p>

         <p>
           If you have a Parquet data file produced outside of Impala, Impala can automatically deduce the appropriate table structure using
           the syntax <codeph>CREATE TABLE ... LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'</codeph>. In <keyword keyref="impala23_full"/>
           and higher, this feature works for Parquet files that include <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or
           <codeph>MAP</codeph> types.
         </p>

 <codeblock><![CDATA[/* In impala-shell, find the HDFS data directory of the original table.
 DESCRIBE FORMATTED tpch_nested_parquet.customer;
 ...
 | Location: | hdfs://localhost:20500/test-warehouse/tpch_nested_parquet.db/customer   | NULL |
 ...

 # In the Unix shell, find the path of any Parquet data file in that HDFS directory.
 $ hdfs dfs -ls hdfs://localhost:20500/test-warehouse/tpch_nested_parquet.db/customer
 Found 4 items
 -rwxr-xr-x   3 dev supergroup  171298918 2015-09-22 23:30 hdfs://localhost:20500/blah/tpch_nested_parquet.db/customer/000000_0
 ...

 /* Back in impala-shell, use the HDFS path in a CREATE TABLE LIKE PARQUET statement. */
 CREATE TABLE customer_ctlp
   LIKE PARQUET 'hdfs://localhost:20500/blah/tpch_nested_parquet.db/customer/000000_0'
   STORED AS PARQUET;

 /* Confirm that old and new tables have the same column layout, including complex types. */
 DESCRIBE tpch_nested_parquet.customer
 +--------------+------------------------------------+---------+
 | name         | type                               | comment |
 +--------------+------------------------------------+---------+
 | c_custkey    | bigint                             |         |
 | c_name       | string                             |         |
 | c_address    | string                             |         |
 | c_nationkey  | smallint                           |         |
 | c_phone      | string                             |         |
 | c_acctbal    | decimal(12,2)                      |         |
 | c_mktsegment | string                             |         |
 | c_comment    | string                             |         |
 | c_orders     | array<struct<                      |         |
 |              |   o_orderkey:bigint,               |         |
 |              |   o_orderstatus:string,            |         |
 |              |   o_totalprice:decimal(12,2),      |         |
 |              |   o_orderdate:string,              |         |
 |              |   o_orderpriority:string,          |         |
 |              |   o_clerk:string,                  |         |
 |              |   o_shippriority:int,              |         |
 |              |   o_comment:string,                |         |
 |              |   o_lineitems:array<struct<        |         |
 |              |     l_partkey:bigint,              |         |
 |              |     l_suppkey:bigint,              |         |
 |              |     l_linenumber:int,              |         |
 |              |     l_quantity:decimal(12,2),      |         |
 |              |     l_extendedprice:decimal(12,2), |         |
 |              |     l_discount:decimal(12,2),      |         |
 |              |     l_tax:decimal(12,2),           |         |
 |              |     l_returnflag:string,           |         |
 |              |     l_linestatus:string,           |         |
 |              |     l_shipdate:string,             |         |
 |              |     l_commitdate:string,           |         |
 |              |     l_receiptdate:string,          |         |
 |              |     l_shipinstruct:string,         |         |
 |              |     l_shipmode:string,             |         |
 |              |     l_comment:string               |         |
 |              |   >>                               |         |
 |              | >>                                 |         |
 +--------------+------------------------------------+---------+

 describe customer_ctlp;
 +--------------+------------------------------------+-----------------------------+
 | name         | type                               | comment                     |
 +--------------+------------------------------------+-----------------------------+
 | c_custkey    | bigint                             | Inferred from Parquet file. |
 | c_name       | string                             | Inferred from Parquet file. |
 | c_address    | string                             | Inferred from Parquet file. |
 | c_nationkey  | int                                | Inferred from Parquet file. |
 | c_phone      | string                             | Inferred from Parquet file. |
 | c_acctbal    | decimal(12,2)                      | Inferred from Parquet file. |
 | c_mktsegment | string                             | Inferred from Parquet file. |
 | c_comment    | string                             | Inferred from Parquet file. |
 | c_orders     | array<struct<                      | Inferred from Parquet file. |
 |              |   o_orderkey:bigint,               |                             |
 |              |   o_orderstatus:string,            |                             |
 |              |   o_totalprice:decimal(12,2),      |                             |
 |              |   o_orderdate:string,              |                             |
 |              |   o_orderpriority:string,          |                             |
 |              |   o_clerk:string,                  |                             |
 |              |   o_shippriority:int,              |                             |
 |              |   o_comment:string,                |                             |
 |              |   o_lineitems:array<struct<        |                             |
 |              |     l_partkey:bigint,              |                             |
 |              |     l_suppkey:bigint,              |                             |
 |              |     l_linenumber:int,              |                             |
 |              |     l_quantity:decimal(12,2),      |                             |
 |              |     l_extendedprice:decimal(12,2), |                             |
 |              |     l_discount:decimal(12,2),      |                             |
 |              |     l_tax:decimal(12,2),           |                             |
 |              |     l_returnflag:string,           |                             |
 |              |     l_linestatus:string,           |                             |
 |              |     l_shipdate:string,             |                             |
 |              |     l_commitdate:string,           |                             |
 |              |     l_receiptdate:string,          |                             |
 |              |     l_shipinstruct:string,         |                             |
 |              |     l_shipmode:string,             |                             |
 |              |     l_comment:string               |                             |
 |              |   >>                               |                             |
 |              | >>                                 |                             |
 +--------------+------------------------------------+-----------------------------+
 ]]>
 </codeblock>

       </conbody>

     </concept>

   </concept>

 </concept>