blob: ba022f50df546c813fe61ae60669e1e56d1c1d99 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.3.0" id="complex_types">
<title id="nested_types">Complex Types (<keyword keyref="impala23"/> or higher only)</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.3.0">
<indexterm audience="hidden">complex types</indexterm>
<indexterm audience="hidden">nested types</indexterm>
<term>Complex types</term> (also referred to as <term>nested types</term>) let you represent multiple data values within a single
row/column position. They differ from the familiar column types such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>, known as
<term>scalar types</term> or <term>primitive types</term>, which represent a single data value within a given row/column position.
Impala supports the complex types <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> in <keyword keyref="impala23_full"/>
and higher. The Hive <codeph>UNION</codeph> type is not currently supported.
</p>
<p outputclass="toc inpage"/>
<p>
Once you understand the basics of complex types, refer to the individual type topics when you need to refresh your memory about syntax
and examples:
</p>
<ul>
<li>
<xref href="impala_array.xml#array"/>
</li>
<li>
<xref href="impala_struct.xml#struct"/>
</li>
<li>
<xref href="impala_map.xml#map"/>
</li>
</ul>
</conbody>
<concept id="complex_types_benefits">
<title>Benefits of Impala Complex Types</title>
<conbody>
<p>
The reasons for using Impala complex types include the following:
</p>
<ul>
<li>
<p>
You already have data produced by Hive or other non-Impala component that uses the complex type column names. You might need to
convert the underlying data to Parquet to use it with Impala.
</p>
</li>
<li>
<p>
Your data model originates with a non-SQL programming language or a NoSQL data management system. For example, if you are
representing Python data expressed as nested lists, dictionaries, and tuples, those data structures correspond closely to Impala
<codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph> types.
</p>
</li>
<li>
<p>
Your analytic queries involving multiple tables could benefit from greater locality during join processing. By packing more
related data items within each HDFS data block, complex types let join queries avoid the network overhead of the traditional
Hadoop shuffle or broadcast join techniques.
</p>
</li>
</ul>
<p>
The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used
with all SQL clauses, such as <codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, all kinds of joins, subqueries, and inline
views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other
programming languages to deconstruct the underlying data structures.
</p>
</conbody>
</concept>
<concept id="complex_types_overview">
<title>Overview of Impala Complex Types</title>
<conbody>
<p>
<!--
Each <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or <codeph>STRUCT</codeph> column can include multiple instances of scalar types
such as <codeph>BIGINT</codeph> and <codeph>STRING</codeph>.
-->
The <codeph>ARRAY</codeph> and <codeph>MAP</codeph> types are closely related: they represent collections with arbitrary numbers of
elements, where each element is the same type. In contrast, <codeph>STRUCT</codeph> groups together a fixed number of items into a
single element. The parts of a <codeph>STRUCT</codeph> element (the <term>fields</term>) can be of different types, and each field
has a name.
</p>
<p>
The elements of an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>, can also be other
complex types. You can construct elaborate data structures with up to 100 levels of nesting. For example, you can make an
<codeph>ARRAY</codeph> whose elements are <codeph>STRUCT</codeph>s. Within each <codeph>STRUCT</codeph>, you can have some fields
that are <codeph>ARRAY</codeph>, <codeph>MAP</codeph>, or another kind of <codeph>STRUCT</codeph>. The Impala documentation uses the
terms complex and nested types interchangeably; for simplicity, it primarily uses the term complex types, to encompass all the
properties of these types.
</p>
<p>
When visualizing your data model in familiar SQL terms, you can think of each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> as a
miniature table, and each <codeph>STRUCT</codeph> as a row within such a table. By default, the table represented by an
<codeph>ARRAY</codeph> has two columns, <codeph>POS</codeph> to represent ordering of elements, and <codeph>ITEM</codeph>
representing the value of each element. Likewise, by default, the table represented by a <codeph>MAP</codeph> encodes key-value
pairs, and therefore has two columns, <codeph>KEY</codeph> and <codeph>VALUE</codeph>.
<!--
When you use a <codeph>STRUCT</codeph> as an
<codeph>ARRAY</codeph> element or the <codeph>VALUE</codeph> part of a <codeph>MAP</codeph>, the field names of the
<codeph>STRUCT</codeph> become additional columns in the result set.
-->
</p>
<p>
The <codeph>ITEM</codeph> and <codeph>VALUE</codeph> names are only required for the very simplest kinds of <codeph>ARRAY</codeph>
and <codeph>MAP</codeph> columns, ones that hold only scalar values. When the elements within the <codeph>ARRAY</codeph> or
<codeph>MAP</codeph> are of type <codeph>STRUCT</codeph> rather than a scalar type, then the result set contains columns with names
corresponding to the <codeph>STRUCT</codeph> fields rather than <codeph>ITEM</codeph> or <codeph>VALUE</codeph>.
</p>
<!--
<p>
<codeph>ARRAY</codeph> and <codeph>MAP</codeph> are both <term>collection</term> types, which can have a variable number of
elements; <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are typically used as the top-level type of a table column.
<codeph>STRUCT</codeph> represents a single element and has a fixed number of fields; <codeph>STRUCT</codeph> is typically used as
the final, lowest level of a nested type definition.
</p>
-->
<p>
You write most queries that process complex type columns using familiar join syntax, even though the data for both sides of the join
resides in a single table. The join notation brings together the scalar values from a row with the values from the complex type
columns for that same row. The final result set contains all scalar values, allowing you to do all the familiar filtering,
aggregation, ordering, and so on for the complex data entirely in SQL or using business intelligence tools that issue SQL queries.
<!--
Instead of pulling together values from different tables, the join selects the specified values from both
the scalar columns, and from inside the complex type columns, producing a flattened result set consisting of all scalar values. When
doing a join query involving a complex type column, Impala derives the join key automatically, without the need to create additional
ID columns in the table.
-->
</p>
<p>
Behind the scenes, Impala ensures that the processing for each row is done efficiently on a single host, without the network traffic
involved in broadcast or shuffle joins. The most common type of join query for tables with complex type columns is <codeph>INNER
JOIN</codeph>, which returns results only in those cases where the complex type contains some elements. Therefore, most query
examples in this section use either the <codeph>INNER JOIN</codeph> clause or the equivalent comma notation.
</p>
<note>
<p>
Although Impala can query complex types that are present in Parquet files, Impala currently cannot create new Parquet files
containing complex types. Therefore, the discussion and examples presume that you are working with existing Parquet data produced
through Hive, Spark, or some other source. See <xref href="#complex_types_ex_hive_etl"/> for examples of constructing Parquet data
files with complex type columns.
</p>
<p>
For learning purposes, you can create empty tables with complex type columns and practice query syntax, even if you do not have
sample data with the required structure.
</p>
</note>
</conbody>
</concept>
<concept id="complex_types_design">
<title>Design Considerations for Complex Types</title>
<conbody>
<p>
When planning to use Impala complex types, and designing the Impala schema, first learn how this kind of schema differs from
traditional table layouts from the relational database and data warehousing fields. Because you might have already encountered
complex types in a Hadoop context while using Hive for ETL, also learn how to write high-performance analytic queries for complex
type data using Impala SQL syntax.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="complex_types_vs_rdbms">
<title>How Complex Types Differ from Traditional Data Warehouse Schemas</title>
<conbody>
<p>
Complex types let you associate arbitrary data structures with a particular row. If you are familiar with schema design for
relational database management systems or data warehouses, a schema with complex types has the following differences:
</p>
<ul>
<li>
<p>
Logically, related values can now be grouped tightly together in the same table.
</p>
<p>
In traditional data warehousing, related values were typically arranged in one of two ways:
</p>
<ul>
<li>
<p>
Split across multiple normalized tables. Foreign key columns specified which rows from each table were associated with
each other. This arrangement avoided duplicate data and therefore the data was compact, but join queries could be
expensive because the related data had to be retrieved from separate locations. (In the case of distributed Hadoop
queries, the joined tables might even be transmitted between different hosts in a cluster.)
</p>
</li>
<li>
<p>
Flattened into a single denormalized table. Although this layout eliminated some potential performance issues by removing
the need for join queries, the table typically became larger because values were repeated. The extra data volume could
cause performance issues in other parts of the workflow, such as longer ETL cycles or more expensive full-table scans
during queries.
</p>
</li>
</ul>
<p>
Complex types represent a middle ground that addresses these performance and volume concerns. By physically locating related
data within the same data files, complex types increase locality and reduce the expense of join queries. By associating an
arbitrary amount of data with a single row, complex types avoid the need to repeat lengthy values such as strings. Because
Impala knows which complex type values are associated with each row, you can save storage by avoiding artificial foreign key
values that are only used for joins. The flexibility of the <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and
<codeph>MAP</codeph> types lets you model familiar constructs such as fact and dimension tables from a data warehouse, and
wide tables representing sparse matrixes.
</p>
</li>
</ul>
</conbody>
</concept>
<concept id="complex_types_physical">
<title>Physical Storage for Complex Types in Parquet</title>
<conbody>
<p>
Physically, the scalar and complex columns in each row are located adjacent to each other in the same Parquet data file, ensuring
that they are processed on the same host rather than being broadcast across the network when cross-referenced within a query. This
co-location simplifies the process of copying, converting, and backing all the columns up at once. Because of the column-oriented
layout of Parquet files, you can still query only the scalar columns of a table without imposing the I/O penalty of reading the
(possibly large) values of the composite columns.
</p>
<p>
Within each Parquet data file, the constituent parts of complex type columns are stored in column-oriented format:
</p>
<ul>
<li>
<p>
Each field of a <codeph>STRUCT</codeph> type is stored like a column, with all the scalar values adjacent to each other and
encoded, compressed, and so on using the Parquet space-saving techniques.
</p>
</li>
<li>
<p>
For an <codeph>ARRAY</codeph> containing scalar values, all those values (represented by the <codeph>ITEM</codeph>
pseudocolumn) are stored adjacent to each other.
</p>
</li>
<li>
<p>
For a <codeph>MAP</codeph>, the values of the <codeph>KEY</codeph> pseudocolumn are stored adjacent to each other. If the
<codeph>VALUE</codeph> pseudocolumn is a scalar type, its values are also stored adjacent to each other.
</p>
</li>
<li>
<p>
If an <codeph>ARRAY</codeph> element, <codeph>STRUCT</codeph> field, or <codeph>MAP</codeph> <codeph>VALUE</codeph> part is
another complex type, the column-oriented storage applies to the next level down (or the next level after that, and so on for
deeply nested types) where the final elements, fields, or values are of scalar types.
</p>
</li>
</ul>
<p>
The numbers represented by the <codeph>POS</codeph> pseudocolumn of an <codeph>ARRAY</codeph> are not physically stored in the
data files. They are synthesized at query time based on the order of the <codeph>ARRAY</codeph> elements associated with each row.
</p>
</conbody>
</concept>
<concept id="complex_types_file_formats">
<title>File Format Support for Impala Complex Types</title>
<conbody>
<p> Currently, Impala queries support complex type data in the Parquet
and ORC file formats. See <xref href="impala_parquet.xml#parquet"/>
for details about the performance benefits and physical layout of
Parquet file format. </p>
<p>
Because Impala does not parse the data structures containing nested types for unsupported formats such as text, Avro,
SequenceFile, or RCFile, you cannot use data files in these formats with Impala, even if the query does not refer to the nested
type columns. Also, if a table using an unsupported format originally contained nested type columns, and then those columns were
dropped from the table using <codeph>ALTER TABLE ... DROP COLUMN</codeph>, any existing data files in the table still contain the
nested type data and Impala queries on that table will generate errors.
</p>
<p rev="2.6.0 IMPALA-2844">
The one exception to the preceding rule is <codeph>COUNT(*)</codeph> queries on RCFile tables that include complex types.
Such queries are allowed in <keyword keyref="impala26_full"/> and higher.
</p>
<p>
You can perform DDL operations for tables involving complex types in
most file formats other than Parquet or ORC. You cannot create tables
in Impala with complex types using text files.
</p>
<p>
You can have a partitioned table with complex type columns that uses
a format other than Parquet or ORC, and use
<codeph>ALTER TABLE</codeph> to change the file format to Parquet/ORC
for individual partitions. When you put Parquet/ORC files into those
partitions, Impala can execute queries against that data as long as
the query does not involve any of the non-Parquet and non-ORC
partitions.
</p>
<p>
If you use the <cmdname>parquet-tools</cmdname> command to examine the structure of a Parquet data file that includes complex
types, you see that both <codeph>ARRAY</codeph> and <codeph>MAP</codeph> are represented as a <codeph>Bag</codeph> in Parquet
terminology, with all fields marked <codeph>Optional</codeph> because Impala allows any column to be nullable.
</p>
<p>
Impala supports either 2-level and 3-level encoding within each Parquet data file. When constructing Parquet data files outside
Impala, use either encoding style but do not mix 2-level and 3-level encoding within the same data file.
</p>
</conbody>
</concept>
<concept id="complex_types_vs_normalization">
<title>Choosing Between Complex Types and Normalized Tables</title>
<conbody>
<p>
Choosing between multiple normalized fact and dimension tables, or a single table containing complex types, is an important design
decision.
</p>
<ul>
<li>
<p>
If you are coming from a traditional database or data warehousing background, you might be familiar with how to split up data
between tables. Your business intelligence tools might already be optimized for dealing with this kind of multi-table scenario
through join queries.
</p>
</li>
<li>
<p>
If you are pulling data from Impala into an application written in a programming language that has data structures analogous
to the complex types, such as Python or Java, complex types in Impala could simplify data interchange and improve
understandability and reliability of your program logic.
</p>
</li>
<li>
<p>
You might already be faced with existing infrastructure or receive high volumes of data that assume one layout or the other.
For example, complex types are popular with web-oriented applications, for example to keep information about an online user
all in one place for convenient lookup and analysis, or to deal with sparse or constantly evolving data fields.
</p>
</li>
<li>
<p>
If some parts of the data change over time while related data remains constant, using multiple normalized tables lets you
replace certain parts of the data without reloading the entire data set. Conversely, if you receive related data all bundled
together, such as in JSON files, using complex types can save the overhead of splitting the related items across multiple
tables.
</p>
</li>
<li>
<p>
From a performance perspective:
</p>
<ul>
<li>
<p>
In Parquet or ORC tables, Impala can skip columns that are not referenced in a query, avoiding the I/O penalty of reading the
embedded data. When complex types are nested within a column, the data is physically divided at a very granular level; for
example, a query referring to data nested multiple levels deep in a complex type column does not have to read all the data
from that column, only the data for the relevant parts of the column type hierarchy.
<!-- Avro not supported in 5.5 / 2.3: Avro tables might experience some performance overhead due to
the need to skip past the complex type columns in each row when reading the data. -->
</p>
</li>
<li>
<p>
Complex types avoid the possibility of expensive join queries when data from fact and dimension tables is processed in
parallel across multiple hosts. All the information for a row containing complex types is typically to be in the same data
block, and therefore does not need to be transmitted across the network when joining fields that are all part of the same
row.
</p>
</li>
<li>
<p>
The tradeoff with complex types is that fewer rows fit in each data block. Whether it is better to have more data blocks
with fewer rows, or fewer data blocks with many rows, depends on the distribution of your data and the characteristics of
your query workload. If the complex columns are rarely referenced, using them might lower efficiency. If you are seeing
low parallelism due to a small volume of data (relatively few data blocks) in each table partition, increasing the row
size by including complex columns might produce more data blocks and thus spread the work more evenly across the cluster.
See <xref href="impala_scalability.xml#scalability"/> for more on this advanced topic.
</p>
</li>
</ul>
</li>
</ul>
</conbody>
</concept>
<concept id="complex_types_hive">
<title>Differences Between Impala and Hive Complex Types</title>
<conbody>
<p>
Impala can query Parquet and ORC tables containing <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> columns
produced by Hive. There are some differences to be aware of between the Impala SQL and HiveQL syntax for complex types, primarily
for queries.
</p>
<p>
Impala supports a subset of the syntax that Hive supports for
specifying <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and
<codeph>MAP</codeph> types in the <codeph>CREATE TABLE</codeph>
statements.
</p>
<p>
Because Impala <codeph>STRUCT</codeph> columns include user-specified field names, you use the <codeph>NAMED_STRUCT()</codeph>
constructor in Hive rather than the <codeph>STRUCT()</codeph> constructor when you populate an Impala <codeph>STRUCT</codeph>
column using a Hive <codeph>INSERT</codeph> statement.
</p>
<p>
The Hive <codeph>UNION</codeph> type is not currently supported in Impala.
</p>
<p>
While Impala usually aims for a high degree of compatibility with HiveQL query syntax, Impala syntax differs from Hive for queries
involving complex types. The differences are intended to provide extra flexibility for queries involving these kinds of tables.
</p>
<ul>
<li>
Impala uses dot notation for referring to element names or elements within complex types, and join notation for
cross-referencing scalar columns with the elements of complex types within the same row, rather than the <codeph>LATERAL
VIEW</codeph> clause and <codeph>EXPLODE()</codeph> function of HiveQL.
</li>
<li>
Using join notation lets you use all the kinds of join queries with complex type columns. For example, you can use a
<codeph>LEFT OUTER JOIN</codeph>, <codeph>LEFT ANTI JOIN</codeph>, or <codeph>LEFT SEMI JOIN</codeph> query to evaluate
different scenarios where the complex columns do or do not contain any elements.
</li>
<li>
You can include references to collection types inside subqueries and inline views. For example, you can construct a
<codeph>FROM</codeph> clause where one of the <q>tables</q> is a subquery against a complex type column, or use a subquery
against a complex type column as the argument to an <codeph>IN</codeph> or <codeph>EXISTS</codeph> clause.
</li>
<li>
The Impala pseudocolumn <codeph>POS</codeph> lets you retrieve the position of elements in an array along with the elements
themselves, equivalent to the <codeph>POSEXPLODE()</codeph> function of HiveQL. You do not use index notation to retrieve a
single array element in a query; the join query loops through the array elements and you use <codeph>WHERE</codeph> clauses to
specify which elements to return.
</li>
<li>
<p>
Join clauses involving complex type columns do not require an <codeph>ON</codeph> or <codeph>USING</codeph> clause. Impala
implicitly applies the join key so that the correct array entries or map elements are associated with the correct row from the
table.
</p>
</li>
<li>
<p>
Impala does not currently support the <codeph>UNION</codeph> complex type.
</p>
</li>
</ul>
</conbody>
</concept>
<concept id="complex_types_limits">
<title>Limitations and Restrictions for Complex Types</title>
<conbody>
<p>
Complex type columns can only be used in tables or partitions with the Parquet or ORC file format.
</p>
<p>
Complex type columns cannot be used as partition key columns in a partitioned table.
</p>
<p>
When you use complex types with the <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, <codeph>HAVING</codeph>, or
<codeph>WHERE</codeph> clauses, you cannot refer to the column name by itself. Instead, you refer to the names of the scalar
values within the complex type, such as the <codeph>ITEM</codeph>, <codeph>POS</codeph>, <codeph>KEY</codeph>, or
<codeph>VALUE</codeph> pseudocolumns, or the field names from a <codeph>STRUCT</codeph>.
</p>
<p>
The maximum depth of nesting for complex types is 100 levels.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_max_length"/>
<p>
For ideal performance and scalability, use small or medium-sized collections, where all the complex columns contain at most a few
hundred megabytes per row. Remember, all the columns of a row are stored in the same HDFS data block, whose size in Parquet files
typically ranges from 256 MB to 1 GB.
</p>
<p>
Including complex type columns in a table introduces some overhead that might make queries that do not reference those columns
somewhat slower than Impala queries against tables without any complex type columns. Expect at most a 2x slowdown compared to
tables that do not have any complex type columns.
</p>
<p>
Currently, the <codeph>COMPUTE STATS</codeph> statement does not collect any statistics for columns containing complex types.
Impala uses heuristics to construct execution plans involving complex type columns.
</p>
<p>
Currently, Impala built-in functions and user-defined functions cannot accept complex types as parameters or produce them as
function return values. (When the complex type values are materialized in an Impala result set, the result set contains the scalar
components of the values, such as the <codeph>POS</codeph> or <codeph>ITEM</codeph> for an <codeph>ARRAY</codeph>, the
<codeph>KEY</codeph> or <codeph>VALUE</codeph> for a <codeph>MAP</codeph>, or the fields of a <codeph>STRUCT</codeph>; these
scalar data items <i>can</i> be used with built-in functions and UDFs as usual.)
</p>
<p conref="../shared/impala_common.xml#common/complex_types_read_only"/>
<p>
Currently, Impala can query complex type columns only from Parquet/ORC tables or Parquet/ORC partitions within partitioned tables.
Although you can use complex types in tables with Avro, text, and other file formats as part of your ETL pipeline, for example as
intermediate tables populated through Hive, doing analytics through Impala requires that the data eventually ends up in a Parquet/ORC
table. The requirement for Parquet/ORC data files means that you can use complex types with Impala tables hosted on other kinds of
file storage systems such as Isilon and Amazon S3, but you cannot use Impala to query complex types from HBase tables. See
<xref href="impala_complex_types.xml#complex_types_file_formats"/> for more details.
</p>
</conbody>
</concept>
</concept>
<concept id="complex_types_using">
<title>Using Complex Types from SQL</title>
<conbody>
<p>
When using complex types through SQL in Impala, you learn the notation for <codeph>&lt; &gt;</codeph> delimiters for the complex
type columns in <codeph>CREATE TABLE</codeph> statements, and how to construct join queries to <q>unpack</q> the scalar values
nested inside the complex data structures. You might need to condense a traditional RDBMS or data warehouse schema into a smaller
number of Parquet tables, and use Hive, Spark, Pig, or other mechanism outside Impala to populate the tables with data.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="nested_types_ddl">
<title>Complex Type Syntax for DDL Statements</title>
<conbody>
<p>
The definition of <varname>data_type</varname>, as seen in the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
statements, now includes complex types in addition to primitive types:
</p>
<codeblock> primitive_type
| array_type
| map_type
| struct_type
</codeblock>
<p>
Unions are not currently supported.
</p>
<p>
<codeph>Array</codeph>, <codeph>struct</codeph>, and
<codeph>map</codeph> column type declarations are specified in the
<codeph>CREATE TABLE</codeph> statement. You can also add or change
the type of complex columns through the <codeph>ALTER TABLE</codeph>
statement. </p>
<p> Currently, Impala queries allow complex types only in tables that
use the Parquet or ORC format. If an Impala query encounters complex
types in a table or partition using other file formats, the query
returns a runtime error. </p>
<p> You can use <codeph>ALTER TABLE ... SET FILEFORMAT PARQUET</codeph>
to change the file format of an existing table containing complex
types to Parquet, after which Impala can query it. Make sure to load
Parquet files into the table after changing the file format, because
the <codeph>ALTER TABLE ... SET FILEFORMAT</codeph> statement does not
convert existing data to the new file format. </p>
<p conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
<p>
Because use cases for Impala complex types require that you already have Parquet/ORC data files produced outside of Impala, you can
use the Impala <codeph>CREATE TABLE LIKE PARQUET</codeph> syntax to produce a table with columns that match the structure of an
existing Parquet file, including complex type columns for nested data structures. Remember to include the <codeph>STORED AS
PARQUET</codeph> clause in this case, because even with <codeph>CREATE TABLE LIKE PARQUET</codeph>, the default file format of the
resulting table is still text.
</p>
<p>
Because the complex columns are omitted from the result set of an Impala <codeph>SELECT *</codeph> or <codeph>SELECT
<varname>col_name</varname></codeph> query, and because Impala currently does not support writing Parquet files with complex type
columns, you cannot use the <codeph>CREATE TABLE AS SELECT</codeph> syntax to create a table with nested type columns.
</p>
<note>
<p>
Once you have a table set up with complex type columns, use the <codeph>DESCRIBE</codeph> and <codeph>SHOW CREATE TABLE</codeph>
statements to see the correct notation with <codeph>&lt;</codeph> and <codeph>&gt;</codeph> delimiters and comma and colon
separators within the complex type definitions. If you do not have existing data with the same layout as the table, you can
query the empty table to practice with the notation for the <codeph>SELECT</codeph> statement. In the <codeph>SELECT</codeph>
list, you use dot notation and pseudocolumns such as <codeph>ITEM</codeph>, <codeph>KEY</codeph>, and <codeph>VALUE</codeph> for
referring to items within the complex type columns. In the <codeph>FROM</codeph> clause, you use join notation to construct
table aliases for any referenced <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns.
</p>
</note>
<!-- To do: show some simple CREATE TABLE statements for each of the complex types, without so much backstory for the schema. -->
<p>
For example, when defining a table that holds contact information, you might represent phone numbers differently depending on the
expected layout and relationships of the data, and how well you can predict those properties in advance.
</p>
<p>
Here are different ways that you might represent phone numbers in a traditional relational schema, with equivalent representations
using complex types.
</p>
<fig id="complex_types_phones_flat_fixed">
<title>Traditional Relational Representation of Phone Numbers: Single Table</title>
<p>
The traditional, simplest way to represent phone numbers in a relational table is to store all contact info in a single table,
with all columns having scalar types, and each potential phone number represented as a separate column. In this example, each
person can only have these 3 types of phone numbers. If the person does not have a particular kind of phone number, the
corresponding column is <codeph>NULL</codeph> for that row.
</p>
<codeblock>
CREATE TABLE contacts_fixed_phones
(
id BIGINT
, name STRING
, address STRING
, home_phone STRING
, work_phone STRING
, mobile_phone STRING
) STORED AS PARQUET;
</codeblock>
</fig>
<fig id="complex_types_phones_array">
<title>An Array of Phone Numbers</title>
<p>
Using a complex type column to represent the phone numbers adds some extra flexibility. Now there could be an unlimited number
of phone numbers. Because the array elements have an order but not symbolic names, you could decide in advance that
phone_number[0] is the home number, [1] is the work number, [2] is the mobile number, and so on. (In subsequent examples, you
will see how to create a more flexible naming scheme using other complex type variations, such as a <codeph>MAP</codeph> or an
<codeph>ARRAY</codeph> where each element is a <codeph>STRUCT</codeph>.)
</p>
<codeblock><![CDATA[
CREATE TABLE contacts_array_of_phones
(
id BIGINT
, name STRING
, address STRING
, phone_number ARRAY < STRING >
) STORED AS PARQUET;
]]>
</codeblock>
</fig>
<fig id="complex_types_phones_map">
<title>A Map of Phone Numbers</title>
<p>
Another way to represent an arbitrary set of phone numbers is with a <codeph>MAP</codeph> column. With a <codeph>MAP</codeph>,
each element is associated with a key value that you specify, which could be a numeric, string, or other scalar type. This
example uses a <codeph>STRING</codeph> key to give each phone number a name, such as <codeph>'home'</codeph> or
<codeph>'mobile'</codeph>. A query could filter the data based on the key values, or display the key values in reports.
</p>
<codeblock><![CDATA[
CREATE TABLE contacts_unlimited_phones
(
id BIGINT, name STRING, address STRING, phone_number MAP < STRING,STRING >
) STORED AS PARQUET;
]]>
</codeblock>
</fig>
<fig id="complex_types_phones_flat_normalized">
<title>Traditional Relational Representation of Phone Numbers: Normalized Tables</title>
<p>
If you are an experienced database designer, you already know how to work around the limitations of the single-table schema from
<xref href="#nested_types_ddl/complex_types_phones_flat_fixed"/>. By normalizing the schema, with the phone numbers in their own
table, you can associate an arbitrary set of phone numbers with each person, and associate additional details with each phone
number, such as whether it is a home, work, or mobile phone.
</p>
<p>
The flexibility of this approach comes with some drawbacks. Reconstructing all the data for a particular person requires a join
query, which might require performance tuning on Hadoop because the data from each table might be transmitted from a different
host. Data management tasks such as backups and refreshing the data require dealing with multiple tables instead of a single
table.
</p>
<p>
This example illustrates a traditional database schema to store contact info normalized across 2 tables. The fact table
establishes the identity and basic information about person. A dimension table stores information only about phone numbers,
using an ID value to associate each phone number with a person ID from the fact table. Each person can have 0, 1, or many
phones; the categories are not restricted to a few predefined ones; and the phone table can contain as many columns as desired,
to represent all sorts of details about each phone number.
</p>
<codeblock>
CREATE TABLE fact_contacts (id BIGINT, name STRING, address STRING) STORED AS PARQUET;
CREATE TABLE dim_phones
(
contact_id BIGINT
, category STRING
, international_code STRING
, area_code STRING
, exchange STRING
, extension STRING
, mobile BOOLEAN
, carrier STRING
, current BOOLEAN
, service_start_date TIMESTAMP
, service_end_date TIMESTAMP
)
STORED AS PARQUET;
</codeblock>
</fig>
<fig id="complex_types_phones_array_struct">
<title>Phone Numbers Represented as an Array of Structs</title>
<p>
To represent a schema equivalent to the one from <xref href="#nested_types_ddl/complex_types_phones_flat_normalized"/> using
complex types, this example uses an <codeph>ARRAY</codeph> where each array element is a <codeph>STRUCT</codeph>. As with the
earlier complex type examples, each person can have an arbitrary set of associated phone numbers. Making each array element into
a <codeph>STRUCT</codeph> lets us associate multiple data items with each phone number, and give a separate name and type to
each data item. The <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph> elements reproduce the columns of the dimension
table from the previous example.
</p>
<p>
You can do all the same kinds of queries with the complex type schema as with the normalized schema from the previous example.
The advantages of the complex type design are in the areas of convenience and performance. Now your backup and ETL processes
only deal with a single table. When a query uses a join to cross-reference the information about a person with their associated
phone numbers, all the relevant data for each row resides in the same HDFS data block, meaning each row can be processed on a
single host without requiring network transmission.
</p>
<codeblock><![CDATA[
CREATE TABLE contacts_detailed_phones
(
id BIGINT, name STRING, address STRING
, phone ARRAY < STRUCT <
category: STRING
, international_code: STRING
, area_code: STRING
, exchange: STRING
, extension: STRING
, mobile: BOOLEAN
, carrier: STRING
, current: BOOLEAN
, service_start_date: TIMESTAMP
, service_end_date: TIMESTAMP
>>
) STORED AS PARQUET;
]]>
</codeblock>
</fig>
</conbody>
</concept>
<concept id="complex_types_sql">
<title>SQL Statements that Support Complex Types</title>
<conbody>
<p>
The Impala SQL statements that support complex types are currently
<codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>,
<codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph>,
<codeph><xref href="impala_describe.xml#describe">DESCRIBE</xref></codeph>,
<codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph>, and
<codeph><xref href="impala_select.xml#select">SELECT</xref></codeph>. That is, currently Impala can create or alter tables
containing complex type columns, examine the structure of a table containing complex type columns, import existing data files
containing complex type columns into a table, and query Parquet/ORC tables containing complex types.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_read_only"/>
<p outputclass="toc inpage"/>
</conbody>
<concept id="complex_types_ddl">
<title>DDL Statements and Complex Types</title>
<conbody>
<p>
Column specifications for complex or nested types use <codeph>&lt;</codeph> and <codeph>&gt;</codeph> delimiters:
</p>
<codeblock><![CDATA[-- What goes inside the < > for an ARRAY is a single type, either a scalar or another
-- complex type (ARRAY, STRUCT, or MAP).
CREATE TABLE array_t
(
id BIGINT,
a1 ARRAY <STRING>,
a2 ARRAY <BIGINT>,
a3 ARRAY <TIMESTAMP>,
a4 ARRAY <STRUCT <f1: STRING, f2: INT, f3: BOOLEAN>>
)
STORED AS PARQUET;
-- What goes inside the < > for a MAP is two comma-separated types specifying the types of the key-value pair:
-- a scalar type representing the key, and a scalar or complex type representing the value.
CREATE TABLE map_t
(
id BIGINT,
m1 MAP <STRING, STRING>,
m2 MAP <STRING, BIGINT>,
m3 MAP <BIGINT, STRING>,
m4 MAP <BIGINT, BIGINT>,
m5 MAP <STRING, ARRAY <STRING>>
)
STORED AS PARQUET;
-- What goes inside the < > for a STRUCT is a comma-separated list of fields, each field defined as
-- name:type. The type can be a scalar or a complex type. The field names for each STRUCT do not clash
-- with the names of table columns or fields in other STRUCTs. A STRUCT is most often used inside
-- an ARRAY or a MAP rather than as a top-level column.
CREATE TABLE struct_t
(
id BIGINT,
s1 STRUCT <f1: STRING, f2: BIGINT>,
s2 ARRAY <STRUCT <f1: INT, f2: TIMESTAMP>>,
s3 MAP <BIGINT, STRUCT <name: STRING, birthday: TIMESTAMP>>
)
STORED AS PARQUET;
]]>
</codeblock>
</conbody>
</concept>
<concept id="complex_types_queries">
<title>Queries and Complex Types</title>
<conbody>
<!-- Hive does the JSON output business: http://www.datascience-labs.com/hive/hiveql-data-manipulation/ -->
<!-- SELECT * works but skips any nested type coloumns. -->
<p>
The result set of an Impala query always contains all scalar types; the elements and fields within any complex type queries must
be <q>unpacked</q> using join queries. A query cannot directly retrieve the entire value for a complex type column. Impala
returns an error in this case. Queries using <codeph>SELECT *</codeph> are allowed for tables with complex types, but the
columns with complex types are skipped.
</p>
<p>
The following example shows how referring directly to a complex type column returns an error, while <codeph>SELECT *</codeph> on
the same table succeeds, but only retrieves the scalar columns.
</p>
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
<!-- Original error message:
ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING,o_totalprice:DECIMAL(12,2),o_orderdate:STRING,o_orderpriority:STRING,o_clerk:STRING,o_shippriority:INT,o_comment:STRING,o_lineitems:ARRAY<STRUCT<l_partkey:BIGINT,l_suppkey:BIGINT,l_linenumber:INT,l_quantity:DECIMAL(12,2),l_extendedprice:DECIMAL(12,2),l_discount:DECIMAL(12,2),l_tax:DECIMAL(12,2),l_returnflag:STRING,l_linestatus:STRING,l_shipdate:STRING,l_commitdate:STRING,l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
-->
<codeblock><![CDATA[SELECT c_orders FROM customer LIMIT 1;
ERROR: AnalysisException: Expr 'c_orders' in select list returns a complex type 'ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING, ... l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>'.
Only scalar types are allowed in the select list.
-- Original column has several scalar and one complex column.
DESCRIBE customer;
+--------------+------------------------------------+
| name | type |
+--------------+------------------------------------+
| c_custkey | bigint |
| c_name | string |
...
| c_orders | array<struct< |
| | o_orderkey:bigint, |
| | o_orderstatus:string, |
| | o_totalprice:decimal(12,2), |
...
| | >> |
+--------------+------------------------------------+
-- When we SELECT * from that table, only the scalar columns come back in the result set.
CREATE TABLE select_star_customer STORED AS PARQUET AS SELECT * FROM customer;
+------------------------+
| summary |
+------------------------+
| Inserted 150000 row(s) |
+------------------------+
-- The c_orders column, being of complex type, was not included in the SELECT * result set.
DESC select_star_customer;
+--------------+---------------+
| name | type |
+--------------+---------------+
| c_custkey | bigint |
| c_name | string |
| c_address | string |
| c_nationkey | smallint |
| c_phone | string |
| c_acctbal | decimal(12,2) |
| c_mktsegment | string |
| c_comment | string |
+--------------+---------------+
]]>
</codeblock>
<!-- To do: These "references to..." bits could be promoted to their own 'expressions' subheads. -->
<p>
References to fields within <codeph>STRUCT</codeph> columns use dot notation. If the field name is unambiguous, you can omit
qualifiers such as table name, column name, or even the <codeph>ITEM</codeph> or <codeph>VALUE</codeph> pseudocolumn names for
<codeph>STRUCT</codeph> elements inside an <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>.
</p>
<!-- To do: rewrite example to use CUSTOMER table. -->
<!-- Don't think TPC-H schema has a bare STRUCT to use in such a simple query though. -->
<!-- Perhaps reuse the STRUCT_DEMO example here. -->
<codeblock>SELECT id, address.city FROM customers WHERE address.zip = 94305;
</codeblock>
<p>
References to elements within <codeph>ARRAY</codeph> columns use the <codeph>ITEM</codeph> pseudocolumn:
</p>
<!-- To do: shorten qualified names. -->
<codeblock>select r_name, r_nations.item.n_name from region, region.r_nations limit 7;
+--------+----------------+
| r_name | item.n_name |
+--------+----------------+
| EUROPE | UNITED KINGDOM |
| EUROPE | RUSSIA |
| EUROPE | ROMANIA |
| EUROPE | GERMANY |
| EUROPE | FRANCE |
| ASIA | VIETNAM |
| ASIA | CHINA |
+--------+----------------+
</codeblock>
<p>
References to fields within <codeph>MAP</codeph> columns use the <codeph>KEY</codeph> and <codeph>VALUE</codeph> pseudocolumns.
In this example, once the query establishes the alias <codeph>MAP_FIELD</codeph> for a <codeph>MAP</codeph> column with a
<codeph>STRING</codeph> key and an <codeph>INT</codeph> value, the query can refer to <codeph>MAP_FIELD.KEY</codeph> and
<codeph>MAP_FIELD.VALUE</codeph>, which have zero, one, or many instances for each row from the containing table.
</p>
<codeblock><![CDATA[DESCRIBE table_0;
+---------+-----------------------+
| name | type |
+---------+-----------------------+
| field_0 | string |
| field_1 | map<string,int> |
...
SELECT field_0, map_field.key, map_field.value
FROM table_0, table_0.field_1 AS map_field
WHERE length(field_0) = 1
LIMIT 10;
+---------+-----------+-------+
| field_0 | key | value |
+---------+-----------+-------+
| b | gshsgkvd | NULL |
| b | twrtcxj6 | 18 |
| b | 2vp5 | 39 |
| b | fh0s | 13 |
| v | 2 | 41 |
| v | 8b58mz | 20 |
| v | hw | 16 |
| v | 65l388pyt | 29 |
| v | 03k68g91z | 30 |
| v | r2hlg5b | NULL |
+---------+-----------+-------+
]]>
</codeblock>
<!-- To do: refer to or reuse examples from the other subtopics that discuss pseudocolumns etc. -->
<p>
When complex types are nested inside each other, you use a combination of joins, pseudocolumn names, and dot notation to refer
to specific fields at the appropriate level. This is the most frequent form of query syntax for complex columns, because the
typical use case involves two levels of complex types, such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements.
</p>
<!-- To do: rewrite example to use CUSTOMER table. -->
<!-- This is my own manufactured example so I have the table, and the query works, but I don't have sample data to show. -->
<codeblock>SELECT id, phone_numbers.area_code FROM contact_info_many_structs INNER JOIN contact_info_many_structs.phone_numbers phone_numbers LIMIT 3;
</codeblock>
<p>
You can express relationships between <codeph>ARRAY</codeph> and <codeph>MAP</codeph> columns at different levels as joins. You
include comparison operators between fields at the top level and within the nested type columns so that Impala can do the
appropriate join operation.
</p>
<!-- Don't think TPC-H schema has any instances where outer field matches up with inner one though. -->
<!-- But don't think this usage is important enough to call out at this early point. Hide the example for now. -->
<!--
<codeblock>SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc;
SELECT c.id, o.txn_id FROM customers c, c.orders o;
</codeblock>
-->
<!-- To do: move these examples down, to the examples subtopic at the end. -->
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
<p>
For example, the following queries work equivalently. They each return customer and order data for customers that have at least
one order.
</p>
<codeblock>SELECT c.c_name, o.o_orderkey FROM customer c, c.c_orders o LIMIT 5;
+--------------------+------------+
| c_name | o_orderkey |
+--------------------+------------+
| Customer#000072578 | 558821 |
| Customer#000072578 | 2079810 |
| Customer#000072578 | 5768068 |
| Customer#000072578 | 1805604 |
| Customer#000072578 | 3436389 |
+--------------------+------------+
SELECT c.c_name, o.o_orderkey FROM customer c INNER JOIN c.c_orders o LIMIT 5;
+--------------------+------------+
| c_name | o_orderkey |
+--------------------+------------+
| Customer#000072578 | 558821 |
| Customer#000072578 | 2079810 |
| Customer#000072578 | 5768068 |
| Customer#000072578 | 1805604 |
| Customer#000072578 | 3436389 |
+--------------------+------------+
</codeblock>
<p>
The following query using an outer join returns customers that have orders, plus customers with no orders (no entries in the
<codeph>C_ORDERS</codeph> array):
</p>
<codeblock><![CDATA[SELECT c.c_custkey, o.o_orderkey
FROM customer c LEFT OUTER JOIN c.c_orders o
LIMIT 5;
+-----------+------------+
| c_custkey | o_orderkey |
+-----------+------------+
| 60210 | NULL |
| 147873 | NULL |
| 72578 | 558821 |
| 72578 | 2079810 |
| 72578 | 5768068 |
+-----------+------------+
]]>
</codeblock>
<p>
The following query returns <i>only</i> customers that have no orders. (With <codeph>LEFT ANTI JOIN</codeph> or <codeph>LEFT
SEMI JOIN</codeph>, the query can only refer to columns from the left-hand table, because by definition there is no matching
information in the right-hand table.)
</p>
<codeblock><![CDATA[SELECT c.c_custkey, c.c_name
FROM customer c LEFT ANTI JOIN c.c_orders o
LIMIT 5;
+-----------+--------------------+
| c_custkey | c_name |
+-----------+--------------------+
| 60210 | Customer#000060210 |
| 147873 | Customer#000147873 |
| 141576 | Customer#000141576 |
| 85365 | Customer#000085365 |
| 70998 | Customer#000070998 |
+-----------+--------------------+
]]>
</codeblock>
<!-- To do: promote the correlated subquery aspect into its own subtopic. -->
<p>
You can also perform correlated subqueries to examine the properties of complex type columns for each row in the result set.
</p>
<p>
Count the number of orders per customer. Note the correlated reference to the table alias <codeph>C</codeph>. The
<codeph>COUNT(*)</codeph> operation applies to all the elements of the <codeph>C_ORDERS</codeph> array for the corresponding
row, avoiding the need for a <codeph>GROUP BY</codeph> clause.
</p>
<codeblock>select c_name, howmany FROM customer c, (SELECT COUNT(*) howmany FROM c.c_orders) v limit 5;
+--------------------+---------+
| c_name | howmany |
+--------------------+---------+
| Customer#000030065 | 15 |
| Customer#000065455 | 18 |
| Customer#000113644 | 21 |
| Customer#000111078 | 0 |
| Customer#000024621 | 0 |
+--------------------+---------+
</codeblock>
<p>
Count the number of orders per customer, ignoring any customers that have not placed any orders:
</p>
<codeblock>SELECT c_name, howmany_orders
FROM
customer c,
(SELECT COUNT(*) howmany_orders FROM c.c_orders) subq1
WHERE howmany_orders > 0
LIMIT 5;
+--------------------+----------------+
| c_name | howmany_orders |
+--------------------+----------------+
| Customer#000072578 | 7 |
| Customer#000046378 | 26 |
| Customer#000069815 | 11 |
| Customer#000079058 | 12 |
| Customer#000092239 | 26 |
+--------------------+----------------+
</codeblock>
<p>
Count the number of line items in each order. The reference to <codeph>C.C_ORDERS</codeph> in the <codeph>FROM</codeph> clause
is needed because the <codeph>O_ORDERKEY</codeph> field is a member of the elements in the <codeph>C_ORDERS</codeph> array. The
subquery labelled <codeph>SUBQ1</codeph> is correlated: it is re-evaluated for the <codeph>C_ORDERS.O_LINEITEMS</codeph> array
from each row of the <codeph>CUSTOMERS</codeph> table.
</p>
<codeblock>SELECT c_name, o_orderkey, howmany_line_items
FROM
customer c,
c.c_orders t2,
(SELECT COUNT(*) howmany_line_items FROM c.c_orders.o_lineitems) subq1
WHERE howmany_line_items > 0
LIMIT 5;
+--------------------+------------+--------------------+
| c_name | o_orderkey | howmany_line_items |
+--------------------+------------+--------------------+
| Customer#000020890 | 1884930 | 95 |
| Customer#000020890 | 4570754 | 95 |
| Customer#000020890 | 3771072 | 95 |
| Customer#000020890 | 2555489 | 95 |
| Customer#000020890 | 919171 | 95 |
+--------------------+------------+--------------------+
</codeblock>
<p>
Get the number of orders, the average order price, and the maximum items in any order per customer. For this example, the
subqueries labelled <codeph>SUBQ1</codeph> and <codeph>SUBQ2</codeph> are correlated: they are re-evaluated for each row from
the original <codeph>CUSTOMER</codeph> table, and only apply to the complex columns associated with that row.
</p>
<codeblock>SELECT c_name, howmany, average_price, most_items
FROM
customer c,
(SELECT COUNT(*) howmany, AVG(o_totalprice) average_price FROM c.c_orders) subq1,
(SELECT MAX(l_quantity) most_items FROM c.c_orders.o_lineitems ) subq2
LIMIT 5;
+--------------------+---------+---------------+------------+
| c_name | howmany | average_price | most_items |
+--------------------+---------+---------------+------------+
| Customer#000030065 | 15 | 128908.34 | 50.00 |
| Customer#000088191 | 0 | NULL | NULL |
| Customer#000101555 | 10 | 164250.31 | 50.00 |
| Customer#000022092 | 0 | NULL | NULL |
| Customer#000036277 | 27 | 166040.06 | 50.00 |
+--------------------+---------+---------------+------------+
</codeblock>
<p>
For example, these queries show how to access information about the <codeph>ARRAY</codeph> elements within the
<codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q> schema, starting with the initial <codeph>ARRAY</codeph> elements
and progressing to examine the <codeph>STRUCT</codeph> fields of the <codeph>ARRAY</codeph>, and then the elements nested within
another <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>:
</p>
<codeblock><![CDATA[-- How many orders does each customer have?
-- The type of the ARRAY column doesn't matter, this is just counting the elements.
SELECT c_custkey, count(*)
FROM customer, customer.c_orders
GROUP BY c_custkey
LIMIT 5;
+-----------+----------+
| c_custkey | count(*) |
+-----------+----------+
| 61081 | 21 |
| 115987 | 15 |
| 69685 | 19 |
| 109124 | 15 |
| 50491 | 12 |
+-----------+----------+
-- How many line items are part of each customer order?
-- Now we examine a field from a STRUCT nested inside the ARRAY.
SELECT c_custkey, c_orders.o_orderkey, count(*)
FROM customer, customer.c_orders c_orders, c_orders.o_lineitems
GROUP BY c_custkey, c_orders.o_orderkey
LIMIT 5;
+-----------+------------+----------+
| c_custkey | o_orderkey | count(*) |
+-----------+------------+----------+
| 63367 | 4985959 | 7 |
| 53989 | 1972230 | 2 |
| 143513 | 5750498 | 5 |
| 17849 | 4857989 | 1 |
| 89881 | 1046437 | 1 |
+-----------+------------+----------+
-- What are the line items in each customer order?
-- One of the STRUCT fields inside the ARRAY is another
-- ARRAY containing STRUCT elements. The join finds
-- all the related items from both levels of ARRAY.
SELECT c_custkey, o_orderkey, l_partkey
FROM customer, customer.c_orders, c_orders.o_lineitems
LIMIT 5;
+-----------+------------+-----------+
| c_custkey | o_orderkey | l_partkey |
+-----------+------------+-----------+
| 113644 | 2738497 | 175846 |
| 113644 | 2738497 | 27309 |
| 113644 | 2738497 | 175873 |
| 113644 | 2738497 | 88559 |
| 113644 | 2738497 | 8032 |
+-----------+------------+-----------+
]]>
</codeblock>
</conbody>
</concept>
</concept>
<concept id="pseudocolumns">
<title>Pseudocolumns for ARRAY and MAP Types</title>
<conbody>
<p>
Each element in an <codeph>ARRAY</codeph> type has a position, indexed starting from zero, and a value. Each element in a
<codeph>MAP</codeph> type represents a key-value pair. Impala provides pseudocolumns that let you retrieve this metadata as part
of a query, or filter query results by including such things in a <codeph>WHERE</codeph> clause. You refer to the pseudocolumns as
part of qualified column names in queries:
</p>
<ul>
<li>
<codeph>ITEM</codeph>: The value of an array element. If the <codeph>ARRAY</codeph> contains <codeph>STRUCT</codeph> elements,
you can refer to either <codeph><varname>array_name</varname>.ITEM.<varname>field_name</varname></codeph> or use the shorthand
<codeph><varname>array_name</varname>.<varname>field_name</varname></codeph>.
</li>
<li>
<codeph>POS</codeph>: The position of an element within an array.
</li>
<li>
<codeph>KEY</codeph>: The value forming the first part of a key-value pair in a map. It is not necessarily unique.
</li>
<li>
<codeph>VALUE</codeph>: The data item forming the second part of a key-value pair in a map. If the <codeph>VALUE</codeph> part
of the <codeph>MAP</codeph> element is a <codeph>STRUCT</codeph>, you can refer to either
<codeph><varname>map_name</varname>.VALUE.<varname>field_name</varname></codeph> or use the shorthand
<codeph><varname>map_name</varname>.<varname>field_name</varname></codeph>.
</li>
</ul>
<!-- To do: Consider whether to move the detailed subtopics underneath ARRAY and MAP instead of embedded here. -->
<p outputclass="toc inpage"/>
</conbody>
<concept id="item">
<title id="pos">ITEM and POS Pseudocolumns</title>
<conbody>
<p>
When an <codeph>ARRAY</codeph> column contains <codeph>STRUCT</codeph> elements, you can refer to a field within the
<codeph>STRUCT</codeph> using a qualified name of the form
<codeph><varname>array_column</varname>.<varname>field_name</varname></codeph>. If the <codeph>ARRAY</codeph> contains scalar
values, Impala recognizes the special name <codeph><varname>array_column</varname>.ITEM</codeph> to represent the value of each
scalar array element. For example, if a column contained an <codeph>ARRAY</codeph> where each element was a
<codeph>STRING</codeph>, you would use <codeph><varname>array_name</varname>.ITEM</codeph> to refer to each scalar value in the
<codeph>SELECT</codeph> list, or the <codeph>WHERE</codeph> or other clauses.
</p>
<p>
This example shows a table with two <codeph>ARRAY</codeph> columns whose elements are of the scalar type
<codeph>STRING</codeph>. When referring to the values of the array elements in the <codeph>SELECT</codeph> list,
<codeph>WHERE</codeph> clause, or <codeph>ORDER BY</codeph> clause, you use the <codeph>ITEM</codeph> pseudocolumn because
within the array, the individual elements have no defined names.
</p>
<codeblock><![CDATA[create TABLE persons_of_interest
(
person_id BIGINT,
aliases ARRAY <STRING>,
associates ARRAY <STRING>,
real_name STRING
)
STORED AS PARQUET;
-- Get all the aliases of each person.
SELECT real_name, aliases.ITEM
FROM persons_of_interest, persons_of_interest.aliases
ORDER BY real_name, aliases.item;
-- Search for particular associates of each person.
SELECT real_name, associates.ITEM
FROM persons_of_interest, persons_of_interest.associates
WHERE associates.item LIKE '% MacGuffin';
]]>
</codeblock>
<p>
Because an array is inherently an ordered data structure, Impala recognizes the special name
<codeph><varname>array_column</varname>.POS</codeph> to represent the numeric position of each element within the array. The
<codeph>POS</codeph> pseudocolumn lets you filter or reorder the result set based on the sequence of array elements.
</p>
<p>
The following example uses a table from a flattened version of the TPC-H schema. The <codeph>REGION</codeph> table only has a
few rows, such as one row for Europe and one for Asia. The row for each region represents all the countries in that region as an
<codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements:
</p>
<codeblock><![CDATA[[localhost:21000] > desc region;
+-------------+--------------------------------------------------------------------+
| name | type |
+-------------+--------------------------------------------------------------------+
| r_regionkey | smallint |
| r_name | string |
| r_comment | string |
| r_nations | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> |
+-------------+--------------------------------------------------------------------+
]]>
</codeblock>
<p>
To find the countries within a specific region, you use a join query. To find out the order of elements in the array, you also
refer to the <codeph>POS</codeph> pseudocolumn in the select list:
</p>
<codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, <b>r2.POS</b>
> FROM region r1 INNER JOIN r1.r_nations r2
> WHERE r1.r_name = 'ASIA';
+--------+-----------+-----+
| r_name | n_name | pos |
+--------+-----------+-----+
| ASIA | VIETNAM | 0 |
| ASIA | CHINA | 1 |
| ASIA | JAPAN | 2 |
| ASIA | INDONESIA | 3 |
| ASIA | INDIA | 4 |
+--------+-----------+-----+
</codeblock>
<p>
Once you know the positions of the elements, you can use that information in subsequent queries, for example to change the
ordering of results from the complex type column or to filter certain elements from the array:
</p>
<codeblock>[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
> FROM region r1 INNER JOIN r1.r_nations r2
> WHERE r1.r_name = 'ASIA'
> <b>ORDER BY r2.POS DESC</b>;
+--------+-----------+-----+
| r_name | n_name | pos |
+--------+-----------+-----+
| ASIA | INDIA | 4 |
| ASIA | INDONESIA | 3 |
| ASIA | JAPAN | 2 |
| ASIA | CHINA | 1 |
| ASIA | VIETNAM | 0 |
+--------+-----------+-----+
[localhost:21000] > SELECT r1.r_name, r2.n_name, r2.POS
> FROM region r1 INNER JOIN r1.r_nations r2
> WHERE r1.r_name = 'ASIA' AND <b>r2.POS BETWEEN 1 and 3</b>;
+--------+-----------+-----+
| r_name | n_name | pos |
+--------+-----------+-----+
| ASIA | CHINA | 1 |
| ASIA | JAPAN | 2 |
| ASIA | INDONESIA | 3 |
+--------+-----------+-----+
</codeblock>
</conbody>
</concept>
<concept id="key">
<title id="value">KEY and VALUE Pseudocolumns</title>
<conbody>
<p>
The <codeph>MAP</codeph> data type is suitable for representing sparse or wide data structures, where each row might only have
entries for a small subset of named fields. Because the element names (the map keys) vary depending on the row, a query must be
able to refer to both the key and the value parts of each key-value pair. The <codeph>KEY</codeph> and <codeph>VALUE</codeph>
pseudocolumns let you refer to the parts of the key-value pair independently within the query, as
<codeph><varname>map_column</varname>.KEY</codeph> and <codeph><varname>map_column</varname>.VALUE</codeph>.
</p>
<p>
The <codeph>KEY</codeph> must always be a scalar type, such as <codeph>STRING</codeph>, <codeph>BIGINT</codeph>, or
<codeph>TIMESTAMP</codeph>. It can be <codeph>NULL</codeph>. Values of the <codeph>KEY</codeph> field are not necessarily unique
within the same <codeph>MAP</codeph>. You apply any required <codeph>DISTINCT</codeph>, <codeph>GROUP BY</codeph>, and other
clauses in the query, and loop through the result set to process all the values matching any specified keys.
</p>
<p>
The <codeph>VALUE</codeph> can be either a scalar type or another complex type. If the <codeph>VALUE</codeph> is a
<codeph>STRUCT</codeph>, you can construct a qualified name
<codeph><varname>map_column</varname>.VALUE.<varname>struct_field</varname></codeph> to refer to the individual fields inside
the value part. If the <codeph>VALUE</codeph> is an <codeph>ARRAY</codeph> or another <codeph>MAP</codeph>, you must include
another join condition that establishes a table alias for <codeph><varname>map_column</varname>.VALUE</codeph>, and then
construct another qualified name using that alias, for example <codeph><varname>table_alias</varname>.ITEM</codeph> or
<codeph><varname>table_alias</varname>.KEY</codeph> and <codeph><varname>table_alias</varname>.VALUE</codeph>
</p>
<p>
The following example shows different ways to access a <codeph>MAP</codeph> column using the <codeph>KEY</codeph> and
<codeph>VALUE</codeph> pseudocolumns. The <codeph>DETAILS</codeph> column has a <codeph>STRING</codeph> first part with short,
standardized values such as <codeph>'Recurring'</codeph>, <codeph>'Lucid'</codeph>, or <codeph>'Anxiety'</codeph>. This is the
<q>key</q> that is used to look up particular kinds of elements from the <codeph>MAP</codeph>. The second part, also a
<codeph>STRING</codeph>, is a longer free-form explanation. Impala gives you the standard pseudocolumn names
<codeph>KEY</codeph> and <codeph>VALUE</codeph> for the two parts, and you apply your own conventions and interpretations to the
underlying values.
</p>
<note>
If you find that the single-item nature of the <codeph>VALUE</codeph> makes it difficult to model your data accurately, the
solution is typically to add some nesting to the complex type. For example, to have several sets of key-value pairs, make the
column an <codeph>ARRAY</codeph> whose elements are <codeph>MAP</codeph>. To make a set of key-value pairs that holds more
elaborate information, make a <codeph>MAP</codeph> column whose <codeph>VALUE</codeph> part contains an <codeph>ARRAY</codeph>
or a <codeph>STRUCT</codeph>.
</note>
<codeblock><![CDATA[CREATE TABLE dream_journal
(
dream_id BIGINT,
details MAP <STRING,STRING>
)
STORED AS PARQUET;
]]>
-- What are all the types of dreams that are recorded?
SELECT DISTINCT details.KEY FROM dream_journal, dream_journal.details;
-- How many lucid dreams were recorded?
-- Because there is no GROUP BY, we count the 'Lucid' keys across all rows.
SELECT <b>COUNT(details.KEY)</b>
FROM dream_journal, dream_journal.details
WHERE <b>details.KEY = 'Lucid'</b>;
-- Print a report of a subset of dreams, filtering based on both the lookup key
-- and the detailed value.
SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE AS "Dream Summary"</b>
FROM dream_journal, dream_journal.details
WHERE
<b>details.KEY IN ('Happy', 'Pleasant', 'Joyous')</b>
AND <b>details.VALUE LIKE '%childhood%'</b>;
</codeblock>
<p>
The following example shows a more elaborate version of the previous table, where the <codeph>VALUE</codeph> part of the
<codeph>MAP</codeph> entry is a <codeph>STRUCT</codeph> rather than a scalar type. Now instead of referring to the
<codeph>VALUE</codeph> pseudocolumn directly, you use dot notation to refer to the <codeph>STRUCT</codeph> fields inside it.
</p>
<codeblock><![CDATA[CREATE TABLE better_dream_journal
(
dream_id BIGINT,
details MAP <STRING,STRUCT <summary: STRING, when_happened: TIMESTAMP, duration: DECIMAL(5,2), woke_up: BOOLEAN> >
)
STORED AS PARQUET;
]]>
-- Do more elaborate reporting and filtering by examining multiple attributes within the same dream.
SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.VALUE.summary AS "Dream Summary"</b>, <b>details.VALUE.duration AS "Duration"</b>
FROM better_dream_journal, better_dream_journal.details
WHERE
<b>details.KEY IN ('Anxiety', 'Nightmare')</b>
AND <b>details.VALUE.duration > 60</b>
AND <b>details.VALUE.woke_up = TRUE</b>;
-- Remember that if the ITEM or VALUE contains a STRUCT, you can reference
-- the STRUCT fields directly without the .ITEM or .VALUE qualifier.
SELECT dream_id, <b>details.KEY AS "Dream Type"</b>, <b>details.summary AS "Dream Summary"</b>, <b>details.duration AS "Duration"</b>
FROM better_dream_journal, better_dream_journal.details
WHERE
<b>details.KEY IN ('Anxiety', 'Nightmare')</b>
AND <b>details.duration > 60</b>
AND <b>details.woke_up = TRUE</b>;
</codeblock>
</conbody>
</concept>
</concept>
<concept id="complex_types_etl">
<!-- This topic overlaps in many ways with the preceding one. See which theme resonates with users, and combine them under the better title. -->
<title>Loading Data Containing Complex Types</title>
<conbody>
<p>
Because the Impala <codeph>INSERT</codeph> statement does not currently support creating new data with complex type columns, or
copying existing complex type values from one table to another, you primarily use Impala to query Parquet/ORC tables with complex
types where the data was inserted through Hive, or create tables with complex types where you already have existing Parquet/ORC
data files.
</p>
<p>
If you have created a Hive table with the Parquet/ORC file format and containing complex types, use the same table for Impala queries
with no changes. If you have such a Hive table in some other format, use a Hive <codeph>CREATE TABLE AS SELECT ... STORED AS
PARQUET</codeph> or <codeph>INSERT ... SELECT</codeph> statement to produce an equivalent Parquet table that Impala can query.
</p>
<p> If you have existing Parquet data files containing complex types,
located outside of any Impala or Hive table, such as data files
created by Spark jobs, you can use an Impala <codeph>CREATE TABLE ...
STORED AS PARQUET</codeph> statement, followed by an Impala
<codeph>LOAD DATA</codeph> statement to move the data files into the
table. As an alternative, you can use an Impala <codeph>CREATE
EXTERNAL TABLE</codeph> statement to create a table pointing to the
HDFS directory that already contains the Parquet or ORC data
files.</p>
<p>The simplest way to get started with complex type data is to take a
denormalized table containing duplicated values, and use an
<codeph>INSERT ... SELECT</codeph> statement to copy the data into a
Parquet table and condense the repeated values into complex types.
With the Hive <codeph>INSERT</codeph> statement, you use the
<codeph>COLLECT_LIST()</codeph>, <codeph>NAMED_STRUCT()</codeph>,
and <codeph>MAP()</codeph> constructor functions within a
<codeph>GROUP BY</codeph> query to produce the complex type values.
<codeph>COLLECT_LIST()</codeph> turns a sequence of values into an
<codeph>ARRAY</codeph>. <codeph>NAMED_STRUCT()</codeph> uses the
first, third, and so on arguments as the field names for a
<codeph>STRUCT</codeph>, to match the field names from the
<codeph>CREATE TABLE</codeph> statement. </p>
<note>
Because Hive currently cannot construct individual rows using complex types through the <codeph>INSERT ... VALUES</codeph> syntax,
you prepare the data in flat form in a separate table, then copy it to the table with complex columns using <codeph>INSERT ...
SELECT</codeph> and the complex type constructors. See <xref href="impala_complex_types.xml#complex_types_ex_hive_etl"/> for
examples.
</note>
</conbody>
</concept>
<concept id="complex_types_nesting">
<title>Using Complex Types as Nested Types</title>
<conbody>
<p>
The <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types can be the top-level types for <q>nested
type</q> columns. That is, each of these types can contain other complex or scalar types, with multiple levels of nesting to a
maximum depth of 100. For example, you can have an array of structures, a map containing other maps, a structure containing an
array of other structures, and so on. At the lowest level, there are always scalar types making up the fields of a
<codeph>STRUCT</codeph>, elements of an <codeph>ARRAY</codeph>, and keys and values of a <codeph>MAP</codeph>.
</p>
<p>
Schemas involving complex types typically use some level of nesting for the complex type columns.
</p>
<p>
For example, to model a relationship like a dimension table and a fact table, you typically use an <codeph>ARRAY</codeph> where
each array element is a <codeph>STRUCT</codeph>. The <codeph>STRUCT</codeph> fields represent what would traditionally be columns
in a separate joined table. It makes little sense to use a <codeph>STRUCT</codeph> as the top-level type for a column, because you
could just make the fields of the <codeph>STRUCT</codeph> into regular table columns.
</p>
<!-- To do: this example might move somewhere else, under STRUCT itself or in a tips-and-tricks section. -->
<p>
Perhaps the only use case for a top-level <codeph>STRUCT</codeph> would be to to allow <codeph>STRUCT</codeph> fields with the
same name as columns to coexist in the same table. The following example shows how a table could have a column named
<codeph>ID</codeph>, and two separate <codeph>STRUCT</codeph> fields also named <codeph>ID</codeph>. Because the
<codeph>STRUCT</codeph> fields are always referenced using qualified names, the identical <codeph>ID</codeph> names do not cause a
conflict.
</p>
<codeblock><![CDATA[CREATE TABLE struct_namespaces
(
id BIGINT
, s1 STRUCT < id: BIGINT, field1: STRING >
, s2 STRUCT < id: BIGINT, when_happened: TIMESTAMP >
)
STORED AS PARQUET;
select id, s1.id, s2.id from struct_namespaces;
]]>
</codeblock>
<p>
It is common to make the value portion of each key-value pair in a <codeph>MAP</codeph> a <codeph>STRUCT</codeph>,
<codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, or other complex type variation. That way, each key in the <codeph>MAP</codeph>
can be associated with a flexible and extensible data structure. The key values are not predefined ahead of time (other than by
specifying their data type). Therefore, the <codeph>MAP</codeph> can accomodate a rapidly evolving schema, or sparse data
structures where each row contains only a few data values drawn from a large set of possible choices.
</p>
<p>
Although you can use an <codeph>ARRAY</codeph> of scalar values as the top-level column in a table, such a simple array is
typically of limited use for analytic queries. The only property of the array elements, aside from the element value, is the
ordering sequence available through the <codeph>POS</codeph> pseudocolumn. To record any additional item about each array element,
such as a <codeph>TIMESTAMP</codeph> or a symbolic name, you use an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> rather than
of scalar values.
</p>
<p>
If you are considering having multiple <codeph>ARRAY</codeph> or <codeph>MAP</codeph> columns, with related items under the same
position in each <codeph>ARRAY</codeph> or the same key in each <codeph>MAP</codeph>, prefer to use a <codeph>STRUCT</codeph> to
group all the related items into a single <codeph>ARRAY</codeph> or <codeph>MAP</codeph>. Doing so avoids the additional storage
overhead and potential duplication of key values from having an extra complex type column. Also, because each
<codeph>ARRAY</codeph> or <codeph>MAP</codeph> that you reference in the query <codeph>SELECT</codeph> list requires an additional
join clause, minimizing the number of complex type columns also makes the query easier to read and maintain, relying more on dot
notation to refer to the relevant fields rather than a sequence of join clauses.
</p>
<p>
For example, here is a table with several complex type columns all at the top level and containing only scalar types. To retrieve
every data item for the row requires a separate join for each <codeph>ARRAY</codeph> or <codeph>MAP</codeph> column. The fields of
the <codeph>STRUCT</codeph> can be referenced using dot notation, but there is no real advantage to using the
<codeph>STRUCT</codeph> at the top level rather than just making separate columns <codeph>FIELD1</codeph> and
<codeph>FIELD2</codeph>.
</p>
<codeblock><![CDATA[CREATE TABLE complex_types_top_level
(
id BIGINT,
a1 ARRAY<INT>,
a2 ARRAY<STRING>,
s STRUCT<field1: INT, field2: STRING>,
-- Numeric lookup key for a string value.
m1 MAP<INT,STRING>,
-- String lookup key for a numeric value.
m2 MAP<STRING,INT>
)
STORED AS PARQUET;
describe complex_types_top_level;
+------+-----------------+
| name | type |
+------+-----------------+
| id | bigint |
| a1 | array<int> |
| a2 | array<string> |
| s | struct< |
| | field1:int, |
| | field2:string |
| | > |
| m1 | map<int,string> |
| m2 | map<string,int> |
+------+-----------------+
select
id,
a1.item,
a2.item,
s.field1,
s.field2,
m1.key,
m1.value,
m2.key,
m2.value
from
complex_types_top_level,
complex_types_top_level.a1,
complex_types_top_level.a2,
complex_types_top_level.m1,
complex_types_top_level.m2;
]]>
</codeblock>
<p>
For example, here is a table with columns containing an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>, a <codeph>MAP</codeph>
where each key value is a <codeph>STRUCT</codeph>, and a <codeph>MAP</codeph> where each key value is an <codeph>ARRAY</codeph> of
<codeph>STRUCT</codeph>.
</p>
<codeblock><![CDATA[CREATE TABLE nesting_demo
(
user_id BIGINT,
family_members ARRAY < STRUCT < name: STRING, email: STRING, date_joined: TIMESTAMP >>,
foo map < STRING, STRUCT < f1: INT, f2: INT, f3: TIMESTAMP, f4: BOOLEAN >>,
gameplay MAP < STRING , ARRAY < STRUCT <
name: STRING, highest: BIGINT, lives_used: INT, total_spent: DECIMAL(16,2)
>>>
)
STORED AS PARQUET;
]]>
</codeblock>
<p>
The <codeph>DESCRIBE</codeph> statement rearranges the <codeph>&lt;</codeph> and <codeph>&gt;</codeph> separators and the field
names within each <codeph>STRUCT</codeph> for easy readability:
</p>
<codeblock><![CDATA[DESCRIBE nesting_demo;
+----------------+-----------------------------+
| name | type |
+----------------+-----------------------------+
| user_id | bigint |
| family_members | array<struct< |
| | name:string, |
| | email:string, |
| | date_joined:timestamp |
| | >> |
| foo | map<string,struct< |
| | f1:int, |
| | f2:int, |
| | f3:timestamp, |
| | f4:boolean |
| | >> |
| gameplay | map<string,array<struct< |
| | name:string, |
| | highest:bigint, |
| | lives_used:int, |
| | total_spent:decimal(16,2) |
| | >>> |
+----------------+-----------------------------+
]]>
</codeblock>
<p>
To query the complex type columns, you use join notation to refer to the lowest-level scalar values. If the value is an
<codeph>ARRAY</codeph> element, the fully qualified name includes the <codeph>ITEM</codeph> pseudocolumn. If the value is inside a
<codeph>MAP</codeph>, the fully qualified name includes the <codeph>KEY</codeph> or <codeph>VALUE</codeph> pseudocolumn. Each
reference to a different <codeph>ARRAY</codeph> or <codeph>MAP</codeph> (even if nested inside another complex type) requires an
additional join clause.
</p>
<!-- To do: shorten qualified names. -->
<codeblock><![CDATA[SELECT
-- The lone scalar field doesn't require any dot notation or join clauses.
user_id
-- Retrieve the fields of a STRUCT inside an ARRAY.
-- The FAMILY_MEMBERS name refers to the FAMILY_MEMBERS table alias defined later in the FROM clause.
, family_members.item.name
, family_members.item.email
, family_members.item.date_joined
-- Retrieve the KEY and VALUE fields of a MAP, with the value being a STRUCT consisting of more fields.
-- The FOO name refers to the FOO table alias defined later in the FROM clause.
, foo.key
, foo.value.f1
, foo.value.f2
, foo.value.f3
, foo.value.f4
-- Retrieve the KEY fields of a MAP, and expand the VALUE part into ARRAY items consisting of STRUCT fields.
-- The GAMEPLAY name refers to the GAMEPLAY table alias defined later in the FROM clause (referring to the MAP item).
-- The GAME_N name refers to the GAME_N table alias defined later in the FROM clause (referring to the ARRAY
-- inside the MAP item's VALUE part.)
, gameplay.key
, game_n.name
, game_n.highest
, game_n.lives_used
, game_n.total_spent
FROM
nesting_demo
, nesting_demo.family_members AS family_members
, nesting_demo.foo AS foo
, nesting_demo.gameplay AS gameplay
, nesting_demo.gameplay.value AS game_n;
]]>
</codeblock>
<p>
Once you understand the notation to refer to a particular data item in the <codeph>SELECT</codeph> list, you can use the same
qualified name to refer to that data item in other parts of the query, such as the <codeph>WHERE</codeph> clause, <codeph>ORDER
BY</codeph> or <codeph>GROUP BY</codeph> clauses, or calls to built-in functions. For example, you might frequently retrieve the
<codeph>VALUE</codeph> part of each <codeph>MAP</codeph> item in the <codeph>SELECT</codeph> list, while choosing the specific
<codeph>MAP</codeph> items by running comparisons against the <codeph>KEY</codeph> part in the <codeph>WHERE</codeph> clause.
</p>
</conbody>
</concept>
<concept id="complex_types_views">
<title>Accessing Complex Type Data in Flattened Form Using Views</title>
<conbody>
<p>
The layout of complex and nested types is largely a physical consideration. The complex type columns reside in the same data files
rather than in separate normalized tables, for your convenience in managing related data sets and performance in querying related
data sets. You can use views to treat tables with complex types as if they were flattened. By putting the join logic and
references to the complex type columns in the view definition, you can query the same tables using existing queries intended for
tables containing only scalar columns. This technique also lets you use tables with complex types with BI tools that are not aware
of the data types and query notation for accessing complex type columns.
</p>
<!-- First conref contains a link back to complex_types topic so not appropriate here.
Second conref might not make sense without the first. Therefore hold off for now,
see later if there's a good way to work in these reusable tips.
<p conref="../shared/impala_common.xml#common/complex_types_views"/>
<p conref="../shared/impala_common.xml#common/complex_types_views_caveat"/>
-->
<p>
For example, the variation of the TPC-H schema containing complex types has a table <codeph>REGION</codeph>. This table has 5
rows, corresponding to 5 regions such as <codeph>NORTH AMERICA</codeph> and <codeph>AFRICA</codeph>. Each row has an
<codeph>ARRAY</codeph> column, where each array item is a <codeph>STRUCT</codeph> containing details about a country in that
region.
</p>
<codeblock><![CDATA[DESCRIBE region;
+-------------+-------------------------+
| name | type |
+-------------+-------------------------+
| r_regionkey | smallint |
| r_name | string |
| r_comment | string |
| r_nations | array<struct< |
| | n_nationkey:smallint, |
| | n_name:string, |
| | n_comment:string |
| | >> |
+-------------+-------------------------+
]]>
</codeblock>
<p>
The same data could be represented in traditional denormalized form, as a single table where the information about each region is
repeated over and over, alongside the information about each country. The nested complex types let us avoid the repetition, while
still keeping the data in a single table rather than normalizing across multiple tables.
</p>
<p>
To use this table with a JDBC or ODBC application that expected scalar columns, we could create a view that represented the result
set as a set of scalar columns (three columns from the original table, plus three more from the <codeph>STRUCT</codeph> fields of
the array elements). In the following examples, any column with an <codeph>R_*</codeph> prefix is taken unchanged from the
original table, while any column with an <codeph>N_*</codeph> prefix is extracted from the <codeph>STRUCT</codeph> inside the
<codeph>ARRAY</codeph>.
</p>
<!-- To do: shorten qualified names. -->
<codeblock>CREATE VIEW region_view AS
SELECT
r_regionkey,
r_name,
r_comment,
array_field.item.n_nationkey AS n_nationkey,
array_field.item.n_name AS n_name,
array_field.n_comment AS n_comment
FROM
region, region.r_nations AS array_field;
</codeblock>
<p>
Then we point the application queries at the view rather than the original table. From the perspective of the view, there are 25
rows in the result set, one for each nation in each region, and queries can refer freely to fields related to the region or the
nation.
</p>
<codeblock>-- Retrieve info such as the nation name from the original R_NATIONS array elements.
select n_name from region_view where r_name in ('EUROPE', 'ASIA');
+----------------+
| n_name |
+----------------+
| UNITED KINGDOM |
| RUSSIA |
| ROMANIA |
| GERMANY |
| FRANCE |
| VIETNAM |
| CHINA |
| JAPAN |
| INDONESIA |
| INDIA |
+----------------+
-- UNITED STATES in AMERICA and UNITED KINGDOM in EUROPE.
SELECT DISTINCT r_name FROM region_view WHERE n_name LIKE 'UNITED%';
+---------+
| r_name |
+---------+
| AMERICA |
| EUROPE |
+---------+
-- For conciseness, we only list some view columns in the SELECT list.
-- SELECT * would bring back all the data, unlike SELECT *
-- queries on the original table with complex type columns.
SELECT r_regionkey, r_name, n_nationkey, n_name FROM region_view LIMIT 7;
+-------------+--------+-------------+----------------+
| r_regionkey | r_name | n_nationkey | n_name |
+-------------+--------+-------------+----------------+
| 3 | EUROPE | 23 | UNITED KINGDOM |
| 3 | EUROPE | 22 | RUSSIA |
| 3 | EUROPE | 19 | ROMANIA |
| 3 | EUROPE | 7 | GERMANY |
| 3 | EUROPE | 6 | FRANCE |
| 2 | ASIA | 21 | VIETNAM |
| 2 | ASIA | 18 | CHINA |
+-------------+--------+-------------+----------------+
</codeblock>
</conbody>
</concept>
</concept>
<concept id="complex_types_examples">
<title>Tutorials and Examples for Complex Types</title>
<prolog>
<metadata>
<data name="Category" value="Tutorials"/>
</metadata>
</prolog>
<conbody>
<p>
The following examples illustrate the query syntax for some common use cases involving complex type columns.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="complex_sample_schema">
<title>Sample Schema and Data for Experimenting with Impala Complex Types</title>
<conbody>
<!-- To do: find where the schema and data creation scripts are posted so I can refer readers to a download location. -->
<p>
The tables used for earlier examples of complex type syntax are trivial ones with no actual data. The more substantial examples of
the complex type feature use these tables, adapted from the schema used for TPC-H testing:
</p>
<codeblock><![CDATA[SHOW TABLES;
+----------+
| name |
+----------+
| customer |
| part |
| region |
| supplier |
+----------+
DESCRIBE customer;
+--------------+------------------------------------+
| name | type |
+--------------+------------------------------------+
| c_custkey | bigint |
| c_name | string |
| c_address | string |
| c_nationkey | smallint |
| c_phone | string |
| c_acctbal | decimal(12,2) |
| c_mktsegment | string |
| c_comment | string |
| c_orders | array<struct< |
| | o_orderkey:bigint, |
| | o_orderstatus:string, |
| | o_totalprice:decimal(12,2), |
| | o_orderdate:string, |
| | o_orderpriority:string, |
| | o_clerk:string, |
| | o_shippriority:int, |
| | o_comment:string, |
| | o_lineitems:array<struct< |
| | l_partkey:bigint, |
| | l_suppkey:bigint, |
| | l_linenumber:int, |
| | l_quantity:decimal(12,2), |
| | l_extendedprice:decimal(12,2), |
| | l_discount:decimal(12,2), |
| | l_tax:decimal(12,2), |
| | l_returnflag:string, |
| | l_linestatus:string, |
| | l_shipdate:string, |
| | l_commitdate:string, |
| | l_receiptdate:string, |
| | l_shipinstruct:string, |
| | l_shipmode:string, |
| | l_comment:string |
| | >> |
| | >> |
+--------------+------------------------------------+
DESCRIBE part;
+---------------+---------------+
| name | type |
+---------------+---------------+
| p_partkey | bigint |
| p_name | string |
| p_mfgr | string |
| p_brand | string |
| p_type | string |
| p_size | int |
| p_container | string |
| p_retailprice | decimal(12,2) |
| p_comment | string |
+---------------+---------------+
DESCRIBE region;
+-------------+--------------------------------------------------------------------+
| name | type |
+-------------+--------------------------------------------------------------------+
| r_regionkey | smallint |
| r_name | string |
| r_comment | string |
| r_nations | array<struct<n_nationkey:smallint,n_name:string,n_comment:string>> |
+-------------+--------------------------------------------------------------------+
DESCRIBE supplier;
+-------------+----------------------------------------------+
| name | type |
+-------------+----------------------------------------------+
| s_suppkey | bigint |
| s_name | string |
| s_address | string |
| s_nationkey | smallint |
| s_phone | string |
| s_acctbal | decimal(12,2) |
| s_comment | string |
| s_partsupps | array<struct<ps_partkey:bigint, |
| | ps_availqty:int,ps_supplycost:decimal(12,2), |
| | ps_comment:string>> |
+-------------+----------------------------------------------+
]]>
</codeblock>
<p>
The volume of data used in the following examples is:
</p>
<codeblock><![CDATA[SELECT count(*) FROM customer;
+----------+
| count(*) |
+----------+
| 150000 |
+----------+
SELECT count(*) FROM part;
+----------+
| count(*) |
+----------+
| 200000 |
+----------+
SELECT count(*) FROM region;
+----------+
| count(*) |
+----------+
| 5 |
+----------+
SELECT count(*) FROM supplier;
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
]]>
</codeblock>
</conbody>
<concept audience="hidden" id="complex_python">
<!-- Hiding this subtopic for the moment because there isn't enough material related to Ibis yet, which would be useful for background or to construct examples. -->
<title>Representing Python Data Types with Impala Complex Types</title>
<conbody>
<p>
If you use Python data types for data science computations or for convenience of manipulating data structures, you might think
of complex or nested data in terms of lists, tuples, and dictionaries and their textual representations. These types map well to
Impala complex types, as follows:
</p>
<!-- To do: add some examples in both languages to show the equivalences. -->
<ul>
<li>
<p>
An Impala <codeph>ARRAY</codeph> is like a Python list.
</p>
</li>
<li>
<p>
An Impala <codeph>STRUCT</codeph> is like a Python tuple. The elements of an Impala <codeph>STRUCT</codeph> are accessed by
name, rather than by numeric index as in Python.
</p>
</li>
<li>
<p>
An Impala <codeph>MAP</codeph> is like a Python dictionary. To emulate the Python <codeph>defaultdict</codeph> type, which
returns a default value in the case of non-existent elements, use an <codeph>OUTER JOIN</codeph> clause in the Impala query
that retrieves the <codeph>MAP</codeph> elements, and translate any <codeph>NULL</codeph> values corresponding to missing
elements using Impala conditional functions.
</p>
</li>
</ul>
</conbody>
</concept>
</concept>
<concept id="complex_types_ex_aggregation" audience="hidden">
<title>Aggregating the Elements in an ARRAY</title>
<conbody>
<p></p>
<codeblock>
</codeblock>
</conbody>
</concept>
<concept id="complex_types_ex_map_keys" audience="hidden">
<title>Finding the Distinct Keys in a MAP</title>
<conbody>
<p></p>
<codeblock>
</codeblock>
</conbody>
</concept>
<concept id="complex_types_ex_map_struct" audience="hidden">
<title>Using a STRUCT as the Value Part of a MAP</title>
<conbody>
<p></p>
<codeblock>
</codeblock>
</conbody>
</concept>
<concept id="complex_types_ex_hive_etl">
<title>Constructing Parquet/ORC Files with Complex Columns Using Hive</title>
<conbody>
<p>
The following examples demonstrate the Hive syntax to transform flat data (tables with all scalar columns) into Parquet/ORC tables
where Impala can query the complex type columns. Each example shows the full sequence of steps, including switching back and forth
between Impala and Hive. Although the source table can use any file format, the destination table must use the Parquet/ORC file
format. We take Parquet in the following examples. You can replace Parquet with ORC to do the same things in ORC file format.
</p>
<p>
<b>Create table with <codeph>ARRAY</codeph> in Impala, load data in Hive, query in Impala:</b>
</p>
<p>
This example shows the cycle of creating the tables and querying the complex data in Impala, and using Hive (either the
<codeph>hive</codeph> shell or <codeph>beeline</codeph>) for the data loading step. The data starts in flattened, denormalized
form in a text table. Hive writes the corresponding Parquet data, including an <codeph>ARRAY</codeph> column. Then Impala can run
analytic queries on the Parquet table, using join notation to unpack the <codeph>ARRAY</codeph> column.
</p>
<codeblock>/* Initial DDL and loading of flat, denormalized data happens in impala-shell */<![CDATA[CREATE TABLE flat_array (country STRING, city STRING);INSERT INTO flat_array VALUES
('Canada', 'Toronto') , ('Canada', 'Vancouver') , ('Canada', "St. John\'s")
, ('Canada', 'Saint John') , ('Canada', 'Montreal') , ('Canada', 'Halifax')
, ('Canada', 'Winnipeg') , ('Canada', 'Calgary') , ('Canada', 'Saskatoon')
, ('Canada', 'Ottawa') , ('Canada', 'Yellowknife') , ('France', 'Paris')
, ('France', 'Nice') , ('France', 'Marseilles') , ('France', 'Cannes')
, ('Greece', 'Athens') , ('Greece', 'Piraeus') , ('Greece', 'Hania')
, ('Greece', 'Heraklion') , ('Greece', 'Rethymnon') , ('Greece', 'Fira');
CREATE TABLE complex_array (country STRING, city ARRAY <STRING>) STORED AS PARQUET;
]]>
</codeblock>
<codeblock><![CDATA[/* Conversion to Parquet and complex and/or nested columns happens in Hive */
INSERT INTO complex_array SELECT country, collect_list(city) FROM flat_array GROUP BY country;
Query ID = dev_20151108160808_84477ff2-82bd-4ba4-9a77-554fa7b8c0cb
Total jobs = 1
Launching Job 1 out of 1
...
]]>
</codeblock>
<codeblock><![CDATA[/* Back to impala-shell again for analytic queries */
REFRESH complex_array;
SELECT country, city.item FROM complex_array, complex_array.city
+---------+-------------+
| country | item |
+---------+-------------+
| Canada | Toronto |
| Canada | Vancouver |
| Canada | St. John's |
| Canada | Saint John |
| Canada | Montreal |
| Canada | Halifax |
| Canada | Winnipeg |
| Canada | Calgary |
| Canada | Saskatoon |
| Canada | Ottawa |
| Canada | Yellowknife |
| France | Paris |
| France | Nice |
| France | Marseilles |
| France | Cannes |
| Greece | Athens |
| Greece | Piraeus |
| Greece | Hania |
| Greece | Heraklion |
| Greece | Rethymnon |
| Greece | Fira |
+---------+-------------+
]]>
</codeblock>
<p>
<b>Create table with <codeph>STRUCT</codeph> and <codeph>ARRAY</codeph> in Impala, load data in Hive, query in Impala:</b>
</p>
<p>
This example shows the cycle of creating the tables and querying the complex data in Impala, and using Hive (either the
<codeph>hive</codeph> shell or <codeph>beeline</codeph>) for the data loading step. The data starts in flattened, denormalized
form in a text table. Hive writes the corresponding Parquet data, including a <codeph>STRUCT</codeph> column with an
<codeph>ARRAY</codeph> field. Then Impala can run analytic queries on the Parquet table, using join notation to unpack the
<codeph>ARRAY</codeph> field from the <codeph>STRUCT</codeph> column.
</p>
<!-- To do: enhance the complex table in this example, with COUNTRY being an ARRAY of STRUCT, with the STRUCT containing another ARRAY field. -->
<codeblock><![CDATA[/* Initial DDL and loading of flat, denormalized data happens in impala-shell */
CREATE TABLE flat_struct_array (continent STRING, country STRING, city STRING);
INSERT INTO flat_struct_array VALUES
('North America', 'Canada', 'Toronto') , ('North America', 'Canada', 'Vancouver')
, ('North America', 'Canada', "St. John\'s") , ('North America', 'Canada', 'Saint John')
, ('North America', 'Canada', 'Montreal') , ('North America', 'Canada', 'Halifax')
, ('North America', 'Canada', 'Winnipeg') , ('North America', 'Canada', 'Calgary')
, ('North America', 'Canada', 'Saskatoon') , ('North America', 'Canada', 'Ottawa')
, ('North America', 'Canada', 'Yellowknife') , ('Europe', 'France', 'Paris')
, ('Europe', 'France', 'Nice') , ('Europe', 'France', 'Marseilles')
, ('Europe', 'France', 'Cannes') , ('Europe', 'Greece', 'Athens')
, ('Europe', 'Greece', 'Piraeus') , ('Europe', 'Greece', 'Hania')
, ('Europe', 'Greece', 'Heraklion') , ('Europe', 'Greece', 'Rethymnon')
, ('Europe', 'Greece', 'Fira');
CREATE TABLE complex_struct_array (continent STRING, country STRUCT <name: STRING, city: ARRAY <STRING> >) STORED AS PARQUET;
]]>
</codeblock>
<codeblock><![CDATA[/* Conversion to Parquet and complex and/or nested columns happens in Hive */
INSERT INTO complex_struct_array SELECT continent, named_struct('name', country, 'city', collect_list(city)) FROM flat_array_array GROUP BY continent, country;
Query ID = dev_20151108163535_11a4fa53-0003-4638-97e6-ef13cdb8e09e
Total jobs = 1
Launching Job 1 out of 1
...
]]>
</codeblock>
<codeblock><![CDATA[/* Back to impala-shell again for analytic queries */
REFRESH complex_struct_array;
SELECT t1.continent, t1.country.name, t2.item
FROM complex_struct_array t1, t1.country.city t2
+---------------+--------------+-------------+
| continent | country.name | item |
+---------------+--------------+-------------+
| Europe | France | Paris |
| Europe | France | Nice |
| Europe | France | Marseilles |
| Europe | France | Cannes |
| Europe | Greece | Athens |
| Europe | Greece | Piraeus |
| Europe | Greece | Hania |
| Europe | Greece | Heraklion |
| Europe | Greece | Rethymnon |
| Europe | Greece | Fira |
| North America | Canada | Toronto |
| North America | Canada | Vancouver |
| North America | Canada | St. John's |
| North America | Canada | Saint John |
| North America | Canada | Montreal |
| North America | Canada | Halifax |
| North America | Canada | Winnipeg |
| North America | Canada | Calgary |
| North America | Canada | Saskatoon |
| North America | Canada | Ottawa |
| North America | Canada | Yellowknife |
+---------------+--------------+-------------+
]]>
</codeblock>
</conbody>
</concept>
<concept audience="hidden" id="complex_types_no_joins">
<title>Using Complex Types without Join Queries</title>
<conbody>
<p>
Although most discussion and examples of complex types revolve around join queries, you can make use of complex columns without
joins. In these cases, you refer to the complex object (possibly nested under multiple levels of other complex types) as the only
<q>table</q> in the <codeph>FROM</codeph> clause, and the columns of this table are the scalar values named <codeph>POS</codeph>
(for <codeph>ARRAY</codeph>), <codeph>ITEM</codeph> (for <codeph>ARRAY</codeph> of scalar values), <codeph>KEY</codeph> (for
<codeph>MAP</codeph>), <codeph>VALUE</codeph> (for <codeph>MAP</codeph> with scalar values), and the names of
<codeph>STRUCT</codeph> fields (where the <codeph>ARRAY</codeph> element or <codeph>MAP</codeph> value is a
<codeph>STRUCT</codeph>).
</p>
<p>
This technique lets you use aggregation operation such as <codeph>COUNT()</codeph>, <codeph>SUM()</codeph>,
<codeph>MAX()</codeph>, or <codeph>DISTINCT</codeph> across all the data items from a complex type, across all rows of the table.
For users new to complex types, this can be a good practice exercise to understand the notation for referring to data items within
deeply nested types, and how the pseudocolumns such as <codeph>POS</codeph> and <codeph>KEY</codeph> are treated as columns
alongside the fields of a <codeph>STRUCT</codeph>.
</p>
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
<p>
For example, these queries perform aggregate functions using the <codeph>CUSTOMER</codeph> table from the <q>nested TPC-H</q>
schema:
</p>
<codeblock><![CDATA[
]]>
</codeblock>
</conbody>
</concept>
<concept id="complex_denormalizing">
<title>Flattening Normalized Tables into a Single Table with Complex Types</title>
<conbody>
<p>
One common use for complex types is to embed the contents of one table into another. The traditional technique of denormalizing
results in a huge number of rows with some column values repeated over and over. With complex types, you can keep the same number
of rows as in the original normalized table, and put all the associated data from the other table in a single new column.
</p>
<p>
In this flattening scenario, you might frequently use a column that is an <codeph>ARRAY</codeph> consisting of
<codeph>STRUCT</codeph> elements, where each field within the <codeph>STRUCT</codeph> corresponds to a column name from the table
that you are combining.
</p>
<p>
The following example shows a traditional normalized layout using two tables, and then an equivalent layout using complex types in
a single table.
</p>
<codeblock><![CDATA[/* Traditional relational design */
-- This table just stores numbers, allowing us to look up details about the employee
-- and details about their vacation time using a three-table join query.
CREATE table employee_vacations
(
employee_id BIGINT,
vacation_id BIGINT
)
STORED AS PARQUET;
-- Each kind of information to track gets its own "fact table".
CREATE table vacation_details
(
vacation_id BIGINT,
vacation_start TIMESTAMP,
duration INT
)
STORED AS PARQUET;
-- Any time we print a human-readable report, we join with this table to
-- display info about employee #1234.
CREATE TABLE employee_contact
(
employee_id BIGINT,
name STRING,
address STRING,
phone STRING,
email STRING,
address_type STRING /* 'home', 'work', 'remote', etc. */
)
STORED AS PARQUET;
/* Equivalent flattened schema using complex types */
-- For analytic queries using complex types, we can bundle the dimension table
-- and multiple fact tables into a single table.
CREATE TABLE employee_vacations_nested_types
(
-- We might still use the employee_id for other join queries.
-- The table needs at least one scalar column to serve as an identifier
-- for the complex type columns.
employee_id BIGINT,
-- Columns of the VACATION_DETAILS table are folded into a STRUCT.
-- We drop the VACATION_ID column because Impala doesn't need
-- synthetic IDs to join a complex type column.
-- Each row from the VACATION_DETAILS table becomes an array element.
vacation ARRAY < STRUCT <
vacation_start: TIMESTAMP,
duration: INT
>>,
-- The ADDRESS_TYPE column, with a small number of predefined values that are distinct
-- for each employee, makes the EMPLOYEE_CONTACT table a good candidate to turn into a MAP,
-- with each row represented as a STRUCT. The string value from ADDRESS_TYPE becomes the
-- "key" (the anonymous first field) of the MAP.
contact MAP < STRING, STRUCT <
address: STRING,
phone: STRING,
email: STRING
>>
)
STORED AS PARQUET;
]]>
</codeblock>
</conbody>
</concept>
<concept id="complex_inference">
<title>Interchanging Complex Type Tables and Data Files with Hive and Other Components</title>
<conbody>
<p>
You can produce Parquet data files through several Hadoop components and APIs.
</p>
<p>
If you have a Hive-created Parquet table that includes <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or <codeph>MAP</codeph>
columns, Impala can query that same table in <keyword keyref="impala23_full"/> and higher, subject to the usual restriction that all other
columns are of data types supported by Impala, and also that the file type of the table must be Parquet.
</p>
<p>
If you have a Parquet data file produced outside of Impala, Impala can automatically deduce the appropriate table structure using
the syntax <codeph>CREATE TABLE ... LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'</codeph>. In <keyword keyref="impala23_full"/>
and higher, this feature works for Parquet files that include <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or
<codeph>MAP</codeph> types.
</p>
<codeblock><![CDATA[/* In impala-shell, find the HDFS data directory of the original table.
DESCRIBE FORMATTED tpch_nested_parquet.customer;
...
| Location: | hdfs://localhost:20500/test-warehouse/tpch_nested_parquet.db/customer | NULL |
...
# In the Unix shell, find the path of any Parquet data file in that HDFS directory.
$ hdfs dfs -ls hdfs://localhost:20500/test-warehouse/tpch_nested_parquet.db/customer
Found 4 items
-rwxr-xr-x 3 dev supergroup 171298918 2015-09-22 23:30 hdfs://localhost:20500/blah/tpch_nested_parquet.db/customer/000000_0
...
/* Back in impala-shell, use the HDFS path in a CREATE TABLE LIKE PARQUET statement. */
CREATE TABLE customer_ctlp
LIKE PARQUET 'hdfs://localhost:20500/blah/tpch_nested_parquet.db/customer/000000_0'
STORED AS PARQUET;
/* Confirm that old and new tables have the same column layout, including complex types. */
DESCRIBE tpch_nested_parquet.customer
+--------------+------------------------------------+---------+
| name | type | comment |
+--------------+------------------------------------+---------+
| c_custkey | bigint | |
| c_name | string | |
| c_address | string | |
| c_nationkey | smallint | |
| c_phone | string | |
| c_acctbal | decimal(12,2) | |
| c_mktsegment | string | |
| c_comment | string | |
| c_orders | array<struct< | |
| | o_orderkey:bigint, | |
| | o_orderstatus:string, | |
| | o_totalprice:decimal(12,2), | |
| | o_orderdate:string, | |
| | o_orderpriority:string, | |
| | o_clerk:string, | |
| | o_shippriority:int, | |
| | o_comment:string, | |
| | o_lineitems:array<struct< | |
| | l_partkey:bigint, | |
| | l_suppkey:bigint, | |
| | l_linenumber:int, | |
| | l_quantity:decimal(12,2), | |
| | l_extendedprice:decimal(12,2), | |
| | l_discount:decimal(12,2), | |
| | l_tax:decimal(12,2), | |
| | l_returnflag:string, | |
| | l_linestatus:string, | |
| | l_shipdate:string, | |
| | l_commitdate:string, | |
| | l_receiptdate:string, | |
| | l_shipinstruct:string, | |
| | l_shipmode:string, | |
| | l_comment:string | |
| | >> | |
| | >> | |
+--------------+------------------------------------+---------+
describe customer_ctlp;
+--------------+------------------------------------+-----------------------------+
| name | type | comment |
+--------------+------------------------------------+-----------------------------+
| c_custkey | bigint | Inferred from Parquet file. |
| c_name | string | Inferred from Parquet file. |
| c_address | string | Inferred from Parquet file. |
| c_nationkey | int | Inferred from Parquet file. |
| c_phone | string | Inferred from Parquet file. |
| c_acctbal | decimal(12,2) | Inferred from Parquet file. |
| c_mktsegment | string | Inferred from Parquet file. |
| c_comment | string | Inferred from Parquet file. |
| c_orders | array<struct< | Inferred from Parquet file. |
| | o_orderkey:bigint, | |
| | o_orderstatus:string, | |
| | o_totalprice:decimal(12,2), | |
| | o_orderdate:string, | |
| | o_orderpriority:string, | |
| | o_clerk:string, | |
| | o_shippriority:int, | |
| | o_comment:string, | |
| | o_lineitems:array<struct< | |
| | l_partkey:bigint, | |
| | l_suppkey:bigint, | |
| | l_linenumber:int, | |
| | l_quantity:decimal(12,2), | |
| | l_extendedprice:decimal(12,2), | |
| | l_discount:decimal(12,2), | |
| | l_tax:decimal(12,2), | |
| | l_returnflag:string, | |
| | l_linestatus:string, | |
| | l_shipdate:string, | |
| | l_commitdate:string, | |
| | l_receiptdate:string, | |
| | l_shipinstruct:string, | |
| | l_shipmode:string, | |
| | l_comment:string | |
| | >> | |
| | >> | |
+--------------+------------------------------------+-----------------------------+
]]>
</codeblock>
</conbody>
</concept>
</concept>
</concept>