blob: b4289ab79f307f5bb9b70f42ab4d4294ceee8f20 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="struct">
<title>STRUCT Complex Type (<keyword keyref="impala23"/> or higher only)</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
A complex data type, representing multiple fields of a single item. Frequently used as the element type of an <codeph>ARRAY</codeph>
or the <codeph>VALUE</codeph> part of a <codeph>MAP</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock><varname>column_name</varname> STRUCT &lt; <varname>name</varname> : <varname>type</varname> [COMMENT '<varname>comment_string</varname>'], ... &gt;
type ::= <varname>primitive_type</varname> | <varname>complex_type</varname>
</codeblock>
<p>
The names and number of fields within the <codeph>STRUCT</codeph> are fixed. Each field can be a different type. A field within a
<codeph>STRUCT</codeph> can also be another <codeph>STRUCT</codeph>, or an <codeph>ARRAY</codeph> or a <codeph>MAP</codeph>, allowing
you to create nested data structures with a maximum nesting depth of 100.
</p>
<p>
A <codeph>STRUCT</codeph> can be the top-level type for a column, or can itself be an item within an <codeph>ARRAY</codeph> or the
value part of the key-value pair in a <codeph>MAP</codeph>.
</p>
<p>
When a <codeph>STRUCT</codeph> is used as an <codeph>ARRAY</codeph> element or a <codeph>MAP</codeph> value, you use a join clause to
bring the <codeph>ARRAY</codeph> or <codeph>MAP</codeph> elements into the result set, and then refer to
<codeph><varname>array_name</varname>.ITEM.<varname>field</varname></codeph> or
<codeph><varname>map_name</varname>.VALUE.<varname>field</varname></codeph>. In the case of a <codeph>STRUCT</codeph> directly inside
an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>, you can omit the <codeph>.ITEM</codeph> and <codeph>.VALUE</codeph> pseudocolumns
and refer directly to <codeph><varname>array_name</varname>.<varname>field</varname></codeph> or
<codeph><varname>map_name</varname>.<varname>field</varname></codeph>.
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p conref="../shared/impala_common.xml#common/complex_types_combo"/>
<p>
A <codeph>STRUCT</codeph> is similar conceptually to a table row: it contains a fixed number of named fields, each with a predefined
type. To combine two related tables, while using complex types to minimize repetition, the typical way to represent that data is as an
<codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements.
</p>
<p>
Because a <codeph>STRUCT</codeph> has a fixed number of named fields, it typically does not make sense to have a
<codeph>STRUCT</codeph> as the type of a table column. In such a case, you could just make each field of the <codeph>STRUCT</codeph>
into a separate column of the table. The <codeph>STRUCT</codeph> type is most useful as an item of an <codeph>ARRAY</codeph> or the
value part of the key-value pair in a <codeph>MAP</codeph>. A nested type column with a <codeph>STRUCT</codeph> at the lowest level
lets you associate a variable number of row-like objects with each row of the table.
</p>
<p>
The <codeph>STRUCT</codeph> type is straightforward to reference within a query. You do not need to include the
<codeph>STRUCT</codeph> column in a join clause or give it a table alias, as is required for the <codeph>ARRAY</codeph> and
<codeph>MAP</codeph> types. You refer to the individual fields using dot notation, such as
<codeph><varname>struct_column_name</varname>.<varname>field_name</varname></codeph>, without any pseudocolumn such as
<codeph>ITEM</codeph> or <codeph>VALUE</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/complex_types_describe"/>
<p conref="../shared/impala_common.xml#common/internals_blurb"/>
<p>
Within the Parquet data file, the values for each <codeph>STRUCT</codeph> field are stored adjacent to each other, so that they can be
encoded and compressed using all the Parquet techniques for storing sets of similar or repeated values. The adjacency applies even
when the <codeph>STRUCT</codeph> values are part of an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>. During a query, Impala avoids
unnecessary I/O by reading only the portions of the Parquet data file containing the requested <codeph>STRUCT</codeph> fields.
</p>
<p conref="../shared/impala_common.xml#common/added_in_230"/>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<ul conref="../shared/impala_common.xml#common/complex_types_restrictions">
<li/>
</ul>
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
<p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
<p>
The following example shows a table with various kinds of <codeph>STRUCT</codeph> columns, both at the top level and nested within
other complex types. Practice the <codeph>CREATE TABLE</codeph> and query notation for complex type columns using empty tables, until
you can visualize a complex data structure and construct corresponding SQL statements reliably.
</p>
<codeblock><![CDATA[CREATE TABLE struct_demo
(
id BIGINT,
name STRING,
-- A STRUCT as a top-level column. Demonstrates how the table ID column
-- and the ID field within the STRUCT can coexist without a name conflict.
employee_info STRUCT < employer: STRING, id: BIGINT, address: STRING >,
-- A STRUCT as the element type of an ARRAY.
places_lived ARRAY < STRUCT <street: STRING, city: STRING, country: STRING >>,
-- A STRUCT as the value portion of the key-value pairs in a MAP.
memorable_moments MAP < STRING, STRUCT < year: INT, place: STRING, details: STRING >>,
-- A STRUCT where one of the fields is another STRUCT.
current_address STRUCT < street_address: STRUCT <street_number: INT, street_name: STRING, street_type: STRING>, country: STRING, postal_code: STRING >
)
STORED AS PARQUET;
]]>
</codeblock>
<p>
The following example shows how to examine the structure of a table containing one or more <codeph>STRUCT</codeph> columns by using
the <codeph>DESCRIBE</codeph> statement. You can visualize each <codeph>STRUCT</codeph> as its own table, with columns named the same
as each field of the <codeph>STRUCT</codeph>. If the <codeph>STRUCT</codeph> is nested inside another complex type, such as
<codeph>ARRAY</codeph>, you can extend the qualified name passed to <codeph>DESCRIBE</codeph> until the output shows just the
<codeph>STRUCT</codeph> fields.
</p>
<codeblock><![CDATA[DESCRIBE struct_demo;
+-------------------+--------------------------+
| name | type |
+-------------------+--------------------------+
| id | bigint |
| name | string |
| employee_info | struct< |
| | employer:string, |
| | id:bigint, |
| | address:string |
| | > |
| places_lived | array<struct< |
| | street:string, |
| | city:string, |
| | country:string |
| | >> |
| memorable_moments | map<string,struct< |
| | year:int, |
| | place:string, |
| | details:string |
| | >> |
| current_address | struct< |
| | street_address:struct< |
| | street_number:int, |
| | street_name:string, |
| | street_type:string |
| | >, |
| | country:string, |
| | postal_code:string |
| | > |
+-------------------+--------------------------+
]]>
</codeblock>
<p>
The top-level column <codeph>EMPLOYEE_INFO</codeph> is a <codeph>STRUCT</codeph>. Describing
<codeph><varname>table_name</varname>.<varname>struct_name</varname></codeph> displays the fields of the <codeph>STRUCT</codeph> as if
they were columns of a table:
</p>
<codeblock><![CDATA[DESCRIBE struct_demo.employee_info;
+----------+--------+
| name | type |
+----------+--------+
| employer | string |
| id | bigint |
| address | string |
+----------+--------+
]]>
</codeblock>
<p>
Because <codeph>PLACES_LIVED</codeph> is a <codeph>STRUCT</codeph> inside an <codeph>ARRAY</codeph>, the initial
<codeph>DESCRIBE</codeph> shows the structure of the <codeph>ARRAY</codeph>:
</p>
<codeblock><![CDATA[DESCRIBE struct_demo.places_lived;
+------+------------------+
| name | type |
+------+------------------+
| item | struct< |
| | street:string, |
| | city:string, |
| | country:string |
| | > |
| pos | bigint |
+------+------------------+
]]>
</codeblock>
<p>
Ask for the details of the <codeph>ITEM</codeph> field of the <codeph>ARRAY</codeph> to see just the layout of the
<codeph>STRUCT</codeph>:
</p>
<codeblock><![CDATA[DESCRIBE struct_demo.places_lived.item;
+---------+--------+
| name | type |
+---------+--------+
| street | string |
| city | string |
| country | string |
+---------+--------+
]]>
</codeblock>
<p>
Likewise, <codeph>MEMORABLE_MOMENTS</codeph> has a <codeph>STRUCT</codeph> inside a <codeph>MAP</codeph>, which requires an extra
level of qualified name to see just the <codeph>STRUCT</codeph> part:
</p>
<codeblock><![CDATA[DESCRIBE struct_demo.memorable_moments;
+-------+------------------+
| name | type |
+-------+------------------+
| key | string |
| value | struct< |
| | year:int, |
| | place:string, |
| | details:string |
| | > |
+-------+------------------+
]]>
</codeblock>
<p>
For a <codeph>MAP</codeph>, ask to see the <codeph>VALUE</codeph> field to see the corresponding <codeph>STRUCT</codeph> fields in a
table-like structure:
</p>
<codeblock><![CDATA[DESCRIBE struct_demo.memorable_moments.value;
+---------+--------+
| name | type |
+---------+--------+
| year | int |
| place | string |
| details | string |
+---------+--------+
]]>
</codeblock>
<p>
For a <codeph>STRUCT</codeph> inside a <codeph>STRUCT</codeph>, we can see the fields of the outer <codeph>STRUCT</codeph>:
</p>
<codeblock><![CDATA[DESCRIBE struct_demo.current_address;
+----------------+-----------------------+
| name | type |
+----------------+-----------------------+
| street_address | struct< |
| | street_number:int, |
| | street_name:string, |
| | street_type:string |
| | > |
| country | string |
| postal_code | string |
+----------------+-----------------------+
]]>
</codeblock>
<p>
Then we can use a further qualified name to see just the fields of the inner <codeph>STRUCT</codeph>:
</p>
<codeblock><![CDATA[DESCRIBE struct_demo.current_address.street_address;
+---------------+--------+
| name | type |
+---------------+--------+
| street_number | int |
| street_name | string |
| street_type | string |
+---------------+--------+
]]>
</codeblock>
<p>
The following example shows how to examine the structure of a table containing one or more <codeph>STRUCT</codeph> columns by using
the <codeph>DESCRIBE</codeph> statement. You can visualize each <codeph>STRUCT</codeph> as its own table, with columns named the same
as each field of the <codeph>STRUCT</codeph>. If the <codeph>STRUCT</codeph> is nested inside another complex type, such as
<codeph>ARRAY</codeph>, you can extend the qualified name passed to <codeph>DESCRIBE</codeph> until the output shows just the
<codeph>STRUCT</codeph> fields.
</p>
<!-- To do: See why the most verbose query form gives an error. -->
<codeblock><![CDATA[DESCRIBE struct_demo;
+-------------------+--------------------------+---------+
| name | type | comment |
+-------------------+--------------------------+---------+
| id | bigint | |
| name | string | |
| employee_info | struct< | |
| | employer:string, | |
| | id:bigint, | |
| | address:string | |
| | > | |
| places_lived | array<struct< | |
| | street:string, | |
| | city:string, | |
| | country:string | |
| | >> | |
| memorable_moments | map<string,struct< | |
| | year:int, | |
| | place:string, | |
| | details:string | |
| | >> | |
| current_address | struct< | |
| | street_address:struct< | |
| | street_number:int, | |
| | street_name:string, | |
| | street_type:string | |
| | >, | |
| | country:string, | |
| | postal_code:string | |
| | > | |
+-------------------+--------------------------+---------+
SELECT id, employee_info.id FROM struct_demo;
SELECT id, employee_info.id AS employee_id FROM struct_demo;
SELECT id, employee_info.id AS employee_id, employee_info.employer
FROM struct_demo;
SELECT id, name, street, city, country
FROM struct_demo, struct_demo.places_lived;
SELECT id, name, places_lived.pos, places_lived.street, places_lived.city, places_lived.country
FROM struct_demo, struct_demo.places_lived;
SELECT id, name, pl.pos, pl.street, pl.city, pl.country
FROM struct_demo, struct_demo.places_lived AS pl;
SELECT id, name, places_lived.pos, places_lived.street, places_lived.city, places_lived.country
FROM struct_demo, struct_demo.places_lived;
SELECT id, name, pos, street, city, country
FROM struct_demo, struct_demo.places_lived;
SELECT id, name, memorable_moments.key,
memorable_moments.value.year,
memorable_moments.value.place,
memorable_moments.value.details
FROM struct_demo, struct_demo.memorable_moments
WHERE memorable_moments.key IN ('Birthday','Anniversary','Graduation');
SELECT id, name, mm.key, mm.value.year, mm.value.place, mm.value.details
FROM struct_demo, struct_demo.memorable_moments AS mm
WHERE mm.key IN ('Birthday','Anniversary','Graduation');
SELECT id, name, memorable_moments.key, memorable_moments.value.year,
memorable_moments.value.place, memorable_moments.value.details
FROM struct_demo, struct_demo.memorable_moments
WHERE key IN ('Birthday','Anniversary','Graduation');
SELECT id, name, key, value.year, value.place, value.details
FROM struct_demo, struct_demo.memorable_moments
WHERE key IN ('Birthday','Anniversary','Graduation');
SELECT id, name, key, year, place, details
FROM struct_demo, struct_demo.memorable_moments
WHERE key IN ('Birthday','Anniversary','Graduation');
SELECT id, name,
current_address.street_address.street_number,
current_address.street_address.street_name,
current_address.street_address.street_type,
current_address.country,
current_address.postal_code
FROM struct_demo;
]]>
</codeblock>
<p>
For example, this table uses a struct that encodes several data values for each phone number associated with a person. Each person can
have a variable-length array of associated phone numbers, and queries can refer to the category field to locate specific home, work,
mobile, and so on kinds of phone numbers.
</p>
<codeblock><![CDATA[CREATE TABLE contact_info_many_structs
(
id BIGINT, name STRING,
phone_numbers ARRAY < STRUCT <category:STRING, country_code:STRING, area_code:SMALLINT, full_number:STRING, mobile:BOOLEAN, carrier:STRING > >
) STORED AS PARQUET;
]]>
</codeblock>
<p>
Because structs are naturally suited to composite values where the fields have different data types, you might use them to decompose
things such as addresses:
</p>
<codeblock><![CDATA[CREATE TABLE contact_info_detailed_address
(
id BIGINT, name STRING,
address STRUCT < house_number:INT, street:STRING, street_type:STRING, apartment:STRING, city:STRING, region:STRING, country:STRING >
);
]]>
</codeblock>
<p>
In a big data context, splitting out data fields such as the number part of the address and the street name could let you do analysis
on each field independently. For example, which streets have the largest number range of addresses, what are the statistical
properties of the street names, which areas have a higher proportion of <q>Roads</q>, <q>Courts</q> or <q>Boulevards</q>, and so on.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_complex_types.xml#complex_types"/>, <xref href="impala_array.xml#array"/>,
<!-- <xref href="impala_struct.xml#struct"/>, -->
<xref href="impala_map.xml#map"/>
</p>
</conbody>
</concept>