blob: facd6f37bff233e2676af9817453a7910e8938a1 [file] [log] [blame]
<!DOCTYPE html>
<!--
| Generated by Apache Maven Doxia Site Renderer 1.8.1 from src/site/markdown/sqlpp/parquet.md at 2022-05-12
| Rendered using Apache Maven Fluido Skin 1.7
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="Date-Revision-yyyymmdd" content="20220512" />
<meta http-equiv="Content-Language" content="en" />
<title>AsterixDB &#x2013; Querying Parquet Files</title>
<link rel="stylesheet" href="../css/apache-maven-fluido-1.7.min.css" />
<link rel="stylesheet" href="../css/site.css" />
<link rel="stylesheet" href="../css/print.css" media="print" />
<script type="text/javascript" src="../js/apache-maven-fluido-1.7.min.js"></script>
</head>
<body class="topBarDisabled">
<div class="container-fluid">
<div id="banner">
<div class="pull-left"><a href=".././" id="bannerLeft"><img src="../images/asterixlogo.png" alt="AsterixDB"/></a></div>
<div class="pull-right"></div>
<div class="clear"><hr/></div>
</div>
<div id="breadcrumbs">
<ul class="breadcrumb">
<li id="publishDate">Last Published: 2022-05-12</li>
<li id="projectVersion" class="pull-right">Version: 0.9.8</li>
<li class="pull-right"><a href="../index.html" title="Documentation Home">Documentation Home</a></li>
</ul>
</div>
<div class="row-fluid">
<div id="leftColumn" class="span2">
<div class="well sidebar-nav">
<ul class="nav nav-list">
<li class="nav-header">Get Started - Installation</li>
<li><a href="../ncservice.html" title="Option 1: using NCService"><span class="none"></span>Option 1: using NCService</a></li>
<li><a href="../ansible.html" title="Option 2: using Ansible"><span class="none"></span>Option 2: using Ansible</a></li>
<li><a href="../aws.html" title="Option 3: using Amazon Web Services"><span class="none"></span>Option 3: using Amazon Web Services</a></li>
<li class="nav-header">AsterixDB Primer</li>
<li><a href="../sqlpp/primer-sqlpp.html" title="Using SQL++"><span class="none"></span>Using SQL++</a></li>
<li class="nav-header">Data Model</li>
<li><a href="../datamodel.html" title="The Asterix Data Model"><span class="none"></span>The Asterix Data Model</a></li>
<li class="nav-header">Queries</li>
<li><a href="../sqlpp/manual.html" title="The SQL++ Query Language"><span class="none"></span>The SQL++ Query Language</a></li>
<li><a href="../SQLPP.html" title="Raw SQL++ Grammar"><span class="none"></span>Raw SQL++ Grammar</a></li>
<li><a href="../sqlpp/builtins.html" title="Builtin Functions"><span class="none"></span>Builtin Functions</a></li>
<li class="nav-header">API/SDK</li>
<li><a href="../api.html" title="HTTP API"><span class="none"></span>HTTP API</a></li>
<li><a href="../csv.html" title="CSV Output"><span class="none"></span>CSV Output</a></li>
<li class="nav-header">Advanced Features</li>
<li><a href="../aql/externaldata.html" title="Accessing External Data"><span class="none"></span>Accessing External Data</a></li>
<li><a href="../feeds.html" title="Data Ingestion with Feeds"><span class="none"></span>Data Ingestion with Feeds</a></li>
<li><a href="../udf.html" title="User Defined Functions"><span class="none"></span>User Defined Functions</a></li>
<li><a href="../sqlpp/filters.html" title="Filter-Based LSM Index Acceleration"><span class="none"></span>Filter-Based LSM Index Acceleration</a></li>
<li><a href="../sqlpp/fulltext.html" title="Support of Full-text Queries"><span class="none"></span>Support of Full-text Queries</a></li>
<li><a href="../sqlpp/similarity.html" title="Support of Similarity Queries"><span class="none"></span>Support of Similarity Queries</a></li>
<li><a href="../geo/quickstart.html" title="GIS Support Overview"><span class="none"></span>GIS Support Overview</a></li>
<li><a href="../geo/functions.html" title="GIS Functions"><span class="none"></span>GIS Functions</a></li>
<li><a href="../interval_join.html" title="Support of Interval Joins"><span class="none"></span>Support of Interval Joins</a></li>
<li><a href="../spatial_join.html" title="Support of Spatial Joins"><span class="none"></span>Support of Spatial Joins</a></li>
<li><a href="../sqlpp/arrayindex.html" title="Support of Array Indexes"><span class="none"></span>Support of Array Indexes</a></li>
<li class="nav-header">Deprecated</li>
<li><a href="../aql/primer.html" title="AsterixDB Primer: Using AQL"><span class="none"></span>AsterixDB Primer: Using AQL</a></li>
<li><a href="../aql/manual.html" title="Queries: The Asterix Query Language (AQL)"><span class="none"></span>Queries: The Asterix Query Language (AQL)</a></li>
<li><a href="../aql/builtins.html" title="Queries: Builtin Functions (AQL)"><span class="none"></span>Queries: Builtin Functions (AQL)</a></li>
</ul>
<hr />
<div id="poweredBy">
<div class="clear"></div>
<div class="clear"></div>
<div class="clear"></div>
<div class="clear"></div>
<a href=".././" title="AsterixDB" class="builtBy"><img class="builtBy" alt="AsterixDB" src="../images/asterixlogo.png" /></a>
</div>
</div>
</div>
<div id="bodyColumn" class="span10" >
<!--
! Licensed to the Apache Software Foundation (ASF) under one
! or more contributor license agreements. See the NOTICE file
! distributed with this work for additional information
! regarding copyright ownership. The ASF licenses this file
! to you under the Apache License, Version 2.0 (the
! "License"); you may not use this file except in compliance
! with the License. You may obtain a copy of the License at
!
! http://www.apache.org/licenses/LICENSE-2.0
!
! Unless required by applicable law or agreed to in writing,
! software distributed under the License is distributed on an
! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
! KIND, either express or implied. See the License for the
! specific language governing permissions and limitations
! under the License.
!-->
<h1>Querying Parquet Files</h1>
<div class="section">
<h2><a name="Table_of_Contents"></a><a name="toc" id="toc">Table of Contents</a></h2>
<ul>
<li><a href="#Overview">Overview</a></li>
<li><a href="#DDL">DDL</a></li>
<li><a href="#QueryParquetFiles">Query Parquet Files</a></li>
<li><a href="#TypeCompatibility">Type Compatibility</a></li>
<li><a href="#ParquetTypeFlags">Parquet Type Flags</a></li>
</ul></div>
<div class="section">
<h2><a name="Overview_.5BBack_to_TOC.5D"></a><a name="Overview" id="Overview">Overview</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
<p><a class="externalLink" href="https://parquet.apache.org/">Apache Parquet</a> is a columnar file format for storing semi-structured data (like JSON). Apache AsterixDB supports running queries against Parquet files that are stored in Amazon S3 and Microsoft Azure Blob Storage as <a href="../aql/externaldata.html">External Datasets</a>.</p></div>
<div class="section">
<h2><a name="DDL_.5BBack_to_TOC.5D"></a><a name="DDL" id="DDL">DDL</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
<p>To start, an end-user needs to create a type as follows:</p>
<div>
<div>
<pre class="source">-- The type should not contain any declared fields
CREATE TYPE ParquetType AS {
}
</pre></div></div>
<p>Note that the created type does not have any declared fields. The reason is that Parquet files embed the schema within each file. Thus, no type is needed to be declared, and it is up to AsterixDB to read each file&#x2019;s schema. If the created type contains any declared type, AsterixDB will throw an error:</p>
<div>
<div>
<pre class="source">Type 'ParquetType' contains declared fields, which is not supported for 'parquet' format
</pre></div></div>
<p>Next, the user can create an external dataset - using the declared type - as follows:</p>
<div class="section">
<h3><a name="Amazon_S3"></a>Amazon S3</h3>
<div>
<div>
<pre class="source">CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
-- Replace &lt;ACCESS-KEY&gt; with your access key
(&quot;accessKeyId&quot;=&quot;&lt;ACCESS-KEY&gt;&quot;),
-- Replace &lt;SECRET-ACCESS-KEY&gt; with your access key
(&quot;secretAccessKey&quot; = &quot;&lt;SECRET-ACCESS-KEY&gt;&quot;),
-- S3 bucket
(&quot;container&quot;=&quot;parquetBucket&quot;),
-- Path to the parquet files within the bucket
(&quot;definition&quot;=&quot;path/to/parquet/files&quot;),
-- Specifying the format as parquet
(&quot;format&quot; = &quot;parquet&quot;)
);
</pre></div></div>
</div>
<div class="section">
<h3><a name="Microsoft_Azure_Blob_Storage"></a>Microsoft Azure Blob Storage</h3>
<div>
<div>
<pre class="source">CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING AZUREBLOB
(
-- Replace &lt;ACCOUNT-NAME&gt; with your account name
(&quot;accountName&quot;=&quot;&lt;ACCOUNT-NAME&gt;&quot;),
-- Replace &lt;ACCOUNT-KEY&gt; with your account key
(&quot;accountKey&quot;=&quot;&lt;ACCOUNT-KEY&gt;&quot;),
-- Azure Blob container
(&quot;container&quot;=&quot;parquetContainer&quot;),
-- Path to the parquet files within the bucket
(&quot;definition&quot;=&quot;path/to/parquet/files&quot;),
-- Specifying the format as parquet
(&quot;format&quot; = &quot;parquet&quot;)
);
</pre></div></div>
<p><i><b>Additional setting/properties could be set as detailed later in <a href="#ParquetTypeFlags">Parquet Type Flags</a></b></i></p></div></div>
<div class="section">
<h2><a name="Query_Parquet_Files_.5BBack_to_TOC.5D"></a><a name="QueryParquetFiles" id="QueryParquetFiles">Query Parquet Files</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
<p>To query the data stored in Parquet files, one can simply write a query against the created External Dataset. For example:</p>
<div>
<div>
<pre class="source">SELECT COUNT(*)
FROM ParquetDataset;
</pre></div></div>
<p>Another example:</p>
<div>
<div>
<pre class="source">SELECT pd.age, COUNT(*) cnt
FROM ParquetDataset pd
GROUP BY pd.age;
</pre></div></div>
</div>
<div class="section">
<h2><a name="Type_Compatibility_.5BBack_to_TOC.5D"></a><a name="TypeCompatibility" id="TypeCompatibility">Type Compatibility</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
<p>AsterixDB supports Parquet&#x2019;s generic types such <tt>STRING</tt>, <tt>INT</tt> and <tt>DOUBLE</tt>. However, Parquet files could contain <a class="externalLink" href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md">additional types</a> such as <tt>DATE</tt> and <tt>DATETIME</tt> like types. The following table show the type mapping between Apache Parquet and AsterixDB:</p>
<table border="0" class="table table-striped">
<thead>
<tr class="a">
<th>Parquet</th>
<th>AsterixDB</th>
<th>Value Examples</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr class="b">
<td><tt>BOOLEAN</tt></td>
<td><tt>BOOLEAN</tt></td>
<td><tt>true</tt> / <tt>false</tt></td>
<td>-</td>
</tr>
<tr class="a">
<td><tt>INT_8</tt></td>
<td rowspan="8"><tt>BIGINT</tt></td>
<td rowspan="8">
AsterixDB <tt>BIGINT</tt> Range:
<ul>
<li><b>Min</b>:-9,223,372,036,854,775,808</li>
<li><b>Max</b>: 9,223,372,036,854,775,807</li>
</ul>
</td>
<td rowspan="7">-</td>
</tr>
<tr class="b">
<td><tt>INT_16</tt></td>
</tr>
<tr class="a">
<td><tt>INT_32</tt></td>
</tr>
<tr class="b">
<td><tt>INT_64</tt></td>
</tr>
<tr class="a">
<td><tt>UNIT_8</tt></td>
</tr>
<tr class="b">
<td><tt>UINT_16</tt></td>
</tr>
<tr class="a">
<td><tt>UINT_32</tt></td>
</tr>
<tr class="b">
<td><tt>UINT_64</tt></td>
<td>There is a possibility that a value overflows. A warning will be issued in case of an overflow and
<tt>MISSING</tt> would be returned.
</td>
</tr>
<tr class="a">
<td><tt>FLOAT</tt></td>
<td rowspan="4"><tt>DOUBLE</tt></td>
<td rowspan="4">
AsterixDB <tt>DOUBLE</tt> Range:
<ul>
<li><b>Min Positive Value</b>: 2^-1074</li>
<li><b>Max Positive Value</b>: 2^1023</li>
</ul>
</td>
<td rowspan="2">-</td>
</tr>
<tr class="b">
<td><tt>DOUBLE</tt></td>
</tr>
<tr class="a">
<td><tt>FIXED_LEN_BYTE_ARRAY (DECIMAL)</tt></td>
<td rowspan="2">
Parquet <tt>DECIMAL</tt> values are converted to doubles, with the possibility of precision loss.
The flag <tt>decimal-to-double</tt> must be set upon creating the dataset.
<ul>
<li><i>See <a href="#ParquetTypeFlags">Parquet Type Flags</a></i></li></ul>
</td>
</tr>
<tr class="b">
<td><tt>BINARY (DECIMAL)</tt></td>
</tr>
<tr class="a">
<td><tt>BINARY (ENUM)</tt></td>
<td><tt>&quot;Fruit&quot;</tt></td>
<td>Parquet Enum values are parsed as Strings</td>
</tr>
<tr class="b">
<td><tt>BINARY (UTF8)</tt></td>
<td><tt>STRING</tt></td>
<td><tt>&quot;Hello World&quot;</tt></td>
<td>-</td>
</tr>
<tr class="a">
<td><tt>FIXED_LEN_BYTE_ARRAY (UUID)</tt></td>
<td><tt>UUID</tt></td>
<td><tt>uuid(&quot;123e4567-e89b-12d3-a456-426614174000&quot;)</tt></td>
<td>-</td>
</tr>
<tr class="b">
<td><tt>INT_32 (DATE)</tt></td>
<td><tt>DATE</tt></td>
<td><tt>date(&quot;2021-11-01&quot;)</tt></td>
<td>-</td>
</tr>
<tr class="a">
<td><tt>INT_32 (TIME)</tt></td>
<td><tt>TIME</tt></td>
<td rowspan="2"><tt>time(&quot;00:00:00.000&quot;)</tt></td>
<td>Time in milliseconds.</td>
</tr>
<tr class="b">
<td><tt>INT_64 (TIME)</tt></td>
<td><tt>TIME</tt></td>
<td>Time in micro/nano seconds.</td>
</tr>
<tr class="a">
<td><tt>INT_64 (TIMESTAMP)</tt></td>
<td rowspan="2"><tt>DATETIME</tt></td>
<td rowspan="2"><tt>datetime(&quot;2021-11-01T21:37:13.738&quot;)&quot;</tt></td>
<td>Timestamp in milli/micro/nano seconds. Parquet also can store the timestamp values with the option
<tt>isAdjustedToUTC = true</tt>. To get the local timestamp value, the user can set the time zone ID
by setting the value using the option <tt>timezone</tt> to get the local <tt>DATETIME</tt> value.
<ul>
<li><i>See <a href="#ParquetTypeFlags">Parquet Type Flags</a></i></li></ul>
</td>
</tr>
<tr class="b">
<td><tt>INT96</tt></td>
<td>A timestamp values that separate days and time to form a timestamp. INT96 is always in localtime.</td>
</tr>
<tr class="a">
<td><tt>BINARY (JSON)</tt></td>
<td>any type</td>
<td>
<ul>
<li><tt>{&quot;name&quot;: &quot;John&quot;}</tt></li>
<li><tt>[1, 2, 3]</tt></li>
</ul>
</td>
<td>
Parse JSON string into internal AsterixDB value.
The flag <tt>parse-json-string</tt> is set by default. To get the string value (i.e., not parsed as
AsterixDB value), unset the flag <tt>parse-json-string</tt>.
<ul>
<li><i>See <a href="#ParquetTypeFlags">Parquet Type Flags</a></i></li></ul>
</td>
</tr>
<tr class="b">
<td><tt>BINARY</tt></td>
<td rowspan="2"><tt>BINARY</tt></td>
<td><tt>hex(&quot;0101FF&quot;)</tt></td>
<td>-</td>
</tr>
<tr class="a">
<td><tt>BSON</tt></td>
<td>N/A</td>
<td>BSON values will be returned as <tt>BINARY</tt></td>
</tr>
<tr class="b">
<td><tt>LIST</tt></td>
<td><tt>ARRAY</tt></td>
<td><tt>[1, 2, 3]</tt></td>
<td>Parquet's <tt>LIST</tt> type is converted into <tt>ARRAY</tt></td>
</tr>
<tr class="a">
<td><tt>MAP</tt></td>
<td><tt>ARRAY</tt> of <tt>OBJECT</tt></td>
<td><tt>[{&quot;key&quot;:1, &quot;value&quot;:1}, {&quot;key&quot;:2, &quot;value&quot;:2}]</tt></td>
<td>Parquet's <tt>MAP</tt> types are converted into an <tt>ARRAY</tt> of <tt>OBJECT</tt>. Each
<tt>OBJECT</tt> value consists of two fields: <tt>key</tt> and <tt>value</tt>
</td>
</tr>
<tr class="b">
<td><tt>FIXED_LEN_BYTE_ARRAY (INTERVAL)</tt></td>
<td>-</td>
<td>N/A</td>
<td><tt>INTERVAL</tt> is not supported. A warning will be issued and <tt>MISSING</tt> value
will be returned.
</td>
</tr>
</tbody>
</table>
</div>
<div class="section">
<h2><a name="Parquet_Type_Flags_.5BBack_to_TOC.5D"></a><a name="ParquetTypeFlags" id="ParquetTypeFlags">Parquet Type Flags</a> <font size="4"><a href="#toc">[Back to TOC]</a></font></h2>
<p>The table in <a href="#TypeCompatibility">Type Compatibility</a> shows the type mapping between Parquet and AsterixDB. Some of the Parquet types are not parsed by default as those type are not natively supported in AsterixDB. However, the user can set a flag to convert some of those types into a supported AsterixDB type.</p>
<div class="section">
<div class="section">
<div class="section">
<h5><a name="DECIMAL_TYPE"></a>DECIMAL TYPE</h5>
<p>The user can enable parsing <tt>DECIMAL</tt> Parquet values by enabling a certain flag as in the following example:</p>
<div>
<div>
<pre class="source">CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
-- Credintials and path to Parquet files
...
-- Enable converting decimal values to double
(&quot;decimal-to-double&quot; = &quot;true&quot;)
);
</pre></div></div>
<p>This flag will enable parsing/converting <tt>DECIMAL</tt> values/types into <tt>DOUBLE</tt>. For example, if the flag <tt>decimal-to-double</tt> is not set and a Parquet file contains a <tt>DECIMAL</tt> value, the following error will be thrown when running a query that request a <tt>DECIMAL</tt> value:</p>
<div>
<div>
<pre class="source">Parquet type &quot;optional fixed_len_byte_array(16) decimalType (DECIMAL(38,18))&quot; is not supported by default. To enable type conversion, recreate the external dataset with the option &quot;decimal-to-double&quot; enabled
</pre></div></div>
<p>and the returned value will be <tt>MISSING</tt>. If the flag <tt>decimal-to-double</tt> is set, the converted <tt>DOUBLE</tt> value will be returned.</p></div>
<div class="section">
<h5><a name="TEMPORAL_TYPES"></a>TEMPORAL TYPES</h5>
<p>For the temporal types (namely <tt>DATETIME</tt>), their values could be stored in Parquet with the option <tt>isAdjustedToUTC = true</tt>. Hence, the user has to provide the timezone ID to adjust their values to the local value by setting the flag <tt>timezone</tt>. To do so, a user can set the timezone ID to &#x201c;<b>PST</b>&#x201d; upon creating a dataset as in the following example:</p>
<div>
<div>
<pre class="source">CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
-- Credintials and path to Parquet files
...
-- Converting UTC time to PST time
(&quot;timezone&quot; = &quot;PST&quot;)
);
</pre></div></div>
<p>If the flag <tt>timezone</tt> is not set, a warning will appear when running a query:</p>
<div>
<div>
<pre class="source">Parquet file(s) contain &quot;datetime&quot; values that are adjusted to UTC. Recreate the external dataset and set &quot;timezone&quot; to get the local &quot;datetime&quot; value.
</pre></div></div>
<p>and the UTC <tt>DATETIME</tt> will be returned.</p></div>
<div class="section">
<h5><a name="JSON_TYPE"></a>JSON TYPE</h5>
<p>By default, we parse the JSON values into AsterixDB values, where a user can process those values using <tt>SQL++</tt> queries. However, one could disable the parsing of JSON string values (which stored as <tt>STRING</tt>) by unsetting the flag <tt>parseJsonString</tt> as in the following example:</p>
<div>
<div>
<pre class="source">CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
-- Credintials and path to Parquet files
...
-- Stop parsing JSON string values
(&quot;parse-json-string&quot; = &quot;false&quot;)
);
</pre></div></div>
<p>And the returned value will be of type <tt>STRING</tt>.</p></div>
<div class="section">
<h5><a name="INTERVAL_TYPE"></a>INTERVAL TYPE</h5>
<p>Currently, AsterixDB do not support Parquet&#x2019;s <tt>INTERVAL</tt> type. When a query requests (or projects) an <tt>INTERVAL</tt> value, a warning will be issued and <tt>MISSING</tt> value will be returned instead.</p></div></div></div></div>
</div>
</div>
</div>
<hr/>
<footer>
<div class="container-fluid">
<div class="row-fluid">
<div class="row-fluid">Apache AsterixDB, AsterixDB, Apache, the Apache
feather logo, and the Apache AsterixDB project logo are either
registered trademarks or trademarks of The Apache Software
Foundation in the United States and other countries.
All other marks mentioned may be trademarks or registered
trademarks of their respective owners.
</div>
</div>
</div>
</footer>
</body>
</html>