docs/topics/impala_s3.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="s3" rev="2.2.0">

   <title>Using Impala with Amazon S3 Object Store</title>
   <titlealts audience="PDF"><navtitle>S3 Tables</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Amazon"/>
       <data name="Category" value="S3"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Preview Features"/>
     </metadata>
   </prolog>

   <conbody>

     <p rev="2.2.0"> You can use Impala to query data residing on the Amazon S3
       object store. This capability allows convenient access to a storage system
       that is remotely managed, accessible from anywhere, and integrated with
       various cloud-based services. Impala can query files in any supported file
       format from S3. The S3 storage location can be for an entire table, or
       individual partitions in a partitioned table. </p>

     <p outputclass="toc inpage"/>

   </conbody>
   <concept id="s3_best_practices" rev="2.6.0 IMPALA-1878">
     <title>Best Practices for Using Impala with S3</title>
     <prolog>
       <metadata>
         <data name="Category" value="Guidelines"/>
         <data name="Category" value="Best Practices"/>
       </metadata>
     </prolog>
     <conbody>
       <p> The following guidelines summarize the best practices described in the
         rest of this topic: </p>
       <ul>
         <li>
           <p> Any reference to an S3 location must be fully qualified when S3 is
             not designated as the default storage, for example,
               <codeph>s3a:://[s3-bucket-name]</codeph>.</p>
         </li>
         <li>
           <p> Set <codeph>fs.s3a.connection.maximum</codeph> to 1500 for
               <cmdname>impalad</cmdname>. </p>
         </li>
         <li>
           <p> Set <codeph>fs.s3a.block.size</codeph> to 134217728 (128 MB in
             bytes) if most Parquet files queried by Impala were written by Hive
             or ParquetMR jobs. </p>
           <p>Set the block size to 268435456 (256 MB in bytes) if most Parquet
             files queried by Impala were written by Impala. </p>
           <p>Starting in Impala 3.4.0, instead of
               <codeph>fs.s3a.block.size</codeph>, the
               <codeph>PARQUET_OBJECT_STORE_SPLIT_SIZE</codeph> query option
             controls the Parquet-specific split size. The default value is 256
             MB.</p>
         </li>
         <li>
           <p>
             <codeph>DROP TABLE .. PURGE</codeph> is much faster than the default
               <codeph>DROP TABLE</codeph>. The same applies to <codeph>ALTER
               TABLE ... DROP PARTITION PURGE</codeph> versus the default
               <codeph>DROP PARTITION</codeph> operation. Due to the eventually
             consistent nature of S3, the files for that table or partition could
             remain for some unbounded time when using <codeph>PURGE</codeph>.
             The default <codeph>DROP TABLE/PARTITION</codeph> is slow because
             Impala copies the files to the HDFS trash folder, and Impala waits
             until all the data is moved. <codeph>DROP TABLE/PARTITION ..
               PURGE</codeph> is a fast delete operation, and the Impala
             statement finishes quickly even though the change might not have
             propagated fully throughout S3. </p>
         </li>
         <li>
           <p>
             <codeph>INSERT</codeph> statements are faster than <codeph>INSERT
               OVERWRITE</codeph> for S3. The query option
               <codeph>S3_SKIP_INSERT_STAGING</codeph>, which is set to
               <codeph>true</codeph> by default, skips the staging step for
             regular <codeph>INSERT</codeph> (but not <codeph>INSERT
               OVERWRITE</codeph>). This makes the operation much faster, but
             consistency is not guaranteed: if a node fails during execution, the
             table could end up with inconsistent data. Set this option to
               <codeph>false</codeph> if stronger consistency is required,
             however, this setting will make the <codeph>INSERT</codeph>
             operations slower. </p>
           <ul>
             <li>
               <p> For Impala-ACID tables, both <codeph>INSERT</codeph> and
                   <codeph>INSERT OVERWRITE</codeph> tables for S3 are fast,
                 regardless of the setting of
                   <codeph>S3_SKIP_INSERT_STAGING</codeph>. Plus, consistency is
                 guaranteed with ACID tables.</p>
             </li>
           </ul>
         </li>
         <li>Enable <xref href="impala_data_cache.xml#data_cache">data cache for
             remote reads</xref>.</li>
         <li>Enable <xref
             href="https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/s3guard.html"
             format="html" scope="external">S3Guard</xref> in your cluster for
           data consistency.</li>
         <li>
           <p> Too many files in a table can make metadata load and update slow
             in S3. If too many requests are made to S3, S3 has a back-off
             mechanism and responds slower than usual.</p>
           <ul>
             <li>If you have many small files due to over-granular partitioning,
               configure partitions with many megabytes of data so that even a
               query against a single partition can be parallelized effectively. </li>
             <li>If you have many small files because of many small
                 <codeph>INSERT</codeph> queries, use bulk
                 <codeph>INSERT</codeph>s so that more data is written to fewer
               files. </li>
           </ul>
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="s3_sql">
     <title>How Impala SQL Statements Work with S3</title>
     <conbody>
       <p> Impala SQL statements work with data in S3 as follows: </p>
       <ul>
         <li>
           <p> The <xref href="impala_create_table.xml#create_table">CREATE
               TABLE</xref> or <xref href="impala_alter_table.xml#alter_table"
               >ALTER TABLE</xref> statement can specify that a table resides in
             the S3 object store by encoding an <codeph>s3a://</codeph> prefix
             for the <codeph>LOCATION</codeph> property. <codeph>ALTER
               TABLE</codeph> can also set the <codeph>LOCATION</codeph> property
             for an individual partition so that some data in a table resides in
             S3 and other data in the same table resides on HDFS. </p>
         </li>
         <li>
           <p> Once a table or partition is designated as residing in S3, the
               <xref href="impala_select.xml#select"/> statement transparently
             accesses the data files from the appropriate storage layer. </p>
         </li>
         <li>
           <p>
             If the S3 table is an internal table, the <xref
               href="impala_drop_table.xml#drop_table">DROP TABLE</xref> statement
             removes the corresponding data files from S3 when the table is dropped.
           </p>
         </li>
         <li>
           <p> The <xref href="impala_truncate_table.xml#truncate_table">TRUNCATE
               TABLE</xref> statement always removes the corresponding
             data files from S3 when the table is truncated. </p>
         </li>
         <li>
           <p>
             The <xref href="impala_load_data.xml#load_data">LOAD DATA</xref>
             statement can move data files residing in HDFS into
             an S3 table.
           </p>
         </li>
         <li>
           <p>
             The <xref href="impala_insert.xml#insert">INSERT</xref> statement, or the <codeph>CREATE TABLE AS SELECT</codeph>
             form of the <codeph>CREATE TABLE</codeph> statement, can copy data from an HDFS table or another S3
             table into an S3 table. The <xref
               href="impala_s3_skip_insert_staging.xml#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING</xref>
             query option chooses whether or not to use a fast code path for these write operations to S3,
             with the tradeoff of potential inconsistency in the case of a failure during the statement.
           </p>
         </li>
       </ul>
       <p>
         For usage information about Impala SQL statements with S3 tables, see <xref href="impala_s3.xml#s3_ddl"/>
         and <xref href="impala_s3.xml#s3_dml"/>.
       </p>
     </conbody>
   </concept>

   <concept id="s3_creds">

     <title>Specifying Impala Credentials to Access Data in S3</title>

     <conbody>

       <p> To allow Impala to access data in S3, specify values for the following
         configuration settings in your <filepath>core-site.xml</filepath> file: </p>
 <codeblock>
 &lt;property&gt;
 &lt;name&gt;fs.s3a.access.key&lt;/name&gt;
 &lt;value&gt;<varname>your_access_key</varname>&lt;/value&gt;
 &lt;/property&gt;
 &lt;property&gt;
 &lt;name&gt;fs.s3a.secret.key&lt;/name&gt;
 &lt;value&gt;<varname>your_secret_key</varname>&lt;/value&gt;
 &lt;/property&gt;
 </codeblock>

       <p> After specifying the credentials, restart both the Impala and Hive
         services. Restarting Hive is required because Impala statements, such as
           <codeph>CREATE TABLE</codeph>, go through the Hive Metastore. </p>

       <note type="important">
           <p>
             Although you can specify the access key ID and secret key as part of the <codeph>s3a://</codeph> URL in the
             <codeph>LOCATION</codeph> attribute, doing so makes this sensitive information visible in many places, such
             as <codeph>DESCRIBE FORMATTED</codeph> output and Impala log files. Therefore, specify this information
             centrally in the <filepath>core-site.xml</filepath> file, and restrict read access to that file to only
             trusted users.
           </p>
       </note>
       <p>See <xref
           href="https://www.google.com/url?q=https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html%23Authenticating_with_S3&amp;sa=D&amp;ust=1572980027740000&amp;usg=AFQjCNFnzPSfNBMVRgJZRenvhLblezHbdw"
           format="html" scope="external">Authenticating with S3</xref> for
         additional authentication mechanisms to access S3.</p>

     </conbody>

   </concept>

   <concept id="s3_etl">

     <title>Loading Data into S3 for Impala Queries</title>
   <prolog>
     <metadata>
       <data name="Category" value="ETL"/>
       <data name="Category" value="Ingest"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         If your ETL pipeline involves moving data into S3 and then querying through Impala,
         you can either use Impala DML statements to create, move, or copy the data, or
         use the same data loading techniques as you would for non-Impala data.
       </p>

     </conbody>

     <concept id="s3_dml" rev="2.6.0 IMPALA-1878">
       <title>Using Impala DML Statements for S3 Data</title>
       <conbody>
         <p>The Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD
             DATA</codeph>, and <codeph>CREATE TABLE AS SELECT</codeph>) can
           write data into a table or partition that resides in S3. The syntax of
           the DML statements is the same as for any other tables because the S3
           location for tables and partitions is specified by an
             <codeph>s3a://</codeph> prefix in the <codeph>LOCATION</codeph>
           attribute of <codeph>CREATE TABLE</codeph> or <codeph>ALTER
             TABLE</codeph> statements. If you bring data into S3 using the
           normal S3 transfer mechanisms instead of Impala DML statements, issue
           a <codeph>REFRESH</codeph> statement for the table before using Impala
           to query the S3 data.</p>
         <p conref="../shared/impala_common.xml#common/s3_dml_performance"/>
       </conbody>
     </concept>

     <concept id="s3_manual_etl">
       <title>Manually Loading Data into Impala Tables in S3</title>
       <conbody>
         <p>
           As an alternative, or on earlier Impala releases without DML support for S3,
           you can use the Amazon-provided methods to bring data files into S3 for querying through Impala. See
           <xref href="http://aws.amazon.com/s3/" scope="external" format="html">the Amazon S3 web site</xref> for
           details.
         </p>

         <note type="important">
           <p conref="../shared/impala_common.xml#common/s3_drop_table_purge"/>
         </note>

         <p> After you upload data files to a location already mapped to an
           Impala table or partition, or if you delete files in S3 from such a
           location, issue the <codeph>REFRESH</codeph> statement to make Impala
           aware of the new set of data files. </p>

       </conbody>
     </concept>

   </concept>

   <concept id="s3_ddl">

     <title>Creating Impala Databases, Tables, and Partitions for Data Stored in
       S3</title>
   <prolog>
     <metadata>
       <data name="Category" value="Databases"/>
     </metadata>
   </prolog>

     <conbody>
       <p>To create a table that resides in S3, run the <codeph>CREATE
           TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement with the
           <codeph>LOCATION</codeph> clause. </p>
       <p><codeph>ALTER TABLE</codeph> can set the <codeph>LOCATION</codeph>
         property for an individual partition, so that some data in a table
         resides in S3 and other data in the same table resides on HDFS.</p>
       <p>The syntax for the <codeph>LOCATION</codeph> clause is:</p>
       <codeblock>LOCATION 's3a://<varname>bucket_name</varname>/<varname>path</varname>/<varname>to</varname>/<varname>file</varname>'</codeblock>
       <p>The file system prefix is always <codeph>s3a://</codeph>. Impala does
         not support the <codeph>s3://</codeph> or <codeph>s3n://</codeph>
         prefixes. </p>
       <p> For a partitioned table, either specify a separate
           <codeph>LOCATION</codeph> clause for each new partition, or specify a
         base <codeph>LOCATION</codeph> for the table and set up a directory
         structure in S3 to mirror the way Impala partitioned tables are
         structured in S3. </p>

       <p> You point a nonpartitioned table or an individual partition at S3 by
         specifying a single directory path in S3, which could be any arbitrary
         directory. To replicate the structure of an entire Impala partitioned
         table or database in S3 requires more care, with directories and
         subdirectories nested and named to match the equivalent directory tree
         in HDFS. Consider setting up an empty staging area if necessary in HDFS,
         and recording the complete directory structure so that you can replicate
         it in S3.  </p>
       <p> When working with multiple tables with data files stored in S3, you can
         create a database with a <codeph>LOCATION</codeph> attribute pointing to
         an S3 path. Specify a URL of the form
             <codeph>s3a://<varname>bucket</varname>/<varname>root</varname>/<varname>path</varname>/<varname>for</varname>/<varname>database</varname></codeph>
         for the <codeph>LOCATION</codeph> attribute of the database. Any tables
         created inside that database automatically create directories underneath
         the one specified by the database <codeph>LOCATION</codeph> attribute. </p>
       <p>The following example creates a table with one partition for the year
         2017 resides on HDFS and one partition for the year 2018 resides in
         S3.</p>

       <p>The partition for year 2018 includes a <codeph>LOCATION</codeph>
         attribute with an <codeph>s3a://</codeph> URL, and so refers to data
         residing in S3, under a specific path underneath the bucket
           <codeph>impala-demo</codeph>. </p>

 <codeblock>CREATE TABLE mostly_on_hdfs (x int) PARTITIONED BY (year INT);
 ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2017);
 ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2018)
    LOCATION 's3a://impala-demo/dir1/dir2/dir3/t1';
 </codeblock>

       <p> The following session creates a database and two partitioned tables
         residing entirely in S3, one partitioned by a single column and the
         other partitioned by multiple columns. </p>
       <ul>
         <li>Because a <codeph>LOCATION</codeph> attribute with an
             <codeph>s3a://</codeph> URL is specified for the database, the
           tables inside that database are automatically created in S3 underneath
           the database directory. </li>
         <li>To see the names of the associated subdirectories, including the
           partition key values, use an S3 client tool to examine how the
           directory structure is organized in S3. </li>
       </ul>

 <codeblock>CREATE DATABASE db_on_s3 LOCATION 's3a://impala-demo/dir1/dir2/dir3';
 CREATE TABLE partitioned_multiple_keys (x INT)
    PARTITIONED BY (year SMALLINT, month TINYINT, day TINYINT);

 ALTER TABLE partitioned_multiple_keys
    ADD PARTITION (year=2015,month=1,day=1);
 ALTER TABLE partitioned_multiple_keys
    ADD PARTITION (year=2015,month=1,day=31);

 !hdfs dfs -ls -R s3a://impala-demo/dir1/dir2/dir3
 2015-03-17 13:56:34          0 dir1/dir2/dir3/
 2015-03-17 16:47:13          0 dir1/dir2/dir3/partitioned_multiple_keys/
 2015-03-17 16:47:44          0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/
 2015-03-17 16:47:50          0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/</codeblock>

       <p>
         The <codeph>CREATE DATABASE</codeph> and <codeph>CREATE TABLE</codeph> statements create the associated
         directory paths if they do not already exist. You can specify multiple levels of directories, and the
         <codeph>CREATE</codeph> statement creates all appropriate levels, similar to using <codeph>mkdir
         -p</codeph>.
       </p>

       <p> Use the standard S3 file upload methods to put the actual data files
         into the right locations. You can also put the directory paths and data
         files in place before creating the associated Impala databases or
         tables, and Impala automatically uses the data from the appropriate
         location after the associated databases and tables are created. </p>
       <p>Use the <codeph>ALTER TABLE</codeph> statement with the
           <codeph>LOCATION</codeph> clause to switch whether an existing table
         or partition points to data in HDFS or S3. For example, if you have an
         Impala table or partition pointing to data files in HDFS or S3, and you
         later transfer those data files to the other filesystem, use the
           <codeph>ALTER TABLE</codeph> statement to adjust the
           <codeph>LOCATION</codeph> attribute of the corresponding table or
         partition to reflect that change. </p>

     </conbody>

   </concept>

   <concept id="s3_internal_external">

     <title>Internal and External Tables Located in S3</title>

     <conbody>

       <p> Just as with tables located on HDFS storage, you can designate
         S3-based tables as either internal (managed by Impala) or external, by
         using the syntax <codeph>CREATE TABLE</codeph> or <codeph>CREATE
           EXTERNAL TABLE</codeph> respectively. </p>
       <p>When you drop an internal table, the files associated with the table
         are removed, even if they are in S3 storage. When you drop an external
         table, the files associated with the table are left alone, and are still
         available for access by other tools or components.</p>

       <p> If the data in S3 is intended to be long-lived and accessed by other
         tools in addition to Impala, create any associated S3 tables with the
           <codeph>CREATE EXTERNAL TABLE</codeph> syntax, so that the files are
         not deleted from S3 when the table is dropped. </p>

       <p> If the data in S3 is only needed for querying by Impala and can be
         safely discarded once the Impala workflow is complete, create the
         associated S3 tables using the <codeph>CREATE TABLE</codeph> syntax, so
         that dropping the table also deletes the corresponding data files in S3. </p>

     </conbody>

   </concept>

   <concept id="s3_queries">

     <title>Running and Tuning Impala Queries for Data Stored in S3</title>

     <conbody>
       <p> Once a table or partition is designated as residing in S3, the
           <codeph>SELECT</codeph> statement transparently accesses the data
         files from the appropriate storage layer. </p>

       <ul>
         <li>
           Queries against S3 data support all the same file formats as for HDFS data.
         </li>

         <li>
           Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in S3
           corresponding to the HDFS directories representing partition key values, or use <codeph>ALTER TABLE ...
           ADD PARTITION</codeph> to set up the appropriate paths in S3.
         </li>

         <li>
           HDFS and HBase tables can be joined to S3 tables, or S3 tables can be joined with each other.
         </li>

         <li> Authorization to control access to databases, tables, or columns
           works the same whether the data is in HDFS or in S3. </li>
         <li> The Catalog Server (<cmdname>catalogd</cmdname>) daemon caches
           metadata for both HDFS and S3 tables.</li>

         <li>
           Queries against S3 tables are subject to the same kinds of admission control and resource management as
           HDFS tables.
         </li>

         <li> Metadata about S3 tables is stored in the same Metastore database
           as for HDFS tables. </li>

         <li>
           You can set up views referring to S3 tables, the same as for HDFS tables.
         </li>

         <li> The <codeph>COMPUTE STATS</codeph>, <codeph>SHOW TABLE
             STATS</codeph>, and <codeph>SHOW COLUMN STATS</codeph> statements
           work for S3 tables. </li>
       </ul>

     </conbody>

     <concept id="s3_performance">

       <title>Understanding and Tuning Impala Query Performance for S3 Data</title>
   <prolog>
     <metadata>
       <data name="Category" value="Performance"/>
     </metadata>
   </prolog>

       <conbody>

         <p>Here are techniques you can use to interpret explain plans and
           profiles for queries against S3 data, and tips to achieve the best
           performance possible for such queries. </p>

         <p> All else being equal, performance is expected to be lower for
           queries running against data in S3 rather than HDFS. The actual
           mechanics of the <codeph>SELECT</codeph> statement are somewhat
           different when the data is in S3. Although the work is still
           distributed across the DataNodes of the cluster, Impala might
           parallelize the work for a distributed query differently for data on
           HDFS and S3.</p>
         <p>S3 does not have the same block notion as HDFS, so Impala uses
           heuristics to determine how to split up large S3 files for processing
           in parallel. Because all hosts can access any S3 data file with equal
           efficiency, the distribution of work might be different than for HDFS
           data, where the data blocks are physically read using short-circuit
           local reads by hosts that contain the appropriate block replicas.
           Although the I/O to read the S3 data might be spread evenly across the
           hosts of the cluster, the fact that all data is initially retrieved
           across the network means that the overall query performance is likely
           to be lower for S3 data than for HDFS data. </p>
         <p>Use the <codeph>PARQUET_OBJECT_STORE_SPLIT_SIZE</codeph> query option
           to control the Parquet-specific split size. The default value is 256
           MB.</p>

         <p> When optimizing aspects of complex queries, such as the join order,
           Impala treats tables on HDFS and S3 the same way. Therefore, follow
           all the same tuning recommendations for S3 tables as for HDFS ones,
           such as using the <codeph>COMPUTE STATS</codeph> statement to help
           Impala construct accurate estimates of row counts and cardinality. See
             <xref href="impala_performance.xml#performance"/> for details. </p>

         <p> In query profile reports, the numbers for
             <codeph>BytesReadLocal</codeph>,
             <codeph>BytesReadShortCircuit</codeph>,
             <codeph>BytesReadDataNodeCached</codeph>, and
             <codeph>BytesReadRemoteUnexpected</codeph> are blank because those
           metrics come from HDFS. By definition, all the I/O for S3 tables
           involves remote reads. </p>

       </conbody>

     </concept>

   </concept>

   <concept id="s3_restrictions">

     <title>Restrictions on Impala Support for S3</title>

     <conbody>

       <p>The following restrictions apply when using Impala with S3:</p>
       <ul>
         <li> Impala does not support the old <codeph>s3://</codeph> block-based
           and <codeph>s3n://</codeph> filesystem schemes, and it only supports
             <codeph>s3a://</codeph>. </li>
         <li>Although S3 is often used to store JSON-formatted data, the current
           Impala support for S3 does not include directly querying JSON data.
           For Impala queries, use data files in one of the file formats listed
           in <xref href="impala_file_formats.xml#file_formats"/>. If you have
           data in JSON format, you can prepare a flattened version of that data
           for querying by Impala as part of your ETL cycle. </li>
         <li>You cannot use the <codeph>ALTER TABLE ... SET CACHED</codeph>
           statement for tables or partitions that are located in S3. </li>
       </ul>

     </conbody>

   </concept>


 </concept>