docs/topics/impala_disk_space.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="disk_space">

   <title>Managing Disk Space for Impala Data</title>

   <titlealts audience="PDF">

     <navtitle>Managing Disk Space</navtitle>

   </titlealts>

   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Disk Storage"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Tables"/>
       <data name="Category" value="Compression"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       Although Impala typically works with many large files in an HDFS storage system with
       plenty of capacity, there are times when you might perform some file cleanup to reclaim
       space, or advise developers on techniques to minimize space consumption and file
       duplication.
     </p>

     <ul>
       <li>
         <p>
           Use compact binary file formats where practical. Numeric and time-based data in
           particular can be stored in more compact form in binary data files. Depending on the
           file format, various compression and encoding features can reduce file size even
           further. You can specify the <codeph>STORED AS</codeph> clause as part of the
           <codeph>CREATE TABLE</codeph> statement, or <codeph>ALTER TABLE</codeph> with the
           <codeph>SET FILEFORMAT</codeph> clause for an existing table or partition within a
           partitioned table. See <xref
             href="impala_file_formats.xml#file_formats"/>
           for details about file formats, especially <xref href="impala_parquet.xml#parquet"/>.
           See <xref href="impala_create_table.xml#create_table"/> and
           <xref
             href="impala_alter_table.xml#alter_table"/> for syntax details.
         </p>
       </li>

       <li>
         <p>
           You manage underlying data files differently depending on whether the corresponding
           Impala table is defined as an
           <xref
             href="impala_tables.xml#internal_tables">internal</xref> or
           <xref
             href="impala_tables.xml#external_tables">external</xref> table:
         </p>
         <ul>
           <li>
             Use the <codeph>DESCRIBE FORMATTED</codeph> statement to check if a particular table
             is internal (managed by Impala) or external, and to see the physical location of the
             data files in HDFS. See <xref
               href="impala_describe.xml#describe"/>
             for details.
           </li>

           <li>
             For Impala-managed (<q>internal</q>) tables, use <codeph>DROP TABLE</codeph>
             statements to remove data files. See
             <xref
               href="impala_drop_table.xml#drop_table"/> for details.
           </li>

           <li>
             For tables not managed by Impala (<q>external</q> tables), use appropriate
             HDFS-related commands such as <codeph>hadoop fs</codeph>, <codeph>hdfs dfs</codeph>,
             or <codeph>distcp</codeph>, to create, move, copy, or delete files within HDFS
             directories that are accessible by the <codeph>impala</codeph> user. Issue a
             <codeph>REFRESH <varname>table_name</varname></codeph> statement after adding or
             removing any files from the data directory of an external table. See
             <xref href="impala_refresh.xml#refresh"/> for details.
           </li>

           <li>
             Use external tables to reference HDFS data files in their original location. With
             this technique, you avoid copying the files, and you can map more than one Impala
             table to the same set of data files. When you drop the Impala table, the data files
             are left undisturbed. See <xref href="impala_tables.xml#external_tables"/> for
             details.
           </li>

           <li>
             Use the <codeph>LOAD DATA</codeph> statement to move HDFS files into the data
             directory for an Impala table from inside Impala, without the need to specify the
             HDFS path of the destination directory. This technique works for both internal and
             external tables. See <xref href="impala_load_data.xml#load_data"/> for details.
           </li>
         </ul>
       </li>

       <li>
         <p>
           Make sure that the HDFS trashcan is configured correctly. When you remove files from
           HDFS, the space might not be reclaimed for use by other files until sometime later,
           when the trashcan is emptied. See <xref href="impala_drop_table.xml#drop_table"/> for
           details. See <xref href="impala_prereqs.xml#prereqs_account"/> for permissions needed
           for the HDFS trashcan to operate correctly.
         </p>
       </li>

       <li>
         <p>
           Drop all tables in a database before dropping the database itself. See
           <xref href="impala_drop_database.xml#drop_database"/> for details.
         </p>
       </li>

       <li>
         <p>
           Clean up temporary files after failed <codeph>INSERT</codeph> statements. If an
           <codeph>INSERT</codeph> statement encounters an error, and you see a directory named
           <filepath>.impala_insert_staging</filepath> or
           <filepath>_impala_insert_staging</filepath> left behind in the data directory for the
           table, it might contain temporary data files taking up space in HDFS. You might be
           able to salvage these data files, for example if they are complete but could not be
           moved into place due to a permission error. Or, you might delete those files through
           commands such as <codeph>hadoop fs</codeph> or <codeph>hdfs dfs</codeph>, to reclaim
           space before re-trying the <codeph>INSERT</codeph>. Issue <codeph>DESCRIBE FORMATTED
           <varname>table_name</varname></codeph> to see the HDFS path where you can check for
           temporary files.
         </p>
       </li>

       <li rev="2.2.0">
         <p>
           If you use the Amazon Simple Storage Service (S3) as a place to offload data to reduce
           the volume of local storage, Impala 2.2.0 and higher can query the data directly from
           S3. See <xref
             href="impala_s3.xml#s3"/> for details.
         </p>
       </li>
     </ul>

     <section id="section_vrg_fjb_3jb">

       <title>Configuring Scratch Space for Spilling to Disk</title>Impala uses intermediate files during large
       sort, join, aggregation, or analytic function operations The files are
       removed when the operation finishes. You can specify locations of the
       intermediate files by starting the <cmdname>impalad</cmdname> daemon with
       the <codeph>--scratch_dirs="<varname>path_to_directory</varname>"</codeph>
       configuration option. By default, intermediate files are stored in the
       directory <filepath>/tmp/impala-scratch</filepath>.<p
         id="order_by_scratch_dir">
         <ul>
           <li>
             You can specify a single directory or a comma-separated list of directories.
           </li>

           <li>
             You can specify an optional a capacity quota per scratch directory using the colon
             (:) as the delimiter.
             <p>
               The capacity quota of <codeph>-1</codeph> or <codeph>0</codeph> is the same as no
               quota for the directory.
             </p>
           </li>

           <li>
             The scratch directories must be on the local filesystem, not in HDFS.
           </li>

           <li>
             You might specify different directory paths for different hosts, depending on the
             capacity and speed of the available storage devices.
           </li>
         </ul>
       </p>

       <p>
         If there is less than 1 GB free on the filesystem where that directory resides, Impala
         still runs, but writes a warning message to its log.
       </p>

       <p>
         Impala successfully starts (with a warning written to the log) if it cannot create or
         read and write files in one of the scratch directories.
       </p>

       <p>
         The following are examples for specifying scratch directories.
         <table frame="all" rowsep="1" colsep="1"
           id="table_a4d_myg_3jb">
           <tgroup cols="2" align="left">
             <colspec colname="c1" colnum="1"/>
             <colspec colname="c2" colnum="2"/>
             <thead>
               <row>
                 <entry>
                   Config option
                 </entry>
                 <entry>
                   Description
                 </entry>
               </row>
             </thead>
             <tbody>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1,/dir2</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with no capacity quota.
                 </entry>
               </row>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1,/dir2:25G</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with no capacity quota on /dir1 and
                   the 25GB quota on /dir2.
                 </entry>
               </row>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1:5MB,/dir2</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with the capacity quota of 5MB on
                   /dir1 and no quota on /dir2.
                 </entry>
               </row>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1:-1,/dir2:0</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with no capacity quota.
                 </entry>
               </row>
             </tbody>
           </tgroup>
         </table>
       </p>

       <p>
         Allocation from a scratch directory will fail if the specified limit for the directory
         is exceeded.
       </p>

       <p>
         If Impala encounters an error reading or writing files in a scratch directory during a
         query, Impala logs the error, and the query fails.
       </p>

     </section>
     <section>
       <title>Priority Based Scratch Directory Selection</title>
       <p>The location of the intermediate files are configured by starting the impalad daemon with
         the flag <codeph>--scratch_dirs="path_to_directory"</codeph>. Currently this startup flag uses the configured
         scratch directories in a round robin fashion. Automatic selection of scratch directories in
         a round robin fashion may not always be ideal in every situation since these directories
         could come from different classes of storage system volumes having different performance
         characteristics (SSD vs HDD, local storage vs network attached storage, etc.). To optimize
         your workload, you have an option to configure the priority of the scratch directories based
         on your storage system configuration.</p>
       <p>The scratch directories will be selected for spilling based on how you configure the
         priorities of the directories and if you provide the same priority for multiple directories
         then the directories will be selected in a round robin fashion.</p>
       <p>The valid formats for specifying the priority directories are as shown here:
         <codeblock><varname>dir-path</varname>:<varname>limit</varname>:<varname>priority</varname>
 <varname>dir-path</varname>::<varname>priority</varname>
 </codeblock></p>
         <p>Example:</p>
       <p>
         <codeblock>/dir1:200GB:0
 /dir1::0
 </codeblock>
       </p>
       <p>The following formats use the default priority:
         <codeblock>/dir1
 /dir1:200GB
 /dir1:200GB:
 </codeblock>
       </p>
       <p>In the example below, dir1 will be used as a spill victim until it is full and then dir2, dir3,
         and dir4 will be used in a round robin fashion.</p>
       <p>
         <codeblock>--scratch_dirs="/dir1:200GB:0, /dir2:1024GB:1, /dir3:1024GB:1, /dir4:1024GB:1"
 </codeblock>
       </p>
     </section>
     <section>
       <title>Increasing Scratch Capacity</title>
       <p> You can compress the data spilled to disk to increase the effective scratch capacity. You
         typically more than double capacity using compression and reduce spilling to disk. Use the
         --disk_spill_compression_codec and –-disk_spill_punch_holes startup options. The
         --disk_spill_compression_codec takes any value supported by the COMPRESSION_CODEC query
         option. The value is not case-sensitive. A value of <codeph>ZSTD</codeph> or
           <codeph>LZ4</codeph> is recommended (default is NONE).</p>
       <p>For example:</p>
 <codeblock>--disk_spill_compression_codec=LZ4
 --disk_spill_punch_holes=true
 </codeblock>
       <p>
         If you set <codeph>--disk_spill_compression_codec</codeph> to a value other than <codeph>NONE</codeph>, you must set <codeph>--disk_spill_punch_holes</codeph> to true.
       </p>
       <p>
         The hole punching feature supported by many filesystems is used to reclaim space in scratch files during execution
         of a query that spills to disk. This results in lower scratch space requirements in many cases, especially when
         combined with disk spill compression. When this option is not enabled, scratch space is still recycled by a query,
         but less effectively in many cases.
       </p>
       <p> You can specify a compression level for <codeph>ZSTD</codeph> only. For example: </p>
 <codeblock>--disk_spill_compression_codec=ZSTD:10
 --disk_spill_punch_holes=true
 </codeblock>
       <p> Compression levels from 1 up to 22 (default 3) are supported for <codeph>ZSTD</codeph>.
         The lower the compression level, the faster the speed at the cost of compression ratio.</p>
     </section>
     <section>
       <title>Configure Impala Daemon to spill to S3</title>
       <p>Impala occasionally needs to use persistent storage for writing intermediate files during
         large sorts, joins, aggregations, or analytic function operations. If your workload results
         in large volumes of intermediate data being written, it is recommended to configure the
         heavy spilling queries to use a remote storage location rather than the local one. The
         advantage of using remote storage for scratch space is that it is elastic and can handle
         any amount of spilling.</p>
       <p><b>Before you begin</b></p>
       <p>Identify the URL for an S3 bucket to which you want your new Impala to write the temporary
         data. If you use the S3 bucket that is associated with the environment, navigate to the S3
         bucket and copy the URL. If you want to use an external S3 bucket, you must first configure
         your environment to use the external S3 bucket with the correct read/write permissions.</p>
       <p><b>Configuring the Start-up Option in Impala daemon</b></p>
       <p>You can use the Impalad start option scratch_dirs to specify the locations of the
         intermediate files. The format of the option is:</p>
       <codeblock>--scratch_dirs="<varname>remote_dir</varname>, <varname>local_buffer_dir</varname> (,<varname>local_dir</varname>…)"</codeblock>
       <p>where <varname>local_buffer_dir</varname> and <varname>local_dir</varname> conform to the
         earlier descriptions for scratch directories.</p>
       <p>With the option specified above:</p>
       <ul>
         <li>You can specify only one remote directory. When you configure a remote directory, you
           must specify a local buffer directory as the buffer. However you can use multiple local
           directories with the remote directory. If you specify multiple local directories, the
           first local directory would be used as the local buffer directory.</li>
         <li>If you configure both remote and local directories, the remote directory is only used
           when the local directories are fully utilized.</li>
         <li>The size of a remote intermediate file could affect the query performance, and the
           value can be set by <codeph>--remote_tmp_file_size=<varname>size</varname></codeph> in
           the start-up option. The default size of a remote intermediate file is 16MB while the
           maximum is 512MB.</li>
       </ul>
       <p><b>Examples</b></p>
       <ul>
         <li>A remote scratch dir with a local buffer dir, file size 64MB.
           <codeblock>--scratch_dirs=s3a://remote_dir,/local_buffer_dir --remote_tmp_file_size=64M</codeblock></li>
         <li>A remote scratch dir with a local buffer dir limited to 256MB, and one local dir
           limited to 10GB.
           <codeblock>--scratch_dirs=s3a://remote_dir,/local_buffer_dir:256M,/local_dir:10G</codeblock></li>
         <li>A remote scratch dir with a local buffer dir, and multiple prioritized local dirs.
           <codeblock>--scratch_dirs=s3a://remote_dir,/local_buffer_dir,/local_dir_1:5G:1,/local_dir_2:5G:2</codeblock></li>
       </ul>
     </section>
     <section>
       <title>Configure Impala Daemon to spill to HDFS</title>
       <p>Impala occasionally needs to use persistent storage for writing intermediate files during
         large sorts, joins, aggregations, or analytic function operations. If your workload results
         in large volumes of intermediate data being written, it is recommended to configure the
         heavy spilling queries to use a remote storage location rather than the local one. The
         advantage of using remote storage for scratch space is that it is elastic and can handle
         any amount of spilling.</p>
       <p><b>Before you begin</b></p>
       <ul>
         <li>Identify the HDFS scratch directory where you want your new Impala to write the
           temporary data.</li>
         <li>Identify the IP address, host name, or service identifier of HDFS.</li>
         <li>Identify the port number of the HDFS NameNode (if not-default).</li>
         <li>Configure Impala to write temporary data to disk during query processing.</li>
       </ul>
       <p><b>Configuring the Start-up Option in Impala daemon</b></p>
       <p>You can use the Impalad start option <codeph>scratch_dirs</codeph> to specify the
         locations of the intermediate files.</p>
       <p>Use the following format for this start up option:</p>
       <codeblock>--scratch_dirs="hdfs://<varname>authority</varname>/<varname>path</varname>(:<varname>max_bytes</varname>), <varname>local_buffer_dir</varname> (,<varname>local_dir</varname>…)"</codeblock>
       <ul>
         <li>Where <codeph>hdfs://<varname>authority</varname>/<varname>path</varname></codeph> is
           the remote directory.</li>
         <li><varname>authority</varname> may include <codeph>ip_address</codeph> or
           <codeph>hostname</codeph> and <codeph>port</codeph>, or <codeph>service_id</codeph>.</li>
         <li><varname>max_bytes</varname> is optional.</li>
       </ul>
       <p>Using the above format:</p>
       <ul>
         <li>You can specify only one remote directory. When you configure a remote directory, you
           must specify a local buffer directory as the buffer. However you can use multiple local
           directories with the remote directory. If you specify multiple local directories, the
           first local directory would be used as the local buffer directory.</li>
         <li>If you configure both remote and local directories, the remote directory is only used
           when the local directories are fully utilized.</li>
         <li>The size of a remote intermediate file could affect the query performance, and the
           value can be set by <codeph>--remote_tmp_file_size=<varname>size</varname></codeph> in
           the start-up option. The default size of a remote intermediate file is 16MB while the
           maximum is 512MB.</li>
       </ul>
       <p><b>Examples</b></p>
       <ul>
         <li>A HDFS scratch dir with one local buffer dir, file size 64MB. The space of HDFS scratch
           dir is limited to 300G.
           <codeblock>--scratch_dirs=hdfs://10.0.0.49:20500/tmp:300G,/local_buffer_dir --remote_tmp_file_size=64M</codeblock></li>
         <li>A HDFS scratch dir with one local buffer dir limited to 512MB, and one local dir
           limited to 10GB. The space of HDFS scratch dir is limited to 300G. The HDFS NameNode uses
           its default port (8020).
           <codeblock>--scratch_dirs=hdfs://hdfsnn/tmp:300G,/local_buffer_dir:512M,/local_dir:10G</codeblock></li>
         <li>A HDFS scratch dir with one local buffer dir, and multiple prioritized local dirs. The
           space of HDFS scratch dir is unlimited. The HDFS service identifier is <codeph>hdfs1</codeph>.
           <codeblock>--scratch_dirs=hdfs://hdfs1/tmp,/local_buffer_dir,/local_dir_1:5G:1,/local_dir_2:5G:2</codeblock></li>
       </ul>
       <p>Even though max_bytes is optional, it is highly recommended to configure for spilling to
         HDFS because the HDFS cluster space is limited.</p>
     </section>
     <section>
       <title>Configure Impala Daemon to spill to Ozone</title>
       <p><b>Before you begin</b></p>
       <ul>
         <li>Identify the Ozone scratch directory where you want your new Impala to write the
           temporary data.</li>
         <li>Identify the IP address, host name, or service identifier of Ozone.</li>
         <li>Identify the port number of the Ozone Manager (if not-default).</li>
       </ul>
       <p><b>Configuring the Start-up Option in Impala daemon</b></p>
       <p>You can use the Impalad start option <codeph>scratch_dirs</codeph> to specify the locations of the
         intermediate files.</p>
       <codeblock>--scratch_dirs="ofs://<varname>authority</varname>/<varname>path</varname>(:<varname>max_bytes</varname>), <varname>local_buffer_dir</varname> (,<varname>local_dir</varname>…)"</codeblock>
       <ul>
         <li>Where <codeph>ofs://<varname>authority</varname>/<varname>path</varname></codeph> is
           the remote directory.</li>
         <li><codeph>authority</codeph> may include <codeph>ip_address</codeph> or
           <codeph>hostname</codeph> and <codeph>port</codeph>, or <codeph>service_id</codeph>.</li>
         <li><codeph>max_bytes</codeph> is optional.</li>
       </ul>
       <p>Using the above format:</p>
       <ul>
         <li>You can specify only one remote directory. When you configure a remote directory, you
           must specify a local buffer directory as the buffer. However you can use multiple local
           directories with the remote directory. If you specify multiple local directories, the
           first local directory would be used as the local buffer directory.</li>
         <li>If you configure both remote and local directories, the remote directory is only used
           when the local directories are fully utilized.</li>
         <li>The size of a remote intermediate file could affect the query performance, and the
           value can be set by <codeph>--remote_tmp_file_size=<varname>size</varname></codeph> in
           the start-up option. The default size of a remote intermediate file is 16MB while the
           maximum is 512MB.</li>
       </ul>
       <p><b>Examples</b></p>
       <ul>
         <li>An Ozone scratch dir with one local buffer dir, file size 64MB. The space of Ozone
           scratch dir is limited to 300G.
           <codeblock>--scratch_dirs=ofs://10.0.0.49:29000/tmp:300G,/local_buffer_dir --remote_tmp_file_size=64M</codeblock></li>
         <li>An Ozone scratch dir with one local buffer dir limited to 512MB, and one local dir
           limited to 10GB. The space of Ozone scratch dir is limited to 300G. The Ozone Manager
           uses its default port (9862).
           <codeblock>--scratch_dirs=ofs://ozonemgr/tmp:300G,/local_buffer_dir:512M,/local_dir:10G</codeblock></li>
         <li>An Ozone scratch dir with one local buffer dir, and multiple prioritized local dirs. The
           space of Ozone scratch dir is unlimited. The Ozone service identifier is <codeph>ozone1</codeph>.
           <codeblock>--scratch_dirs=ofs://ozone1/tmp,/local_buffer_dir,/local_dir_1:5G:1,/local_dir_2:5G:2</codeblock></li>
       </ul>
       <p>Even though max_bytes is optional, it is highly recommended to configure for spilling to
         Ozone because the Ozone cluster space is limited.</p>
     </section>
   </conbody>

 </concept>