docs/topics/impala_disk_space.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="disk_space">

   <title>Managing Disk Space for Impala Data</title>

   <titlealts audience="PDF">

     <navtitle>Managing Disk Space</navtitle>

   </titlealts>

   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Disk Storage"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Tables"/>
       <data name="Category" value="Compression"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       Although Impala typically works with many large files in an HDFS storage system with
       plenty of capacity, there are times when you might perform some file cleanup to reclaim
       space, or advise developers on techniques to minimize space consumption and file
       duplication.
     </p>

     <ul>
       <li>
         <p>
           Use compact binary file formats where practical. Numeric and time-based data in
           particular can be stored in more compact form in binary data files. Depending on the
           file format, various compression and encoding features can reduce file size even
           further. You can specify the <codeph>STORED AS</codeph> clause as part of the
           <codeph>CREATE TABLE</codeph> statement, or <codeph>ALTER TABLE</codeph> with the
           <codeph>SET FILEFORMAT</codeph> clause for an existing table or partition within a
           partitioned table. See <xref
             href="impala_file_formats.xml#file_formats"/>
           for details about file formats, especially <xref href="impala_parquet.xml#parquet"/>.
           See <xref href="impala_create_table.xml#create_table"/> and
           <xref
             href="impala_alter_table.xml#alter_table"/> for syntax details.
         </p>
       </li>

       <li>
         <p>
           You manage underlying data files differently depending on whether the corresponding
           Impala table is defined as an
           <xref
             href="impala_tables.xml#internal_tables">internal</xref> or
           <xref
             href="impala_tables.xml#external_tables">external</xref> table:
         </p>
         <ul>
           <li>
             Use the <codeph>DESCRIBE FORMATTED</codeph> statement to check if a particular table
             is internal (managed by Impala) or external, and to see the physical location of the
             data files in HDFS. See <xref
               href="impala_describe.xml#describe"/>
             for details.
           </li>

           <li>
             For Impala-managed (<q>internal</q>) tables, use <codeph>DROP TABLE</codeph>
             statements to remove data files. See
             <xref
               href="impala_drop_table.xml#drop_table"/> for details.
           </li>

           <li>
             For tables not managed by Impala (<q>external</q> tables), use appropriate
             HDFS-related commands such as <codeph>hadoop fs</codeph>, <codeph>hdfs dfs</codeph>,
             or <codeph>distcp</codeph>, to create, move, copy, or delete files within HDFS
             directories that are accessible by the <codeph>impala</codeph> user. Issue a
             <codeph>REFRESH <varname>table_name</varname></codeph> statement after adding or
             removing any files from the data directory of an external table. See
             <xref href="impala_refresh.xml#refresh"/> for details.
           </li>

           <li>
             Use external tables to reference HDFS data files in their original location. With
             this technique, you avoid copying the files, and you can map more than one Impala
             table to the same set of data files. When you drop the Impala table, the data files
             are left undisturbed. See <xref href="impala_tables.xml#external_tables"/> for
             details.
           </li>

           <li>
             Use the <codeph>LOAD DATA</codeph> statement to move HDFS files into the data
             directory for an Impala table from inside Impala, without the need to specify the
             HDFS path of the destination directory. This technique works for both internal and
             external tables. See <xref href="impala_load_data.xml#load_data"/> for details.
           </li>
         </ul>
       </li>

       <li>
         <p>
           Make sure that the HDFS trashcan is configured correctly. When you remove files from
           HDFS, the space might not be reclaimed for use by other files until sometime later,
           when the trashcan is emptied. See <xref href="impala_drop_table.xml#drop_table"/> for
           details. See <xref href="impala_prereqs.xml#prereqs_account"/> for permissions needed
           for the HDFS trashcan to operate correctly.
         </p>
       </li>

       <li>
         <p>
           Drop all tables in a database before dropping the database itself. See
           <xref href="impala_drop_database.xml#drop_database"/> for details.
         </p>
       </li>

       <li>
         <p>
           Clean up temporary files after failed <codeph>INSERT</codeph> statements. If an
           <codeph>INSERT</codeph> statement encounters an error, and you see a directory named
           <filepath>.impala_insert_staging</filepath> or
           <filepath>_impala_insert_staging</filepath> left behind in the data directory for the
           table, it might contain temporary data files taking up space in HDFS. You might be
           able to salvage these data files, for example if they are complete but could not be
           moved into place due to a permission error. Or, you might delete those files through
           commands such as <codeph>hadoop fs</codeph> or <codeph>hdfs dfs</codeph>, to reclaim
           space before re-trying the <codeph>INSERT</codeph>. Issue <codeph>DESCRIBE FORMATTED
           <varname>table_name</varname></codeph> to see the HDFS path where you can check for
           temporary files.
         </p>
       </li>

       <li rev="2.2.0">
         <p>
           If you use the Amazon Simple Storage Service (S3) as a place to offload data to reduce
           the volume of local storage, Impala 2.2.0 and higher can query the data directly from
           S3. See <xref
             href="impala_s3.xml#s3"/> for details.
         </p>
       </li>
     </ul>

     <section id="section_vrg_fjb_3jb">

       <title>Configuring Scratch Space for Spilling to Disk</title>Impala uses intermediate files during large
       sort, join, aggregation, or analytic function operations The files are
       removed when the operation finishes. You can specify locations of the
       intermediate files by starting the <cmdname>impalad</cmdname> daemon with
       the
           <codeph>&#8209;&#8209;scratch_dirs="<varname>path_to_directory</varname>"</codeph>
       configuration option. By default, intermediate files are stored in the
       directory <filepath>/tmp/impala-scratch</filepath>.<p
         id="order_by_scratch_dir">
         <ul>
           <li>
             You can specify a single directory or a comma-separated list of directories.
           </li>

           <li>
             You can specify an optional a capacity quota per scratch directory using the colon
             (:) as the delimiter.
             <p>
               The capacity quota of <codeph>-1</codeph> or <codeph>0</codeph> is the same as no
               quota for the directory.
             </p>
           </li>

           <li>
             The scratch directories must be on the local filesystem, not in HDFS.
           </li>

           <li>
             You might specify different directory paths for different hosts, depending on the
             capacity and speed of the available storage devices.
           </li>
         </ul>
       </p>

       <p>
         If there is less than 1 GB free on the filesystem where that directory resides, Impala
         still runs, but writes a warning message to its log.
       </p>

       <p>
         Impala successfully starts (with a warning written to the log) if it cannot create or
         read and write files in one of the scratch directories.
       </p>

       <p>
         The following are examples for specifying scratch directories.
         <table frame="all" rowsep="1" colsep="1"
           id="table_a4d_myg_3jb">
           <tgroup cols="2" align="left">
             <colspec colname="c1" colnum="1"/>
             <colspec colname="c2" colnum="2"/>
             <thead>
               <row>
                 <entry>
                   Config option
                 </entry>
                 <entry>
                   Description
                 </entry>
               </row>
             </thead>
             <tbody>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1,/dir2</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with no capacity quota.
                 </entry>
               </row>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1,/dir2:25G</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with no capacity quota on /dir1 and
                   the 25GB quota on /dir2.
                 </entry>
               </row>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1:5MB,/dir2</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with the capacity quota of 5MB on
                   /dir1 and no quota on /dir2.
                 </entry>
               </row>
               <row>
                 <entry>
                   <codeph>--scratch_dirs=/dir1:-1,/dir2:0</codeph>
                 </entry>
                 <entry>
                   Use /dir1 and /dir2 as scratch directories with no capacity quota.
                 </entry>
               </row>
             </tbody>
           </tgroup>
         </table>
       </p>

       <p>
         Allocation from a scratch directory will fail if the specified limit for the directory
         is exceeded.
       </p>

       <p>
         If Impala encounters an error reading or writing files in a scratch directory during a
         query, Impala logs the error, and the query fails.
       </p>

     </section>

   </conbody>

 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="disk_space">

	<title>Managing Disk Space for Impala Data</title>

	<titlealts audience="PDF">

	<navtitle>Managing Disk Space</navtitle>

	</titlealts>

	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Disk Storage"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	<data name="Category" value="Tables"/>
	<data name="Category" value="Compression"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Although Impala typically works with many large files in an HDFS storage system with
	plenty of capacity, there are times when you might perform some file cleanup to reclaim
	space, or advise developers on techniques to minimize space consumption and file
	duplication.
	</p>

	<ul>
	<li>
	<p>
	Use compact binary file formats where practical. Numeric and time-based data in
	particular can be stored in more compact form in binary data files. Depending on the
	file format, various compression and encoding features can reduce file size even
	further. You can specify the <codeph>STORED AS</codeph> clause as part of the
	<codeph>CREATE TABLE</codeph> statement, or <codeph>ALTER TABLE</codeph> with the
	<codeph>SET FILEFORMAT</codeph> clause for an existing table or partition within a
	partitioned table. See <xref
	href="impala_file_formats.xml#file_formats"/>
	for details about file formats, especially <xref href="impala_parquet.xml#parquet"/>.
	See <xref href="impala_create_table.xml#create_table"/> and
	<xref
	href="impala_alter_table.xml#alter_table"/> for syntax details.
	</p>
	</li>

	<li>
	<p>
	You manage underlying data files differently depending on whether the corresponding
	Impala table is defined as an
	<xref
	href="impala_tables.xml#internal_tables">internal</xref> or
	<xref
	href="impala_tables.xml#external_tables">external</xref> table:
	</p>
	<ul>
	<li>
	Use the <codeph>DESCRIBE FORMATTED</codeph> statement to check if a particular table
	is internal (managed by Impala) or external, and to see the physical location of the
	data files in HDFS. See <xref
	href="impala_describe.xml#describe"/>
	for details.
	</li>

	<li>
	For Impala-managed (<q>internal</q>) tables, use <codeph>DROP TABLE</codeph>
	statements to remove data files. See
	<xref
	href="impala_drop_table.xml#drop_table"/> for details.
	</li>

	<li>
	For tables not managed by Impala (<q>external</q> tables), use appropriate
	HDFS-related commands such as <codeph>hadoop fs</codeph>, <codeph>hdfs dfs</codeph>,
	or <codeph>distcp</codeph>, to create, move, copy, or delete files within HDFS
	directories that are accessible by the <codeph>impala</codeph> user. Issue a
	<codeph>REFRESH <varname>table_name</varname></codeph> statement after adding or
	removing any files from the data directory of an external table. See
	<xref href="impala_refresh.xml#refresh"/> for details.
	</li>

	<li>
	Use external tables to reference HDFS data files in their original location. With
	this technique, you avoid copying the files, and you can map more than one Impala
	table to the same set of data files. When you drop the Impala table, the data files
	are left undisturbed. See <xref href="impala_tables.xml#external_tables"/> for
	details.
	</li>

	<li>
	Use the <codeph>LOAD DATA</codeph> statement to move HDFS files into the data
	directory for an Impala table from inside Impala, without the need to specify the
	HDFS path of the destination directory. This technique works for both internal and
	external tables. See <xref href="impala_load_data.xml#load_data"/> for details.
	</li>
	</ul>
	</li>

	<li>
	<p>
	Make sure that the HDFS trashcan is configured correctly. When you remove files from
	HDFS, the space might not be reclaimed for use by other files until sometime later,
	when the trashcan is emptied. See <xref href="impala_drop_table.xml#drop_table"/> for
	details. See <xref href="impala_prereqs.xml#prereqs_account"/> for permissions needed
	for the HDFS trashcan to operate correctly.
	</p>
	</li>

	<li>
	<p>
	Drop all tables in a database before dropping the database itself. See
	<xref href="impala_drop_database.xml#drop_database"/> for details.
	</p>
	</li>

	<li>
	<p>
	Clean up temporary files after failed <codeph>INSERT</codeph> statements. If an
	<codeph>INSERT</codeph> statement encounters an error, and you see a directory named
	<filepath>.impala_insert_staging</filepath> or
	<filepath>_impala_insert_staging</filepath> left behind in the data directory for the
	table, it might contain temporary data files taking up space in HDFS. You might be
	able to salvage these data files, for example if they are complete but could not be
	moved into place due to a permission error. Or, you might delete those files through
	commands such as <codeph>hadoop fs</codeph> or <codeph>hdfs dfs</codeph>, to reclaim
	space before re-trying the <codeph>INSERT</codeph>. Issue <codeph>DESCRIBE FORMATTED
	<varname>table_name</varname></codeph> to see the HDFS path where you can check for
	temporary files.
	</p>
	</li>

	<li rev="2.2.0">
	<p>
	If you use the Amazon Simple Storage Service (S3) as a place to offload data to reduce
	the volume of local storage, Impala 2.2.0 and higher can query the data directly from
	S3. See <xref
	href="impala_s3.xml#s3"/> for details.
	</p>
	</li>
	</ul>

	<section id="section_vrg_fjb_3jb">

	<title>Configuring Scratch Space for Spilling to Disk</title>Impala uses intermediate files during large
	sort, join, aggregation, or analytic function operations The files are
	removed when the operation finishes. You can specify locations of the
	intermediate files by starting the <cmdname>impalad</cmdname> daemon with
	the
	<codeph>‑‑scratch_dirs="<varname>path_to_directory</varname>"</codeph>
	configuration option. By default, intermediate files are stored in the
	directory <filepath>/tmp/impala-scratch</filepath>.<p
	id="order_by_scratch_dir">
	<ul>
	<li>
	You can specify a single directory or a comma-separated list of directories.
	</li>

	<li>
	You can specify an optional a capacity quota per scratch directory using the colon
	(:) as the delimiter.
	<p>
	The capacity quota of <codeph>-1</codeph> or <codeph>0</codeph> is the same as no
	quota for the directory.
	</p>
	</li>

	<li>
	The scratch directories must be on the local filesystem, not in HDFS.
	</li>

	<li>
	You might specify different directory paths for different hosts, depending on the
	capacity and speed of the available storage devices.
	</li>
	</ul>
	</p>

	<p>
	If there is less than 1 GB free on the filesystem where that directory resides, Impala
	still runs, but writes a warning message to its log.
	</p>

	<p>
	Impala successfully starts (with a warning written to the log) if it cannot create or
	read and write files in one of the scratch directories.
	</p>

	<p>
	The following are examples for specifying scratch directories.
	<table frame="all" rowsep="1" colsep="1"
	id="table_a4d_myg_3jb">
	<tgroup cols="2" align="left">
	<colspec colname="c1" colnum="1"/>
	<colspec colname="c2" colnum="2"/>
	<thead>
	<row>
	<entry>
	Config option
	</entry>
	<entry>
	Description
	</entry>
	</row>
	</thead>
	<tbody>
	<row>
	<entry>
	<codeph>--scratch_dirs=/dir1,/dir2</codeph>
	</entry>
	<entry>
	Use /dir1 and /dir2 as scratch directories with no capacity quota.
	</entry>
	</row>
	<row>
	<entry>
	<codeph>--scratch_dirs=/dir1,/dir2:25G</codeph>
	</entry>
	<entry>
	Use /dir1 and /dir2 as scratch directories with no capacity quota on /dir1 and
	the 25GB quota on /dir2.
	</entry>
	</row>
	<row>
	<entry>
	<codeph>--scratch_dirs=/dir1:5MB,/dir2</codeph>
	</entry>
	<entry>
	Use /dir1 and /dir2 as scratch directories with the capacity quota of 5MB on
	/dir1 and no quota on /dir2.
	</entry>
	</row>
	<row>
	<entry>
	<codeph>--scratch_dirs=/dir1:-1,/dir2:0</codeph>
	</entry>
	<entry>
	Use /dir1 and /dir2 as scratch directories with no capacity quota.
	</entry>
	</row>
	</tbody>
	</tgroup>
	</table>
	</p>

	<p>
	Allocation from a scratch directory will fail if the specified limit for the directory
	is exceeded.
	</p>

	<p>
	If Impala encounters an error reading or writing files in a scratch directory during a
	query, Impala logs the error, and the query fails.
	</p>

	</section>

	</conbody>

	</concept>