| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="disk_space"> |
| |
| <title>Managing Disk Space for Impala Data</title> |
| |
| <titlealts audience="PDF"> |
| |
| <navtitle>Managing Disk Space</navtitle> |
| |
| </titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Disk Storage"/> |
| <data name="Category" value="Administrators"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Tables"/> |
| <data name="Category" value="Compression"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Although Impala typically works with many large files in an HDFS storage system with |
| plenty of capacity, there are times when you might perform some file cleanup to reclaim |
| space, or advise developers on techniques to minimize space consumption and file |
| duplication. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Use compact binary file formats where practical. Numeric and time-based data in |
| particular can be stored in more compact form in binary data files. Depending on the |
| file format, various compression and encoding features can reduce file size even |
| further. You can specify the <codeph>STORED AS</codeph> clause as part of the |
| <codeph>CREATE TABLE</codeph> statement, or <codeph>ALTER TABLE</codeph> with the |
| <codeph>SET FILEFORMAT</codeph> clause for an existing table or partition within a |
| partitioned table. See <xref |
| href="impala_file_formats.xml#file_formats"/> |
| for details about file formats, especially <xref href="impala_parquet.xml#parquet"/>. |
| See <xref href="impala_create_table.xml#create_table"/> and |
| <xref |
| href="impala_alter_table.xml#alter_table"/> for syntax details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| You manage underlying data files differently depending on whether the corresponding |
| Impala table is defined as an |
| <xref |
| href="impala_tables.xml#internal_tables">internal</xref> or |
| <xref |
| href="impala_tables.xml#external_tables">external</xref> table: |
| </p> |
| <ul> |
| <li> |
| Use the <codeph>DESCRIBE FORMATTED</codeph> statement to check if a particular table |
| is internal (managed by Impala) or external, and to see the physical location of the |
| data files in HDFS. See <xref |
| href="impala_describe.xml#describe"/> |
| for details. |
| </li> |
| |
| <li> |
| For Impala-managed (<q>internal</q>) tables, use <codeph>DROP TABLE</codeph> |
| statements to remove data files. See |
| <xref |
| href="impala_drop_table.xml#drop_table"/> for details. |
| </li> |
| |
| <li> |
| For tables not managed by Impala (<q>external</q> tables), use appropriate |
| HDFS-related commands such as <codeph>hadoop fs</codeph>, <codeph>hdfs dfs</codeph>, |
| or <codeph>distcp</codeph>, to create, move, copy, or delete files within HDFS |
| directories that are accessible by the <codeph>impala</codeph> user. Issue a |
| <codeph>REFRESH <varname>table_name</varname></codeph> statement after adding or |
| removing any files from the data directory of an external table. See |
| <xref href="impala_refresh.xml#refresh"/> for details. |
| </li> |
| |
| <li> |
| Use external tables to reference HDFS data files in their original location. With |
| this technique, you avoid copying the files, and you can map more than one Impala |
| table to the same set of data files. When you drop the Impala table, the data files |
| are left undisturbed. See <xref href="impala_tables.xml#external_tables"/> for |
| details. |
| </li> |
| |
| <li> |
| Use the <codeph>LOAD DATA</codeph> statement to move HDFS files into the data |
| directory for an Impala table from inside Impala, without the need to specify the |
| HDFS path of the destination directory. This technique works for both internal and |
| external tables. See <xref href="impala_load_data.xml#load_data"/> for details. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| <p> |
| Make sure that the HDFS trashcan is configured correctly. When you remove files from |
| HDFS, the space might not be reclaimed for use by other files until sometime later, |
| when the trashcan is emptied. See <xref href="impala_drop_table.xml#drop_table"/> for |
| details. See <xref href="impala_prereqs.xml#prereqs_account"/> for permissions needed |
| for the HDFS trashcan to operate correctly. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Drop all tables in a database before dropping the database itself. See |
| <xref href="impala_drop_database.xml#drop_database"/> for details. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Clean up temporary files after failed <codeph>INSERT</codeph> statements. If an |
| <codeph>INSERT</codeph> statement encounters an error, and you see a directory named |
| <filepath>.impala_insert_staging</filepath> or |
| <filepath>_impala_insert_staging</filepath> left behind in the data directory for the |
| table, it might contain temporary data files taking up space in HDFS. You might be |
| able to salvage these data files, for example if they are complete but could not be |
| moved into place due to a permission error. Or, you might delete those files through |
| commands such as <codeph>hadoop fs</codeph> or <codeph>hdfs dfs</codeph>, to reclaim |
| space before re-trying the <codeph>INSERT</codeph>. Issue <codeph>DESCRIBE FORMATTED |
| <varname>table_name</varname></codeph> to see the HDFS path where you can check for |
| temporary files. |
| </p> |
| </li> |
| |
| <li rev="2.2.0"> |
| <p> |
| If you use the Amazon Simple Storage Service (S3) as a place to offload data to reduce |
| the volume of local storage, Impala 2.2.0 and higher can query the data directly from |
| S3. See <xref |
| href="impala_s3.xml#s3"/> for details. |
| </p> |
| </li> |
| </ul> |
| |
| <section id="section_vrg_fjb_3jb"> |
| |
| <title>Configuring Scratch Space for Spilling to Disk</title>Impala uses intermediate files during large |
| sort, join, aggregation, or analytic function operations The files are |
| removed when the operation finishes. You can specify locations of the |
| intermediate files by starting the <cmdname>impalad</cmdname> daemon with |
| the |
| <codeph>‑‑scratch_dirs="<varname>path_to_directory</varname>"</codeph> |
| configuration option. By default, intermediate files are stored in the |
| directory <filepath>/tmp/impala-scratch</filepath>.<p |
| id="order_by_scratch_dir"> |
| <ul> |
| <li> |
| You can specify a single directory or a comma-separated list of directories. |
| </li> |
| |
| <li> |
| You can specify an optional a capacity quota per scratch directory using the colon |
| (:) as the delimiter. |
| <p> |
| The capacity quota of <codeph>-1</codeph> or <codeph>0</codeph> is the same as no |
| quota for the directory. |
| </p> |
| </li> |
| |
| <li> |
| The scratch directories must be on the local filesystem, not in HDFS. |
| </li> |
| |
| <li> |
| You might specify different directory paths for different hosts, depending on the |
| capacity and speed of the available storage devices. |
| </li> |
| </ul> |
| </p> |
| |
| <p> |
| If there is less than 1 GB free on the filesystem where that directory resides, Impala |
| still runs, but writes a warning message to its log. |
| </p> |
| |
| <p> |
| Impala successfully starts (with a warning written to the log) if it cannot create or |
| read and write files in one of the scratch directories. |
| </p> |
| |
| <p> |
| The following are examples for specifying scratch directories. |
| <table frame="all" rowsep="1" colsep="1" |
| id="table_a4d_myg_3jb"> |
| <tgroup cols="2" align="left"> |
| <colspec colname="c1" colnum="1"/> |
| <colspec colname="c2" colnum="2"/> |
| <thead> |
| <row> |
| <entry> |
| Config option |
| </entry> |
| <entry> |
| Description |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| <codeph>--scratch_dirs=/dir1,/dir2</codeph> |
| </entry> |
| <entry> |
| Use /dir1 and /dir2 as scratch directories with no capacity quota. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <codeph>--scratch_dirs=/dir1,/dir2:25G</codeph> |
| </entry> |
| <entry> |
| Use /dir1 and /dir2 as scratch directories with no capacity quota on /dir1 and |
| the 25GB quota on /dir2. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <codeph>--scratch_dirs=/dir1:5MB,/dir2</codeph> |
| </entry> |
| <entry> |
| Use /dir1 and /dir2 as scratch directories with the capacity quota of 5MB on |
| /dir1 and no quota on /dir2. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <codeph>--scratch_dirs=/dir1:-1,/dir2:0</codeph> |
| </entry> |
| <entry> |
| Use /dir1 and /dir2 as scratch directories with no capacity quota. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </p> |
| |
| <p> |
| Allocation from a scratch directory will fail if the specified limit for the directory |
| is exceeded. |
| </p> |
| |
| <p> |
| If Impala encounters an error reading or writing files in a scratch directory during a |
| query, Impala logs the error, and the query fails. |
| </p> |
| |
| </section> |
| <section> |
| <title>Increasing Scratch Capacity</title> |
| <p> You can compress the data spilled to disk to increase the effective scratch capacity. You |
| typically more than double capacity using compression and reduce spilling to disk. Use the |
| --disk_spill_compression_codec and –-disk_spill_punch_holes startup options. The |
| --disk_spill_compression_codec takes any value supported by the COMPRESSION_CODEC query |
| option. The value is not case-sensitive. A value of <codeph>ZSTD</codeph> or |
| <codeph>LZ4</codeph> is recommended (default is NONE).</p> |
| <p>For example:</p> |
| <codeblock>--disk_spill_compression_codec=LZ4 |
| --disk_spill_punch_holes=true |
| </codeblock> |
| <p> |
| If you set <codeph>--disk_spill_compression_codec</codeph> to a value other than <codeph>NONE</codeph>, you must set <codeph>--disk_spill_punch_holes</codeph> to true. |
| </p> |
| <p> |
| The hole punching feature supported by many filesystems is used to reclaim space in scratch files during execution |
| of a query that spills to disk. This results in lower scratch space requirements in many cases, especially when |
| combined with disk spill compression. When this option is not enabled, scratch space is still recycled by a query, |
| but less effectively in many cases. |
| </p> |
| <p> You can specify a compression level for <codeph>ZSTD</codeph> only. For example: </p> |
| <codeblock>--disk_spill_compression_codec=ZSTD:10 |
| --disk_spill_punch_holes=true |
| </codeblock> |
| <p> Compression levels from 1 up to 22 (default 3) are supported for <codeph>ZSTD</codeph>. |
| The lower the compression level, the faster the speed at the cost of compression ratio.</p> |
| </section> |
| </conbody> |
| |
| </concept> |