docs/topics/impala_perf_hdfs_caching.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept rev="1.4" id="hdfs_caching">

   <title>Using HDFS Caching with Impala (<keyword keyref="impala21"/> or higher only)</title>
   <titlealts audience="PDF"><navtitle>HDFS Caching</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Scalability"/>
       <data name="Category" value="HDFS"/>
       <data name="Category" value="HDFS Caching"/>
       <data name="Category" value="Memory"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       HDFS caching provides performance and scalability benefits in production environments where Impala queries
       and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes,
       making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory.
       Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when
       using data from the Linux OS cache.
     </p>

     <note>
       <p>
         On a small or lightly loaded cluster, HDFS caching might not produce any speedup. It might even lead to
         slower queries, if I/O read operations that were performed in parallel across the entire cluster are replaced by in-memory
         operations operating on a smaller number of hosts. The hosts where the HDFS blocks are cached can become
         bottlenecks because they experience high CPU load while processing the cached data blocks, while other hosts remain idle.
         Therefore, always compare performance with and without this feature enabled, using a realistic workload.
       </p>
       <p rev="2.2.0">
         In <keyword keyref="impala22_full"/> and higher, you can spread the CPU load more evenly by specifying the <codeph>WITH REPLICATION</codeph>
         clause of the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements.
         This clause lets you control the replication factor for
         HDFS caching for a specific table or partition. By default, each cached block is
         only present on a single host, which can lead to CPU contention if the same host
         processes each cached block. Increasing the replication factor lets Impala choose
         different hosts to process different cached blocks, to better distribute the CPU load.
         Always use a <codeph>WITH REPLICATION</codeph> setting of at least 3, and adjust upward
         if necessary to match the replication factor for the underlying HDFS data files.
       </p>
       <p rev="2.5.0">
         In <keyword keyref="impala25_full"/> and higher, Impala automatically randomizes which host processes
         a cached HDFS block, to avoid CPU hotspots. For tables where HDFS caching is not applied,
         Impala designates which host to process a data block using an algorithm that estimates
         the load on each host. If CPU hotspots still arise during queries,
         you can enable additional randomization for the scheduling algorithm for non-HDFS cached data
         by setting the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option.
       </p>
     </note>

     <p outputclass="toc inpage"/>

 <!-- Could conref this background link; haven't decided yet the best place or if it's needed twice. -->

     <p>
       For background information about how to set up and manage HDFS caching for a <keyword keyref="distro"/> cluster, see
       <xref keyref="setup_hdfs_caching"/>.
     </p>
   </conbody>

   <concept id="hdfs_caching_overview">

     <title>Overview of HDFS Caching for Impala</title>
   <prolog>
     <metadata>
       <data name="Category" value="Concepts"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         In <keyword keyref="impala14_full"/> and higher, Impala can use the HDFS caching feature to make more effective use of RAM, so that
         repeated queries can take advantage of data <q>pinned</q> in memory regardless of how much data is
         processed overall. The HDFS caching feature lets you designate a subset of frequently accessed data to be
         pinned permanently in memory, remaining in the cache across multiple queries and never being evicted. This
         technique is suitable for tables or partitions that are frequently accessed and are small enough to fit
         entirely within the HDFS memory cache. For example, you might designate several dimension tables to be
         pinned in the cache, to speed up many different join queries that reference them. Or in a partitioned
         table, you might pin a partition holding data from the most recent time period because that data will be
         queried intensively; then when the next set of data arrives, you could unpin the previous partition and pin
         the partition holding the new data.
       </p>

       <p>
         Because this Impala performance feature relies on HDFS infrastructure, it only applies to Impala tables
         that use HDFS data files. HDFS caching for Impala does not apply to HBase tables, S3 tables,
         Kudu tables,
         or Isilon tables.
       </p>

     </conbody>
   </concept>

   <concept id="hdfs_caching_prereqs">

     <title>Setting Up HDFS Caching for Impala</title>

     <conbody>

       <p>
         To use HDFS caching with Impala, first set up that feature for your <keyword keyref="distro"/> cluster:
       </p>

       <ul>
         <li>
           <p>
           Decide how much memory to devote to the HDFS cache on each host. Remember that the total memory available
           for cached data is the sum of the cache sizes on all the hosts. By default, any data block is only cached on one
           host, although you can cache a block across multiple hosts by increasing the replication factor.
           <!-- Obsoleted in Impala 2.2 and higher by IMPALA-1587.
           Once a data block is cached on one host, all requests to process that block are routed to that same host.)
           -->
           </p>
         </li>

         <li>
           <p>
           Issue <cmdname>hdfs cacheadmin</cmdname> commands to set up one or more cache pools, owned by the same
           user as the <cmdname>impalad</cmdname> daemon (typically <codeph>impala</codeph>). For example:
 <codeblock>hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000
 </codeblock>
           For details about the <cmdname>hdfs cacheadmin</cmdname> command, see
           <xref keyref="setup_hdfs_caching"/>.
           </p>
         </li>
       </ul>

       <p>
         Once HDFS caching is enabled and one or more pools are available, see
         <xref href="impala_perf_hdfs_caching.xml#hdfs_caching_ddl"/> for how to choose which Impala data to load
         into the HDFS cache. On the Impala side, you specify the cache pool name defined by the <codeph>hdfs
         cacheadmin</codeph> command in the Impala DDL statements that enable HDFS caching for a table or partition,
         such as <codeph>CREATE TABLE ... CACHED IN <varname>pool</varname></codeph> or <codeph>ALTER TABLE ... SET
         CACHED IN <varname>pool</varname></codeph>.
       </p>
     </conbody>
   </concept>

   <concept id="hdfs_caching_ddl">

     <title>Enabling HDFS Caching for Impala Tables and Partitions</title>

     <conbody>

       <p>
         Begin by choosing which tables or partitions to cache. For example, these might be lookup tables that are
         accessed by many different join queries, or partitions corresponding to the most recent time period that
         are analyzed by different reports or ad hoc queries.
       </p>

       <p>
         In your SQL statements, you specify logical divisions such as tables and partitions to be cached. Impala
         translates these requests into HDFS-level directives that apply to particular directories and files. For
         example, given a partitioned table <codeph>CENSUS</codeph> with a partition key column
         <codeph>YEAR</codeph>, you could choose to cache all or part of the data as follows:
       </p>

       <p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/>

 <codeblock>-- Cache the entire table (all partitions).
 alter table census set cached in '<varname>pool_name</varname>';

 -- Remove the entire table from the cache.
 alter table census set uncached;

 -- Cache a portion of the table (a single partition).
 -- If the table is partitioned by multiple columns (such as year, month, day),
 -- the ALTER TABLE command must specify values for all those columns.
 alter table census partition (year=1960) set cached in '<varname>pool_name</varname>';

 <ph rev="2.2.0">-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any
 -- single host when the same data block is processed multiple times.
 alter table census partition (year=1970)
   set cached in '<varname>pool_name</varname>' with replication = 4;</ph>

 -- At each stage, check the volume of cached data.
 -- For large tables or partitions, the background loading might take some time,
 -- so you might have to wait and reissue the statement until all the data
 -- has finished being loaded into the cache.
 show table stats census;
 +-------+-------+--------+------+--------------+--------+
 | year  | #Rows | #Files | Size | Bytes Cached | Format |
 +-------+-------+--------+------+--------------+--------+
 | 1900  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
 | 1940  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
 | 1960  | -1    | 1      | 11B  | 11B          | TEXT   |
 | 1970  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
 | Total | -1    | 4      | 44B  | 11B          |        |
 +-------+-------+--------+------+--------------+--------+
 </codeblock>

       <p>
         <b>CREATE TABLE considerations:</b>
       </p>

       <p>
         The HDFS caching feature affects the Impala <codeph>CREATE TABLE</codeph> statement as follows:
       </p>

       <ul>
         <li>
         <p>
           You can put a <codeph>CACHED IN '<varname>pool_name</varname>'</codeph> clause
           <ph rev="2.2.0">and optionally a <codeph>WITH REPLICATION = <varname>number_of_hosts</varname></codeph> clause</ph>
           at the end of a
           <codeph>CREATE TABLE</codeph> statement to automatically cache the entire contents of the table,
           including any partitions added later. The <varname>pool_name</varname> is a pool that you previously set
           up with the <cmdname>hdfs cacheadmin</cmdname> command.
         </p>
         </li>

         <li>
         <p>
           Once a table is designated for HDFS caching through the <codeph>CREATE TABLE</codeph> statement, if new
           partitions are added later through <codeph>ALTER TABLE ... ADD PARTITION</codeph> statements, the data in
           those new partitions is automatically cached in the same pool.
         </p>
         </li>

         <li>
         <p>
           If you want to perform repetitive queries on a subset of data from a large table, and it is not practical
           to designate the entire table or specific partitions for HDFS caching, you can create a new cached table
           with just a subset of the data by using <codeph>CREATE TABLE ... CACHED IN '<varname>pool_name</varname>'
           AS SELECT ... WHERE ...</codeph>. When you are finished with generating reports from this subset of data,
           drop the table and both the data files and the data cached in RAM are automatically deleted.
         </p>
         </li>
       </ul>

       <p>
         See <xref href="impala_create_table.xml#create_table"/> for the full syntax.
       </p>

       <p>
         <b>Other memory considerations:</b>
       </p>

       <p>
         Certain DDL operations, such as <codeph>ALTER TABLE ... SET LOCATION</codeph>, are blocked while the
         underlying HDFS directories contain cached files. You must uncache the files first, before changing the
         location, dropping the table, and so on.
       </p>

       <p> When data is requested to be pinned in memory, that process happens in
         the background without blocking access to the data while the caching is
         in progress. Loading the data from disk could take some time. Impala
         reads each HDFS data block from memory if it has been pinned already, or
         from disk if it has not been pinned yet.</p>

       <p>
         The amount of data that you can pin on each node through the HDFS caching mechanism is subject to a quota
         that is enforced by the underlying HDFS service. Before requesting to pin an Impala table or partition in
         memory, check that its size does not exceed this quota.
       </p>

       <note>
         Because the HDFS cache consists of combined memory from all the DataNodes in the cluster, cached tables or
         partitions can be bigger than the amount of HDFS cache memory on any single host.
       </note>
     </conbody>
   </concept>

   <concept id="hdfs_caching_etl">

     <title>Loading and Removing Data with HDFS Caching Enabled</title>
   <prolog>
     <metadata>
       <data name="Category" value="ETL"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         When HDFS caching is enabled, extra processing happens in the background when you add or remove data
         through statements such as <codeph>INSERT</codeph> and <codeph>DROP TABLE</codeph>.
       </p>

       <p>
         <b>Inserting or loading data:</b>
       </p>

       <ul>
         <li>
           When Impala performs an <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> or
           <codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph> statement for a table or
           partition that is cached, the new data files are automatically cached and Impala recognizes that fact
           automatically.
         </li>

         <li>
           If you perform an <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> through Hive, as always, Impala
           only recognizes the new data files after a <codeph>REFRESH <varname>table_name</varname></codeph>
           statement in Impala.
         </li>

         <li>
           If the cache pool is entirely full, or becomes full before all the requested data can be cached, the
           Impala DDL statement returns an error. This is to avoid situations where only some of the requested data
           could be cached.
         </li>

         <li>
           When HDFS caching is enabled for a table or partition, new data files are cached automatically when they
           are added to the appropriate directory in HDFS, without the need for a <codeph>REFRESH</codeph> statement
           in Impala. Impala automatically performs a <codeph>REFRESH</codeph> once the new data is loaded into the
           HDFS cache.
         </li>
       </ul>

       <p>
         <b>Dropping tables, partitions, or cache pools:</b>
       </p>

       <p>
         The HDFS caching feature interacts with the Impala
         <codeph><xref href="impala_drop_table.xml#drop_table">DROP TABLE</xref></codeph> and
         <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE ... DROP PARTITION</xref></codeph>
         statements as follows:
       </p>

       <ul>
         <li>
           When you issue a <codeph>DROP TABLE</codeph> for a table that is entirely cached, or has some partitions
           cached, the <codeph>DROP TABLE</codeph> succeeds and all the cache directives Impala submitted for that
           table are removed from the HDFS cache system.
         </li>

         <li>
           The same applies to <codeph>ALTER TABLE ... DROP PARTITION</codeph>. The operation succeeds and any cache
           directives are removed.
         </li>

         <li>
           As always, the underlying data files are removed if the dropped table is an internal table, or the
           dropped partition is in its default location underneath an internal table. The data files are left alone
           if the dropped table is an external table, or if the dropped partition is in a non-default location.
         </li>

         <li>
           If you designated the data files as cached through the <cmdname>hdfs cacheadmin</cmdname> command, and
           the data files are left behind as described in the previous item, the data files remain cached. Impala
           only removes the cache directives submitted by Impala through the <codeph>CREATE TABLE</codeph> or
           <codeph>ALTER TABLE</codeph> statements. It is OK to have multiple redundant cache directives pertaining
           to the same files; the directives all have unique IDs and owners so that the system can tell them apart.
         </li>

         <li>
           If you drop an HDFS cache pool through the <cmdname>hdfs cacheadmin</cmdname> command, all the Impala
           data files are preserved, just no longer cached. After a subsequent <codeph>REFRESH</codeph>,
           <codeph>SHOW TABLE STATS</codeph> reports 0 bytes cached for each associated Impala table or partition.
         </li>
       </ul>

       <p>
         <b>Relocating a table or partition:</b>
       </p>

       <p>
         The HDFS caching feature interacts with the Impala
         <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE ... SET LOCATION</xref></codeph>
         statement as follows:
       </p>

       <ul>
         <li>
           If you have designated a table or partition as cached through the <codeph>CREATE TABLE</codeph> or
           <codeph>ALTER TABLE</codeph> statements, subsequent attempts to relocate the table or partition through
           an <codeph>ALTER TABLE ... SET LOCATION</codeph> statement will fail. You must issue an <codeph>ALTER
           TABLE ... SET UNCACHED</codeph> statement for the table or partition first. Otherwise, Impala would lose
           track of some cached data files and have no way to uncache them later.
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="hdfs_caching_admin">

     <title>Administration for HDFS Caching with Impala</title>

     <conbody>

       <p>
         Here are the guidelines and steps to check or change the status of HDFS caching for Impala data:
       </p>

       <p>
         <b>hdfs cacheadmin command:</b>
       </p>

       <ul>
         <li>
           If you drop a cache pool with the <cmdname>hdfs cacheadmin</cmdname> command, Impala queries against the
           associated data files will still work, by falling back to reading the files from disk. After performing a
           <codeph>REFRESH</codeph> on the table, Impala reports the number of bytes cached as 0 for all associated
           tables and partitions.
         </li>

         <li>
           You might use <cmdname>hdfs cacheadmin</cmdname> to get a list of existing cache pools, or detailed
           information about the pools, as follows:
 <codeblock scale="60">hdfs cacheadmin -listDirectives         # Basic info
 Found 122 entries
   ID POOL       REPL EXPIRY  PATH
  123 testPool      1 never   /user/hive/warehouse/tpcds.store_sales
  124 testPool      1 never   /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15
  125 testPool      1 never   /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01
 ...

 hdfs cacheadmin -listDirectives -stats  # More details
 Found 122 entries
   ID POOL       REPL EXPIRY  PATH                                                        BYTES_NEEDED  BYTES_CACHED  FILES_NEEDED  FILES_CACHED
  123 testPool      1 never   /user/hive/warehouse/tpcds.store_sales                                 0             0             0             0
  124 testPool      1 never   /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15         143169        143169             1             1
  125 testPool      1 never   /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01         112447        112447             1             1
 ...
 </codeblock>
         </li>
       </ul>

       <p>
         <b>Impala SHOW statement:</b>
       </p>

       <ul>
         <li>
           For each table or partition, the <codeph>SHOW TABLE STATS</codeph> or <codeph>SHOW PARTITIONS</codeph>
           statement displays the number of bytes currently cached by the HDFS caching feature. If there are no
           cache directives in place for that table or partition, the result set displays <codeph>NOT
           CACHED</codeph>. A value of 0, or a smaller number than the overall size of the table or partition,
           indicates that the cache request has been submitted but the data has not been entirely loaded into memory
           yet. See <xref href="impala_show.xml#show"/> for details.
         </li>
       </ul>

       <p>
         <b>Impala memory limits:</b>
       </p>

       <p>
         The Impala HDFS caching feature interacts with the Impala memory limits as follows:
       </p>

       <ul>
         <li>
           The maximum size of each HDFS cache pool is specified externally to Impala, through the <cmdname>hdfs
           cacheadmin</cmdname> command.
         </li>

         <li>
           All the memory used for HDFS caching is separate from the <cmdname>impalad</cmdname> daemon address space
           and does not count towards the limits of the <codeph>--mem_limit</codeph> startup option,
           <codeph>MEM_LIMIT</codeph> query option, or further limits imposed through YARN resource management or
           the Linux <codeph>cgroups</codeph> mechanism.
         </li>

         <li>
           Because accessing HDFS cached data avoids a memory-to-memory copy operation, queries involving cached
           data require less memory on the Impala side than the equivalent queries on uncached data. In addition to
           any performance benefits in a single-user environment, the reduced memory helps to improve scalability
           under high-concurrency workloads.
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="hdfs_caching_performance">

     <title>Performance Considerations for HDFS Caching with Impala</title>

     <conbody>

       <p>
         In Impala 1.4.0 and higher, Impala supports efficient reads from data that is pinned in memory through HDFS
         caching. Impala takes advantage of the HDFS API and reads the data from memory rather than from disk
         whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where
         you specify HDFS paths.
       </p>

       <p>
         When you examine the output of the <cmdname>impala-shell</cmdname> <cmdname>SUMMARY</cmdname> command, or
         look in the metrics report for the <cmdname>impalad</cmdname> daemon, you see how many bytes are read from
         the HDFS cache. For example, this excerpt from a query profile illustrates that all the data read during a
         particular phase of the query came from the HDFS cache, because the <codeph>BytesRead</codeph> and
         <codeph>BytesReadDataNodeCache</codeph> values are identical.
       </p>

 <codeblock>HDFS_SCAN_NODE (id=0):(Total: 11s114ms, non-child: 11s114ms, % non-child: 100.00%)
         - AverageHdfsReadThreadConcurrency: 0.00
         - AverageScannerThreadConcurrency: 32.75
 <b>        - BytesRead: 10.47 GB (11240756479)
         - BytesReadDataNodeCache: 10.47 GB (11240756479)</b>
         - BytesReadLocal: 10.47 GB (11240756479)
         - BytesReadShortCircuit: 10.47 GB (11240756479)
         - DecompressionTime: 27s572ms
 </codeblock>

       <p>
         For queries involving smaller amounts of data, or in single-user workloads, you might not notice a
         significant difference in query response time with or without HDFS caching. Even with HDFS caching turned
         off, the data for the query might still be in the Linux OS buffer cache. The benefits become clearer as
         data volume increases, and especially as the system processes more concurrent queries. HDFS caching
         improves the scalability of the overall system. That is, it prevents query performance from declining when
         the workload outstrips the capacity of the Linux OS cache.
       </p>

       <p conref="../shared/impala_common.xml#common/hdfs_caching_encryption_caveat"/>

       <p>
         <b>SELECT considerations:</b>
       </p>

       <p>
         The Impala HDFS caching feature interacts with the
         <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement and query performance as
         follows:
       </p>

       <ul>
         <li>
           Impala automatically reads from memory any data that has been designated as cached and actually loaded
           into the HDFS cache. (It could take some time after the initial request to fully populate the cache for a
           table with large size or many partitions.) The speedup comes from two aspects: reading from RAM instead
           of disk, and accessing the data straight from the cache area instead of copying from one RAM area to
           another. This second aspect yields further performance improvement over the standard OS caching
           mechanism, which still results in memory-to-memory copying of cached data.
         </li>

         <li>
           For small amounts of data, the query speedup might not be noticeable in terms of wall clock time. The
           performance might be roughly the same with HDFS caching turned on or off, due to recently used data being
           held in the Linux OS cache. The difference is more pronounced with:
           <ul>
             <li>
               Data volumes (for all queries running concurrently) that exceed the size of the Linux OS cache.
             </li>

             <li>
               A busy cluster running many concurrent queries, where the reduction in memory-to-memory copying and
               overall memory usage during queries results in greater scalability and throughput.
             </li>

             <li>
               Thus, to really exercise and benchmark this feature in a development environment, you might need to
               simulate realistic workloads and concurrent queries that match your production environment.
             </li>

             <li>
               One way to simulate a heavy workload on a lightly loaded system is to flush the OS buffer cache (on
               each DataNode) between iterations of queries against the same tables or partitions:
 <codeblock>$ sync
 $ echo 1 &gt; /proc/sys/vm/drop_caches
 </codeblock>
             </li>
           </ul>
         </li>

         <li>
           Impala queries take advantage of HDFS cached data regardless of whether the cache directive was issued by
           Impala or externally through the <cmdname>hdfs cacheadmin</cmdname> command, for example for an external
           table where the cached data files might be accessed by several different Hadoop components.
         </li>

         <li>
           If your query returns a large result set, the time reported for the query could be dominated by the time
           needed to print the results on the screen. To measure the time for the underlying query processing, query
           the <codeph>COUNT()</codeph> of the big result set, which does all the same processing but only prints a
           single line to the screen.
         </li>
       </ul>
     </conbody>
   </concept>
 </concept>