blob: 9c18635e9591e417111be86826f48ce57d1c3089 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="data_cache">
<title>Data Cache for Remote Reads</title>
<conbody>
<p>
When Impala compute nodes and its storage are not co-located, the network bandwidth
requirement goes up as the network traffic includes the data fetch as well as the
shuffling exchange traffic of intermediate results.
</p>
<p>
To mitigate the pressure on the network, you can enable the compute nodes to cache the
working set read from remote filesystems, such as, remote HDFS data node, S3, ABFS, ADLS.
</p>
<p>
To enable remote data cache, set the <codeph>--data_cache</codeph> Impala Daemon start-up
flag as below:
</p>
<codeblock>--data_cache=<varname>dir1</varname>,<varname>dir2</varname>,<varname>dir3</varname>,...:<varname>quota</varname></codeblock>
<p>
The flag is set to a list of directories, separated by <codeph>,</codeph>, followed by a
<codeph>:</codeph>, and a capacity <codeph><varname>quota</varname></codeph> per
directory.
</p>
<p>
If set to an empty string, data caching is disabled.
</p>
<p>
Cached data is stored in the specified directories.
</p>
<p>
The specified directories must exist in the local filesystem of each Impala Daemon, or
Impala will fail to start.
</p>
<p>
In addition, the filesystem which the directory resides in must support hole punching.
</p>
<p>
The cache can consume up to the <codeph>quota</codeph> bytes for each of the directories
specified.
</p>
<p>
The default setting for <codeph>--data_cache</codeph> is an empty string.
</p>
<p>
For example, with the following setting, the data cache may use up to 1 TB, with 500 GB
max in <codeph>/data/0</codeph> and <codeph>/data/1</codeph> respectively.
</p>
<codeblock>--data_cache=/data/0,/data/1:500GB</codeblock>
<p> In Impala 3.4 and higher, you can configure one of the following cache eviction policies for
the data cache: <ul>
<li>LRU (Least Recently Used--the default)</li>
<li>LIRS (Inter-referenece Recency Set)</li>
</ul> LIRS is a scan-resistent, low performance-overhead policy. You configure a cache
eviction policy using the <codeph>--data_cache_eviction_policy</codeph> Impala Daemon start-up
flag: </p>
<p>
<codeblock>--data_cache_eviction_policy=<varname>policy</varname>
</codeblock>
</p>
<note>The cache item will not expire as long as the same file metadata is used in the query.
This is because the cache key consists of the filename, mtime (last modified time of the
file), and file offset. If the mtime in the file metadata remains unchanged, the scan request
will consistently access the cache (provided that there is enough capacity).</note>
</conbody>
</concept>