commit	6121c4f7d61fb9f2341cf14e1be3404325fb35b9	[log] [tgz]
author	Michael Smith <michael.smith@cloudera.com>	Tue Mar 12 16:58:35 2024 -0700
committer	Joe McDonnell <joemcdonnell@cloudera.com>	Wed Apr 10 03:11:49 2024 +0000
tree	4d2748b9792d8a611fc16b2bb888e0ddef9f0452
parent	4764b91f42820e0cd114268592e090285ea79b3c [diff]

IMPALA-12905: Disk-based tuple caching

This implements on-disk caching for the tuple cache. The
TupleCacheNode uses the TupleFileWriter and TupleFileReader
to write and read back tuples from local files. The file format
uses RowBatch's standard serialization used for KRPC data streams.

The TupleCacheMgr is the daemon-level structure that coordinates
the state machine for cache entries, including eviction. When a
writer is adding an entry, it inserts an IN_PROGRESS entry before
starting to write data. This does not count towards cache capacity,
because the total size is not known yet. This IN_PROGRESS entry
prevents other writers from concurrently writing the same entry.
If the write is successful, the entry transitions to the COMPLETE
state and updates the total size of the entry. If the write is
unsuccessful and a new execution might succeed, then the entry is
removed. If the write is unsuccessful and won't succeed later
(e.g. if the total size of the entry exceeds the max size of an
entry), then it transitions to the TOMBSTONE state. TOMBSTONE
entries avoid the overhead of trying to write entries that are
too large.

Given these states, when a TupleCacheNode is doing its initial
Lookup() call, one of three things can happen:
 1. It can find a COMPLETE entry and read it.
 2. It can find an IN_PROGRESS/TOMBSTONE entry, which means it
    cannot read or write the entry.
 3. It finds no entry and inserts its own IN_PROGRESS entry
    to start a write.

The tuple cache is configured using the tuple_cache parameter,
which is a combination of the cache directory and the capacity
similar to the data_cache parameter. For example, /data/0:100GB
uses directory /data/0 for the cache with a total capacity of
100GB. This currently supports a single directory, but it can
be expanded to multiple directories later if needed. The cache
eviction policy can be specified via the tuple_cache_eviction_policy
parameter, which currently supports LRU or LIRS. The tuple_cache
parameter cannot be specified if allow_tuple_caching=false.

This contains contributions from Michael Smith, Yida Wu,
and Joe McDonnell.

Testing:
 - This adds basic custom cluster tests for the tuple cache.

Change-Id: I13a65c4c0559cad3559d5f714a074dd06e9cc9bf
Reviewed-on: http://gerrit.cloudera.org:8080/21171
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>

18 files changed

tree: 4d2748b9792d8a611fc16b2bb888e0ddef9f0452

README.md

Welcome to Impala

Lightning-fast, distributed SQL queries for petabytes of data stored in open data and table formats.

Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:

Best of breed performance and scalability.
Support for data stored in Apache Iceberg, HDFS, Apache HBase, Apache Kudu, Amazon S3, Azure Data Lake Storage, Apache Hadoop Ozone and more!
Wide analytic SQL support, including window functions and subqueries.
On-the-fly code generation using LLVM to generate lightning-fast code tailored specifically to each individual query.
Support for the most commonly-used Hadoop file formats, including Apache Parquet and Apache ORC.
Support for industry-standard security protocols, including Kerberos, LDAP and TLS.
Apache-licensed, 100% open source.

More about Impala

The fastest way to try out Impala is a quickstart Docker container. You can try out running queries and processing data sets in Impala on a single machine without installing dependencies. It can automatically load test data sets into Apache Kudu and Apache Parquet formats and you can start playing around with Apache Impala SQL within minutes.

To learn more about Impala as a user or administrator, or to try Impala, please visit the Impala homepage. Detailed documentation for administrators and users is available at Apache Impala documentation.

If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.

Supported Platforms

Impala only supports Linux at the moment. Impala supports x86_64 and has experimental support for arm64 (as of Impala 4.0). Impala Requirements contains more detailed information on the minimum CPU requirements.

Supported OS Distributions

Impala runs on Linux systems only. The supported distros are

Ubuntu 16.04/18.04
CentOS/RHEL 7/8

Other systems, e.g. SLES12, may also be supported but are not tested by the community.

Export Control Notice

This distribution uses cryptographic software and may be subject to export controls. Please refer to EXPORT_CONTROL.md for more information.

Build Instructions

See Impala's developer documentation to get started.

Detailed build notes has some detailed information on the project layout and build.