blob: d1fb08c4ea2feca1afa673c551616bd7c7edfe62 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Pushing Down Predicate Evaluation in Apache Kudu</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href=""
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="" />
<link rel="alternate" type="application/atom+xml"
title="RSS Feed for Apache Kudu blog"
href="/feed.xml" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src=""></script>
<script src=""></script>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<a class="logo" href="/"><img
srcset="// 1x, // 2x"
alt="Apache Kudu"/></a>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
<li >
<a href="/overview.html">Overview</a>
<li >
<a href="/docs/">Documentation</a>
<li >
<a href="/releases/">Releases</a>
<li class="active">
<a href="/blog/">Blog</a>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="">GitHub</a></li>
<li><a class="icon gerrit" href="">Gerrit Code Review</a></li>
<li><a class="icon jira" href="">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="">Twitter</a></li>
<li><a href="">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="" target="_blank">Security</a></li>
<li><a href="" target="_blank">Sponsorship</a></li>
<li><a href="" target="_blank">Thanks</a></li>
<li><a href="" target="_blank">License</a></li>
<li >
<a href="/faq.html">FAQ</a>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
<div class="row header">
<div class="col-lg-12">
<h2><a href="/blog">Apache Kudu Blog</a></h2>
<div class="row-fluid">
<div class="col-lg-9">
<h1 class="entry-title">Pushing Down Predicate Evaluation in Apache Kudu</h1>
<p class="meta">Posted 16 Sep 2016 by Andrew Wong</p>
<div class="entry-content">
<p>I had the pleasure of interning with the Apache Kudu team at Cloudera this
summer. This project was my summer contribution to Kudu: a restructuring of the
scan path to speed up queries.</p>
<h2 id="introduction">Introduction</h2>
<p>In Kudu, <em>predicate pushdown</em> refers to the way in which predicates are
handled. When a scan is requested, its predicates are passed through the
different layers of Kudu’s storage hierarchy, allowing for pruning and other
optimizations to happen at each level before reaching the underlying data.</p>
<p>While predicates are pushed down, predicate evaluation itself occurs at a fairly
high level, precluding the evaluation process from certain data-specific
optimizations. These optimizations can make tablet scans an order of magnitude
faster, if not more.</p>
<h2 id="a-day-in-the-life-of-a-query">A Day in the Life of a Query</h2>
<p>Because Kudu is a columnar storage engine, its scan path has a number of
optimizations to avoid extraneous reads, copies, and computation. When a query
is sent to a tablet server, the server prunes tablets based on the
primary key, directing the request to only the tablets that contain the key
range of interest. Once at a tablet, only the columns relevant to the query are
scanned. Further pruning is done over the primary key, and if the query is
predicated on non-key columns, the entire column is scanned. The columns in a
tablet are stored as <em>cfiles</em>, which are split into encoded <em>blocks</em>. Once the
relevant cfiles are determined, the data are materialized by the block
decoders, i.e. their underlying data are decoded and copied into a buffer,
which is passed back to the tablet layer. The tablet can then evaluate the
predicate on the batch of data and mark which rows should be returned to the
<p>One of the encoding types I worked very closely with is <em>dictionary encoding</em>,
an encoding type for strings that performs particularly well for cfiles that
have repeating values. Rather than storing every row’s string, each unique
string is assigned a numeric codeword, and the rows are stored numerically on
disk. When materializing a dictionary block, all of the numeric data are scanned
and all of the corresponding strings are copied and buffered for evaluation.
When the vocabulary of a dictionary-encoded cfile gets too large, the blocks
begin switching to <em>plain encoding mode</em> to act like <em>plain-encoded</em> blocks.</p>
<p>In a plain-encoded block, strings are stored contiguously and the character
offsets to the start of each string are stored as a list of integers. When
materializing, all of the strings are copied to a buffer for evaluation.</p>
<p>Therein lies room for improvement: this predicate evaluation path is the same
for all data types and encoding types. Within the tablet, the correct cfiles
are determined, the cfiles’ decoders are opened, all of the data are copied to
a buffer, and the predicates are evaluated on this buffered data via
type-specific comparators. This path is extremely flexible, but because it was
designed to be encoding-independent, there is room for improvement.</p>
<h2 id="trimming-the-fat">Trimming the Fat</h2>
<p>The first step is to allow the decoders access to the predicate. In doing so,
each encoding type can specialize its evaluation. Additionally, this puts the
decoder in a position where it can determine whether a given row satisfies the
query, which in turn, allows the decoders to determine what data gets copied
instead of eagerly copying all of its data to get evaluated.</p>
<p>Take the case of dictionary-encoded strings as an example. With the existing
scan path, not only are all of the strings in a column copied into a buffer, but
string comparisons are done on every row. By taking advantage of the fact that
the data can be represented as integers, the cost of determining the query
results can be greatly reduced. The string comparisons can be swapped out with
evaluation based on the codewords, in which case the room for improvement boils
down to how to most quickly determine whether or not a given codeword
corresponds to a string that satisfies the predicate. Dictionary columns will
now use a bitset to store the codewords that match the predicates. It will then
scan through the integer-valued data and checks the bitset to determine whether
it should copy the corresponding string over.</p>
<p>This is great in the best case scenario where a cfile’s vocabulary is small,
but when the vocabulary gets too large and the dictionary blocks switch to plain
encoding mode, performance is hampered. In this mode, the blocks don’t utilize
any dictionary metadata and end up wasting the codeword bitset. That isn’t to
say all is lost: the decoders can still evaluate a predicate via string
comparison, and the fact that evaluation can still occur at the decoder-level
means the eager buffering can still be avoided.</p>
<p>Dictionary encoding is a perfect storm in that the decoders can completely
evaluate the predicates. This is not the case for most other encoding types,
but having decoders support evaluation leaves the door open for other encoding
types to extend this idea.</p>
<h2 id="performance">Performance</h2>
<p>Depending on the dataset and query, predicate pushdown can lead to significant
improvements. Tablet scans were timed with datasets consisting of repeated
string patterns of tunable length and tunable cardinality.</p>
<p><img src="/img/predicate-pushdown/pushdown-10.png" alt="png" class="img-responsive" />
<img src="/img/predicate-pushdown/pushdown-10M.png" alt="png" class="img-responsive" /></p>
<p>The above plots show the time taken to completely scan a single tablet, recorded
using a dataset of ten million rows of strings with length ten. Predicates were
designed to select values out of bounds (Empty), select a single value (Equal,
i.e. for cardinality <em>k</em>, this would select 1/<em>k</em> of the dataset), select half
of the full range (Half), and select the full range of values (All).</p>
<p>With the original evaluation implementation, the tablet must copy and scan
through the tablet to determine whether any values match. This means that even
when the result set is small, the full column is still copied. This is avoided
by pushing down predicates, which only copies as needed, and can be seen in the
above queries: those with near-empty result sets (Empty and Equal) have shorter
scan times than those with larger result sets (Half and All).</p>
<p>Note that for dictionary encoding, given a low cardinality, Kudu can completely
rely on the dictionary codewords to evaluate, making the query significantly
faster. At higher cardinalities, the dictionaries completely fill up and the
blocks fall back on plain encoding. The slower, albeit still improved,
performance on the dataset containing 10M unique values reflects this.</p>
<p><img src="/img/predicate-pushdown/pushdown-tpch.png" alt="png" class="img-responsive" /></p>
<p>Similar predicates were run with the TPC-H dataset, querying on the shipdate
column. The full path of a query includes not only the tablet scanning itself,
but also RPCs and batched data transfer to the caller as the scan progresses.
As such, the times plotted above refer to the average end-to-end time required
to scan and return a batch of rows. Regardless of this additional overhead,
significant improvements on the scan path still yield substantial improvements
to the query performance as a whole.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Pushing down predicate evaluation in Kudu yielded substantial improvements to
the scan path. For dictionary encoding, pushdown can be particularly powerful,
and other encoding types are either unaffected or also improved. This change has
been pushed to the main branch of Kudu, and relevant commits can be found
<a href="">here</a>
<a href="">here</a>.</p>
<p>This summer has been a phenomenal learning experience for me, in terms of the
tools, the workflow, the datasets, the thought-processes that go into building
something at Kudu’s scale. I am extremely thankful for all of the mentoring and
support I received, and that I got to be a part of Kudu’s journey from
incubating to a Top Level Apache project. I can’t express enough how grateful I
am for the amount of support I got from the Kudu team, from the intern
coordinators, and from the Cloudera community as a whole.</p>
<div class="col-lg-3 recent-posts">
<h3>Recent posts</h3>
<li> <a href="/2020/07/30/building-near-real-time-big-data-lake.html">Building Near Real-time Big Data Lake</a> </li>
<li> <a href="/2020/05/18/apache-kudu-1-12-0-release.html">Apache Kudu 1.12.0 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-11-1-release.html">Apache Kudu 1.11.1 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-10-1-release.html">Apache Kudu 1.10.1 released</a> </li>
<li> <a href="/2019/07/09/apache-kudu-1-10-0-release.html">Apache Kudu 1.10.0 Released</a> </li>
<li> <a href="/2019/04/30/location-awareness.html">Location Awareness in Kudu</a> </li>
<li> <a href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html">Fine-Grained Authorization with Apache Kudu and Impala</a> </li>
<li> <a href="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html">Testing Apache Kudu Applications on the JVM</a> </li>
<li> <a href="/2019/03/15/apache-kudu-1-9-0-release.html">Apache Kudu 1.9.0 Released</a> </li>
<li> <a href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a> </li>
<li> <a href="/2018/12/11/call-for-posts.html">Call for Posts</a> </li>
<li> <a href="/2018/10/26/apache-kudu-1-8-0-released.html">Apache Kudu 1.8.0 Released</a> </li>
<li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
<li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
<li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2019 The Apache Software Foundation.
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
<div class="col-md-3">
<a class="pull-right" href="">
<img src=""/>
<script src=""></script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
} else {
<script src=""
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
<script src=""></script>
anchors.options = {
placement: 'right',
visible: 'touch',