blob: 3d995fd0b665bb725501681b8488688204a73669 [file] [log] [blame]
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2021-01-28T14:27:14-08:00</updated><id>/feed.xml</id><entry><title type="html">Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu</title><link href="/2021/01/15/bloom-filter-predicate.html" rel="alternate" type="text/html" title="Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu" /><published>2021-01-15T00:00:00-08:00</published><updated>2021-01-15T00:00:00-08:00</updated><id>/2021/01/15/bloom-filter-predicate</id><content type="html" xml:base="/2021/01/15/bloom-filter-predicate.html">&lt;p&gt;Note: This is a cross-post from the Cloudera Engineering Blog
&lt;a href=&quot;https://blog.cloudera.com/optimized-joins-filtering-with-bloom-filter-predicate-in-kudu/&quot;&gt;Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Cloudera’s CDP Runtime version 7.1.5 maps to Apache Kudu 1.13 and upcoming Apache Impala 4.0&lt;/p&gt;
&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In database systems one of the most effective ways to improve performance is to avoid doing
unnecessary work, such as network transfers and reading data from disk. One of the ways Apache
Kudu achieves this is by supporting column predicates with scanners. Pushing down column predicate
filters to Kudu allows for optimized execution by skipping reading column values for filtered out
rows and reducing network IO between a client, like the distributed query engine Apache Impala, and
Kudu. See the documentation on
&lt;a href=&quot;https://docs.cloudera.com/runtime/latest/impala-reference/topics/impala-runtime-filtering.html&quot;&gt;runtime filtering in Impala&lt;/a&gt;
for details.&lt;/p&gt;
&lt;p&gt;CDP Runtime 7.1.5 and CDP Public Cloud added support for Bloom filter column predicate pushdown in
Kudu and the associated integration in Impala.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;bloom-filter&quot;&gt;Bloom filter&lt;/h2&gt;
&lt;p&gt;A Bloom filter is a space-efficient probabilistic data structure used to test set membership with a
possibility of false positive matches. In database systems these are used to determine whether a
set of data can be ignored when only a subset of the records are required. See the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Bloom_filter&quot;&gt;wikipedia page&lt;/a&gt; for more details.&lt;/p&gt;
&lt;p&gt;The implementation used in Kudu is a space, hash, and cache efficient block-based Bloom filter from
&lt;a href=&quot;https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf&quot;&gt;“Cache-, Hash- and Space-Efficient Bloom Filters”&lt;/a&gt;
by Putze et al. This Bloom filter was taken from the implementation in Impala and further enhanced.
The block based Bloom filter is designed to fit in CPU cache, and it allows SIMD operations using
AVX2, when available, for efficient lookup and insertion.&lt;/p&gt;
&lt;p&gt;Consider the case of a broadcast hash join between a small table and a big table where predicate
push down is not available. This typically involves following steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read the entire small table and construct a hash table from it.&lt;/li&gt;
&lt;li&gt;Broadcast the generated hash table to all worker nodes.&lt;/li&gt;
&lt;li&gt;On the worker nodes start fetching and iterating on slices of the big table, check whether the
key in the big table exists in the hash table, and only return the matched rows.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Step 3 is the heaviest since it involves reading the entire big table, and could involve heavy
network IO if the worker and the nodes hosting the big table are not on the same server.&lt;/p&gt;
&lt;p&gt;Before 7.1.5, Impala supported pushing down only the Minimum/Maximum (MIN_MAX) runtime filter to
Kudu which filters out values not within the specified bounds. In addition to the MIN_MAX runtime
filter, Impala in CDP 7.1.5+ now supports pushing down a runtime Bloom filter to Kudu. With the
newly introduced Bloom filter predicate support in Kudu, Impala can use this feature to perform
drastically more efficient joins for data stored in Kudu.
Performance
As in the scenario described above, we ran a Impala query which joins a big table stored on Kudu
and a small table stored as Parquet on HDFS. The small table was created using Parquet on HDFS to
isolate the new feature, but could also be stored in Kudu just the same. We ran the queries first
using only the MIN_MAX filter and then using both the MIN_MAX and BLOOM filter
(ALL runtime filters). For comparison, we created the same big table in Parquet on HDFS. Using
Parquet on HDFS is a great baseline for comparison because Impala already supports both MIN_MAX and
BLOOM filters for Parquet on HDFS.&lt;/p&gt;
&lt;h2 id=&quot;setup&quot;&gt;Setup&lt;/h2&gt;
&lt;p&gt;The following test was performed on a 6 node cluster with CDP Runtime 7.1.5.&lt;/p&gt;
&lt;p&gt;Hardware Configuration:
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dell PowerEdge R430, 20c/40t Xeon e5-2630 v4 @ 2.2Ghz, 128GB RAM, 4-2TB HDDs with 1 for WAL and 3
for data directories.&lt;/code&gt;&lt;/p&gt;
&lt;h3 id=&quot;schema&quot;&gt;Schema:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Big table consists of 260 million rows with randomly generated data hash partitioned by primary
key across 20 partitions on Kudu. The Kudu table was explicitly rebalanced to ensure a balanced
layout after the load.&lt;/li&gt;
&lt;li&gt;Small table consists of 2000 rows of top 1000 and bottom 1000 keys from the big table stored as
Parquet on HDFS. This prevents the MIN_MAX filters from doing any filtering on the big table as
all rows would fall under the range bounds of the MIN_MAX filters.&lt;/li&gt;
&lt;li&gt;COMPUTE STATS were run on all tables to help gather information about the table metadata and help
Impala optimize the query plan.&lt;/li&gt;
&lt;li&gt;All queries were run 10 times and the mean query runtime is depicted below.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;join-queries&quot;&gt;Join Queries&lt;/h2&gt;
&lt;p&gt;For join queries, we saw performance improvements of 3X to 5X in Kudu with Bloom filter predicate
pushdown. We expect to see even better performance multiples with larger data sizes and more
selective queries.&lt;/p&gt;
&lt;p&gt;Compared to Parquet on HDFS, Kudu performance is now better by around 17-33%.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/bloom-filter-join-queries.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;update-query&quot;&gt;Update Query&lt;/h2&gt;
&lt;p&gt;For an update query that basically upserts the entire small table into the existing big table, we
saw 15X improvement. This is primarily due to the increased query performance when selecting the
rows to update.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/bloom-filter-update-query.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;See references section below for details on the table schema, loading process, and queries that were
run.&lt;/p&gt;
&lt;h2 id=&quot;tpc-h&quot;&gt;TPC-H&lt;/h2&gt;
&lt;p&gt;We also ran the TPC-H benchmark on a single node cluster with a scale factor of 30 and saw
performance improvements in the range of 19% to 31% with different block cache capacity settings.&lt;/p&gt;
&lt;p&gt;Kudu automatically disables Bloom filter predicates that are not effectively filtering data to avoid
any performance penalties from the new feature. During development of the feature, query 9 in the
TPCH benchmark (TPCH-Q9) exhibited regression of 50-96%. On further investigation, the time required
to scan the rows from Kudu increased by up to 2X. When investigating this regression we found that
the Bloom filter predicate that was pushed down was filtering out less than 10% of the rows, leading
to increased CPU usage in Kudu which outweighed the benefit of the filter. To resolve the regression
we added a heuristic in Kudu wherein if a Bloom filter predicate is not filtering out a sufficient
percentage of rows then it’s disabled automatically for the remainder of the scan. This is safe
because Bloom filters can return false positives and hence false matches returned to the client are
expected to be filtered out using other deterministic filters.&lt;/p&gt;
&lt;h2 id=&quot;feature-availability&quot;&gt;Feature Availability&lt;/h2&gt;
&lt;p&gt;Users querying Kudu using Impala will have the feature enabled by default from CDP 7.1.5 onward
and CDP Public Cloud. We highly recommend users upgrade to get this performance enhancement and many
other performance enhancements in the release. For custom applications that use the Kudu client API
directly, the Kudu C++ client also has the Bloom filter predicate available from CDP 7.1.5 onward.
The Kudu Java client does not have the Bloom filter predicate available yet,
&lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-3221&quot;&gt;KUDU-3221&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;references&quot;&gt;References:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Performance testing related schema and queries:
&lt;a href=&quot;https://gist.github.com/bbhavsar/006df9c40b4b0528e297fac29824ceb4&quot;&gt;https://gist.github.com/bbhavsar/006df9c40b4b0528e297fac29824ceb4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kudu C++ client documentation:
&lt;a href=&quot;https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduTable.html#a356e8d0d10491d4d8540adefac86be94&quot;&gt;https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduTable.html#a356e8d0d10491d4d8540adefac86be94&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Example code to create and pass Bloom filter predicate:
&lt;a href=&quot;https://github.com/apache/kudu/blob/master/src/kudu/client/predicate-test.cc#L1416&quot;&gt;https://github.com/apache/kudu/blob/master/src/kudu/client/predicate-test.cc#L1416&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Block based Bloom filter:
&lt;a href=&quot;https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L51&quot;&gt;https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L51&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
&lt;p&gt;This feature was implemented jointly by Bankim Bhavsar and Wenzhe Zhou with guidance and feedback
from Tim Armstrong, Adar Dembo, Thomas Tauber-Marshall, Andrew Wong, and Grant Henke. We are also
grateful for our customers especially Mauricio Aristizabal from Impact for providing us valuable
feedback and benchmarks.&lt;/p&gt;</content><author><name>Bankim Bhavsar</name></author><summary type="html">Note: This is a cross-post from the Cloudera Engineering Blog Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu Cloudera’s CDP Runtime version 7.1.5 maps to Apache Kudu 1.13 and upcoming Apache Impala 4.0 Introduction In database systems one of the most effective ways to improve performance is to avoid doing unnecessary work, such as network transfers and reading data from disk. One of the ways Apache Kudu achieves this is by supporting column predicates with scanners. Pushing down column predicate filters to Kudu allows for optimized execution by skipping reading column values for filtered out rows and reducing network IO between a client, like the distributed query engine Apache Impala, and Kudu. See the documentation on runtime filtering in Impala for details. CDP Runtime 7.1.5 and CDP Public Cloud added support for Bloom filter column predicate pushdown in Kudu and the associated integration in Impala.</summary></entry><entry><title type="html">Apache Kudu 1.13.0 released</title><link href="/2020/09/21/apache-kudu-1-13-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.13.0 released" /><published>2020-09-21T00:00:00-07:00</published><updated>2020-09-21T00:00:00-07:00</updated><id>/2020/09/21/apache-kudu-1-13-0-release</id><content type="html" xml:base="/2020/09/21/apache-kudu-1-13-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.13.0!&lt;/p&gt;
&lt;p&gt;The new release adds several new features and improvements, including the
following:&lt;/p&gt;
&lt;!--more--&gt;
&lt;ul&gt;
&lt;li&gt;Added table ownership support. All newly created tables are automatically
owned by the user creating them. It is also possible to change the owner by
altering the table. You can also assign privileges to table owners via Apache
Ranger.&lt;/li&gt;
&lt;li&gt;An experimental feature is added to Kudu that allows it to automatically
rebalance tablet replicas among tablet servers. The background task can be
enabled by setting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--auto_rebalancing_enabled&lt;/code&gt; flag on the Kudu masters.
Before starting auto-rebalancing on an existing cluster, the CLI rebalancer
tool should be run first.&lt;/li&gt;
&lt;li&gt;Bloom filter column predicate pushdown has been added to allow optimized
execution of filters which match on a set of column values with a
false-positive rate. Support for Impala queries utilizing Bloom filter
predicate is available yielding performance improvements of 19% to 30% in TPC-H
benchmarks and around 41% improvement for distributed joins across large
tables. Support for Spark is not yet available.&lt;/li&gt;
&lt;li&gt;ARM-based architectures are now supported.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The above is just a list of the highlights, for a more complete list of new
features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.13.0/docs/release_notes.html&quot;&gt;release
notes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
1.13.0, follow these steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.13.0&quot;&gt;1.13.0 source release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.13.0/docs/installation.html#build_from_source&quot;&gt;1.13.0 from
source&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
DataSource, Flume sink, and other Java integrations are published to the ASF
Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.13.0&quot;&gt;now
available&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Python client source is also available on
&lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Additionally, experimental Docker images are published to
&lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;, including for AArch64-based
architectures (ARM).&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.13.0! The new release adds several new features and improvements, including the following:</summary></entry><entry><title type="html">Fine-Grained Authorization with Apache Kudu and Apache Ranger</title><link href="/2020/08/11/fine-grained-authz-ranger.html" rel="alternate" type="text/html" title="Fine-Grained Authorization with Apache Kudu and Apache Ranger" /><published>2020-08-11T00:00:00-07:00</published><updated>2020-08-11T00:00:00-07:00</updated><id>/2020/08/11/fine-grained-authz-ranger</id><content type="html" xml:base="/2020/08/11/fine-grained-authz-ranger.html">&lt;p&gt;When Apache Kudu was first released in September 2016, it didn’t support any
kind of authorization. Anyone who could access the cluster could do anything
they wanted. To remedy this, coarse-grained authorization was added along with
authentication in Kudu 1.3.0. This meant allowing only certain users to access
Kudu, but those who were allowed access could still do whatever they wanted. The
only way to achieve finer-grained access control was to limit access to Apache
Impala where access control &lt;a href=&quot;/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html&quot;&gt;could be enforced&lt;/a&gt; by
fine-grained policies in Apache Sentry. This method limited how Kudu could be
accessed, so we saw a need to implement fine-grained access control in a way
that wouldn’t limit access to Impala only.&lt;/p&gt;
&lt;p&gt;Kudu 1.10.0 integrated with Apache Sentry to enable finer-grained authorization
policies. This integration was rather short-lived as it was deprecated in Kudu
1.12.0 and will be completely removed in Kudu 1.13.0.&lt;/p&gt;
&lt;p&gt;Most recently, since 1.12.0 Kudu supports fine-grained authorization by
integrating with Apache Ranger 2.1 and later. In this post, we’ll cover how this
works and how to set it up.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;
&lt;p&gt;Ranger supports a wide range of software across the Apache Hadoop ecosystem, but
unlike Sentry, it doesn’t depend on any of them for fine-grained authorization,
making it an ideal choice for Kudu.&lt;/p&gt;
&lt;p&gt;Ranger consists of an Admin server that has a web UI and a REST API where admins
can create policies. The policies are stored in a database (supported database
systems are Microsoft SQL Server, MySQL, Oracle, PostgreSQL, and SQL Anywhere)
and are periodically fetched and cached by the Ranger plugin that runs on the
Kudu Masters. The Ranger plugin is responsible for authorizing the requests
against the cached policies. At the time of writing this post, the Ranger plugin
base is available only in Java, as most Hadoop ecosystem projects, including
Ranger, are written in Java.&lt;/p&gt;
&lt;p&gt;Unlike Sentry’s client which we reimplemented in C++, the Ranger plugin is a fat
client that handles the evaluation of the policies (which are much richer and
more complex than Sentry policies) locally, so we decided not to reimplement it
in C++.&lt;/p&gt;
&lt;p&gt;Each Kudu Master spawns a JVM child process that is effectively a wrapper around
the Ranger plugin and communicates with it via named pipes.&lt;/p&gt;
&lt;h2 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;This post assumes the Admin Tool of a compatible Ranger version is
&lt;a href=&quot;https://ranger.apache.org/quick_start_guide.html&quot;&gt;installed&lt;/a&gt; on a host that is
reachable by both you and by all Kudu Master servers.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: At the time of writing this post, Ranger 2.0 is the most recent release
which does NOT support Kudu yet. Ranger 2.1 will be the first version that
supports Kudu. If you wish to use Kudu with Ranger before this is released, you
either need to build Ranger from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;master&lt;/code&gt; branch or use a distribution that
has already backported the relevant bits
(&lt;a href=&quot;https://issues.apache.org/jira/browse/RANGER-2684&quot;&gt;RANGER-2684&lt;/a&gt;:
0b23df7801062cc7836f2e162e1775101898add4).&lt;/p&gt;
&lt;p&gt;To enable Ranger integration in Kudu, Java 8 or later has to be available on the
Master servers.&lt;/p&gt;
&lt;p&gt;You can build the Ranger subprocess by navigating to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java/&lt;/code&gt; inside the Kudu
source directory, then running the below command:&lt;/p&gt;
&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./gradlew :kudu-subprocess:jar&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;
&lt;p&gt;This will build the subprocess JAR which you can find in the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-subprocess/build/libs&lt;/code&gt; directory.&lt;/p&gt;
&lt;h2 id=&quot;setting-up-kudu-with-ranger&quot;&gt;Setting up Kudu with Ranger&lt;/h2&gt;
&lt;p&gt;The first step is to add Kudu in Ranger Admin and set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tag.download.auth.users&lt;/code&gt;
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;policy.download.auth.users&lt;/code&gt; to the user or service principal name running
the Kudu process (typically &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu&lt;/code&gt;). The former is for downloading tag-based
policies which Kudu doesn’t currently support, so this is only for forward
compatibility and can be safely omitted.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-ranger/create-service.png&quot; alt=&quot;create-service&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Next, you’ll have to configure the Ranger plugin. As it’s written in Java and is
part of the Hadoop ecosystem, it expects to find a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; in its
classpath that at a minimum configures the authentication types (simple or
Kerberos) and the group mapping. If your Kudu is co-located with a Hadoop
cluster, you can simply use your Hadoop’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; and it should work.
Otherwise, you can use the below sample &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; assuming you have
Kerberos enabled and shell-based groups mapping works for you:&lt;/p&gt;
&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-xml&quot; data-lang=&quot;xml&quot;&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;hadoop.security.authentication&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;kerberos&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;hadoop.security.group.mapping&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;org.apache.hadoop.security.ShellBasedUnixGroupsMapping&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;
&lt;p&gt;In addition to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; file, you’ll also need a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger-kudu-security.xml&lt;/code&gt; in the same directory that looks like this:&lt;/p&gt;
&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-xml&quot; data-lang=&quot;xml&quot;&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.cache.dir&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;/path/to/policy/cache/&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.service.name&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;kudu&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.rest.url&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;http://ranger-admin:6080&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.source.impl&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;org.apache.ranger.admin.client.RangerAdminRESTClient&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.pollIntervalMs&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;30000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.access.cluster.name&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Cluster 1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policy.cache.dir&lt;/code&gt; - A directory that is writable by the
user running the Master process where the plugin will cache the policies it
fetches from Ranger Admin.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.service.name&lt;/code&gt; - This needs to be set to whatever the
service name was set to on Ranger Admin.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policiy.rest.url&lt;/code&gt; - The URL of the Ranger Admin REST API.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policy.source.impl&lt;/code&gt; - This should always be
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;org.apache.ranger.admin.client.RangerAdminRESTClient&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policy.pollIntervalMs&lt;/code&gt; - This is the interval at which the
plugin will fetch policies from the Ranger Admin.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.access.cluster.name&lt;/code&gt; - The name of the cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: This is a minimal config. For more options refer to the &lt;a href=&quot;https://cwiki.apache.org/confluence/display/RANGER/Index&quot;&gt;Ranger
documentation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Once these files are created, you need to point Kudu Masters to the directory
containing them with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_config_path&lt;/code&gt; flag. In addition,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_jar_path&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_java_path&lt;/code&gt; should be configured. The Java path
defaults to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$JAVA_HOME/bin/java&lt;/code&gt; if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$JAVA_HOME&lt;/code&gt; is set and falls back to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$PATH&lt;/code&gt; if not. The JAR path defaults to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-subprocess.jar&lt;/code&gt; in the
directory containing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-master&lt;/code&gt; binary.&lt;/p&gt;
&lt;p&gt;As the last step, you need to set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-tserver_enforce_access_control&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;true&lt;/code&gt; on
the Tablet Servers to make sure access control is respected across the cluster.&lt;/p&gt;
&lt;h2 id=&quot;creating-policies&quot;&gt;Creating policies&lt;/h2&gt;
&lt;p&gt;After setting up the integration it’s time to create some policies, as now only
trusted users are allowed to perform any action, everyone else is locked out.&lt;/p&gt;
&lt;p&gt;To create your first policy, log in to Ranger Admin, click on the Kudu service
you created in the first step of setup, then on the “Add New Policy” button in
the top right corner. You’ll need to name the policy and set the resource it
will apply to. Kudu doesn’t support databases, but with Ranger integration
enabled, it will treat the part of the table name before the first period as the
database name, or default to “default” if the table name doesn’t contain a
period (configurable with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_default_database&lt;/code&gt; flag on the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-master&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;There is no implicit hierarchy in the resources, which means that granting
privileges on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=foo&lt;/code&gt; won’t imply privileges on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;foo.bar&lt;/code&gt;. To create a policy
that applies to all tables and all columns in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;foo&lt;/code&gt; database you need to
create a policy for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=foo-&amp;gt;tbl=*-&amp;gt;col=*&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-ranger/create-policy.png&quot; alt=&quot;create-policy&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;For a list of the required privileges to perform operations please refer to our
&lt;a href=&quot;/docs/security.html#policy-for-kudu-masters&quot;&gt;documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;table-ownership&quot;&gt;Table ownership&lt;/h2&gt;
&lt;p&gt;Kudu 1.13 will introduce table ownership, which enhances the authorization
experience when Ranger integration is enabled. Tables are automatically owned by
the users creating the table and it’s possible to change the owner as a part of
an alter table operation.&lt;/p&gt;
&lt;p&gt;Ranger supports granting privileges to the table owners via a special &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{OWNER}&lt;/code&gt;
user. You can, for example, grant the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALL&lt;/code&gt; privilege and delegate admin (this
is required to change the owner of a table) to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{OWNER}&lt;/code&gt; on
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=*-&amp;gt;table=*-&amp;gt;column=*&lt;/code&gt;. This way your users will be able to perform any
actions on the tables they created without having to explicitly assign
privileges per table. They will, of course, need to be granted the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CREATE&lt;/code&gt;
privilege on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=*&lt;/code&gt; or on a specific database to actually be able to create
their own tables.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog-ranger/allow-conditions.png&quot; alt=&quot;allow-conditions&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this post we’ve covered how to set up and use the newest Kudu integration,
Apache Ranger, and a sneak peek into the table ownership feature. Please try
them out if you have a chance, and let us know what you think on our &lt;a href=&quot;mailto:user@kudu.apache.org&quot;&gt;mailing
list&lt;/a&gt; or &lt;a href=&quot;https://getkudu.slack.com&quot;&gt;Slack&lt;/a&gt;. If you
run into any issues, feel free to reach out to us on either platform, or open a
&lt;a href=&quot;https://issues.apache.org/jira/projects/KUDU&quot;&gt;bug report&lt;/a&gt;.&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary type="html">When Apache Kudu was first released in September 2016, it didn’t support any kind of authorization. Anyone who could access the cluster could do anything they wanted. To remedy this, coarse-grained authorization was added along with authentication in Kudu 1.3.0. This meant allowing only certain users to access Kudu, but those who were allowed access could still do whatever they wanted. The only way to achieve finer-grained access control was to limit access to Apache Impala where access control could be enforced by fine-grained policies in Apache Sentry. This method limited how Kudu could be accessed, so we saw a need to implement fine-grained access control in a way that wouldn’t limit access to Impala only. Kudu 1.10.0 integrated with Apache Sentry to enable finer-grained authorization policies. This integration was rather short-lived as it was deprecated in Kudu 1.12.0 and will be completely removed in Kudu 1.13.0. Most recently, since 1.12.0 Kudu supports fine-grained authorization by integrating with Apache Ranger 2.1 and later. In this post, we’ll cover how this works and how to set it up.</summary></entry><entry><title type="html">Building Near Real-time Big Data Lake</title><link href="/2020/07/30/building-near-real-time-big-data-lake.html" rel="alternate" type="text/html" title="Building Near Real-time Big Data Lake" /><published>2020-07-30T00:00:00-07:00</published><updated>2020-07-30T00:00:00-07:00</updated><id>/2020/07/30/building-near-real-time-big-data-lake</id><content type="html" xml:base="/2020/07/30/building-near-real-time-big-data-lake.html">&lt;p&gt;Note: This is a cross-post from the Boris Tyukin’s personal blog &lt;a href=&quot;https://boristyukin.com/building-near-real-time-big-data-lake-part-2/&quot;&gt;Building Near Real-time Big Data Lake: Part 2&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This is the second part of the series. In &lt;a href=&quot;https://boristyukin.com/building-near-real-time-big-data-lake-part-i/&quot;&gt;Part 1&lt;/a&gt;
I wrote about our use-case for the Data Lake architecture and shared our success story.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;requirements&quot;&gt;Requirements&lt;/h2&gt;
&lt;p&gt;Before we embarked on our journey, we had identified high-level requirements and guiding principles.
It is crucial to think it through to envision who and how will use your Data Lake. Identify your
first three projects to keep them while you are building the Data Lake.&lt;/p&gt;
&lt;p&gt;The best way is to start a few smaller proof-of-concept projects: play with various distributed
engines and tools, run tons of benchmarks, and learn from others, who implemented a similar solution
successfully. Do not forget to learn from others’ mistakes too.&lt;/p&gt;
&lt;p&gt;We had settled on these 7 guiding principles before we started looking at technology and architecture:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Scale-out, not scale-up.&lt;/li&gt;
&lt;li&gt;Design for resiliency and availability.&lt;/li&gt;
&lt;li&gt;Support both real-time and batch ingestion into a Data Lake.&lt;/li&gt;
&lt;li&gt;Enable both ad-hoc exploratory analysis as well as interactive queries.&lt;/li&gt;
&lt;li&gt;Replicate in near real-time 300+ Cerner Millennium tables from 3 remote-hosted Cerner Oracle RAC
instances with average latency less than 10 seconds (time between a change made in Cerner EHR system
by clinicians and data ingested and ready for consumption in Data Lake).&lt;/li&gt;
&lt;li&gt;Have robust logging and monitoring processes to ensure reliability of the pipeline and to simplify
troubleshooting.&lt;/li&gt;
&lt;li&gt;Reduce manual work greatly and ease the ongoing support.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We decided to embrace the benefits and scalability of Big Data technology. In fact, it was a pretty
easy sell as our leadership was tired of constantly buying expensive software and hardware from
big-name vendors and not being able to scale-out to support an avalanche of new projects and requests.&lt;/p&gt;
&lt;p&gt;We started looking at Change Data Capture (CDC) products to mine and ship database logs from Oracle.&lt;/p&gt;
&lt;p&gt;We knew we had to implement a metadata- or code-as-configuration driven solution to manage hundreds
of tables, without expanding our team.&lt;/p&gt;
&lt;p&gt;We needed a flexible orchestration and scheduling tool, designed with real-time workloads in mind.&lt;/p&gt;
&lt;p&gt;Finally, we engaged our and Cerner’s leadership early, as it would take time to hash out all the
details, and to make their DBAs confident that we were not going to break their production systems
by streaming 1000s of messages every second 24x7. In fact, one of the goals was to relieve production
systems from analytical workloads.&lt;/p&gt;
&lt;h2 id=&quot;platform-selection&quot;&gt;Platform selection&lt;/h2&gt;
&lt;p&gt;First off, we had to decide on the actual platform. After spending 3 months researching, 4 options
emerged, given the realities of our organization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;On-premises virtualized cluster, using preferred vendors, recommended by our infrastructure team.&lt;/li&gt;
&lt;li&gt;On-premises Big Data appliance (bundled hardware and software, optimized for Big Data workloads).&lt;/li&gt;
&lt;li&gt;Big Data cluster in cloud, managed by ourselves (IaaS model, which just means renting a bunch of
VMs and running Cloudera or Hortonworks Big Data distribution).&lt;/li&gt;
&lt;li&gt;A fully managed cloud data platform and native cloud data warehouse (Snowflake, Google BigQuery,
Amazon Redshift, and etc.)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each option had a long list of pros and cons, but ultimately we went with option 2. The price was
really attractive, it was a capital expense (our finance people rightfully hate subscriptions), the
best performance, security, and control.&lt;/p&gt;
&lt;p&gt;It was in 2017 when we made a decision. While we could not provision cluster resources and add nodes
with a click of button and we learned that software and hardware upgrades were a real chore, it was
still very much worth that as we’ve saved a 7 number figure for the organization to get the
performance we needed.&lt;/p&gt;
&lt;p&gt;Owning hardware also made a lot of sense for us as we could not forecast our needs far enough in the
future and we could get a really powerful 6 node cluster for a fraction the cost that we would end
up paying in subscription fees in the next 12 months. Of course, it did help that we already had a
state-of-the-art data center and people managing it.&lt;/p&gt;
&lt;p&gt;Fully-managed or serverless architecture was not really an option back then, but if you asked me
today, this would be the first thing I would look at if I had to build a data lake today
(definitely check AWS Lake Formation, AWS Athena, Amazon Redshift, Azure Synapse, Snowflake and
Google BigQuery).&lt;/p&gt;
&lt;p&gt;Your organization, goals, projects and situation could be very different and you should definitely
evaluate cloud solutions, especially in 2020 when prices are decreasing, cloud providers are
extremely competitive and there are new attractive pricing options with 3 year commitment. Make sure
you understand the cost and billing model. Or, hire a company (there are plenty now), that will
explain your cloud bills before you get a horrifying check.&lt;/p&gt;
&lt;p&gt;Some of the things to consider:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Existing data center infrastructure and access to people, supporting it.&lt;/li&gt;
&lt;li&gt;Integration with current tools (BI, ETL, Advanced Analytics etc.) Do they stay on-premises or can
be moved into cloud to avoid networking lags or charges for data egress?&lt;/li&gt;
&lt;li&gt;Total ownership cost and cost to performance ratio.&lt;/li&gt;
&lt;li&gt;Do you really need elasticity? This is the first thing that cloud advocates are preaching but
think if and how this applies to you.&lt;/li&gt;
&lt;li&gt;Is time-to-market so crucial for you, or you can wait a few months to build Big Data
infrastructure on-premises to save some money and get much better performance and control of
physical hardware?&lt;/li&gt;
&lt;li&gt;Are you okay with locking yourself in to vendor’s solution XYZ? This is an especially crucial
question if you are selecting a fully managed platform.&lt;/li&gt;
&lt;li&gt;Can you easily change your cloud provider? Or can you even afford to put all your trust and faith
in a single cloud provider?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Do your homework, spend a lot of time reading and talking to other people (engineers and architects,
not sales reps), and make sure you understand what you are signing up for.&lt;/p&gt;
&lt;p&gt;And remember, there is no magic! You still need to architect, design, build, support, test, and make
good choices and use common sense. No matter what your favorite vendor tells you. You might save
time by spinning up a cluster in minutes, but you still need people to manage all that. You still
need great architects and engineers to realize benefits from all that hot new tech.&lt;/p&gt;
&lt;h2 id=&quot;building-blocks&quot;&gt;Building blocks&lt;/h2&gt;
&lt;p&gt;Once we agreed on the platform of our choice, powered by Cloudera Enterprise Data Hub, we started
prototyping and benchmarking various engines and tools that came with it. We looked at other
open-source projects, as nothing really prevents you from installing and using any open-source
product you desire and trust. One of these products for us was Apache NiFi, which proved to be a
tremendous value.&lt;/p&gt;
&lt;p&gt;After a lot of trials and errors, we decided on this architecture:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/pipelinearchitecture.png&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;One of the toughest challenges we faced right away was the fact that most of the Big Data data
engines were not designed to support mutable data but rather immutable append-only data. All the
workarounds we had tried did not work for us and no matter what we did with partitioning strategy,
we just needed a simple ability to update and delete data, not only insert. Anyone who worked with
RDBMS or legacy columnar databases takes this capability for granted, but surprisingly it is a very
difficult task in Big Data world.&lt;/p&gt;
&lt;p&gt;We considered Apache HBase, but the performance of analytics-style ETL and interactive queries was
really bad. We were blown away by Apache Impala’s performance on HDFS as no matter what we threw at
Impala, it was hundreds of times faster…but we could not update data in place.&lt;/p&gt;
&lt;p&gt;At about the same time, Cloudera released and open-sourced Apache Kudu project that became part of
its official distribution. We got very excited about it (refer to our benchmarks &lt;a href=&quot;http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/&quot;&gt;here&lt;/a&gt;), and decided
to proceed with Kudu as a storage engine, while using Apache Impala as SQL query engine. One of the
ambitious goals of Apache Kudu is to cut the need for the infamous &lt;a href=&quot;https://en.wikipedia.org/wiki/Lambda_architecture&quot;&gt;Lambda architecture&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;After talking to 7 vendors and playing with our top picks, we selected a Change Data Capture product
(Oracle GoldenGate for Big Data edition). It deserves a separate post but let’s just say it was the
only product out of 7, that was able to handle complexities of the source Oracle RAC systems and
offer great performance without the need to install any agents or software on the actual production
database. Other solutions had a very long list of limitations for Oracle systems, make sure to read
and understand these limitations.&lt;/p&gt;
&lt;p&gt;Our homegrown tool &lt;a href=&quot;http://boristyukin.com/how-to-ingest-a-large-number-of-tables-into-a-big-data-lake-or-why-i-built-metazoo/&quot;&gt;MetaZoo&lt;/a&gt;
has been instrumental to bring order and peace, and that’s why it earned
its own blog post!&lt;/p&gt;
&lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;
&lt;p&gt;Initial ingest is pretty typical - we use Sqoop to extract data from Cerner Oracle databases, and
NiFi helps orchestrate initial load for hundreds of tables. Actually, this NiFi flow below can
handle initial ingest of hundreds of tables!&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/nifi_initial.png&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Our secret sauce though is &lt;a href=&quot;http://boristyukin.com/how-to-ingest-a-large-number-of-tables-into-a-big-data-lake-or-why-i-built-metazoo/&quot;&gt;MetaZoo&lt;/a&gt;.
MetaZoo generates optimal parameters for Sqoop (such as a number of mappers, split-by column, and so
forth), generates DDLs for staging and final tables, and SQL commands to transform data before they
land in the Data Lake. MetaZoo also provides control tables to record status of every table.&lt;/p&gt;
&lt;p&gt;The throughput of Sqoop is nothing but amazing. Gone are the days when we had to ask Cerner to dump
tables on a hard-drive and ship it by snail mail (do not ask how much it cost us!). And we like how
YARN queues help to limit the load on production databases.&lt;/p&gt;
&lt;p&gt;To give you one example, a few years ago it took us 4 weeks to reload &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clinical_event&lt;/code&gt; table from
Cerner using Informatica into our local Oracle database. With Sqoop and Big Data, it was done in 11
hours!&lt;/p&gt;
&lt;p&gt;This is what happens during the initial ingest.&lt;/p&gt;
&lt;p&gt;First, MetaZoo gathers relevant metadata from source system about tables to ingest, and based on
that metadata will generate DDL scripts, SQL commands snippets, Sqoop parameters, and more. It will
initialize tables in MetaZoo control tables as well.&lt;/p&gt;
&lt;p&gt;Then NiFi picks a list of tables to ingest from MetaZoo control tables and run the following steps
to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Execute and wait for Sqoop to finish.&lt;/li&gt;
&lt;li&gt;Apply some basic rules to map data types to the corresponding data types in the lake. We convert
timestamps to a proper time zone as well. While you do not want to do any heavy processing or or any
data modeling in Data Lake and keep data closer to raw format as much as you can, some light
processing upfront goes a long way and make it easier for analysts and developers to use these
tables later.&lt;/li&gt;
&lt;li&gt;Load processed data into final tables after some basic validation.&lt;/li&gt;
&lt;li&gt;Compute Impala statistics.&lt;/li&gt;
&lt;li&gt;Set initial ingest status to completed in MetaZoo control tables so it is ready for real-time
streaming.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Before we kick off the initial ingest process, we start Oracle GoldenGate extracts and replicats
(that’s the actual term) to begin capturing changes from a database and send them into Kafka. Every
message, depending on database operation type and GoldenGate configuration, might have before/after
table row values, operation type and database commit transaction time (it only extracts changes for
committed transactions). Once the initial ingest is finished, and because GoldenGate continues
sending changes since the moment we started, we can now start real-time ingest flow in NiFi.&lt;/p&gt;
&lt;p&gt;A side benefit of decoupling GoldenGate from Kafka and NiFi and Kudu is to make this process
resilient to failures. This allows us as well to bring one of these systems down for maintenance
without much impact.&lt;/p&gt;
&lt;p&gt;Below is the NiFi flow than handles real-time streaming from Oracle/GoldenGate/Kafka and persists
data into Kudu:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/nifi_rt.png&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;NiFi flow consumes Kafka messages, produced by GoldenGate. Every table from every domain has
its own Kafka topic. Topics have only one partition to preserve the original order of messages.&lt;/li&gt;
&lt;li&gt;New messages are queued in NiFi, using a simple First-In-First-Out pattern and grouped by a
table. It is important to ensure the order of messages but still process tables concurrently.&lt;/li&gt;
&lt;li&gt;Messages are transformed, using the same basic rules we apply during the initial ingest.&lt;/li&gt;
&lt;li&gt;Finally, messages are persisted into Kudu. Some of them represent INSERT type operation, which
results in brand new rows added to Kudu tables. Other messages are UPDATE and DELETE operations.
And we have to deal with an exotic PK_UPDATE operation, when a primary key was changed for some
reason in the source system (e.g. PK=111 was renamed to 222). We had to write a custom Kudu client
to handle all these cases using Java Kudu API that was fun to use. NiFi allowed us to write custom
processors and integrate that custom Kudu code directly into our flow.&lt;/li&gt;
&lt;li&gt;Useful metrics are stored in a separate Kudu table. We collect number of messages processed,
operation type (insert, update, delete or primary key update), latency, important timestamps.
Using this data, we can optimize and tweak the performance of the pipeline, and to monitor it by
visualizing data on a dashboard.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The entire flow handles 900+ tables today (as we capture 300 tables from 3 Cerner domains).&lt;/p&gt;
&lt;p&gt;We process ~2,000 messages per second or 125MM messages per day. GoldenGate accumulates 150Gb worth
of database changes per day. In Kudu, we store over 120B rows of data.&lt;/p&gt;
&lt;p&gt;Our average latency is 6 seconds and the pipeline is running 24x7.&lt;/p&gt;
&lt;h2 id=&quot;user-experience&quot;&gt;User experience&lt;/h2&gt;
&lt;p&gt;I am biased, but I think this is a game-changer for analysts, BI developers, or any data people.
What they get is an ability to access near real-time production data, with all the benefits and
scalability of Big Data technology.&lt;/p&gt;
&lt;p&gt;Here, I run a query in Impala to count patients, admitted to our hospitals within the last 7 days,
who are still in the hospitals (not discharged yet):&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/query1.png&quot; alt=&quot;query 1&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Then 5 seconds later I run the same query again to see numbers changed - more patients got admitted
and discharged:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/query2.png&quot; alt=&quot;query 2&quot; /&gt;]&lt;/p&gt;
&lt;p&gt;This query below counts certain clinical events in the 20B row Kudu table (which is updated in near
real-time). While it takes 28 seconds to finish, this query would never even finish I ran it against
our Oracle database. It found 13.7B events:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/query3.png&quot; alt=&quot;query 3&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;credits&quot;&gt;Credits&lt;/h2&gt;
&lt;p&gt;Apache Impala, Apache Kudu and Apache NiFi were the pillars of our real-time pipeline. Back in 2017,
Impala was already a rock solid battle-tested project, while NiFi and Kudu were relatively new. We
did have some reservations about using them and were concerned about support if/when we needed it
(and we did need it a few times).&lt;/p&gt;
&lt;p&gt;We were amazed by all the help, dedication, knowledge sharing, friendliness, and openness of Impala,
NiFi and Kudu developers. Huge thank you to all of you who helped us alone the way. You guys are
amazing and you are building fantastic products!&lt;/p&gt;
&lt;p&gt;To be continued…&lt;/p&gt;</content><author><name>Boris Tyukin</name></author><summary type="html">Note: This is a cross-post from the Boris Tyukin’s personal blog Building Near Real-time Big Data Lake: Part 2 This is the second part of the series. In Part 1 I wrote about our use-case for the Data Lake architecture and shared our success story.</summary></entry><entry><title type="html">Apache Kudu 1.12.0 released</title><link href="/2020/05/18/apache-kudu-1-12-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.12.0 released" /><published>2020-05-18T00:00:00-07:00</published><updated>2020-05-18T00:00:00-07:00</updated><id>/2020/05/18/apache-kudu-1-12-0-release</id><content type="html" xml:base="/2020/05/18/apache-kudu-1-12-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.12.0!&lt;/p&gt;
&lt;p&gt;The new release adds several new features and improvements, including the
following:&lt;/p&gt;
&lt;!--more--&gt;
&lt;ul&gt;
&lt;li&gt;Kudu now supports native fine-grained authorization via integration with
Apache Ranger. Kudu may now enforce access control policies defined for
Kudu tables and columns stored in Ranger. See the
&lt;a href=&quot;/releases/1.12.0/docs/security.html#fine_grained_authz&quot;&gt;authorization documentation&lt;/a&gt;
for more details.&lt;/li&gt;
&lt;li&gt;Kudu’s web UI now supports proxying via Apache Knox. Kudu may be deployed
in a firewalled state behind a Knox Gateway which will forward HTTP requests
and responses between clients and the Kudu web UI.&lt;/li&gt;
&lt;li&gt;Kudu’s web UI now supports HTTP keep-alive. Operations that access multiple
URLs will now reuse a single HTTP connection, improving their performance.&lt;/li&gt;
&lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu tserver quiesce&lt;/code&gt; tool is added to quiesce tablet servers. While a
tablet server is quiescing, it will stop hosting tablet leaders and stop
serving new scan requests. This can be used to orchestrate a rolling restart
without stopping on-going Kudu workloads.&lt;/li&gt;
&lt;li&gt;Introduced &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;auto&lt;/code&gt; time source for HybridClock timestamps. With
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--time_source=auto&lt;/code&gt; in AWS and GCE cloud environments, Kudu masters and
tablet servers use the built-in NTP client synchronized with dedicated NTP
servers available via host-only networks. With &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--time_source=auto&lt;/code&gt; in
environments other than AWS/GCE, Kudu masters and tablet servers rely on
their local machine’s clock synchronized by NTP. The default setting for
the HybridClock time source (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--time_source=system&lt;/code&gt;) is backward-compatible,
requiring the local machine’s clock to be synchronized by the kernel’s NTP
discipline.&lt;/li&gt;
&lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu cluster rebalance&lt;/code&gt; tool now supports moving replicas away from
specific tablet servers by supplying the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--ignored_tservers&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--move_replicas_from_ignored_tservers&lt;/code&gt; arguments (see
&lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2914&quot;&gt;KUDU-2914&lt;/a&gt; for more
details).&lt;/li&gt;
&lt;li&gt;Write Ahead Log file segments and index chunks are now managed by Kudu’s file
cache. With that, all long-lived file descriptors used by Kudu are managed by
the file cache, and there’s no longer a need for capacity planning of file
descriptor usage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The above is just a list of the highlights, for a more complete list of new
features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.12.0/docs/release_notes.html&quot;&gt;release
notes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
1.12.0, follow these steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.12.0&quot;&gt;1.12.0 source release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.12.0/docs/installation.html#build_from_source&quot;&gt;1.12.0 from
source&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
DataSource, Flume sink, and other Java integrations are published to the ASF
Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.12.0&quot;&gt;now
available&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Python client source is also available on
&lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Additionally, experimental Docker images are published to
&lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;.&lt;/p&gt;</content><author><name>Hao Hao</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.12.0! The new release adds several new features and improvements, including the following:</summary></entry><entry><title type="html">Apache Kudu 1.10.1 released</title><link href="/2019/11/20/apache-kudu-1-10-1-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.10.1 released" /><published>2019-11-20T00:00:00-08:00</published><updated>2019-11-20T00:00:00-08:00</updated><id>/2019/11/20/apache-kudu-1-10-1-release</id><content type="html" xml:base="/2019/11/20/apache-kudu-1-10-1-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.10.1!&lt;/p&gt;
&lt;p&gt;Apache Kudu 1.10.1 is a bug fix release which fixes critical issues discovered
in Apache Kudu 1.10.0. In particular, this fixes a licensing issue with
distributing libnuma library with the kudu-binary JAR artifact. Users of
Kudu 1.10.0 are encouraged to upgrade to 1.10.1 as soon as possible. See the
&lt;a href=&quot;/releases/1.10.1/docs/release_notes.html&quot;&gt;release notes&lt;/a&gt; for details.&lt;/p&gt;
&lt;!--more--&gt;
&lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
1.10.1, follow these steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.10.1&quot;&gt;1.10.1 source release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.10.1/docs/installation.html#build_from_source&quot;&gt;1.10.1 from
source&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
DataSource, Flume sink, and other Java integrations are published to the ASF
Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.10.1&quot;&gt;now
available&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Python client source is also available on
&lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;</content><author><name>Alexey Serbin</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.10.1! Apache Kudu 1.10.1 is a bug fix release which fixes critical issues discovered in Apache Kudu 1.10.0. In particular, this fixes a licensing issue with distributing libnuma library with the kudu-binary JAR artifact. Users of Kudu 1.10.0 are encouraged to upgrade to 1.10.1 as soon as possible. See the release notes for details.</summary></entry><entry><title type="html">Apache Kudu 1.11.1 released</title><link href="/2019/11/20/apache-kudu-1-11-1-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.11.1 released" /><published>2019-11-20T00:00:00-08:00</published><updated>2019-11-20T00:00:00-08:00</updated><id>/2019/11/20/apache-kudu-1-11-1-release</id><content type="html" xml:base="/2019/11/20/apache-kudu-1-11-1-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.11.1!&lt;/p&gt;
&lt;p&gt;This release contains a fix which addresses a critical issue discovered in
1.10.0 and 1.11.0 and adds several new features and improvements since 1.10.0.&lt;/p&gt;
&lt;!--more--&gt;
&lt;p&gt;Apache Kudu 1.11.1 is a bug fix release which fixes critical issues discovered
in Apache Kudu 1.11.0. In particular, this release fixes a licensing issue with
distributing libnuma library with the kudu-binary JAR artifact. Users of
Kudu 1.11.0 are encouraged to upgrade to 1.11.1 as soon as possible.&lt;/p&gt;
&lt;p&gt;Apache Kudu 1.11.1 adds several new features and improvements since
Apache Kudu 1.10.0, including the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Kudu now supports putting tablet servers into maintenance mode: while in this
mode, the tablet server’s replicas will not be re-replicated if the server
fails. Administrative CLI are added to orchestrate tablet server maintenance
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2069&quot;&gt;KUDU-2069&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Kudu now has a built-in NTP client which maintains the internal wallclock
time used for generation of HybridTime timestamps. When enabled, system clock
synchronization for nodes running Kudu is no longer necessary,
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2935&quot;&gt;KUDU-2935&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Aggregated table statistics are now available to Kudu clients. This allows
for various query optimizations. For example, Spark now uses it to perform
join optimizations
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2797&quot;&gt;KUDU-2797&lt;/a&gt; and
&lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2721&quot;&gt;KUDU-2721&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;The kudu CLI tool now supports creating and dropping range partitions
for a table
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2881&quot;&gt;KUDU-2881&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;The kudu CLI tool now supports altering and dropping table columns.&lt;/li&gt;
&lt;li&gt;The kudu CLI tool now supports getting and setting extra configuration
properties for a table
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2514&quot;&gt;KUDU-2514&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;See the &lt;a href=&quot;/releases/1.11.1/docs/release_notes.html&quot;&gt;release notes&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
1.11.1, follow these steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.11.1&quot;&gt;1.11.1 source release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.11.1/docs/installation.html#build_from_source&quot;&gt;1.11.1 from
source&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
DataSource, Flume sink, and other Java integrations are published to the ASF
Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.11.1&quot;&gt;now
available&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Python client source is also available on
&lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Additionally experimental Docker images are published to
&lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;.&lt;/p&gt;</content><author><name>Alexey Serbin</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.11.1! This release contains a fix which addresses a critical issue discovered in 1.10.0 and 1.11.0 and adds several new features and improvements since 1.10.0.</summary></entry><entry><title type="html">Apache Kudu 1.10.0 Released</title><link href="/2019/07/09/apache-kudu-1-10-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.10.0 Released" /><published>2019-07-09T00:00:00-07:00</published><updated>2019-07-09T00:00:00-07:00</updated><id>/2019/07/09/apache-kudu-1-10-0-release</id><content type="html" xml:base="/2019/07/09/apache-kudu-1-10-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.10.0!&lt;/p&gt;
&lt;p&gt;The new release adds several new features and improvements, including the
following:&lt;/p&gt;
&lt;!--more--&gt;
&lt;ul&gt;
&lt;li&gt;Kudu now supports both full and incremental table backups via a job
implemented using Apache Spark. Additionally it supports restoring
tables from full and incremental backups via a restore job implemented using
Apache Spark. See the
&lt;a href=&quot;/releases/1.10.0/docs/administration.html#backup&quot;&gt;backup documentation&lt;/a&gt;
for more details.&lt;/li&gt;
&lt;li&gt;Kudu can now synchronize its internal catalog with the Apache Hive Metastore,
automatically updating Hive Metastore table entries upon table creation,
deletion, and alterations in Kudu. See the
&lt;a href=&quot;/releases/1.10.0/docs/hive_metastore.html#metadata_sync&quot;&gt;HMS synchronization documentation&lt;/a&gt;
for more details.&lt;/li&gt;
&lt;li&gt;Kudu now supports native fine-grained authorization via integration with
Apache Sentry. Kudu may now enforce access control policies defined for Kudu
tables and columns, as well as policies defined on Hive servers and databases
that may store Kudu tables. See the
&lt;a href=&quot;/releases/1.10.0/docs/security.html#fine_grained_authz&quot;&gt;authorization documentation&lt;/a&gt;
for more details.&lt;/li&gt;
&lt;li&gt;Kudu’s web UI now supports SPNEGO, a protocol for securing HTTP requests with
Kerberos by passing negotiation through HTTP headers. To enable, set the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--webserver_require_spnego&lt;/code&gt; command line flag.&lt;/li&gt;
&lt;li&gt;Column comments can now be stored in Kudu tables, and can be updated using
the AlterTable API
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-1711&quot;&gt;KUDU-1711&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;The performance of mutations (i.e. UPDATE, DELETE, and re-INSERT) to
not-yet-flushed Kudu data has been significantly optimized
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2826&quot;&gt;KUDU-2826&lt;/a&gt; and
&lt;a href=&quot;https://github.com/apache/kudu/commit/f9f9526d3&quot;&gt;f9f9526d3&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Predicate performance for primitive columns and IS NULL and IS NOT NULL
has been optimized
(see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2846&quot;&gt;KUDU-2846&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The above is just a list of the highlights, for a more complete list of new
features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.10.0/docs/release_notes.html&quot;&gt;release
notes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
1.10.0, follow these steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.10.0&quot;&gt;1.10.0 source release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.10.0/docs/installation.html#build_from_source&quot;&gt;1.10.0 from
source&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
DataSource, Flume sink, and other Java integrations are published to the ASF
Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.10.0&quot;&gt;now
available&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Python client source is also available on
&lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Additionally experimental Docker images are published to
&lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.10.0! The new release adds several new features and improvements, including the following:</summary></entry><entry><title type="html">Location Awareness in Kudu</title><link href="/2019/04/30/location-awareness.html" rel="alternate" type="text/html" title="Location Awareness in Kudu" /><published>2019-04-30T00:00:00-07:00</published><updated>2019-04-30T00:00:00-07:00</updated><id>/2019/04/30/location-awareness</id><content type="html" xml:base="/2019/04/30/location-awareness.html">&lt;p&gt;This post is about location awareness in Kudu. It gives an overview
of the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;principles of the design&lt;/li&gt;
&lt;li&gt;restrictions of the current implementation&lt;/li&gt;
&lt;li&gt;potential future enhancements and extensions&lt;/li&gt;
&lt;/ul&gt;
&lt;!--more--&gt;
&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Kudu supports location awareness starting with the 1.9.0 release. The
initial implementation of location awareness in Kudu is built to satisfy the
following requirement:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In a Kudu cluster consisting of multiple servers spread over several racks,
place the replicas of a tablet in such a way that the tablet stays available
even if all the servers in a single rack become unavailable.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A rack failure can occur when a hardware component shared among servers in the
rack, such as a network switch or power supply, fails. More generally,
replace ‘rack’ with any other aggregation of nodes (e.g., chassis, site,
cloud availability zone, etc.) where some or all nodes in an aggregate become
unavailable in case of a failure. This even applies to a datacenter if the
network latency between datacenters is low. This is why we call the feature
&lt;em&gt;location awareness&lt;/em&gt; and not &lt;em&gt;rack awareness&lt;/em&gt;.&lt;/p&gt;
&lt;h1 id=&quot;locations-in-kudu&quot;&gt;Locations in Kudu&lt;/h1&gt;
&lt;p&gt;In Kudu, a location is defined by a string that begins with a slash (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/&lt;/code&gt;) and
consists of slash-separated tokens each of which contains only characters from
the set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[a-zA-Z0-9_-.]&lt;/code&gt;. The components of the location string hierarchy
should correspond to the physical or cloud-defined hierarchy of the deployed
cluster, e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/data-center-0/rack-09&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/region-0/availability-zone-01&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The design choice of using hierarchical paths for location strings is
partially influenced by HDFS. The intention was to make it possible using
the same locations as for existing HDFS nodes, because it’s common to deploy
Kudu alongside HDFS. In addition, the hierarchical structure of location
strings allows for interpretation of those in terms of common ancestry and
relative proximity. As of now, Kudu does not exploit the hierarchical
structure of the location except for the client’s logic to find the closest
tablet server. However, we plan to leverage the hierarchical structure
in future releases.&lt;/p&gt;
&lt;h1 id=&quot;defining-and-assigning-locations&quot;&gt;Defining and assigning locations&lt;/h1&gt;
&lt;p&gt;Kudu masters assign locations to tablet servers and clients.&lt;/p&gt;
&lt;p&gt;Every Kudu master runs the location assignment procedure to assign a location
to a tablet server when it registers. To determine the location for a tablet
server, the master invokes an executable that takes the IP address or hostname
of the tablet server and outputs the corresponding location string for the
specified IP address or hostname. If the executable exits with non-zero exit
status, that’s interpreted as an error and masters add corresponding error
message about that into their logs. In case of tablet server registrations
such outcome is deemed as a registration failure and the corresponding tablet
server is not added into the master’s registry. The latter renders the tablet
server unusable to Kudu clients since non-registered tablet servers are not
discoverable to Kudu clients via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetTableLocations&lt;/code&gt; RPC.&lt;/p&gt;
&lt;p&gt;The master associates the produced location string with the registered tablet
server and keeps it until the tablet server re-registers, which only occurs
if the master or tablet server restarts. Masters use the assigned location
information internally to make replica placement decisions, trying to place
replicas evenly across locations and to keep tablets available in case all
tablet servers in a single location fail (see
&lt;a href=&quot;https://s.apache.org/location-awareness-design&quot;&gt;the design document&lt;/a&gt;
for details). In addition, masters provide connected clients with
the information on the client’s assigned location, so the clients can make
informed decisions when they attempt to read from the closest tablet server.
Kudu tablet servers themselves are location agnostic, at least for now,
so the assigned location is not reported back to a registered tablet server.&lt;/p&gt;
&lt;h1 id=&quot;the-location-aware-placement-policy-for-tablet-replicas-in-kudu&quot;&gt;The location-aware placement policy for tablet replicas in Kudu&lt;/h1&gt;
&lt;p&gt;While placing replicas of tablets in location-aware cluster, Kudu uses a best
effort approach to adhere to the following principle:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Spread replicas across locations so that the failure of tablet servers
in one location does not make tablets unavailable.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s referred to as the &lt;em&gt;replica placement policy&lt;/em&gt; or just &lt;em&gt;placement policy&lt;/em&gt;.
In Kudu, both the initial placement of tablet replicas and the automatic
re-replication are governed by that policy. As of now, that’s the only
replica placement policy available in Kudu. The placement policy isn’t
customizable and doesn’t have any configurable parameters.&lt;/p&gt;
&lt;h1 id=&quot;automatic-re-replication-and-placement-policy&quot;&gt;Automatic re-replication and placement policy&lt;/h1&gt;
&lt;p&gt;By design, keeping the target replication factor for tablets has higher
priority than conforming to the replica placement policy. In other words,
when bringing up tablet replicas to replace failed ones, Kudu uses a best-effort
approach with regard to conforming to the constraints of the placement policy.
Essentially, that means that if there isn’t a way to place a replica to conform
with the placement policy, the system places the replica anyway. The resulting
violation of the placement policy can be addressed later on when unreachable
tablet servers become available again or the misconfiguration is addressed.
As of now, to fix the resulting placement policy violations it’s necessary
to run the CLI rebalancer tool manually (see below for details),
but in future releases that might be done &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2780&quot;&gt;automatically in background&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id=&quot;an-example-of-location-aware-rebalancing&quot;&gt;An example of location-aware rebalancing&lt;/h1&gt;
&lt;p&gt;This section illustrates what happens during each phase of the location-aware
rebalancing process.&lt;/p&gt;
&lt;p&gt;In the diagrams below, the larger outer boxes denote locations, and the
smaller inner ones denote tablet servers. As for the real-world objects behind
locations in this example, one might think of server racks with a shared power
supply or a shared network switch. It’s assumed that no more than one tablet
server is run at each node (i.e. machine) in a rack.&lt;/p&gt;
&lt;p&gt;The first phase of the rebalancing process is about detecting violations and
reinstating the placement policy in the cluster. In the diagram below, there
are three locations defined: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L0&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L1&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L2&lt;/code&gt;. Each location has two tablet
servers. Table &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; has the replication factor of three (RF=3) and consists of
four tablets: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A0&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A1&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A2&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A3&lt;/code&gt;. Table &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt; has replication factor of five
(RF=5) and consists of three tablets: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B0&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B1&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The distribution of the replicas for tablet &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A0&lt;/code&gt; violates the placement policy.
Why? Because replicas &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A0.0&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A0.1&lt;/code&gt; constitute the majority of replicas
(two out of three) and reside in the same location &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L0&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; /L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | A0.1 | | | | A0.2 | | | | | | | | | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | B0.1 | | | | B0.2 | | B0.3 | | | | B0.4 | | | |
| | B1.0 | | B1.1 | | | | B1.2 | | B1.3 | | | | B1.4 | | | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | B2.4 | | | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The location-aware rebalancer should initiate movement either of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;T0.0&lt;/code&gt; or
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;T0.1&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L0&lt;/code&gt; to other location, so the resulting replica distribution would
&lt;em&gt;not&lt;/em&gt; contain the majority of replicas in any single location. In addition to
that, the rebalancer tool tries to evenly spread the load across all locations
and tablet servers within each location. The latter narrows down the list
of the candidate replicas to move: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A0.1&lt;/code&gt; is the best candidate to move from
location &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L0&lt;/code&gt;, so location &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L0&lt;/code&gt; would not contain the majority of replicas
for tablet &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A0&lt;/code&gt;. The same principle dictates the target location and the target
tablet server to receive &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A0.1&lt;/code&gt;: that should be tablet server &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TS5&lt;/code&gt; in the
location &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L2&lt;/code&gt;. The result distribution of the tablet replicas after the move
is represented in the diagram below.&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; /L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | | | | | A0.2 | | | | | | | | A0.1 | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | B0.1 | | | | B0.2 | | B0.3 | | | | B0.4 | | | |
| | B1.0 | | B1.1 | | | | B1.2 | | B1.3 | | | | B1.4 | | | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | B2.4 | | | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The second phase of the location-aware rebalancing is about moving tablet
replicas across locations to make the locations’ load more balanced. For the
number &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S&lt;/code&gt; of tablet servers in a location and the total number &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;R&lt;/code&gt; of replicas
in the location, the &lt;em&gt;load of the location&lt;/em&gt; is defined as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;R/S&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;At this stage all violations of the placement policy are already rectified. The
rebalancer tool doesn’t attempt to make any moves which would violate the
placement policy.&lt;/p&gt;
&lt;p&gt;The load of the locations in the diagram above:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L0&lt;/code&gt;: 1/5&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L1&lt;/code&gt;: 1/5&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L2&lt;/code&gt;: 2/7&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A possible distribution of the tablet replicas after the second phase is
represented below. The result load of the locations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L0&lt;/code&gt;: 2/9&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L1&lt;/code&gt;: 2/9&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/L2&lt;/code&gt;: 2/9&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; /L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | | | | | A0.2 | | | | | | | | A0.1 | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | | | | | B0.2 | | B0.3 | | | | B0.4 | | B0.1 | |
| | B1.0 | | B1.1 | | | | | | B1.3 | | | | B1.4 | | B2.2 | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | B2.4 | | | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The third phase of the location-aware rebalancing is about moving tablet
replicas within each location to make the distribution of replicas even,
both per-table and per-server.&lt;/p&gt;
&lt;p&gt;See below for a possible replicas’ distribution in the example scenario
after the third phase of the location-aware rebalancing successfully completes.&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; /L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | | | | | | | A0.2 | | | | | | A0.1 | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | | | | | B0.2 | | B0.3 | | | | B0.4 | | B0.1 | |
| | B1.0 | | B1.1 | | | | | | B1.3 | | | | B1.4 | | B1.2 | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | | | B2.4 | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h1 id=&quot;how-to-make-a-kudu-cluster-location-aware&quot;&gt;How to make a Kudu cluster location-aware&lt;/h1&gt;
&lt;p&gt;To make a Kudu cluster location-aware, it’s necessary to set the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--location_mapping_cmd&lt;/code&gt; flag for Kudu master(s) and make the corresponding
executable (binary or a script) available at the nodes where Kudu masters run.
In case of multiple masters, it’s important to make sure that the location
mappings stay the same regardless of the node where the location assignment
command is running.&lt;/p&gt;
&lt;p&gt;It’s recommended to have at least three locations defined in a Kudu
cluster so that no location contains a majority of tablet replicas.
With two locations or less it’s not possible to spread replicas
of tablets with replication factor of three and higher such that no location
contains a majority of replicas.&lt;/p&gt;
&lt;p&gt;For example, running a Kudu cluster in a single datacenter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dc0&lt;/code&gt;, assign
location &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dc0/rack0&lt;/code&gt; to tablet servers running at machines in the rack &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rack0&lt;/code&gt;,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dc0/rack1&lt;/code&gt; to tablet servers running at machines in the rack &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rack1&lt;/code&gt;,
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dc0/rack2&lt;/code&gt; to tablet servers running at machines in the rack &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rack2&lt;/code&gt;.
In a similar way, when running in cloud, assign location &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/regionA/az0&lt;/code&gt;
to tablet servers running in availability zone &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;az0&lt;/code&gt; of region &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;regionA&lt;/code&gt;,
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/regionA/az1&lt;/code&gt; to tablet servers running in zone &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;az1&lt;/code&gt; of the same region.&lt;/p&gt;
&lt;h1 id=&quot;an-example-of-location-assignment-script-for-kudu&quot;&gt;An example of location assignment script for Kudu&lt;/h1&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;#!/bin/sh
#
# It's assumed a Kudu cluster consists of nodes with IPv4 addresses in the
# private 192.168.100.0/32 subnet. The nodes are hosted in racks, where
# each rack can contain at most 32 nodes. This results in 8 locations,
# one location per rack.
#
# This example script maps IP addresses into locations assuming that RPC
# endpoints of tablet servers are specified via IPv4 addresses. If tablet
# servers' RPC endpoints are specified using DNS hostnames (and that's how
# it's done by default), the script should consume DNS hostname instead of
# an IP address as an input parameter. Check the `--rpc_bind_addresses` and
# `--rpc_advertised_addresses` command line flags of kudu-tserver for details.
#
# DISCLAIMER:
# This is an example Bourne shell script for Kudu location assignment. Please
# note it's just a toy script created with illustrative-only purpose.
# The error handling and the input validation are minimalistic. Also, the
# network topology choice, supportability and capacity planning aspects of
# this script might be sub-optimal if applied as-is for real-world use cases.
set -e
if [ $# -ne 1 ]; then
echo &quot;usage: $0 &amp;lt;ip_address&amp;gt;&quot;
exit 1
fi
ip_address=$1
shift
suffix=${ip_address##192.168.100.}
if [ -z &quot;${suffix##*.*}&quot; ]; then
# An IP address from a non-controlled subnet: maps into the 'other' location.
echo &quot;/other&quot;
exit 0
fi
# The mapping of the IP addresses
if [ -z &quot;$suffix&quot; -o $suffix -lt 0 -o $suffix -gt 255 ]; then
echo &quot;ERROR: '$ip_address' is not a valid IPv4 address&quot;
exit 2
fi
if [ $suffix -eq 0 -o $suffix -eq 255 ]; then
echo &quot;ERROR: '$ip_address' is a 0xffffff00 IPv4 subnet address&quot;
exit 3
fi
if [ $suffix -lt 32 ]; then
echo &quot;/dc0/rack00&quot;
elif [ $suffix -ge 32 -a $suffix -lt 64 ]; then
echo &quot;/dc0/rack01&quot;
elif [ $suffix -ge 64 -a $suffix -lt 96 ]; then
echo &quot;/dc0/rack02&quot;
elif [ $suffix -ge 96 -a $suffix -lt 128 ]; then
echo &quot;/dc0/rack03&quot;
elif [ $suffix -ge 128 -a $suffix -lt 160 ]; then
echo &quot;/dc0/rack04&quot;
elif [ $suffix -ge 160 -a $suffix -lt 192 ]; then
echo &quot;/dc0/rack05&quot;
elif [ $suffix -ge 192 -a $suffix -lt 224 ]; then
echo &quot;/dc0/rack06&quot;
else
echo &quot;/dc0/rack07&quot;
fi
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h1 id=&quot;reinstating-the-placement-policy-in-a-location-aware-kudu-cluster&quot;&gt;Reinstating the placement policy in a location-aware Kudu cluster&lt;/h1&gt;
&lt;p&gt;As explained earlier, even if the initial placement of tablet replicas conforms
to the placement policy, the cluster might get to a point where there are not
enough tablet servers to place a new or a replacement replica. Ideally, such
situations should be handled automatically: once there are enough tablet servers
in the cluster or the misconfiguration is fixed, the placement policy should
be reinstated. Currently, it’s possible to reinstate the placement policy using
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu&lt;/code&gt; CLI tool:&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sudo -u kudu kudu cluster rebalance &amp;lt;master_rpc_endpoints&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;In the first phase, the location-aware rebalancing process tries to
reestablish the placement policy. If that’s not possible, the tool
terminates. Use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--disable_policy_fixer&lt;/code&gt; flag to skip this phase and
continue to the cross-location rebalancing phase.&lt;/p&gt;
&lt;p&gt;The second phase is cross-location rebalancing, i.e. moving tablet replicas
between different locations in attempt to spread tablet replicas among
locations evenly, equalizing the loads of locations throughout the cluster.
If the benefits of spreading the load among locations do not justify the cost
of the cross-location replica movement, the tool can be instructed to skip the
second phase of the location-aware rebalancing. Use the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--disable_cross_location_rebalancing&lt;/code&gt; command line flag for that.&lt;/p&gt;
&lt;p&gt;The third phase is intra-location rebalancing, i.e. balancing the distribution
of tablet replicas within each location as if each location is a cluster on its
own. Use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--disable_intra_location_rebalancing&lt;/code&gt; flag to skip this phase.&lt;/p&gt;
&lt;h1 id=&quot;future-work&quot;&gt;Future work&lt;/h1&gt;
&lt;p&gt;Having a CLI tool to reinstate placement policy is nice, but it would be great
to run the location-aware rebalancing in background, automatically reinstating
the placement policy and making tablet replica distribution even
across a Kudu cluster.&lt;/p&gt;
&lt;p&gt;In addition to that, there is a idea to make it possible to have
multiple customizable placement policies in the system. As of now, there is
a request to implement so-called ‘table pinning’, i.e. make it possible
to specify placement policy where replicas of tablets of particular tables
are placed only at nodes within the specified locations. The table pinning
request is tracked by KUDU-2604 in Apache JIRA, see
&lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2604&quot;&gt;KUDU-2604&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;
&lt;p&gt;[1] Location awareness in Kudu: &lt;a href=&quot;https://github.com/apache/kudu/blob/master/docs/design-docs/location-awareness.md&quot;&gt;design document&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;[2] A proposal for Kudu tablet server labeling: &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2604&quot;&gt;KUDU-2604&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;[3] Further improvement: &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2780&quot;&gt;automatic cluster rebalancing&lt;/a&gt;.&lt;/p&gt;</content><author><name>Alexey Serbin</name></author><summary type="html">This post is about location awareness in Kudu. It gives an overview of the following: principles of the design restrictions of the current implementation potential future enhancements and extensions</summary></entry><entry><title type="html">Fine-Grained Authorization with Apache Kudu and Impala</title><link href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html" rel="alternate" type="text/html" title="Fine-Grained Authorization with Apache Kudu and Impala" /><published>2019-04-22T00:00:00-07:00</published><updated>2019-04-22T00:00:00-07:00</updated><id>/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala</id><content type="html" xml:base="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html">&lt;p&gt;Note: This is a cross-post from the Cloudera Engineering Blog
&lt;a href=&quot;https://blog.cloudera.com/blog/2019/04/fine-grained-authorization-with-apache-kudu-and-impala/&quot;&gt;Fine-Grained Authorization with Apache Kudu and Impala&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Apache Impala supports fine-grained authorization via Apache Sentry on all of the tables it
manages including Apache Kudu tables. Given Impala is a very common way to access the data stored
in Kudu, this capability allows users deploying Impala and Kudu to fully secure the Kudu data in
multi-tenant clusters even though Kudu does not yet have native fine-grained authorization of its
own. This solution works because Kudu natively supports coarse-grained (all or nothing)
authorization which enables blocking all access to Kudu directly except for the impala user and
an optional whitelist of other trusted users. This post will describe how to use Apache Impala’s
fine-grained authorization support along with Apache Kudu’s coarse-grained authorization to
achieve a secure multi-tenant deployment.&lt;/p&gt;
&lt;!--more--&gt;
&lt;h2 id=&quot;sample-workflow&quot;&gt;Sample Workflow&lt;/h2&gt;
&lt;p&gt;The examples in this post enable a workflow that uses Apache Spark to ingest data directly into
Kudu and Impala to run analytic queries on that data. The Spark job, run as the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;etl_service&lt;/code&gt; user,
is permitted to access the Kudu data via coarse-grained authorization. Even though this gives
access to all the data in Kudu, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;etl_service&lt;/code&gt; user is only used for scheduled jobs or by an
administrator. All queries on the data, from a wide array of users, will use Impala and leverage
Impala’s fine-grained authorization. Impala’s
&lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_grant.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GRANT&lt;/code&gt; statements&lt;/a&gt;
allow you to flexibly control the privileges on the Kudu storage tables. Impala’s fine-grained
privileges along with support for
&lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_select.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT&lt;/code&gt;&lt;/a&gt;,
&lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_insert.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt;&lt;/a&gt;,
&lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_update.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt;&lt;/a&gt;,
&lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_upsert.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPSERT&lt;/code&gt;&lt;/a&gt;,
and &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_delete.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt;&lt;/a&gt;
statements, allow you to finely control who can read and write data to your Kudu tables while
using Impala. Below is a diagram showing the workflow described:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/fine-grained-authorization-with-apache-kudu.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: The examples below assume that Authorization has already been configured for Kudu, Impala,
and Spark. For help configuring authorization see the Cloudera
&lt;a href=&quot;https://www.cloudera.com/documentation/enterprise/latest/topics/sg_auth_overview.html&quot;&gt;authorization documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;configuring-kudus-coarse-grained-authorization&quot;&gt;Configuring Kudu’s Coarse-Grained Authorization&lt;/h2&gt;
&lt;p&gt;Kudu supports coarse-grained authorization of client requests based on the authenticated client
Kerberos principal. The two levels of access which can be configured are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Superuser&lt;/em&gt; – principals authorized as a superuser are able to perform certain administrative
functionality such as using the kudu command line tool to diagnose or repair cluster issues.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;User&lt;/em&gt; – principals authorized as a user are able to access and modify all data in the Kudu
cluster. This includes the ability to create, drop, and alter tables as well as read, insert,
update, and delete data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Access levels are granted using whitelist-style Access Control Lists (ACLs), one for each of the
two levels. Each access control list either specifies a comma-separated list of users, or may be
set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*&lt;/code&gt; to indicate that all authenticated users are able to gain access at the specified level.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: The default value for the User ACL is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*&lt;/code&gt;, which allows all users access to the cluster.&lt;/p&gt;
&lt;h3 id=&quot;example-configuration&quot;&gt;Example Configuration&lt;/h3&gt;
&lt;p&gt;The first and most important step is to remove the default ACL of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*&lt;/code&gt; from Kudu’s
&lt;a href=&quot;https://kudu.apache.org/docs/configuration_reference.html#kudu-master_user_acl&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;–user_acl&lt;/code&gt; configuration&lt;/a&gt;.
This will ensure only the users you list will have access to the Kudu cluster. Then, to allow the
Impala service to access all of the data in Kudu, the Impala service user, usually impala, should
be added to the Kudu &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;–user_acl&lt;/code&gt; configuration. Any user that is not using Impala will also need
to be added to this list. For example, an Apache Spark job might be used to load data directly
into Kudu. Generally, a single user is used to run scheduled jobs of applications that do not
support fine-grained authorization on their own. For this example, that user is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;etl_service&lt;/code&gt;. The
full &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;–user_acl&lt;/code&gt; configuration is:&lt;/p&gt;
&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nt&quot;&gt;--user_acl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;impala,etl_service&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;
&lt;p&gt;For more details see the Kudu
&lt;a href=&quot;https://kudu.apache.org/docs/security.html#_coarse_grained_authorization&quot;&gt;authorization documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;using-impalas-fine-grained-authorization&quot;&gt;Using Impala’s Fine-Grained Authorization&lt;/h2&gt;
&lt;p&gt;Follow Impala’s
&lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_authorization.html&quot;&gt;authorization documentation&lt;/a&gt;
to configure fine-grained authorization. Once configured, you can use Impala’s
&lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_grant.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GRANT&lt;/code&gt; statements&lt;/a&gt;
to control the privileges of Kudu tables. These fine-grained privileges can be set at the database,
table and column level. Additionally you can individually control &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CREATE&lt;/code&gt;,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALTER&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DROP&lt;/code&gt; privileges.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: A user needs the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALL&lt;/code&gt; privilege in order to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPSERT&lt;/code&gt;
statements against a Kudu table.&lt;/p&gt;
&lt;p&gt;Below is a brief example with a couple tables stored in Kudu:&lt;/p&gt;
&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;messages&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;TIMESTAMP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;message&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HASH&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITIONS&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KUDU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;GRANT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;messages&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;host&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INT64&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;DOUBLE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HASH&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITIONS&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KUDU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;GRANT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;messages&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userB&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This brief example that combines Kudu’s coarse-grained authorization and Impala’s fine-grained
authorization should enable you to meet the security needs of your data workflow today. The
pattern described here can be applied to other services and workflows using Kudu as well. For
greater authorization flexibility, you can look forward to the near future when Kudu supports
native fine-grained authorization on its own. The Apache Kudu contributors understand the
importance of native fine-grained authorization and they are working on integrations with
Apache Sentry and Apache Ranger.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary type="html">Note: This is a cross-post from the Cloudera Engineering Blog Fine-Grained Authorization with Apache Kudu and Impala Apache Impala supports fine-grained authorization via Apache Sentry on all of the tables it manages including Apache Kudu tables. Given Impala is a very common way to access the data stored in Kudu, this capability allows users deploying Impala and Kudu to fully secure the Kudu data in multi-tenant clusters even though Kudu does not yet have native fine-grained authorization of its own. This solution works because Kudu natively supports coarse-grained (all or nothing) authorization which enables blocking all access to Kudu directly except for the impala user and an optional whitelist of other trusted users. This post will describe how to use Apache Impala’s fine-grained authorization support along with Apache Kudu’s coarse-grained authorization to achieve a secure multi-tenant deployment.</summary></entry></feed>