feed.xml - kudu-site - Git at Google

 <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2021-06-22T14:15:16-07:00</updated><id>/feed.xml</id><entry><title type="html">Apache Kudu 1.15.0 Released</title><link href="/2021/06/22/apache-kudu-1-15-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.15.0 Released" /><published>2021-06-22T00:00:00-07:00</published><updated>2021-06-22T00:00:00-07:00</updated><id>/2021/06/22/apache-kudu-1-15-0-released</id><content type="html" xml:base="/2021/06/22/apache-kudu-1-15-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.15.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;
     &lt;p&gt;Kudu now experimentally supports multi-row transactions. Currently only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt; and
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT_IGNORE&lt;/code&gt; operations are supported.
 See &lt;a href=&quot;https://github.com/apache/kudu/blob/master/docs/design-docs/transactions.adoc&quot;&gt;here&lt;/a&gt; for a
 design overview of this feature.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Kudu now supports Raft configuration change for Kudu masters and CLI tools for orchestrating
 addition and removal of masters in a Kudu cluster. These tools substantially simplify the process
 of migrating to multiple masters, recovering a dead master and removing masters from a Kudu
 cluster. For detailed steps, see the latest administration documentation. This feature is evolving
 and the steps to add, remove and recover masters may change in the future.
 See &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2181&quot;&gt;KUDU-2181&lt;/a&gt; for details.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Kudu now supports table comments directly on Kudu tables which are automatically synchronized
 when the Hive Metastore integration is enabled. These comments can be added at table creation time
 and changed via table alteration.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Kudu now experimentally supports per-table size limits based on leader disk space usage or number
 of rows. When generating new authorization tokens, Masters will now consider the size limits and
 strip tokens of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; privileges if either limit is reached. To enable this
 feature, set the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--enable_table_write_limit&lt;/code&gt; master flag; adjust the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--table_disk_size_limit&lt;/code&gt;
 and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--table_row_count_limit&lt;/code&gt; flags as desired or use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu table set_limit&lt;/code&gt; tool to set
 limits per table.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;It is now possible to change the Kerberos Service Principal Name using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--principal&lt;/code&gt; flag. The
 default SPN is still &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu/_HOST&lt;/code&gt;. Clients connecting to a cluster using a non-default SPN must
 set the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sasl_protocol_name&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;saslProtocolName&lt;/code&gt; to match the SPN base
 (i.e. “kudu” if the SPN is “kudu/_HOST”) in the client builder or the Kudu CLI.
 See &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-1884&quot;&gt;KUDU-1884&lt;/a&gt; for details.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Kudu RPC now supports TLSv1.3.  Kudu servers and clients automatically negotiate TLSv1.3 for Kudu
 RPC if OpenSSL (or Java runtime correspondingly) on each side supports TLSv1.3.
 If necessary, use the newly introduced flag &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--rpc_tls_ciphersuites&lt;/code&gt; to customize TLSv1.3-specific
 cipher suites at the server side.
 See &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2871&quot;&gt;KUDU-2871&lt;/a&gt; for details.&lt;/p&gt;
   &lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.15.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.15.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.15.0&quot;&gt;1.15.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.15.0/docs/installation.html#build_from_source&quot;&gt;1.15.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.15.0&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Additionally, experimental Docker images are published to
 &lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;, including for AArch64-based
 architectures (ARM).&lt;/p&gt;</content><author><name>Bankim Bhavsar</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.15.0! The new release adds several new features and improvements, including the following:</summary></entry><entry><title type="html">Apache Kudu 1.14.0 Released</title><link href="/2021/01/28/apache-kudu-1-14-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.14.0 Released" /><published>2021-01-28T00:00:00-08:00</published><updated>2021-01-28T00:00:00-08:00</updated><id>/2021/01/28/apache-kudu-1-14-0-release</id><content type="html" xml:base="/2021/01/28/apache-kudu-1-14-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.14.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;
     &lt;p&gt;Full support for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT_IGNORE&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE_IGNORE&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE_IGNORE&lt;/code&gt; operations
 was added. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT_IGNORE&lt;/code&gt; operation will insert a row if one matching the key
 does not exist and ignore the operation if one already exists. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE_IGNORE&lt;/code&gt;
 operation will update the row if one matching the key exists and ignore the operation
 if one does not exist. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE_IGNORE&lt;/code&gt; operation will delete the row if one matching
 the key exists and ignore the operation if one does not exist. These operations are
 particularly useful in situations where retries or duplicate operations could occur and
 you do not want to handle the errors that could result manually or you do not want to cause
 unnecessary writes and compaction work as a result of using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPSERT&lt;/code&gt; operation.
 The Java client can check if the cluster it is communicating with supports these operations
 by calling the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;supportsIgnoreOperations()&lt;/code&gt; method on the KuduClient.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Spark 3 compatible JARs compiled for Scala 2.12 are now published for the Kudu Spark integration.
 See &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-3202&quot;&gt;KUDU-3202&lt;/a&gt; for more details.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Every Kudu cluster now has an automatically generated cluster Id that can be used to uniquely
 identify a cluster. The cluster Id is shown in the masters web-UI, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu master list&lt;/code&gt; tool,
 and in master server logs.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Downloading the WAL data and data blocks when copying tablets to another tablet server is now
 parallelized, resulting in much faster tablet copy operations. These operations occur when
 recovering from a down tablet server or when running the cluster rebalancer.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;The HMS integration now supports multiple Kudu clusters associated with a single HMS
 including Kudu clusters that do not have HMS synchronization enabled. This is possible,
 because the Kudu master will now leverage the cluster Id to ignore notifications from
 tables in a different cluster. Additionally, the HMS plugin will check if the Kudu cluster
 associated with a table has HMS synchronization enabled.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;DeltaMemStores will now be flushed as long as any DMS in a tablet is older than the point
 defined by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--flush_threshold_secs&lt;/code&gt;, rather than flushing once every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--flush_threshold_secs&lt;/code&gt;
 period. This can reduce memory pressure under update- or delete-heavy workloads, and lower tablet
 server restart times following such workloads.&lt;/p&gt;
   &lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.14.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.14.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.14.0&quot;&gt;1.14.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.14.0/docs/installation.html#build_from_source&quot;&gt;1.14.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.14.0&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Additionally, experimental Docker images are published to
 &lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;, including for AArch64-based
 architectures (ARM).&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.14.0! The new release adds several new features and improvements, including the following:</summary></entry><entry><title type="html">Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu</title><link href="/2021/01/15/bloom-filter-predicate.html" rel="alternate" type="text/html" title="Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu" /><published>2021-01-15T00:00:00-08:00</published><updated>2021-01-15T00:00:00-08:00</updated><id>/2021/01/15/bloom-filter-predicate</id><content type="html" xml:base="/2021/01/15/bloom-filter-predicate.html">&lt;p&gt;Note: This is a cross-post from the Cloudera Engineering Blog
 &lt;a href=&quot;https://blog.cloudera.com/optimized-joins-filtering-with-bloom-filter-predicate-in-kudu/&quot;&gt;Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;Cloudera’s CDP Runtime version 7.1.5 maps to Apache Kudu 1.13 and upcoming Apache Impala 4.0&lt;/p&gt;

 &lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
 &lt;p&gt;In database systems one of the most effective ways to improve performance is to avoid doing
 unnecessary work, such as network transfers and reading data from disk. One of the ways Apache
 Kudu achieves this is by supporting column predicates with scanners. Pushing down column predicate
 filters to Kudu allows for optimized execution by skipping reading column values for filtered out
 rows and reducing network IO between a client, like the distributed query engine Apache Impala, and
 Kudu. See the documentation on
 &lt;a href=&quot;https://docs.cloudera.com/runtime/latest/impala-reference/topics/impala-runtime-filtering.html&quot;&gt;runtime filtering in Impala&lt;/a&gt;
 for details.&lt;/p&gt;

 &lt;p&gt;CDP Runtime 7.1.5 and CDP Public Cloud added support for Bloom filter column predicate pushdown in
 Kudu and the associated integration in Impala.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;h2 id=&quot;bloom-filter&quot;&gt;Bloom filter&lt;/h2&gt;
 &lt;p&gt;A Bloom filter is a space-efficient probabilistic data structure used to test set membership with a
 possibility of false positive matches. In database systems these are used to determine whether a
 set of data can be ignored when only a subset of the records are required. See the
 &lt;a href=&quot;https://en.wikipedia.org/wiki/Bloom_filter&quot;&gt;wikipedia page&lt;/a&gt; for more details.&lt;/p&gt;

 &lt;p&gt;The implementation used in Kudu is a space, hash, and cache efficient block-based Bloom filter from
 &lt;a href=&quot;https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf&quot;&gt;“Cache-, Hash- and Space-Efficient Bloom Filters”&lt;/a&gt;
 by Putze et al. This Bloom filter was taken from the implementation in Impala and further enhanced.
 The block based Bloom filter is designed to fit in CPU cache, and it allows SIMD operations using
 AVX2, when available, for efficient lookup and insertion.&lt;/p&gt;

 &lt;p&gt;Consider the case of a broadcast hash join between a small table and a big table where predicate
 push down is not available. This typically involves following steps:&lt;/p&gt;
 &lt;ol&gt;
   &lt;li&gt;Read the entire small table and construct a hash table from it.&lt;/li&gt;
   &lt;li&gt;Broadcast the generated hash table to all worker nodes.&lt;/li&gt;
   &lt;li&gt;On the worker nodes start fetching and iterating on slices of the big table,  check whether the
 key in the big table exists in the hash table, and only return the matched rows.&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;Step 3 is the heaviest since it involves reading the entire big table, and could involve heavy
 network IO if the worker and the nodes hosting the big table are not on the same server.&lt;/p&gt;

 &lt;p&gt;Before 7.1.5, Impala supported pushing down only the Minimum/Maximum (MIN_MAX) runtime filter to
 Kudu which filters out values not within the specified bounds. In addition to the MIN_MAX runtime
 filter, Impala in CDP 7.1.5+ now supports pushing down a runtime Bloom filter to Kudu. With the
 newly introduced Bloom filter predicate support in Kudu, Impala can use this feature to perform
 drastically more efficient joins for data stored in Kudu.
 Performance
 As in the scenario described above, we ran a Impala query which joins a big table stored on Kudu
 and a small table stored as Parquet on HDFS. The small table was created using Parquet on HDFS to
 isolate the new feature, but could also be stored in Kudu just the same. We ran the queries first
 using only the MIN_MAX filter and then using both the MIN_MAX and BLOOM filter
 (ALL runtime filters). For comparison, we created the same big table in Parquet on HDFS. Using
 Parquet on HDFS is a great baseline for comparison because Impala already supports both MIN_MAX and
 BLOOM filters for Parquet on HDFS.&lt;/p&gt;

 &lt;h2 id=&quot;setup&quot;&gt;Setup&lt;/h2&gt;
 &lt;p&gt;The following test was performed on a 6 node cluster with CDP Runtime 7.1.5.&lt;/p&gt;

 &lt;p&gt;Hardware Configuration:
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dell PowerEdge R430, 20c/40t Xeon e5-2630 v4 @ 2.2Ghz, 128GB RAM, 4-2TB HDDs with 1 for WAL and 3
 for data directories.&lt;/code&gt;&lt;/p&gt;

 &lt;h3 id=&quot;schema&quot;&gt;Schema:&lt;/h3&gt;
 &lt;ul&gt;
   &lt;li&gt;Big table consists of 260 million rows with randomly generated data hash partitioned by primary
 key across 20 partitions on Kudu. The Kudu table was explicitly rebalanced to ensure a balanced
 layout after the load.&lt;/li&gt;
   &lt;li&gt;Small table consists of 2000 rows of top 1000 and bottom 1000 keys from the big table stored as
 Parquet on HDFS. This prevents the MIN_MAX filters from doing any filtering on the big table as
 all rows would fall under the range bounds of the MIN_MAX filters.&lt;/li&gt;
   &lt;li&gt;COMPUTE STATS were run on all tables to help gather information about the table metadata and help
 Impala optimize the query plan.&lt;/li&gt;
   &lt;li&gt;All queries were run 10 times and the mean query runtime is depicted below.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;h2 id=&quot;join-queries&quot;&gt;Join Queries&lt;/h2&gt;
 &lt;p&gt;For join queries, we saw performance improvements of 3X to 5X in Kudu with Bloom filter predicate
 pushdown. We expect to see even better performance multiples with larger data sizes and more
 selective queries.&lt;/p&gt;

 &lt;p&gt;Compared to Parquet on HDFS, Kudu performance is now better by around 17-33%.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/bloom-filter-join-queries.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;h2 id=&quot;update-query&quot;&gt;Update Query&lt;/h2&gt;
 &lt;p&gt;For an update query that basically upserts the entire small table into the existing big table, we
 saw 15X improvement. This is primarily due to the increased query performance when selecting the
 rows to update.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/bloom-filter-update-query.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;See references section below for details on the table schema, loading process, and queries that were
 run.&lt;/p&gt;

 &lt;h2 id=&quot;tpc-h&quot;&gt;TPC-H&lt;/h2&gt;
 &lt;p&gt;We also ran the TPC-H benchmark on a single node cluster with a scale factor of 30 and saw
 performance improvements in the range of 19% to 31% with different block cache capacity settings.&lt;/p&gt;

 &lt;p&gt;Kudu automatically disables Bloom filter predicates that are not effectively filtering data to avoid
 any performance penalties from the new feature. During development of the feature, query 9 in the
 TPCH benchmark (TPCH-Q9) exhibited regression of 50-96%. On further investigation, the time required
 to scan the rows from Kudu increased by up to 2X. When investigating this regression we found that
 the Bloom filter predicate that was pushed down was filtering out less than 10% of the rows, leading
 to increased CPU usage in Kudu which outweighed the benefit of the filter. To resolve the regression
 we added a heuristic in Kudu wherein if a Bloom filter predicate is not filtering out a sufficient
 percentage of rows then it’s disabled automatically for the remainder of the scan. This is safe
 because Bloom filters can return false positives and hence false matches returned to the client are
 expected to be filtered out using other deterministic filters.&lt;/p&gt;

 &lt;h2 id=&quot;feature-availability&quot;&gt;Feature Availability&lt;/h2&gt;
 &lt;p&gt;Users querying Kudu using Impala will have the feature enabled by default from CDP 7.1.5 onward
 and CDP Public Cloud. We highly recommend users upgrade to get this performance enhancement and many
 other performance enhancements in the release. For custom applications that use the Kudu client API
 directly, the Kudu C++ client also has the Bloom filter predicate available from CDP 7.1.5 onward.
 The Kudu Java client does not have the Bloom filter predicate available yet,
 &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-3221&quot;&gt;KUDU-3221&lt;/a&gt;.&lt;/p&gt;

 &lt;h2 id=&quot;references&quot;&gt;References:&lt;/h2&gt;
 &lt;ul&gt;
   &lt;li&gt;Performance testing related schema and queries:
 &lt;a href=&quot;https://gist.github.com/bbhavsar/006df9c40b4b0528e297fac29824ceb4&quot;&gt;https://gist.github.com/bbhavsar/006df9c40b4b0528e297fac29824ceb4&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Kudu C++ client documentation:
 &lt;a href=&quot;https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduTable.html#a356e8d0d10491d4d8540adefac86be94&quot;&gt;https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduTable.html#a356e8d0d10491d4d8540adefac86be94&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Example code to create and pass Bloom filter predicate:
 &lt;a href=&quot;https://github.com/apache/kudu/blob/master/src/kudu/client/predicate-test.cc#L1416&quot;&gt;https://github.com/apache/kudu/blob/master/src/kudu/client/predicate-test.cc#L1416&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Block based Bloom filter:
 &lt;a href=&quot;https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L51&quot;&gt;https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L51&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
 &lt;p&gt;This feature was implemented jointly by Bankim Bhavsar and Wenzhe Zhou with guidance and feedback
 from Tim Armstrong, Adar Dembo, Thomas Tauber-Marshall, Andrew Wong, and Grant Henke. We are also
 grateful for our customers especially Mauricio Aristizabal from Impact for providing us valuable
 feedback and benchmarks.&lt;/p&gt;</content><author><name>Bankim Bhavsar</name></author><summary type="html">Note: This is a cross-post from the Cloudera Engineering Blog Optimized joins &amp;amp; filtering with Bloom filter predicate in Kudu Cloudera’s CDP Runtime version 7.1.5 maps to Apache Kudu 1.13 and upcoming Apache Impala 4.0 Introduction In database systems one of the most effective ways to improve performance is to avoid doing unnecessary work, such as network transfers and reading data from disk. One of the ways Apache Kudu achieves this is by supporting column predicates with scanners. Pushing down column predicate filters to Kudu allows for optimized execution by skipping reading column values for filtered out rows and reducing network IO between a client, like the distributed query engine Apache Impala, and Kudu. See the documentation on runtime filtering in Impala for details. CDP Runtime 7.1.5 and CDP Public Cloud added support for Bloom filter column predicate pushdown in Kudu and the associated integration in Impala.</summary></entry><entry><title type="html">Apache Kudu 1.13.0 released</title><link href="/2020/09/21/apache-kudu-1-13-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.13.0 released" /><published>2020-09-21T00:00:00-07:00</published><updated>2020-09-21T00:00:00-07:00</updated><id>/2020/09/21/apache-kudu-1-13-0-release</id><content type="html" xml:base="/2020/09/21/apache-kudu-1-13-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.13.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;Added table ownership support. All newly created tables are automatically
 owned by the user creating them. It is also possible to change the owner by
 altering the table. You can also assign privileges to table owners via Apache
 Ranger.&lt;/li&gt;
   &lt;li&gt;An experimental feature is added to Kudu that allows it to automatically
 rebalance tablet replicas among tablet servers. The background task can be
 enabled by setting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--auto_rebalancing_enabled&lt;/code&gt; flag on the Kudu masters.
 Before starting auto-rebalancing on an existing cluster, the CLI rebalancer
 tool should be run first.&lt;/li&gt;
   &lt;li&gt;Bloom filter column predicate pushdown has been added to allow optimized
 execution of filters which match on a set of column values with a
 false-positive rate. Support for Impala queries utilizing Bloom filter
 predicate is available yielding performance improvements of 19% to 30% in TPC-H
 benchmarks and around 41% improvement for distributed joins across large
 tables. Support for Spark is not yet available.&lt;/li&gt;
   &lt;li&gt;ARM-based architectures are now supported.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.13.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.13.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.13.0&quot;&gt;1.13.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.13.0/docs/installation.html#build_from_source&quot;&gt;1.13.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.13.0&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Additionally, experimental Docker images are published to
 &lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;, including for AArch64-based
 architectures (ARM).&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.13.0! The new release adds several new features and improvements, including the following:</summary></entry><entry><title type="html">Fine-Grained Authorization with Apache Kudu and Apache Ranger</title><link href="/2020/08/11/fine-grained-authz-ranger.html" rel="alternate" type="text/html" title="Fine-Grained Authorization with Apache Kudu and Apache Ranger" /><published>2020-08-11T00:00:00-07:00</published><updated>2020-08-11T00:00:00-07:00</updated><id>/2020/08/11/fine-grained-authz-ranger</id><content type="html" xml:base="/2020/08/11/fine-grained-authz-ranger.html">&lt;p&gt;When Apache Kudu was first released in September 2016, it didn’t support any
 kind of authorization. Anyone who could access the cluster could do anything
 they wanted. To remedy this, coarse-grained authorization was added along with
 authentication in Kudu 1.3.0. This meant allowing only certain users to access
 Kudu, but those who were allowed access could still do whatever they wanted. The
 only way to achieve finer-grained access control was to limit access to Apache
 Impala where access control &lt;a href=&quot;/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html&quot;&gt;could be enforced&lt;/a&gt; by
 fine-grained policies in Apache Sentry. This method limited how Kudu could be
 accessed, so we saw a need to implement fine-grained access control in a way
 that wouldn’t limit access to Impala only.&lt;/p&gt;

 &lt;p&gt;Kudu 1.10.0 integrated with Apache Sentry to enable finer-grained authorization
 policies. This integration was rather short-lived as it was deprecated in Kudu
 1.12.0 and will be completely removed in Kudu 1.13.0.&lt;/p&gt;

 &lt;p&gt;Most recently, since 1.12.0 Kudu supports fine-grained authorization by
 integrating with Apache Ranger 2.1 and later. In this post, we’ll cover how this
 works and how to set it up.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;

 &lt;p&gt;Ranger supports a wide range of software across the Apache Hadoop ecosystem, but
 unlike Sentry, it doesn’t depend on any of them for fine-grained authorization,
 making it an ideal choice for Kudu.&lt;/p&gt;

 &lt;p&gt;Ranger consists of an Admin server that has a web UI and a REST API where admins
 can create policies. The policies are stored in a database (supported database
 systems are Microsoft SQL Server, MySQL, Oracle, PostgreSQL, and SQL Anywhere)
 and are periodically fetched and cached by the Ranger plugin that runs on the
 Kudu Masters. The Ranger plugin is responsible for authorizing the requests
 against the cached policies. At the time of writing this post, the Ranger plugin
 base is available only in Java, as most Hadoop ecosystem projects, including
 Ranger, are written in Java.&lt;/p&gt;

 &lt;p&gt;Unlike Sentry’s client which we reimplemented in C++, the Ranger plugin is a fat
 client that handles the evaluation of the policies (which are much richer and
 more complex than Sentry policies) locally, so we decided not to reimplement it
 in C++.&lt;/p&gt;

 &lt;p&gt;Each Kudu Master spawns a JVM child process that is effectively a wrapper around
 the Ranger plugin and communicates with it via named pipes.&lt;/p&gt;

 &lt;h2 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h2&gt;

 &lt;p&gt;This post assumes the Admin Tool of a compatible Ranger version is
 &lt;a href=&quot;https://ranger.apache.org/quick_start_guide.html&quot;&gt;installed&lt;/a&gt; on a host that is
 reachable by both you and by all Kudu Master servers.&lt;/p&gt;

 &lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: At the time of writing this post, Ranger 2.0 is the most recent release
 which does NOT support Kudu yet. Ranger 2.1 will be the first version that
 supports Kudu. If you wish to use Kudu with Ranger before this is released, you
 either need to build Ranger from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;master&lt;/code&gt; branch or use a distribution that
 has already backported the relevant bits
 (&lt;a href=&quot;https://issues.apache.org/jira/browse/RANGER-2684&quot;&gt;RANGER-2684&lt;/a&gt;:
 0b23df7801062cc7836f2e162e1775101898add4).&lt;/p&gt;

 &lt;p&gt;To enable Ranger integration in Kudu, Java 8 or later has to be available on the
 Master servers.&lt;/p&gt;

 &lt;p&gt;You can build the Ranger subprocess by navigating to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java/&lt;/code&gt; inside the Kudu
 source directory, then running the below command:&lt;/p&gt;

 &lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./gradlew :kudu-subprocess:jar&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

 &lt;p&gt;This will build the subprocess JAR which you can find in the
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-subprocess/build/libs&lt;/code&gt; directory.&lt;/p&gt;

 &lt;h2 id=&quot;setting-up-kudu-with-ranger&quot;&gt;Setting up Kudu with Ranger&lt;/h2&gt;

 &lt;p&gt;The first step is to add Kudu in Ranger Admin and set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tag.download.auth.users&lt;/code&gt;
 and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;policy.download.auth.users&lt;/code&gt; to the user or service principal name running
 the Kudu process (typically &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu&lt;/code&gt;). The former is for downloading tag-based
 policies which Kudu doesn’t currently support, so this is only for forward
 compatibility and can be safely omitted.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/blog-ranger/create-service.png&quot; alt=&quot;create-service&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Next, you’ll have to configure the Ranger plugin. As it’s written in Java and is
 part of the Hadoop ecosystem, it expects to find a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; in its
 classpath that at a minimum configures the authentication types (simple or
 Kerberos) and the group mapping. If your Kudu is co-located with a Hadoop
 cluster, you can simply use your Hadoop’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; and it should work.
 Otherwise, you can use the below sample &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; assuming you have
 Kerberos enabled and shell-based groups mapping works for you:&lt;/p&gt;

 &lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-xml&quot; data-lang=&quot;xml&quot;&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;hadoop.security.authentication&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;kerberos&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;hadoop.security.group.mapping&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;org.apache.hadoop.security.ShellBasedUnixGroupsMapping&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

 &lt;p&gt;In addition to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;core-site.xml&lt;/code&gt; file, you’ll also need a
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger-kudu-security.xml&lt;/code&gt; in the same directory that looks like this:&lt;/p&gt;

 &lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-xml&quot; data-lang=&quot;xml&quot;&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.cache.dir&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;/path/to/policy/cache/&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.service.name&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;kudu&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.rest.url&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;http://ranger-admin:6080&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.source.impl&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;org.apache.ranger.admin.client.RangerAdminRESTClient&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.policy.pollIntervalMs&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;30000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ranger.plugin.kudu.access.cluster.name&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Cluster 1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

 &lt;ul&gt;
   &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policy.cache.dir&lt;/code&gt; - A directory that is writable by the
 user running the Master process where the plugin will cache the policies it
 fetches from Ranger Admin.&lt;/li&gt;
   &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.service.name&lt;/code&gt; - This needs to be set to whatever the
 service name was set to on Ranger Admin.&lt;/li&gt;
   &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policiy.rest.url&lt;/code&gt; - The URL of the Ranger Admin REST API.&lt;/li&gt;
   &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policy.source.impl&lt;/code&gt; - This should always be
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;org.apache.ranger.admin.client.RangerAdminRESTClient&lt;/code&gt;.&lt;/li&gt;
   &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.policy.pollIntervalMs&lt;/code&gt; - This is the interval at which the
 plugin will fetch policies from the Ranger Admin.&lt;/li&gt;
   &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ranger.plugin.kudu.access.cluster.name&lt;/code&gt; - The name of the cluster.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: This is a minimal config. For more options refer to the &lt;a href=&quot;https://cwiki.apache.org/confluence/display/RANGER/Index&quot;&gt;Ranger
 documentation&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;Once these files are created, you need to point Kudu Masters to the directory
 containing them with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_config_path&lt;/code&gt; flag. In addition,
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_jar_path&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_java_path&lt;/code&gt; should be configured. The Java path
 defaults to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$JAVA_HOME/bin/java&lt;/code&gt; if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$JAVA_HOME&lt;/code&gt; is set and falls back to
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$PATH&lt;/code&gt; if not. The JAR path defaults to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-subprocess.jar&lt;/code&gt; in the
 directory containing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-master&lt;/code&gt; binary.&lt;/p&gt;

 &lt;p&gt;As the last step, you need to set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-tserver_enforce_access_control&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;true&lt;/code&gt; on
 the Tablet Servers to make sure access control is respected across the cluster.&lt;/p&gt;

 &lt;h2 id=&quot;creating-policies&quot;&gt;Creating policies&lt;/h2&gt;

 &lt;p&gt;After setting up the integration it’s time to create some policies, as now only
 trusted users are allowed to perform any action, everyone else is locked out.&lt;/p&gt;

 &lt;p&gt;To create your first policy, log in to Ranger Admin, click on the Kudu service
 you created in the first step of setup, then on the “Add New Policy” button in
 the top right corner. You’ll need to name the policy and set the resource it
 will apply to. Kudu doesn’t support databases, but with Ranger integration
 enabled, it will treat the part of the table name before the first period as the
 database name, or default to “default” if the table name doesn’t contain a
 period (configurable with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ranger_default_database&lt;/code&gt; flag on the
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu-master&lt;/code&gt;).&lt;/p&gt;

 &lt;p&gt;There is no implicit hierarchy in the resources, which means that granting
 privileges on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=foo&lt;/code&gt; won’t imply privileges on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;foo.bar&lt;/code&gt;. To create a policy
 that applies to all tables and all columns in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;foo&lt;/code&gt; database you need to
 create a policy for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=foo-&amp;gt;tbl=*-&amp;gt;col=*&lt;/code&gt;.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/blog-ranger/create-policy.png&quot; alt=&quot;create-policy&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;For a list of the required privileges to perform operations please refer to our
 &lt;a href=&quot;/docs/security.html#policy-for-kudu-masters&quot;&gt;documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;h2 id=&quot;table-ownership&quot;&gt;Table ownership&lt;/h2&gt;

 &lt;p&gt;Kudu 1.13 will introduce table ownership, which enhances the authorization
 experience when Ranger integration is enabled. Tables are automatically owned by
 the users creating the table and it’s possible to change the owner as a part of
 an alter table operation.&lt;/p&gt;

 &lt;p&gt;Ranger supports granting privileges to the table owners via a special &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{OWNER}&lt;/code&gt;
 user. You can, for example, grant the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALL&lt;/code&gt; privilege and delegate admin (this
 is required to change the owner of a table) to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{OWNER}&lt;/code&gt; on
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=*-&amp;gt;table=*-&amp;gt;column=*&lt;/code&gt;. This way your users will be able to perform any
 actions on the tables they created without having to explicitly assign
 privileges per table. They will, of course, need to be granted the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CREATE&lt;/code&gt;
 privilege on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db=*&lt;/code&gt; or on a specific database to actually be able to create
 their own tables.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/blog-ranger/allow-conditions.png&quot; alt=&quot;allow-conditions&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

 &lt;p&gt;In this post we’ve covered how to set up and use the newest Kudu integration,
 Apache Ranger, and a sneak peek into the table ownership feature. Please try
 them out if you have a chance, and let us know what you think on our &lt;a href=&quot;mailto:user@kudu.apache.org&quot;&gt;mailing
 list&lt;/a&gt; or &lt;a href=&quot;https://getkudu.slack.com&quot;&gt;Slack&lt;/a&gt;. If you
 run into any issues, feel free to reach out to us on either platform, or open a
 &lt;a href=&quot;https://issues.apache.org/jira/projects/KUDU&quot;&gt;bug report&lt;/a&gt;.&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary type="html">When Apache Kudu was first released in September 2016, it didn’t support any kind of authorization. Anyone who could access the cluster could do anything they wanted. To remedy this, coarse-grained authorization was added along with authentication in Kudu 1.3.0. This meant allowing only certain users to access Kudu, but those who were allowed access could still do whatever they wanted. The only way to achieve finer-grained access control was to limit access to Apache Impala where access control could be enforced by fine-grained policies in Apache Sentry. This method limited how Kudu could be accessed, so we saw a need to implement fine-grained access control in a way that wouldn’t limit access to Impala only. Kudu 1.10.0 integrated with Apache Sentry to enable finer-grained authorization policies. This integration was rather short-lived as it was deprecated in Kudu 1.12.0 and will be completely removed in Kudu 1.13.0. Most recently, since 1.12.0 Kudu supports fine-grained authorization by integrating with Apache Ranger 2.1 and later. In this post, we’ll cover how this works and how to set it up.</summary></entry><entry><title type="html">Building Near Real-time Big Data Lake</title><link href="/2020/07/30/building-near-real-time-big-data-lake.html" rel="alternate" type="text/html" title="Building Near Real-time Big Data Lake" /><published>2020-07-30T00:00:00-07:00</published><updated>2020-07-30T00:00:00-07:00</updated><id>/2020/07/30/building-near-real-time-big-data-lake</id><content type="html" xml:base="/2020/07/30/building-near-real-time-big-data-lake.html">&lt;p&gt;Note: This is a cross-post from the Boris Tyukin’s personal blog &lt;a href=&quot;https://boristyukin.com/building-near-real-time-big-data-lake-part-2/&quot;&gt;Building Near Real-time Big Data Lake: Part 2&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;This is the second part of the series. In &lt;a href=&quot;https://boristyukin.com/building-near-real-time-big-data-lake-part-i/&quot;&gt;Part 1&lt;/a&gt;
 I wrote about our use-case for the Data Lake architecture and shared our success story.&lt;/p&gt;

 &lt;!--more--&gt;
 &lt;h2 id=&quot;requirements&quot;&gt;Requirements&lt;/h2&gt;
 &lt;p&gt;Before we embarked on our journey, we had identified high-level requirements and guiding principles.
 It is crucial to think it through to envision who and how will use your Data Lake. Identify your
 first three projects to keep them while you are building the Data Lake.&lt;/p&gt;

 &lt;p&gt;The best way is to start a few smaller proof-of-concept projects: play with various distributed
 engines and tools, run tons of benchmarks, and learn from others, who implemented a similar solution
 successfully. Do not forget to learn from others’ mistakes too.&lt;/p&gt;

 &lt;p&gt;We had settled on these 7 guiding principles before we started looking at technology and architecture:&lt;/p&gt;
 &lt;ol&gt;
   &lt;li&gt;Scale-out, not scale-up.&lt;/li&gt;
   &lt;li&gt;Design for resiliency and availability.&lt;/li&gt;
   &lt;li&gt;Support both real-time and batch ingestion into a Data Lake.&lt;/li&gt;
   &lt;li&gt;Enable both ad-hoc exploratory analysis as well as interactive queries.&lt;/li&gt;
   &lt;li&gt;Replicate in near real-time 300+ Cerner Millennium tables from 3 remote-hosted Cerner Oracle RAC
 instances with average latency less than 10 seconds (time between a change made in Cerner EHR system
 by clinicians and data ingested and ready for consumption in Data Lake).&lt;/li&gt;
   &lt;li&gt;Have robust logging and monitoring processes to ensure reliability of the pipeline and to simplify
 troubleshooting.&lt;/li&gt;
   &lt;li&gt;Reduce manual work greatly and ease the ongoing support.&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;We decided to embrace the benefits and scalability of Big Data technology. In fact, it was a pretty
 easy sell as our leadership was tired of constantly buying expensive software and hardware from
 big-name vendors and not being able to scale-out to support an avalanche of new projects and requests.&lt;/p&gt;

 &lt;p&gt;We started looking at Change Data Capture (CDC) products to mine and ship database logs from Oracle.&lt;/p&gt;

 &lt;p&gt;We knew we had to implement a metadata- or code-as-configuration driven solution to manage hundreds
 of tables, without expanding our team.&lt;/p&gt;

 &lt;p&gt;We needed a flexible orchestration and scheduling tool, designed with real-time workloads in mind.&lt;/p&gt;

 &lt;p&gt;Finally, we engaged our and Cerner’s leadership early, as it would take time to hash out all the
 details, and to make their DBAs confident that we were not going to break their production systems
 by streaming 1000s of messages every second 24x7. In fact, one of the goals was to relieve production
 systems from analytical workloads.&lt;/p&gt;
 &lt;h2 id=&quot;platform-selection&quot;&gt;Platform selection&lt;/h2&gt;
 &lt;p&gt;First off, we had to decide on the actual platform. After spending 3 months researching, 4 options
 emerged, given the realities of our organization:&lt;/p&gt;
 &lt;ol&gt;
   &lt;li&gt;On-premises virtualized cluster, using preferred vendors, recommended by our infrastructure team.&lt;/li&gt;
   &lt;li&gt;On-premises Big Data appliance (bundled hardware and software, optimized for Big Data workloads).&lt;/li&gt;
   &lt;li&gt;Big Data cluster in cloud, managed by ourselves (IaaS model, which just means renting a bunch of
 VMs and running Cloudera or Hortonworks Big Data distribution).&lt;/li&gt;
   &lt;li&gt;A fully managed cloud data platform and native cloud data warehouse (Snowflake, Google BigQuery,
 Amazon Redshift, and etc.)&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;Each option had a long list of pros and cons, but ultimately we went with option 2. The price was
 really attractive, it was a capital expense (our finance people rightfully hate subscriptions), the
 best performance, security, and control.&lt;/p&gt;

 &lt;p&gt;It was in 2017 when we made a decision. While we could not provision cluster resources and add nodes
 with a click of button and we learned that software and hardware upgrades were a real chore, it was
 still very much worth that as we’ve saved a 7 number figure for the organization to get the
 performance we needed.&lt;/p&gt;

 &lt;p&gt;Owning hardware also made a lot of sense for us as we could not forecast our needs far enough in the
 future and we could get a really powerful 6 node cluster for a fraction the cost that we would end
 up paying in subscription fees in the next 12 months. Of course, it did help that we already had a
 state-of-the-art data center and people managing it.&lt;/p&gt;

 &lt;p&gt;Fully-managed or serverless architecture was not really an option back then, but if you asked me
 today, this would be the first thing I would look at if I had to build a data lake today
 (definitely check AWS Lake Formation, AWS Athena, Amazon Redshift, Azure Synapse, Snowflake and
 Google BigQuery).&lt;/p&gt;

 &lt;p&gt;Your organization, goals, projects and situation could be very different and you should definitely
 evaluate cloud solutions, especially in 2020 when prices are decreasing, cloud providers are
 extremely competitive and there are new attractive pricing options with 3 year commitment. Make sure
 you understand the cost and billing model. Or, hire a company (there are plenty now), that will
 explain your cloud bills before you get a horrifying check.&lt;/p&gt;

 &lt;p&gt;Some of the things to consider:&lt;/p&gt;

 &lt;ol&gt;
   &lt;li&gt;Existing data center infrastructure and access to people, supporting it.&lt;/li&gt;
   &lt;li&gt;Integration with current tools (BI, ETL, Advanced Analytics etc.) Do they stay on-premises or can
 be moved into cloud to avoid networking lags or charges for data egress?&lt;/li&gt;
   &lt;li&gt;Total ownership cost and cost to performance ratio.&lt;/li&gt;
   &lt;li&gt;Do you really need elasticity? This is the first thing that cloud advocates are preaching but
 think if and how this applies to you.&lt;/li&gt;
   &lt;li&gt;Is time-to-market so crucial for you, or you can wait a few months to build Big Data
 infrastructure on-premises to save some money and get much better performance and control of
 physical hardware?&lt;/li&gt;
   &lt;li&gt;Are you okay with locking yourself in to vendor’s solution XYZ? This is an especially crucial
 question if you are selecting a fully managed platform.&lt;/li&gt;
   &lt;li&gt;Can you easily change your cloud provider? Or can you even afford to put all your trust and faith
 in a single cloud provider?&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;Do your homework, spend a lot of time reading and talking to other people (engineers and architects,
 not sales reps), and make sure you understand what you are signing up for.&lt;/p&gt;

 &lt;p&gt;And remember, there is no magic! You still need to architect, design, build, support, test, and make
 good choices and use common sense. No matter what your favorite vendor tells you. You might save
 time by spinning up a cluster in minutes, but you still need people to manage all that. You still
 need great architects and engineers to realize benefits from all that hot new tech.&lt;/p&gt;

 &lt;h2 id=&quot;building-blocks&quot;&gt;Building blocks&lt;/h2&gt;

 &lt;p&gt;Once we agreed on the platform of our choice, powered by Cloudera Enterprise Data Hub, we started
 prototyping and benchmarking various engines and tools that came with it. We looked at other
 open-source projects, as nothing really prevents you from installing and using any open-source
 product you desire and trust. One of these products for us was Apache NiFi, which proved to be a
 tremendous value.&lt;/p&gt;

 &lt;p&gt;After a lot of trials and errors, we decided on this architecture:&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/pipelinearchitecture.png&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;One of the toughest challenges we faced right away was the fact that most of the Big Data data
 engines were not designed to support mutable data but rather immutable append-only data. All the
 workarounds we had tried did not work for us and no matter what we did with partitioning strategy,
 we just needed a simple ability to update and delete data, not only insert. Anyone who worked with
 RDBMS or legacy columnar databases takes this capability for granted, but surprisingly it is a very
 difficult task in Big Data world.&lt;/p&gt;

 &lt;p&gt;We considered Apache HBase, but the performance of analytics-style ETL and interactive queries was
 really bad. We were blown away by Apache Impala’s performance on HDFS as no matter what we threw at
 Impala, it was hundreds of times faster…but we could not update data in place.&lt;/p&gt;

 &lt;p&gt;At about the same time, Cloudera released and open-sourced Apache Kudu project that became part of
 its official distribution. We got very excited about it (refer to our benchmarks &lt;a href=&quot;http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/&quot;&gt;here&lt;/a&gt;), and decided
 to proceed with Kudu as a storage engine, while using Apache Impala as SQL query engine. One of the
 ambitious goals of Apache Kudu is to cut the need for the infamous &lt;a href=&quot;https://en.wikipedia.org/wiki/Lambda_architecture&quot;&gt;Lambda architecture&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;After talking to 7 vendors and playing with our top picks, we selected a Change Data Capture product
 (Oracle GoldenGate for Big Data edition). It deserves a separate post but let’s just say it was the
 only product out of 7, that was able to handle complexities of the source Oracle RAC systems and
 offer great performance without the need to install any agents or software on the actual production
 database. Other solutions had a very long list of limitations for Oracle systems, make sure to read
 and understand these limitations.&lt;/p&gt;

 &lt;p&gt;Our homegrown tool &lt;a href=&quot;http://boristyukin.com/how-to-ingest-a-large-number-of-tables-into-a-big-data-lake-or-why-i-built-metazoo/&quot;&gt;MetaZoo&lt;/a&gt;
 has been instrumental to bring order and peace, and that’s why it earned
 its own blog post!&lt;/p&gt;

 &lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;
 &lt;p&gt;Initial ingest is pretty typical - we use Sqoop to extract data from Cerner Oracle databases, and
 NiFi helps orchestrate initial load for hundreds of tables. Actually, this NiFi flow below can
 handle initial ingest of hundreds of tables!&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/nifi_initial.png&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Our secret sauce though is &lt;a href=&quot;http://boristyukin.com/how-to-ingest-a-large-number-of-tables-into-a-big-data-lake-or-why-i-built-metazoo/&quot;&gt;MetaZoo&lt;/a&gt;.
 MetaZoo generates optimal parameters for Sqoop (such as a number of mappers, split-by column, and so
 forth), generates DDLs for staging and final tables, and SQL commands to transform data before they
 land in the Data Lake. MetaZoo also provides control tables to record status of every table.&lt;/p&gt;

 &lt;p&gt;The throughput of Sqoop is nothing but amazing. Gone are the days when we had to ask Cerner to dump
 tables on a hard-drive and ship it by snail mail (do not ask how much it cost us!). And we like how
 YARN queues help to limit the load on production databases.&lt;/p&gt;

 &lt;p&gt;To give you one example, a few years ago it took us 4 weeks to reload &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clinical_event&lt;/code&gt; table from
 Cerner using Informatica into our local Oracle database. With Sqoop and Big Data, it was done in 11
 hours!&lt;/p&gt;

 &lt;p&gt;This is what happens during the initial ingest.&lt;/p&gt;

 &lt;p&gt;First, MetaZoo gathers relevant metadata from source system about tables to ingest, and based on
 that metadata will generate DDL scripts, SQL commands snippets, Sqoop parameters, and more. It will
 initialize tables in MetaZoo control tables as well.&lt;/p&gt;

 &lt;p&gt;Then NiFi picks a list of tables to ingest from MetaZoo control tables and run the following steps
 to:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Execute and wait for Sqoop to finish.&lt;/li&gt;
   &lt;li&gt;Apply some basic rules to map data types to the corresponding data types in the lake. We convert
 timestamps to a proper time zone as well. While you do not want to do any heavy processing or or any
 data modeling in Data Lake and keep data closer to raw format as much as you can, some light
 processing upfront goes a long way and make it easier for analysts and developers to use these
 tables later.&lt;/li&gt;
   &lt;li&gt;Load processed data into final tables after some basic validation.&lt;/li&gt;
   &lt;li&gt;Compute Impala statistics.&lt;/li&gt;
   &lt;li&gt;Set initial ingest status to completed in MetaZoo control tables so it is ready for real-time
 streaming.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;Before we kick off the initial ingest process, we start Oracle GoldenGate extracts and replicats
 (that’s the actual term) to begin capturing changes from a database and send them into Kafka. Every
 message, depending on database operation type and GoldenGate configuration, might have before/after
 table row values, operation type and database commit transaction time (it only extracts changes for
 committed transactions). Once the initial ingest is finished, and because GoldenGate continues
 sending changes since the moment we started, we can now start real-time ingest flow in NiFi.&lt;/p&gt;

 &lt;p&gt;A side benefit of decoupling GoldenGate from Kafka and NiFi and Kudu is to make this process
 resilient to failures. This allows us as well to bring one of these systems down for maintenance
 without much impact.&lt;/p&gt;

 &lt;p&gt;Below is the NiFi flow than handles real-time streaming from Oracle/GoldenGate/Kafka and persists
 data into Kudu:&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/nifi_rt.png&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

 &lt;ol&gt;
   &lt;li&gt;NiFi flow consumes Kafka messages, produced by GoldenGate. Every table from every domain has
 its own Kafka topic. Topics have only one partition to preserve the original order of messages.&lt;/li&gt;
   &lt;li&gt;New messages are queued in NiFi, using a simple First-In-First-Out pattern and grouped by a
 table. It is important to ensure the order of messages but still process tables concurrently.&lt;/li&gt;
   &lt;li&gt;Messages are transformed, using the same basic rules we apply during the initial ingest.&lt;/li&gt;
   &lt;li&gt;Finally, messages are persisted into Kudu. Some of them represent INSERT type operation, which
 results in brand new rows added to Kudu tables. Other messages are UPDATE and DELETE operations.
 And we have to deal with an exotic PK_UPDATE operation, when a primary key was changed for some
 reason in the source system (e.g. PK=111 was renamed to 222). We had to write a custom Kudu client
 to handle all these cases using Java Kudu API that was fun to use. NiFi allowed us to write custom
 processors and integrate that custom Kudu code directly into our flow.&lt;/li&gt;
   &lt;li&gt;Useful metrics are stored in a separate Kudu table. We collect number of messages processed,
 operation type (insert, update, delete or primary key update), latency, important timestamps.
 Using this data, we can optimize and tweak the performance of the pipeline, and to monitor it by
 visualizing data on a dashboard.&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;The entire flow handles 900+ tables today (as we capture 300 tables from 3 Cerner domains).&lt;/p&gt;

 &lt;p&gt;We process ~2,000 messages per second or 125MM messages per day. GoldenGate accumulates 150Gb worth
 of database changes per day. In Kudu, we store over 120B rows of data.&lt;/p&gt;

 &lt;p&gt;Our average latency is 6 seconds and the pipeline is running 24x7.&lt;/p&gt;

 &lt;h2 id=&quot;user-experience&quot;&gt;User experience&lt;/h2&gt;
 &lt;p&gt;I am biased, but I think this is a game-changer for analysts, BI developers, or any data people.
 What they get is an ability to access near real-time production data, with all the benefits and
 scalability of Big Data technology.&lt;/p&gt;

 &lt;p&gt;Here, I run a query in Impala to count patients, admitted to our hospitals within the last 7 days,
 who are still in the hospitals (not discharged yet):&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/query1.png&quot; alt=&quot;query 1&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Then 5 seconds later I run the same query again to see numbers changed - more patients got admitted
 and discharged:&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/query2.png&quot; alt=&quot;query 2&quot; /&gt;]&lt;/p&gt;

 &lt;p&gt;This query below counts certain clinical events in the 20B row Kudu table (which is updated in near
 real-time). While it takes 28 seconds to finish, this query would never even finish I ran it against
 our Oracle database. It found 13.7B events:&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;https://boristyukin.com/content/images/query3.png&quot; alt=&quot;query 3&quot; /&gt;&lt;/p&gt;

 &lt;h2 id=&quot;credits&quot;&gt;Credits&lt;/h2&gt;
 &lt;p&gt;Apache Impala, Apache Kudu and Apache NiFi were the pillars of our real-time pipeline. Back in 2017,
 Impala was already a rock solid battle-tested project, while NiFi and Kudu were relatively new. We
 did have some reservations about using them and were concerned about support if/when we needed it
 (and we did need it a few times).&lt;/p&gt;

 &lt;p&gt;We were amazed by all the help, dedication, knowledge sharing, friendliness, and openness of Impala,
 NiFi and Kudu developers. Huge thank you to all of you who helped us alone the way. You guys are
 amazing and you are building fantastic products!&lt;/p&gt;

 &lt;p&gt;To be continued…&lt;/p&gt;</content><author><name>Boris Tyukin</name></author><summary type="html">Note: This is a cross-post from the Boris Tyukin’s personal blog Building Near Real-time Big Data Lake: Part 2 This is the second part of the series. In Part 1 I wrote about our use-case for the Data Lake architecture and shared our success story.</summary></entry><entry><title type="html">Apache Kudu 1.12.0 released</title><link href="/2020/05/18/apache-kudu-1-12-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.12.0 released" /><published>2020-05-18T00:00:00-07:00</published><updated>2020-05-18T00:00:00-07:00</updated><id>/2020/05/18/apache-kudu-1-12-0-release</id><content type="html" xml:base="/2020/05/18/apache-kudu-1-12-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.12.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;Kudu now supports native fine-grained authorization via integration with
 Apache Ranger. Kudu may now enforce access control policies defined for
 Kudu tables and columns stored in Ranger. See the
 &lt;a href=&quot;/releases/1.12.0/docs/security.html#fine_grained_authz&quot;&gt;authorization documentation&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;Kudu’s web UI now supports proxying via Apache Knox. Kudu may be deployed
 in a firewalled state behind a Knox Gateway which will forward HTTP requests
 and responses between clients and the Kudu web UI.&lt;/li&gt;
   &lt;li&gt;Kudu’s web UI now supports HTTP keep-alive. Operations that access multiple
 URLs will now reuse a single HTTP connection, improving their performance.&lt;/li&gt;
   &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu tserver quiesce&lt;/code&gt; tool is added to quiesce tablet servers. While a
 tablet server is quiescing, it will stop hosting tablet leaders and stop
 serving new scan requests. This can be used to orchestrate a rolling restart
 without stopping on-going Kudu workloads.&lt;/li&gt;
   &lt;li&gt;Introduced &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;auto&lt;/code&gt; time source for HybridClock timestamps. With
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--time_source=auto&lt;/code&gt; in AWS and GCE cloud environments, Kudu masters and
 tablet servers use the built-in NTP client synchronized with dedicated NTP
 servers available via host-only networks. With &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--time_source=auto&lt;/code&gt; in
 environments other than AWS/GCE, Kudu masters and tablet servers rely on
 their local machine’s clock synchronized by NTP. The default setting for
 the HybridClock time source (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--time_source=system&lt;/code&gt;) is backward-compatible,
 requiring the local machine’s clock to be synchronized by the kernel’s NTP
 discipline.&lt;/li&gt;
   &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kudu cluster rebalance&lt;/code&gt; tool now supports moving replicas away from
 specific tablet servers by supplying the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--ignored_tservers&lt;/code&gt; and
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--move_replicas_from_ignored_tservers&lt;/code&gt; arguments (see
 &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2914&quot;&gt;KUDU-2914&lt;/a&gt; for more
 details).&lt;/li&gt;
   &lt;li&gt;Write Ahead Log file segments and index chunks are now managed by Kudu’s file
 cache. With that, all long-lived file descriptors used by Kudu are managed by
 the file cache, and there’s no longer a need for capacity planning of file
 descriptor usage.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.12.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.12.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.12.0&quot;&gt;1.12.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.12.0/docs/installation.html#build_from_source&quot;&gt;1.12.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.12.0&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Additionally, experimental Docker images are published to
 &lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;.&lt;/p&gt;</content><author><name>Hao Hao</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.12.0! The new release adds several new features and improvements, including the following:</summary></entry><entry><title type="html">Apache Kudu 1.10.1 released</title><link href="/2019/11/20/apache-kudu-1-10-1-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.10.1 released" /><published>2019-11-20T00:00:00-08:00</published><updated>2019-11-20T00:00:00-08:00</updated><id>/2019/11/20/apache-kudu-1-10-1-release</id><content type="html" xml:base="/2019/11/20/apache-kudu-1-10-1-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.10.1!&lt;/p&gt;

 &lt;p&gt;Apache Kudu 1.10.1 is a bug fix release which fixes critical issues discovered
 in Apache Kudu 1.10.0. In particular, this fixes a licensing issue with
 distributing libnuma library with the kudu-binary JAR artifact. Users of
 Kudu 1.10.0 are encouraged to upgrade to 1.10.1 as soon as possible. See the
 &lt;a href=&quot;/releases/1.10.1/docs/release_notes.html&quot;&gt;release notes&lt;/a&gt; for details.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.10.1, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.10.1&quot;&gt;1.10.1 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.10.1/docs/installation.html#build_from_source&quot;&gt;1.10.1 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.10.1&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;</content><author><name>Alexey Serbin</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.10.1! Apache Kudu 1.10.1 is a bug fix release which fixes critical issues discovered in Apache Kudu 1.10.0. In particular, this fixes a licensing issue with distributing libnuma library with the kudu-binary JAR artifact. Users of Kudu 1.10.0 are encouraged to upgrade to 1.10.1 as soon as possible. See the release notes for details.</summary></entry><entry><title type="html">Apache Kudu 1.11.1 released</title><link href="/2019/11/20/apache-kudu-1-11-1-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.11.1 released" /><published>2019-11-20T00:00:00-08:00</published><updated>2019-11-20T00:00:00-08:00</updated><id>/2019/11/20/apache-kudu-1-11-1-release</id><content type="html" xml:base="/2019/11/20/apache-kudu-1-11-1-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.11.1!&lt;/p&gt;

 &lt;p&gt;This release contains a fix which addresses a critical issue discovered in
 1.10.0 and 1.11.0 and adds several new features and improvements since 1.10.0.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;Apache Kudu 1.11.1 is a bug fix release which fixes critical issues discovered
 in Apache Kudu 1.11.0. In particular, this release fixes a licensing issue with
 distributing libnuma library with the kudu-binary JAR artifact. Users of
 Kudu 1.11.0 are encouraged to upgrade to 1.11.1 as soon as possible.&lt;/p&gt;

 &lt;p&gt;Apache Kudu 1.11.1 adds several new features and improvements since
 Apache Kudu 1.10.0, including the following:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Kudu now supports putting tablet servers into maintenance mode: while in this
 mode, the tablet server’s replicas will not be re-replicated if the server
 fails. Administrative CLI are added to orchestrate tablet server maintenance
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2069&quot;&gt;KUDU-2069&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;Kudu now has a built-in NTP client which maintains the internal wallclock
 time used for generation of HybridTime timestamps. When enabled, system clock
 synchronization for nodes running Kudu is no longer necessary,
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2935&quot;&gt;KUDU-2935&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;Aggregated table statistics are now available to Kudu clients. This allows
 for various query optimizations. For example, Spark now uses it to perform
 join optimizations
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2797&quot;&gt;KUDU-2797&lt;/a&gt; and
 &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2721&quot;&gt;KUDU-2721&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;The kudu CLI tool now supports creating and dropping range partitions
 for a table
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2881&quot;&gt;KUDU-2881&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;The kudu CLI tool now supports altering and dropping table columns.&lt;/li&gt;
   &lt;li&gt;The kudu CLI tool now supports getting and setting extra configuration
 properties for a table
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2514&quot;&gt;KUDU-2514&lt;/a&gt;).&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;See the &lt;a href=&quot;/releases/1.11.1/docs/release_notes.html&quot;&gt;release notes&lt;/a&gt; for details.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.11.1, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.11.1&quot;&gt;1.11.1 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.11.1/docs/installation.html#build_from_source&quot;&gt;1.11.1 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.11.1&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Additionally experimental Docker images are published to
 &lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;.&lt;/p&gt;</content><author><name>Alexey Serbin</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.11.1! This release contains a fix which addresses a critical issue discovered in 1.10.0 and 1.11.0 and adds several new features and improvements since 1.10.0.</summary></entry><entry><title type="html">Apache Kudu 1.10.0 Released</title><link href="/2019/07/09/apache-kudu-1-10-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.10.0 Released" /><published>2019-07-09T00:00:00-07:00</published><updated>2019-07-09T00:00:00-07:00</updated><id>/2019/07/09/apache-kudu-1-10-0-release</id><content type="html" xml:base="/2019/07/09/apache-kudu-1-10-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.10.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;Kudu now supports both full and incremental table backups via a job
 implemented using Apache Spark. Additionally it supports restoring
 tables from full and incremental backups via a restore job implemented using
 Apache Spark. See the
 &lt;a href=&quot;/releases/1.10.0/docs/administration.html#backup&quot;&gt;backup documentation&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;Kudu can now synchronize its internal catalog with the Apache Hive Metastore,
 automatically updating Hive Metastore table entries upon table creation,
 deletion, and alterations in Kudu. See the
 &lt;a href=&quot;/releases/1.10.0/docs/hive_metastore.html#metadata_sync&quot;&gt;HMS synchronization documentation&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;Kudu now supports native fine-grained authorization via integration with
 Apache Sentry. Kudu may now enforce access control policies defined for Kudu
 tables and columns, as well as policies defined on Hive servers and databases
 that may store Kudu tables. See the
 &lt;a href=&quot;/releases/1.10.0/docs/security.html#fine_grained_authz&quot;&gt;authorization documentation&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;Kudu’s web UI now supports SPNEGO, a protocol for securing HTTP requests with
 Kerberos by passing negotiation through HTTP headers. To enable, set the
 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--webserver_require_spnego&lt;/code&gt; command line flag.&lt;/li&gt;
   &lt;li&gt;Column comments can now be stored in Kudu tables, and can be updated using
 the AlterTable API
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-1711&quot;&gt;KUDU-1711&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;The performance of mutations (i.e. UPDATE, DELETE, and re-INSERT) to
 not-yet-flushed Kudu data has been significantly optimized
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2826&quot;&gt;KUDU-2826&lt;/a&gt; and
 &lt;a href=&quot;https://github.com/apache/kudu/commit/f9f9526d3&quot;&gt;f9f9526d3&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;Predicate performance for primitive columns and IS NULL and IS NOT NULL
 has been optimized
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2846&quot;&gt;KUDU-2846&lt;/a&gt;).&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.10.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.10.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.10.0&quot;&gt;1.10.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.10.0/docs/installation.html#build_from_source&quot;&gt;1.10.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.10.0&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Additionally experimental Docker images are published to
 &lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary type="html">The Apache Kudu team is happy to announce the release of Kudu 1.10.0! The new release adds several new features and improvements, including the following:</summary></entry></feed>