releases/1.17.1/docs/index.html - kudu - Git at Google

 ---
 title: Introducing Apache Kudu
 layout: default
 active_nav: docs
 last_updated: 'Last updated 2024-10-28 17:38:51 -0700'
 ---
 <!--

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->


 <div class="container">
   <div class="row">
     <div class="col-md-9">

 <h1>Introducing Apache Kudu</h1>
       <div id="preamble">
 <div class="sectionbody">
 <div class="paragraph">
 <p>Kudu is a distributed columnar storage engine optimized for OLAP workloads.
 Kudu runs on commodity hardware, is horizontally scalable, and supports highly
 available operation.</p>
 </div>
 <div class="paragraph">
 <p>Kudu&#8217;s design sets it apart. Some of Kudu&#8217;s benefits include:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
 <p>Fast processing of OLAP workloads.</p>
 </li>
 <li>
 <p>Strong but flexible consistency model, allowing you to choose consistency
 requirements on a per-request basis, including the option for
 strict-serializable consistency.</p>
 </li>
 <li>
 <p>Structured data model.</p>
 </li>
 <li>
 <p>Strong performance for running sequential and random workloads simultaneously.</p>
 </li>
 <li>
 <p>Tight integration with Apache Impala, making it a good, mutable alternative to
 using HDFS with Apache Parquet.</p>
 </li>
 <li>
 <p>Integration with Apache NiFi and Apache Spark.</p>
 </li>
 <li>
 <p>Integration with Hive Metastore (HMS) and Apache Ranger to provide
 fine-grain authorization and access control.</p>
 </li>
 <li>
 <p>Authenticated and encrypted RPC communication.</p>
 </li>
 <li>
 <p>High availability: Tablet Servers and Masters use the <a href="#raft">Raft Consensus Algorithm</a>, which ensures
 that as long as more than half the total number of tablet replicas is
 available, the tablet is available for reads and writes. For instance,
 if 2 out of 3 replicas (or 3 out of 5 replicas, etc.) are available,
 the tablet is available. Reads can be serviced by read-only follower tablet
 replicas, even in the event of a leader replica&#8217;s failure.</p>
 </li>
 <li>
 <p>Automatic fault detection and self-healing: to keep data highly available,
 the system detects failed tablet replicas and re-replicates data from
 available ones, so failed replicas are automatically replaced when enough
 Tablet Servers are available in the cluster.</p>
 </li>
 <li>
 <p>Location awareness (a.k.a. rack awareness) to keep the system available
 in case of correlated failures and allowing Kudu clusters to span over
 multiple availability zones.</p>
 </li>
 <li>
 <p>Logical backup (full and incremental) and restore.</p>
 </li>
 <li>
 <p>Multi-row transactions (only for INSERT/INSERT_IGNORE operations as of
 Kudu 1.15 release).</p>
 </li>
 <li>
 <p>Easy to administer and manage.</p>
 </li>
 </ul>
 </div>
 <div class="paragraph">
 <p>By combining all of these properties, Kudu targets support for families of
 applications that are difficult or impossible to implement using Hadoop storage
 technologies, while it is compatible with most of the data processing
 frameworks in the Hadoop ecosystem.</p>
 </div>
 <div class="paragraph">
 <p>A few examples of applications for which Kudu is a great solution are:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
 <p>Reporting applications where newly-arrived data needs to be immediately available for end users</p>
 </li>
 <li>
 <p>Time-series applications that must simultaneously support:</p>
 <div class="ulist">
 <ul>
 <li>
 <p>queries across large amounts of historic data</p>
 </li>
 <li>
 <p>granular queries about an individual entity that must return very quickly</p>
 </li>
 </ul>
 </div>
 </li>
 <li>
 <p>Applications that use predictive models to make real-time decisions with periodic
 refreshes of the predictive model based on all historic data</p>
 </li>
 </ul>
 </div>
 <div class="paragraph">
 <p>For more information about these and other scenarios, see <a href="#kudu_use_cases">Example Use Cases</a>.</p>
 </div>
 </div>
 </div>
 <div class="sect1">
 <h2 id="_kudu_impala_integration_features"><a class="link" href="#_kudu_impala_integration_features">Kudu-Impala Integration Features</a></h2>
 <div class="sectionbody">
 <div class="dlist">
 <dl>
 <dt class="hdlist1"><code>CREATE/ALTER/DROP TABLE</code></dt>
 <dd>
 <p>Impala supports creating, altering, and dropping tables using Kudu as the persistence layer.
 The tables follow the same internal / external approach as other tables in Impala,
 allowing for flexible data ingestion and querying.</p>
 </dd>
 <dt class="hdlist1"><code>INSERT</code></dt>
 <dd>
 <p>Data can be inserted into Kudu tables in Impala using the same syntax as
 any other Impala table like those using HDFS or HBase for persistence.</p>
 </dd>
 <dt class="hdlist1"><code>UPDATE</code> / <code>DELETE</code></dt>
 <dd>
 <p>Impala supports the <code>UPDATE</code> and <code>DELETE</code> SQL commands to modify existing data in
 a Kudu table row-by-row or as a batch. The syntax of the SQL commands is chosen
 to be as compatible as possible with existing standards. In addition to simple <code>DELETE</code>
 or <code>UPDATE</code> commands, you can specify complex joins with a <code>FROM</code> clause in a subquery.</p>
 </dd>
 <dt class="hdlist1">Flexible Partitioning</dt>
 <dd>
 <p>Similar to partitioning of tables in Hive, Kudu allows you to dynamically
 pre-split tables by hash or range into a predefined number of tablets, in order
 to distribute writes and queries evenly across your cluster. You can partition by
 any number of primary key columns, by any number of hashes, and an optional list of
 split rows. See <a href="schema_design.html#schema_design">Schema Design</a>.</p>
 </dd>
 <dt class="hdlist1">Parallel Scan</dt>
 <dd>
 <p>To achieve the highest possible performance on modern hardware, the Kudu client
 used by Impala parallelizes scans across multiple tablets.</p>
 </dd>
 <dt class="hdlist1">High-efficiency queries</dt>
 <dd>
 <p>Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates
 are evaluated as close as possible to the data. Query performance is comparable
 to Parquet in many workloads.</p>
 </dd>
 </dl>
 </div>
 <div class="paragraph">
 <p>For more details regarding querying data stored in Kudu using Impala, please
 refer to the Impala documentation.</p>
 </div>
 </div>
 </div>
 <div class="sect1">
 <h2 id="_concepts_and_terms"><a class="link" href="#_concepts_and_terms">Concepts and Terms</a></h2>
 <div class="sectionbody">
 <div id="kudu_columnar_data_store" class="paragraph">
 <div class="title">Columnar Data Store</div>
 <p>Kudu is a <em>columnar data store</em>. A columnar data store stores data in strongly-typed
 columns. With a proper design, it is superior for analytical or data warehousing
 workloads for several reasons.</p>
 </div>
 <div class="dlist">
 <dl>
 <dt class="hdlist1">Read Efficiency</dt>
 <dd>
 <p>For analytical queries, you can read a single column, or a portion
 of that column, while ignoring other columns. This means you can fulfill your query
 while reading a minimal number of blocks on disk. With a row-based store, you need
 to read the entire row, even if you only return values from a few columns.</p>
 </dd>
 <dt class="hdlist1">Data Compression</dt>
 <dd>
 <p>Because a given column contains only one type of data,
 pattern-based compression can be orders of magnitude more efficient than
 compressing mixed data types, which are used in row-based solutions. Combined
 with the efficiencies of reading data from columns, compression allows you to
 fulfill your query while reading even fewer blocks from disk. See
 <a href="schema_design.html#encoding">Data Compression</a></p>
 </dd>
 </dl>
 </div>
 <div class="paragraph">
 <div class="title">Table</div>
 <p>A <em>table</em> is where your data is stored in Kudu. A table has a schema and
 a totally ordered primary key. A table is split into segments called tablets.</p>
 </div>
 <div class="paragraph">
 <div class="title">Tablet</div>
 <p>A <em>tablet</em> is a contiguous segment of a table, similar to a <em>partition</em> in
 other data storage engines or relational databases. A given tablet is
 replicated on multiple tablet servers, and at any given point in time,
 one of these replicas is considered the leader tablet. Any replica can service
 reads, and writes require consensus among the set of tablet servers serving the tablet.</p>
 </div>
 <div class="paragraph">
 <div class="title">Tablet Server</div>
 <p>A <em>tablet server</em> stores and serves tablets to clients. For a
 given tablet, one tablet server acts as a leader, and the others act as
 follower replicas of that tablet. Only leaders service write requests, while
 leaders or followers each service read requests. Leaders are elected using
 <a href="#raft">Raft Consensus Algorithm</a>. One tablet server can serve multiple tablets, and one tablet can be served
 by multiple tablet servers.</p>
 </div>
 <div class="paragraph">
 <div class="title">Master</div>
 <p>The <em>master</em> keeps track of all the tablets, tablet servers, the
 <a href="#catalog_table">Catalog Table</a>, and other metadata related to the cluster. At a given point
 in time, there can only be one acting master (the leader). If the current leader
 disappears, a new master is elected using <a href="#raft">Raft Consensus Algorithm</a>.</p>
 </div>
 <div class="paragraph">
 <p>The master also coordinates metadata operations for clients. For example, when
 creating a new table, the client internally sends the request to the master. The
 master writes the metadata for the new table into the catalog table, and
 coordinates the process of creating tablets on the tablet servers.</p>
 </div>
 <div class="paragraph">
 <p>All the master&#8217;s data is stored in a tablet, which can be replicated to all the
 other candidate masters.</p>
 </div>
 <div class="paragraph">
 <p>Tablet servers heartbeat to the master at a set interval (the default is once
 per second).</p>
 </div>
 <div id="raft" class="paragraph">
 <div class="title">Raft Consensus Algorithm</div>
 <p>Kudu uses the <a href="https://raft.github.io/">Raft consensus algorithm</a> as
 a means to guarantee fault-tolerance and consistency, both for regular tablets and for master
 data. Through Raft, multiple replicas of a tablet elect a <em>leader</em>, which is responsible
 for accepting and replicating writes to <em>follower</em> replicas. Once a write is persisted
 in a majority of replicas it is acknowledged to the client. A given group of <code>N</code> replicas
 (usually 3 or 5) is able to accept writes with at most <code>(N - 1)/2</code> faulty replicas.</p>
 </div>
 <div id="catalog_table" class="paragraph">
 <div class="title">Catalog Table</div>
 <p>The <em>catalog table</em> is the central location for
 metadata of Kudu. It stores information about tables and tablets. The catalog
 table may not be read or written directly. Instead, it is accessible
 only via metadata operations exposed in the client API.</p>
 </div>
 <div class="paragraph">
 <p>The catalog table stores two categories of metadata:</p>
 </div>
 <div class="dlist">
 <dl>
 <dt class="hdlist1">Tables</dt>
 <dd>
 <p>table schemas, locations, and states</p>
 </dd>
 <dt class="hdlist1">Tablets</dt>
 <dd>
 <p>the list of existing tablets, which tablet servers have replicas of
 each tablet, the tablet&#8217;s current state, and start and end keys.</p>
 </dd>
 </dl>
 </div>
 <div class="paragraph">
 <div class="title">Logical Replication</div>
 <p>Kudu replicates operations, not on-disk data. This is referred to as <em>logical replication</em>,
 as opposed to <em>physical replication</em>. This has several advantages:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
 <p>Although inserts and updates do transmit data over the network, deletes do not need
 to move any data. The delete operation is sent to each tablet server, which performs
 the delete locally.</p>
 </li>
 <li>
 <p>Physical operations, such as compaction, do not need to transmit the data over the
 network in Kudu. This is different from storage systems that use HDFS, where
 the blocks need to be transmitted over the network to fulfill the required number of
 replicas.</p>
 </li>
 <li>
 <p>Tablets do not need to perform compactions at the same time or on the same schedule,
 or otherwise remain in sync on the physical storage layer. This decreases the chances
 of all tablet servers experiencing high latency at the same time, due to compactions
 or heavy write loads.</p>
 </li>
 </ul>
 </div>
 </div>
 </div>
 <div class="sect1">
 <h2 id="_architectural_overview"><a class="link" href="#_architectural_overview">Architectural Overview</a></h2>
 <div class="sectionbody">
 <div class="paragraph">
 <p>The following diagram shows a Kudu cluster with three masters and multiple tablet
 servers, each serving multiple tablets. It illustrates how Raft consensus is used
 to allow for both leaders and followers for both the masters and tablet servers. In
 addition, a tablet server can be a leader for some tablets, and a follower for others.
 Leaders are shown in gold, while followers are shown in blue.</p>
 </div>
 <div class="imageblock">
 <div class="content">
 <img src="./images/kudu-architecture-2.png" alt="Kudu Architecture" width="800">
 </div>
 </div>
 </div>
 </div>
 <div class="sect1">
 <h2 id="kudu_use_cases"><a class="link" href="#kudu_use_cases">Example Use Cases</a></h2>
 <div class="sectionbody">
 <div class="paragraph">
 <div class="title">Streaming Input with Near Real Time Availability</div>
 <p>A common challenge in data analysis is one where new data arrives rapidly and constantly,
 and the same data needs to be available in near real time for reads, scans, and
 updates. Kudu offers the powerful combination of fast inserts and updates with
 efficient columnar scans to enable real-time analytics use cases on a single storage layer.</p>
 </div>
 <div class="paragraph">
 <div class="title">Time-series application with widely varying access patterns</div>
 <p>A time-series schema is one in which data points are organized and keyed according
 to the time at which they occurred. This can be useful for investigating the
 performance of metrics over time or attempting to predict future behavior based
 on past data. For instance, time-series customer data might be used both to store
 purchase click-stream history and to predict future purchases, or for use by a
 customer support representative. While these different types of analysis are occurring,
 inserts and mutations may also be occurring individually and in bulk, and become available
 immediately to read workloads. Kudu can handle all of these access patterns
 simultaneously in a scalable and efficient manner.</p>
 </div>
 <div class="paragraph">
 <p>Kudu is a good fit for time-series workloads for several reasons. With Kudu&#8217;s support for
 hash-based partitioning, combined with its native support for compound row keys, it is
 simple to set up a table spread across many servers without the risk of "hotspotting"
 that is commonly observed when range partitioning is used. Kudu&#8217;s columnar storage engine
 is also beneficial in this context, because many time-series workloads read only a few columns,
 as opposed to the whole row.</p>
 </div>
 <div class="paragraph">
 <p>In the past, you might have needed to use multiple data stores to handle different
 data access patterns. This practice adds complexity to your application and operations,
 and duplicates your data, doubling (or worse) the amount of storage
 required. Kudu can handle all of these access patterns natively and efficiently,
 without the need to off-load work to other data stores.</p>
 </div>
 <div class="paragraph">
 <div class="title">Predictive Modeling</div>
 <p>Data scientists often develop predictive learning models from large sets of data. The
 model and the data may need to be updated or modified often as the learning takes
 place or as the situation being modeled changes. In addition, the scientist may want
 to change one or more factors in the model to see what happens over time. Updating
 a large set of data stored in files in HDFS is resource-intensive, as each file needs
 to be completely rewritten. In Kudu, updates happen in near real time. The scientist
 can tweak the value, re-run the query, and refresh the graph in seconds or minutes,
 rather than hours or days. In addition, batch or incremental algorithms can be run
 across the data at any time, with near-real-time results.</p>
 </div>
 <div class="paragraph">
 <div class="title">Combining Data In Kudu With Legacy Systems</div>
 <p>Companies generate data from multiple sources and store it in a variety of systems
 and formats. For instance, some of your data may be stored in Kudu, some in a traditional
 RDBMS, and some in files in HDFS. You can access and query all of these sources and
 formats using Impala, without the need to change your legacy systems.</p>
 </div>
 </div>
 </div>
 <div class="sect1">
 <h2 id="_next_steps"><a class="link" href="#_next_steps">Next Steps</a></h2>
 <div class="sectionbody">
 <div class="ulist">
 <ul>
 <li>
 <p><a href="quickstart.html">Get Started With Kudu</a></p>
 </li>
 <li>
 <p><a href="installation.html">Installing Kudu</a></p>
 </li>
 </ul>
 </div>
 </div>
 </div>
     </div>
     <div class="col-md-3">

   <div id="toc" data-spy="affix" data-offset-top="70">
   <ul>

       <li>
 <span class="active-toc">Introducing Kudu</span>
             <ul class="sectlevel1">
 <li><a href="#_kudu_impala_integration_features">Kudu-Impala Integration Features</a></li>
 <li><a href="#_concepts_and_terms">Concepts and Terms</a></li>
 <li><a href="#_architectural_overview">Architectural Overview</a></li>
 <li><a href="#kudu_use_cases">Example Use Cases</a></li>
 <li><a href="#_next_steps">Next Steps</a></li>
 </ul>
       </li>
       <li>

           <a href="release_notes.html">Kudu Release Notes</a>
       </li>
       <li>

           <a href="quickstart.html">Quickstart Guide</a>
       </li>
       <li>

           <a href="installation.html">Installation Guide</a>
       </li>
       <li>

           <a href="configuration.html">Configuring Kudu</a>
       </li>
       <li>

           <a href="hive_metastore.html">Using the Hive Metastore with Kudu</a>
       </li>
       <li>

           <a href="kudu_impala_integration.html">Using Impala with Kudu</a>
       </li>
       <li>

           <a href="administration.html">Administering Kudu</a>
       </li>
       <li>

           <a href="troubleshooting.html">Troubleshooting Kudu</a>
       </li>
       <li>

           <a href="developing.html">Developing Applications with Kudu</a>
       </li>
       <li>

           <a href="schema_design.html">Kudu Schema Design</a>
       </li>
       <li>

           <a href="scaling_guide.html">Kudu Scaling Guide</a>
       </li>
       <li>

           <a href="security.html">Kudu Security</a>
       </li>
       <li>

           <a href="transaction_semantics.html">Kudu Transaction Semantics</a>
       </li>
       <li>

           <a href="background_tasks.html">Background Maintenance Tasks</a>
       </li>
       <li>

           <a href="configuration_reference.html">Kudu Configuration Reference</a>
       </li>
       <li>

           <a href="command_line_tools_reference.html">Kudu Command Line Tools Reference</a>
       </li>
       <li>

           <a href="metrics_reference.html">Kudu Metrics Reference</a>
       </li>
       <li>

           <a href="known_issues.html">Known Issues and Limitations</a>
       </li>
       <li>

           <a href="contributing.html">Contributing to Kudu</a>
       </li>
       <li>

           <a href="export_control.html">Export Control Notice</a>
       </li>
   </ul>
   </div>
     </div>
   </div>
 </div>
	---
	title: Introducing Apache Kudu
	layout: default
	active_nav: docs
	last_updated: 'Last updated 2024-10-28 17:38:51 -0700'
	---
	<!--

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->


	<div class="container">
	<div class="row">
	<div class="col-md-9">

	<h1>Introducing Apache Kudu</h1>
	<div id="preamble">
	<div class="sectionbody">
	<div class="paragraph">
	<p>Kudu is a distributed columnar storage engine optimized for OLAP workloads.
	Kudu runs on commodity hardware, is horizontally scalable, and supports highly
	available operation.</p>
	</div>
	<div class="paragraph">
	<p>Kudu’s design sets it apart. Some of Kudu’s benefits include:</p>
	</div>
	<div class="ulist">
	<ul>
	<li>
	<p>Fast processing of OLAP workloads.</p>
	</li>
	<li>
	<p>Strong but flexible consistency model, allowing you to choose consistency
	requirements on a per-request basis, including the option for
	strict-serializable consistency.</p>
	</li>
	<li>
	<p>Structured data model.</p>
	</li>
	<li>
	<p>Strong performance for running sequential and random workloads simultaneously.</p>
	</li>
	<li>
	<p>Tight integration with Apache Impala, making it a good, mutable alternative to
	using HDFS with Apache Parquet.</p>
	</li>
	<li>
	<p>Integration with Apache NiFi and Apache Spark.</p>
	</li>
	<li>
	<p>Integration with Hive Metastore (HMS) and Apache Ranger to provide
	fine-grain authorization and access control.</p>
	</li>
	<li>
	<p>Authenticated and encrypted RPC communication.</p>
	</li>
	<li>
	<p>High availability: Tablet Servers and Masters use the <a href="#raft">Raft Consensus Algorithm</a>, which ensures
	that as long as more than half the total number of tablet replicas is
	available, the tablet is available for reads and writes. For instance,
	if 2 out of 3 replicas (or 3 out of 5 replicas, etc.) are available,
	the tablet is available. Reads can be serviced by read-only follower tablet
	replicas, even in the event of a leader replica’s failure.</p>
	</li>
	<li>
	<p>Automatic fault detection and self-healing: to keep data highly available,
	the system detects failed tablet replicas and re-replicates data from
	available ones, so failed replicas are automatically replaced when enough
	Tablet Servers are available in the cluster.</p>
	</li>
	<li>
	<p>Location awareness (a.k.a. rack awareness) to keep the system available
	in case of correlated failures and allowing Kudu clusters to span over
	multiple availability zones.</p>
	</li>
	<li>
	<p>Logical backup (full and incremental) and restore.</p>
	</li>
	<li>
	<p>Multi-row transactions (only for INSERT/INSERT_IGNORE operations as of
	Kudu 1.15 release).</p>
	</li>
	<li>
	<p>Easy to administer and manage.</p>
	</li>
	</ul>
	</div>
	<div class="paragraph">
	<p>By combining all of these properties, Kudu targets support for families of
	applications that are difficult or impossible to implement using Hadoop storage
	technologies, while it is compatible with most of the data processing
	frameworks in the Hadoop ecosystem.</p>
	</div>
	<div class="paragraph">
	<p>A few examples of applications for which Kudu is a great solution are:</p>
	</div>
	<div class="ulist">
	<ul>
	<li>
	<p>Reporting applications where newly-arrived data needs to be immediately available for end users</p>
	</li>
	<li>
	<p>Time-series applications that must simultaneously support:</p>
	<div class="ulist">
	<ul>
	<li>
	<p>queries across large amounts of historic data</p>
	</li>
	<li>
	<p>granular queries about an individual entity that must return very quickly</p>
	</li>
	</ul>
	</div>
	</li>
	<li>
	<p>Applications that use predictive models to make real-time decisions with periodic
	refreshes of the predictive model based on all historic data</p>
	</li>
	</ul>
	</div>
	<div class="paragraph">
	<p>For more information about these and other scenarios, see <a href="#kudu_use_cases">Example Use Cases</a>.</p>
	</div>
	</div>
	</div>
	<div class="sect1">
	<h2 id="_kudu_impala_integration_features"><a class="link" href="#_kudu_impala_integration_features">Kudu-Impala Integration Features</a></h2>
	<div class="sectionbody">
	<div class="dlist">
	<dl>
	<dt class="hdlist1"><code>CREATE/ALTER/DROP TABLE</code></dt>
	<dd>
	<p>Impala supports creating, altering, and dropping tables using Kudu as the persistence layer.
	The tables follow the same internal / external approach as other tables in Impala,
	allowing for flexible data ingestion and querying.</p>
	</dd>
	<dt class="hdlist1"><code>INSERT</code></dt>
	<dd>
	<p>Data can be inserted into Kudu tables in Impala using the same syntax as
	any other Impala table like those using HDFS or HBase for persistence.</p>
	</dd>
	<dt class="hdlist1"><code>UPDATE</code> / <code>DELETE</code></dt>
	<dd>
	<p>Impala supports the <code>UPDATE</code> and <code>DELETE</code> SQL commands to modify existing data in
	a Kudu table row-by-row or as a batch. The syntax of the SQL commands is chosen
	to be as compatible as possible with existing standards. In addition to simple <code>DELETE</code>
	or <code>UPDATE</code> commands, you can specify complex joins with a <code>FROM</code> clause in a subquery.</p>
	</dd>
	<dt class="hdlist1">Flexible Partitioning</dt>
	<dd>
	<p>Similar to partitioning of tables in Hive, Kudu allows you to dynamically
	pre-split tables by hash or range into a predefined number of tablets, in order
	to distribute writes and queries evenly across your cluster. You can partition by
	any number of primary key columns, by any number of hashes, and an optional list of
	split rows. See <a href="schema_design.html#schema_design">Schema Design</a>.</p>
	</dd>
	<dt class="hdlist1">Parallel Scan</dt>
	<dd>
	<p>To achieve the highest possible performance on modern hardware, the Kudu client
	used by Impala parallelizes scans across multiple tablets.</p>
	</dd>
	<dt class="hdlist1">High-efficiency queries</dt>
	<dd>
	<p>Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates
	are evaluated as close as possible to the data. Query performance is comparable
	to Parquet in many workloads.</p>
	</dd>
	</dl>
	</div>
	<div class="paragraph">
	<p>For more details regarding querying data stored in Kudu using Impala, please
	refer to the Impala documentation.</p>
	</div>
	</div>
	</div>
	<div class="sect1">
	<h2 id="_concepts_and_terms"><a class="link" href="#_concepts_and_terms">Concepts and Terms</a></h2>
	<div class="sectionbody">
	<div id="kudu_columnar_data_store" class="paragraph">
	<div class="title">Columnar Data Store</div>
	<p>Kudu is a <em>columnar data store</em>. A columnar data store stores data in strongly-typed
	columns. With a proper design, it is superior for analytical or data warehousing
	workloads for several reasons.</p>
	</div>
	<div class="dlist">
	<dl>
	<dt class="hdlist1">Read Efficiency</dt>
	<dd>
	<p>For analytical queries, you can read a single column, or a portion
	of that column, while ignoring other columns. This means you can fulfill your query
	while reading a minimal number of blocks on disk. With a row-based store, you need
	to read the entire row, even if you only return values from a few columns.</p>
	</dd>
	<dt class="hdlist1">Data Compression</dt>
	<dd>
	<p>Because a given column contains only one type of data,
	pattern-based compression can be orders of magnitude more efficient than
	compressing mixed data types, which are used in row-based solutions. Combined
	with the efficiencies of reading data from columns, compression allows you to
	fulfill your query while reading even fewer blocks from disk. See
	<a href="schema_design.html#encoding">Data Compression</a></p>
	</dd>
	</dl>
	</div>
	<div class="paragraph">
	<div class="title">Table</div>
	<p>A <em>table</em> is where your data is stored in Kudu. A table has a schema and
	a totally ordered primary key. A table is split into segments called tablets.</p>
	</div>
	<div class="paragraph">
	<div class="title">Tablet</div>
	<p>A <em>tablet</em> is a contiguous segment of a table, similar to a <em>partition</em> in
	other data storage engines or relational databases. A given tablet is
	replicated on multiple tablet servers, and at any given point in time,
	one of these replicas is considered the leader tablet. Any replica can service
	reads, and writes require consensus among the set of tablet servers serving the tablet.</p>
	</div>
	<div class="paragraph">
	<div class="title">Tablet Server</div>
	<p>A <em>tablet server</em> stores and serves tablets to clients. For a
	given tablet, one tablet server acts as a leader, and the others act as
	follower replicas of that tablet. Only leaders service write requests, while
	leaders or followers each service read requests. Leaders are elected using
	<a href="#raft">Raft Consensus Algorithm</a>. One tablet server can serve multiple tablets, and one tablet can be served
	by multiple tablet servers.</p>
	</div>
	<div class="paragraph">
	<div class="title">Master</div>
	<p>The <em>master</em> keeps track of all the tablets, tablet servers, the
	<a href="#catalog_table">Catalog Table</a>, and other metadata related to the cluster. At a given point
	in time, there can only be one acting master (the leader). If the current leader
	disappears, a new master is elected using <a href="#raft">Raft Consensus Algorithm</a>.</p>
	</div>
	<div class="paragraph">
	<p>The master also coordinates metadata operations for clients. For example, when
	creating a new table, the client internally sends the request to the master. The
	master writes the metadata for the new table into the catalog table, and
	coordinates the process of creating tablets on the tablet servers.</p>
	</div>
	<div class="paragraph">
	<p>All the master’s data is stored in a tablet, which can be replicated to all the
	other candidate masters.</p>
	</div>
	<div class="paragraph">
	<p>Tablet servers heartbeat to the master at a set interval (the default is once
	per second).</p>
	</div>
	<div id="raft" class="paragraph">
	<div class="title">Raft Consensus Algorithm</div>
	<p>Kudu uses the <a href="https://raft.github.io/">Raft consensus algorithm</a> as
	a means to guarantee fault-tolerance and consistency, both for regular tablets and for master
	data. Through Raft, multiple replicas of a tablet elect a <em>leader</em>, which is responsible
	for accepting and replicating writes to <em>follower</em> replicas. Once a write is persisted
	in a majority of replicas it is acknowledged to the client. A given group of <code>N</code> replicas
	(usually 3 or 5) is able to accept writes with at most <code>(N - 1)/2</code> faulty replicas.</p>
	</div>
	<div id="catalog_table" class="paragraph">
	<div class="title">Catalog Table</div>
	<p>The <em>catalog table</em> is the central location for
	metadata of Kudu. It stores information about tables and tablets. The catalog
	table may not be read or written directly. Instead, it is accessible
	only via metadata operations exposed in the client API.</p>
	</div>
	<div class="paragraph">
	<p>The catalog table stores two categories of metadata:</p>
	</div>
	<div class="dlist">
	<dl>
	<dt class="hdlist1">Tables</dt>
	<dd>
	<p>table schemas, locations, and states</p>
	</dd>
	<dt class="hdlist1">Tablets</dt>
	<dd>
	<p>the list of existing tablets, which tablet servers have replicas of
	each tablet, the tablet’s current state, and start and end keys.</p>
	</dd>
	</dl>
	</div>
	<div class="paragraph">
	<div class="title">Logical Replication</div>
	<p>Kudu replicates operations, not on-disk data. This is referred to as <em>logical replication</em>,
	as opposed to <em>physical replication</em>. This has several advantages:</p>
	</div>
	<div class="ulist">
	<ul>
	<li>
	<p>Although inserts and updates do transmit data over the network, deletes do not need
	to move any data. The delete operation is sent to each tablet server, which performs
	the delete locally.</p>
	</li>
	<li>
	<p>Physical operations, such as compaction, do not need to transmit the data over the
	network in Kudu. This is different from storage systems that use HDFS, where
	the blocks need to be transmitted over the network to fulfill the required number of
	replicas.</p>
	</li>
	<li>
	<p>Tablets do not need to perform compactions at the same time or on the same schedule,
	or otherwise remain in sync on the physical storage layer. This decreases the chances
	of all tablet servers experiencing high latency at the same time, due to compactions
	or heavy write loads.</p>
	</li>
	</ul>
	</div>
	</div>
	</div>
	<div class="sect1">
	<h2 id="_architectural_overview"><a class="link" href="#_architectural_overview">Architectural Overview</a></h2>
	<div class="sectionbody">
	<div class="paragraph">
	<p>The following diagram shows a Kudu cluster with three masters and multiple tablet
	servers, each serving multiple tablets. It illustrates how Raft consensus is used
	to allow for both leaders and followers for both the masters and tablet servers. In
	addition, a tablet server can be a leader for some tablets, and a follower for others.
	Leaders are shown in gold, while followers are shown in blue.</p>
	</div>
	<div class="imageblock">
	<div class="content">
	<img src="./images/kudu-architecture-2.png" alt="Kudu Architecture" width="800">
	</div>
	</div>
	</div>
	</div>
	<div class="sect1">
	<h2 id="kudu_use_cases"><a class="link" href="#kudu_use_cases">Example Use Cases</a></h2>
	<div class="sectionbody">
	<div class="paragraph">
	<div class="title">Streaming Input with Near Real Time Availability</div>
	<p>A common challenge in data analysis is one where new data arrives rapidly and constantly,
	and the same data needs to be available in near real time for reads, scans, and
	updates. Kudu offers the powerful combination of fast inserts and updates with
	efficient columnar scans to enable real-time analytics use cases on a single storage layer.</p>
	</div>
	<div class="paragraph">
	<div class="title">Time-series application with widely varying access patterns</div>
	<p>A time-series schema is one in which data points are organized and keyed according
	to the time at which they occurred. This can be useful for investigating the
	performance of metrics over time or attempting to predict future behavior based
	on past data. For instance, time-series customer data might be used both to store
	purchase click-stream history and to predict future purchases, or for use by a
	customer support representative. While these different types of analysis are occurring,
	inserts and mutations may also be occurring individually and in bulk, and become available
	immediately to read workloads. Kudu can handle all of these access patterns
	simultaneously in a scalable and efficient manner.</p>
	</div>
	<div class="paragraph">
	<p>Kudu is a good fit for time-series workloads for several reasons. With Kudu’s support for
	hash-based partitioning, combined with its native support for compound row keys, it is
	simple to set up a table spread across many servers without the risk of "hotspotting"
	that is commonly observed when range partitioning is used. Kudu’s columnar storage engine
	is also beneficial in this context, because many time-series workloads read only a few columns,
	as opposed to the whole row.</p>
	</div>
	<div class="paragraph">
	<p>In the past, you might have needed to use multiple data stores to handle different
	data access patterns. This practice adds complexity to your application and operations,
	and duplicates your data, doubling (or worse) the amount of storage
	required. Kudu can handle all of these access patterns natively and efficiently,
	without the need to off-load work to other data stores.</p>
	</div>
	<div class="paragraph">
	<div class="title">Predictive Modeling</div>
	<p>Data scientists often develop predictive learning models from large sets of data. The
	model and the data may need to be updated or modified often as the learning takes
	place or as the situation being modeled changes. In addition, the scientist may want
	to change one or more factors in the model to see what happens over time. Updating
	a large set of data stored in files in HDFS is resource-intensive, as each file needs
	to be completely rewritten. In Kudu, updates happen in near real time. The scientist
	can tweak the value, re-run the query, and refresh the graph in seconds or minutes,
	rather than hours or days. In addition, batch or incremental algorithms can be run
	across the data at any time, with near-real-time results.</p>
	</div>
	<div class="paragraph">
	<div class="title">Combining Data In Kudu With Legacy Systems</div>
	<p>Companies generate data from multiple sources and store it in a variety of systems
	and formats. For instance, some of your data may be stored in Kudu, some in a traditional
	RDBMS, and some in files in HDFS. You can access and query all of these sources and
	formats using Impala, without the need to change your legacy systems.</p>
	</div>
	</div>
	</div>
	<div class="sect1">
	<h2 id="_next_steps"><a class="link" href="#_next_steps">Next Steps</a></h2>
	<div class="sectionbody">
	<div class="ulist">
	<ul>
	<li>
	<p><a href="quickstart.html">Get Started With Kudu</a></p>
	</li>
	<li>
	<p><a href="installation.html">Installing Kudu</a></p>
	</li>
	</ul>
	</div>
	</div>
	</div>
	</div>
	<div class="col-md-3">

	<div id="toc" data-spy="affix" data-offset-top="70">
	<ul>

	<li>
	<span class="active-toc">Introducing Kudu</span>
	<ul class="sectlevel1">
	<li><a href="#_kudu_impala_integration_features">Kudu-Impala Integration Features</a></li>
	<li><a href="#_concepts_and_terms">Concepts and Terms</a></li>
	<li><a href="#_architectural_overview">Architectural Overview</a></li>
	<li><a href="#kudu_use_cases">Example Use Cases</a></li>
	<li><a href="#_next_steps">Next Steps</a></li>
	</ul>
	</li>
	<li>

	<a href="release_notes.html">Kudu Release Notes</a>
	</li>
	<li>

	<a href="quickstart.html">Quickstart Guide</a>
	</li>
	<li>

	<a href="installation.html">Installation Guide</a>
	</li>
	<li>

	<a href="configuration.html">Configuring Kudu</a>
	</li>
	<li>

	<a href="hive_metastore.html">Using the Hive Metastore with Kudu</a>
	</li>
	<li>

	<a href="kudu_impala_integration.html">Using Impala with Kudu</a>
	</li>
	<li>

	<a href="administration.html">Administering Kudu</a>
	</li>
	<li>

	<a href="troubleshooting.html">Troubleshooting Kudu</a>
	</li>
	<li>

	<a href="developing.html">Developing Applications with Kudu</a>
	</li>
	<li>

	<a href="schema_design.html">Kudu Schema Design</a>
	</li>
	<li>

	<a href="scaling_guide.html">Kudu Scaling Guide</a>
	</li>
	<li>

	<a href="security.html">Kudu Security</a>
	</li>
	<li>

	<a href="transaction_semantics.html">Kudu Transaction Semantics</a>
	</li>
	<li>

	<a href="background_tasks.html">Background Maintenance Tasks</a>
	</li>
	<li>

	<a href="configuration_reference.html">Kudu Configuration Reference</a>
	</li>
	<li>

	<a href="command_line_tools_reference.html">Kudu Command Line Tools Reference</a>
	</li>
	<li>

	<a href="metrics_reference.html">Kudu Metrics Reference</a>
	</li>
	<li>

	<a href="known_issues.html">Known Issues and Limitations</a>
	</li>
	<li>

	<a href="contributing.html">Contributing to Kudu</a>
	</li>
	<li>

	<a href="export_control.html">Export Control Notice</a>
	</li>
	</ul>
	</div>
	</div>
	</div>
	</div>