blob: 4ccf84c155aaab4a8794e077410866b69cb8186b [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Overview</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7"
crossorigin="anonymous">
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="logo" href="/"><img
src="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png"
srcset="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png 1x, //d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_160px.png 2x"
alt="Apache Kudu"/></a>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
</li>
<li class="active">
<a href="/overview.html">Overview</a>
</li>
<li >
<a href="/docs/">Documentation</a>
</li>
<li >
<a href="/releases/">Releases</a>
</li>
<li >
<a href="/blog/">Blog</a>
</li>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="https://join.slack.com/t/getkudu/shared_invite/zt-244b4zvki-hB1q9IbAk6CqHNMZHvUALA">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<li><a href="/ecosystem.html">Ecosystem</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="https://github.com/apache/incubator-kudu">GitHub</a></li>
<li><a class="icon gerrit" href="http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu">Gerrit Code Review</a></li>
<li><a class="icon jira" href="https://issues.apache.org/jira/browse/KUDU">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="https://twitter.com/ApacheKudu">Twitter</a></li>
<li><a href="https://www.reddit.com/r/kudu/">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
<li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li>
</ul>
</li>
<li >
<a href="/faq.html">FAQ</a>
</li>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
</nav>
<!------------------------------------------------------------->
<div class="row header">
<div class="col-lg-12">
<h2>Apache Kudu Overview</h2>
</div>
</div>
<div class="row overview">
<div class="col-md-6">
<h3>Data Model</h3>
<p>
A Kudu cluster stores tables that look just like tables you're used to from relational (SQL) databases.
A table can be as simple as an binary <code>key</code> and <code>value</code>, or as complex
as a few hundred different strongly-typed attributes.
</p>
<p>
Just like SQL, every table has a <code>PRIMARY KEY</code> made up of one or more columns.
This might be a single column like a unique user identifier, or a compound key such as a
<code>(host, metric, timestamp)</code> tuple for a machine time series database. Rows can be efficiently
read, updated, or deleted by their primary key.
</p>
<p>
Kudu's simple data model makes it breeze to port legacy applications or build new ones:
no need to worry about how to encode your data into binary blobs or make sense of a
huge database full of hard-to-interpret JSON. Tables are self-describing, so you can
use standard tools like SQL engines or Spark to analyze your data.
</p>
<ul>
<li><a href="docs/schema_design.html">Learn more about schema design with Kudu</a></li>
</ul>
</div>
<div class="col-md-6">
<img class="img-responsive" src="/img/twitter-firehose-schema.png" />
</div>
</div>
<div class="row overview">
<div class="col-md-6">
<img class="img-responsive" src="/img/java-code.png" />
</div>
<div class="col-md-6">
<h3>Low-latency random access</h3>
<p>
Unlike other storage for big data analytics, Kudu isn't just a file format. It's a live storage
system which supports low-latency millisecond-scale access to individual rows. For
"NoSQL"-style access, you can choose between Java, C++, or Python APIs. And of course these
random access APIs can be used in conjunction with batch access for machine learning or analytics.
</p>
<p>
Kudu's APIs are designed to be easy to use. The data model is fully typed, so you don't
need to worry about binary encodings or exotic serialization. You can just store primitive
types, like when you use JDBC or ODBC.
</p>
<p>
Kudu isn't designed to be an OLTP system, but if you have some subset of data which fits
in memory, it offers competitive random access performance. We've measured 99th percentile
latencies of 6ms or below using YCSB with a uniform random access workload over a billion
rows. Being able to run low-latency online workloads on the same storage as back-end
data analytics can dramatically simplify application architecture.
</p>
<ul>
<li><a href="apidocs/">View the Java API docs</a></li>
<li><a href="cpp-client-api/">View the C++ API docs</a></li>
<li><a href="docs/developing.html">Learn more about developing applications with Kudu</a></li>
<!-- TODO: include a link to YCSB results -->
</ul>
</div>
</div>
<div class="row overview">
<div class="col-md-6">
<h3>Apache Hadoop Ecosystem Integration</h3>
<p>
Kudu was designed to fit in with the Hadoop ecosystem, and integrating it with other
data processing frameworks is simple. You can stream data in from live real-time data sources
using the Java client, and then process it immediately upon arrival using Spark, Impala,
or MapReduce. You can even transparently join Kudu tables with data stored in other Hadoop
storage such as HDFS or HBase.
</p>
<p>
Kudu is a good citizen on a Hadoop cluster: it can easily share data
disks with HDFS DataNodes, and can operate in a RAM footprint as small as 1 GB for
light workloads.
</p>
<ul>
<li><a href="docs/kudu_impala_integration.html">Learn more about integration with Impala</a>
<li><a href="https://github.com/apache/incubator-kudu/blob/master/java/kudu-client-tools/src/main/java/org/kududb/mapreduce/tools/RowCounter.java">
View an example of a MapReduce job on Kudu
</a></li>
</ul>
</div>
<div class="col-md-6">
<img class="img-responsive" src="/img/impala.png" />
</div>
</div>
<div class="row overview">
<div class="col-md-6">
<img class="img-responsive" src="/img/tracing.png" />
</div>
<div class="col-md-6">
<h3>Built by and for Operators</h3>
<p>
Kudu was built by a group of engineers who have spent many late nights providing
on-call production support for critical Hadoop clusters across hundreds of
enterprise use cases. We know how frustrating it is to debug software
without good metrics, tracing, or administrative tools.
</p>
<p>
Ever since its first beta release, Kudu has included advanced in-process tracing capabilities,
extensive metrics support, and even watchdog threads which check for latency
outliers and dump "smoking gun" stack traces to get to the root of the problem
quickly.
</p>
<ul>
<li><a href="docs/administration.html">Learn more about administering Kudu</a></li>
<li><a href="docs/troubleshooting.html#kudu_tracing">Learn more about Kudu's tracing capabilities</a></li>
</ul>
</div>
</div>
<div class="row overview noborder">
<div class="col-md-6">
<h3><a name="opensource">Open Source</a></h3>
<p>
Kudu is Open Source software, licensed under the Apache 2.0 license and
governed under the aegis of the Apache Software Foundation. We believe
that Kudu's long-term success depends on building a vibrant community of
developers and users from diverse organizations and backgrounds.
</p>
<ul>
<li><a href="/community.html">Learn more about how to contribute</a></li>
<li><a href="https://github.com/apache/incubator-kudu">View the Kudu github repository</a></li>
</ul>
</div>
<div class="col-md-6">
<img class="img-responsive" src="/img/asf_logo.png" size="75%"/>
</div>
</div>
<!------------------------------------------------------------->
<div class="row header">
<div class="col-lg-12">
<h2><a name="architecture">Kudu Architecture</a></h2>
</div>
</div>
<div class="row overview">
<div class="col-md-6">
<h3>Super-fast Columnar Storage</h3>
<p>
Like most modern analytic data stores, Kudu internally organizes its data by column rather than
row. Columnar storage allows efficient encoding and compression. For example, a string field with
only a few unique values can use only a few bits per row of storage. With techniques such as
run-length encoding, differential encoding, and vectorized bit-packing, Kudu is as fast at
reading the data as it is space-efficient at storing it.
</p>
<p>
Columnar storage also dramatically reduces the amount of data IO required to service analytic
queries. Using techniques such as lazy data materialization and predicate pushdown, Kudu can perform
drill-down and needle-in-a-haystack queries over billions of rows and terabytes of data in seconds.
</p>
<ul>
<li><a href="kudu.pdf">Read the Kudu paper for more details and a performance evaluation</a></li>
</ul>
</div>
<div class="col-md-6">
<img class="img-responsive" src="/img/kudu_vs_parquet.png" />
</div>
</div>
<div class="row overview">
<div class="col-md-6">
<img class="img-responsive" src="/img/raft.png" width="400"/>
</div>
<div class="col-md-6">
<h3>Distribution and Fault Tolerance</h3>
<p>
In order to scale out to large datasets and large clusters, Kudu splits tables
into smaller units called <em>tablets</em>. This splitting can be configured
on a per-table basis to be based on hashing, range partitioning, or a combination
thereof. This allows the operator to easily trade off between parallelism for
analytic workloads and high concurrency for more online ones.
</p>
<p>
In order to keep your data safe and available at all times, Kudu uses the
<a href="http://raft.github.io">Raft</a> consensus algorithm to replicate
all operations for a given tablet. Raft, like Paxos, ensures that every
write is persisted by at least two nodes before responding to
the client request, ensuring that no data is ever lost due to a
machine failure. When machines do fail, replicas reconfigure
themselves within a few seconds to maintain extremely high system
availability.
</p>
<p>
The use of majority consensus provides very low tail latencies
even when some nodes may be stressed by concurrent workloads such as
Spark jobs or heavy Impala queries. But unlike eventually
consistent systems, Raft consensus ensures that all replicas will
come to agreement around the state of the data, and by using a
combination of logical and physical clocks, Kudu can offer strict
snapshot consistency to clients that demand it.
</p>
<ul>
<li><a href="http://raft.github.io/">Learn more about Raft Consensus</a></li>
<li><a href="kudu.pdf">Read the Kudu paper for more details on its architecture</a></li>
</ul>
</div>
</div>
<div class="row overview noborder">
<div class="col-md-6">
<h3>Designed for Next-Generation Hardware</h3>
<p>
The Kudu team has worked closely with engineers at Intel to harness the power
of the next generation of hardware technologies.
Kudu's storage is designed to take advantage of the IO
characteristics of solid state drives, and it includes an
experimental cache implementation based on the <a href="http://pmem.io">libpmem</a>
library which can store data in persistent memory.
</p>
<p>
Kudu is implemented in C++, so it can scale easily to large amounts
of memory per node. And because key storage data structures are designed to
be highly concurrent, it can scale easily to tens of cores. With an
in-memory columnar execution path, Kudu achieves good instruction-level
parallelism using SIMD operations from the SSE4 and AVX instruction sets.
</p>
</div>
<div class="col-md-6">
<img class="img-responsive" src="/img/intel_logo.png" size="75%"/>
</div>
</div>
<!------------------------------------------------------------->
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2023 The Apache Software Foundation.
</p>
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
</p>
</div>
<div class="col-md-3">
<a class="pull-right" href="https://www.apache.org/events/current-event.html">
<img src="https://www.apache.org/events/current-event-234x60.png"/>
</a>
</div>
</div>
</footer>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
$(document.documentElement).addClass("touch");
} else {
$(document.documentElement).addClass("no-touch");
}
});
</script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"
integrity="sha384-0mSbJDEHialfmuBBQP6A4Qrprq5OVfW37PRR3j5ELqxss1yVqOtnepnHVP9aJ7xS"
crossorigin="anonymous"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script>
<script>
anchors.options = {
placement: 'right',
visible: 'touch',
};
anchors.add();
</script>
</body>
</html>