| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> |
| <meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu (incubating) completes Hadoop's storage layer to enable fast analytics on fast data" /> |
| <meta name="author" content="Cloudera" /> |
| <title>Apache Kudu (incubating) - Overview</title> |
| <!-- Bootstrap core CSS --> |
| <link href="/css/bootstrap.min.css" rel="stylesheet" /> |
| |
| <!-- Custom styles for this template --> |
| <link href="/css/justified-nav.css" rel="stylesheet" /> |
| |
| <link href="/css/kudu.css" rel="stylesheet"/> |
| <link href="/css/asciidoc.css" rel="stylesheet"/> |
| <link rel="shortcut icon" href="/img/logo-favicon.ico" /> |
| <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" /> |
| |
| |
| |
| <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries --> |
| <!--[if lt IE 9]> |
| <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script> |
| <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> |
| <![endif]--> |
| </head> |
| <body> |
| <!-- Fork me on GitHub --> |
| <a class="fork-me-on-github" href="https://github.com/apache/incubator-kudu"><img src="//aral.github.io/fork-me-on-github-retina-ribbons/right-cerulean@2x.png" alt="Fork me on GitHub" /></a> |
| |
| <div class="kudu-site container-fluid"> |
| <!-- Static navbar --> |
| <nav class="container-fluid navbar-default"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="logo" href="/"><img src="/img/logo_small.png" width="80" /></a> |
| |
| </div> |
| <div id="navbar" class="navbar-collapse collapse navbar-right"> |
| <ul class="nav navbar-nav"> |
| <li > |
| <a href="/">Home</a> |
| </li> |
| <li class="active"> |
| <a href="/overview.html">Overview</a> |
| </li> |
| <li > |
| <a href="/docs/">Documentation</a> |
| </li> |
| <li > |
| <a href="/releases/">Download</a> |
| </li> |
| <li > |
| <a href="/blog/">Blog</a> |
| </li> |
| <li > |
| <a href="/community.html">Community</a> |
| </li> |
| <li > |
| <a href="/faq.html">FAQ</a> |
| </li> |
| </ul> |
| </div><!--/.nav-collapse --> |
| </nav> |
| |
| <!-------------------------------------------------------------> |
| <div class="row header"> |
| <div class="col-lg-12"> |
| <h2>Apache Kudu (incubating) Overview</h2> |
| </div> |
| </div> |
| |
| <div class="row overview"> |
| <div class="col-md-6"> |
| <h3>Data Model</h3> |
| <p> |
| A Kudu cluster stores tables that look just like tables you're used to from relational (SQL) databases. |
| A table can be as simple as an binary <code>key</code> and <code>value</code>, or as complex |
| as a few hundred different strongly-typed attributes. |
| </p> |
| <p> |
| Just like SQL, every table has a <code>PRIMARY KEY</code> made up of one or more columns. |
| This might be a single column like a unique user identifier, or a compound key such as a |
| <code>(host, metric, timestamp)</code> tuple for a machine time series database. Rows can be efficiently |
| read, updated, or deleted by their primary key. |
| </p> |
| <p> |
| Kudu's simple data model makes it breeze to port legacy applications or build new ones: |
| no need to worry about how to encode your data into binary blobs or make sense of a |
| huge database full of hard-to-interpret JSON. Tables are self-describing, so you can |
| use standard tools like SQL engines or Spark to analyze your data. |
| </p> |
| <ul> |
| <li><a href="docs/schema_design.html">Learn more about schema design with Kudu</a></li> |
| </ul> |
| </div> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/twitter-firehose-schema.png" /> |
| </div> |
| </div> |
| |
| <div class="row overview"> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/java-code.png" /> |
| </div> |
| <div class="col-md-6"> |
| <h3>Low-latency random access</h3> |
| <p> |
| Unlike other storage for big data analytics, Kudu isn't just a file format. It's a live storage |
| system which supports low-latency millisecond-scale access to individual rows. You can choose |
| between Java or C++ APIs, with Python support under development. And of course these random access |
| APIs can be used in conjunction with batch access for machine learning or analytics. |
| </p> |
| <p> |
| Kudu's APIs are designed to be easy to use. The data model is fully typed, so you don't |
| need to worry about binary encodings or exotic serialization. You can just store primitive |
| types, like when you use JDBC or ODBC. |
| </p> |
| <p> |
| Kudu isn't designed to be an OLTP system, but if you have some subset of data which fits |
| in memory, it offers competitive random access performance. We've measured 99th percentile |
| latencies of 6ms or below using YCSB with a uniform random access workload over a billion |
| rows. Being able to run low-latency online workloads on the same storage as back-end |
| data analytics can dramatically simplify application architecture. |
| </p> |
| <ul> |
| <li><a href="apidocs/">View the Java API docs</a></li> |
| <li><a href="https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/client.h">View the C++ client API</a></li> |
| <li><a href="docs/developing.html">Learn more about developing applications with Kudu</a></li> |
| <!-- TODO: include a link to YCSB results --> |
| </ul> |
| </div> |
| </div> |
| |
| |
| <div class="row overview"> |
| <div class="col-md-6"> |
| <h3>Integration with the Hadoop Ecosystem</h3> |
| <p> |
| Kudu was designed to fit in with the Hadoop ecosystem, and integrating it with other |
| data processing frameworks is simple. You can stream data in from live real-time data sources |
| using the Java client, and then process it immediately upon arrival using Spark, Impala, |
| or MapReduce. You can even transparently join Kudu tables with data stored in other Hadoop |
| storage such as HDFS or HBase. |
| </p> |
| <p> |
| Kudu is a good citizen on a Hadoop cluster: it can easily share data |
| disks with HDFS DataNodes, and can operate in a RAM footprint as small as 1 GB for |
| light workloads. |
| </p> |
| <ul> |
| <li><a href="docs/kudu_impala_integration.html">Learn more about integration with Impala</a> |
| <li><a href="https://github.com/apache/incubator-kudu/blob/master/java/kudu-client-tools/src/main/java/org/kududb/mapreduce/tools/RowCounter.java"> |
| View an example of a MapReduce job on Kudu |
| </a></li> |
| </ul> |
| </div> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/impala.png" /> |
| </div> |
| </div> |
| |
| <div class="row overview"> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/tracing.png" /> |
| </div> |
| <div class="col-md-6"> |
| <h3>Built by and for Operators</h3> |
| <p> |
| Kudu was built by a group of engineers who have spent many late nights providing |
| on-call production support for critical Hadoop clusters across hundreds of |
| enterprise use cases. We know how frustrating it is to debug software |
| without good metrics, tracing, or administrative tools. |
| </p> |
| <p> |
| Even in its first beta release, Kudu includes advanced in-process tracing capabilities, |
| extensive metrics support, and even watchdog threads which check for latency |
| outliers and dump "smoking gun" stack traces to get to the root of the problem |
| quickly. |
| </p> |
| <ul> |
| <li><a href="docs/administration.html">Learn more about administering Kudu</a></li> |
| <li><a href="docs/troubleshooting.html#kudu_tracing">Learn more about Kudu's tracing capabilities</a></li> |
| </ul> |
| </div> |
| </div> |
| |
| <div class="row overview noborder"> |
| <div class="col-md-6"> |
| <h3><a name="opensource">Open Source</a></h3> |
| <p> |
| Kudu is Open Source software, licensed under the Apache 2.0 license. Although |
| most of the contributions have been from developers from a single company |
| up to this point, we believe that Kudu's long-term success depends on |
| building a vibrant community of developers and users from diverse organizations and backgrounds. |
| </p> |
| <p> |
| Our first order of business is to migrate all of our code reviews and |
| daily development discussions into the open. We're also working on proposing |
| Kudu for inclusion in the Apache Software Foundation's Incubator. With |
| several Apache Members and Committers on the development team, we are |
| familiar with the Apache Way and looking forward to contributing this |
| project to the ASF. |
| </p> |
| <ul> |
| <li><a href="/community.html">Learn more about how to contribute</a></li> |
| <li><a href="https://github.com/apache/incubator-kudu">View the Kudu github repository</a></li> |
| </ul> |
| </div> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/github.png" size="75%"/> |
| </div> |
| </div> |
| |
| <!-------------------------------------------------------------> |
| |
| <div class="row header"> |
| <div class="col-lg-12"> |
| <h2><a name="architecture">Kudu Architecture</a></h2> |
| </div> |
| </div> |
| |
| <div class="row overview"> |
| <div class="col-md-6"> |
| <h3>Super-fast Columnar Storage</h3> |
| <p> |
| Like most modern analytic data stores, Kudu internally organizes its data by column rather than |
| row. Columnar storage allows efficient encoding and compression. For example, a string field with |
| only a few unique values can use only a few bits per row of storage. With techniques such as |
| run-length encoding, differential encoding, and vectorized bit-packing, Kudu is as fast at |
| reading the data as it is space-efficient at storing it. |
| </p> |
| <p> |
| Columnar storage also dramatically reduces the amount of data IO required to service analytic |
| queries. Using techniques such as lazy data materialization and predicate pushdown, Kudu can perform |
| drill-down and needle-in-a-haystack queries over billions of rows and terabytes of data in seconds. |
| </p> |
| <ul> |
| <li><a href="kudu.pdf">Read the Kudu paper for more details and a performance evaluation</a></li> |
| </ul> |
| </div> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/kudu_vs_parquet.png" /> |
| </div> |
| </div> |
| |
| <div class="row overview"> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/raft.png" width="400"/> |
| </div> |
| <div class="col-md-6"> |
| <h3>Distribution and Fault Tolerance</h3> |
| <p> |
| In order to scale out to large datasets and large clusters, Kudu splits tables |
| into smaller units called <em>tablets</em>. This splitting can be configured |
| on a per-table basis to be based on hashing, range partitioning, or a combination |
| thereof. This allows the operator to easily trade off between parallelism for |
| analytic workloads and high concurrency for more online ones. |
| </p> |
| <p> |
| In order to keep your data safe and available at all times, Kudu uses the |
| <a href="http://raft.github.io">Raft</a> consensus algorithm to replicate |
| all operations for a given tablet. Raft, like Paxos, ensures that every |
| write is persisted by at least two nodes before responding to |
| the client request, ensuring that no data is ever lost due to a |
| machine failure. When machines do fail, replicas reconfigure |
| themselves within a few seconds to maintain extremely high system |
| availability. |
| </p> |
| <p> |
| The use of majority consensus provides very low tail latencies |
| even when some nodes may be stressed by concurrent workloads such as |
| MapReduce jobs or heavy Impala queries. But unlike eventual |
| consistency systems, Raft consensus ensures that all replicas will |
| come to agreement around the state of the data, and by using a |
| combination of logical and physical clocks, Kudu can offer strict |
| snapshot consistency to clients that demand it. |
| </p> |
| <ul> |
| <li><a href="http://raft.github.io/">Learn more about Raft Consensus</a></li> |
| <li><a href="kudu.pdf">Read the Kudu paper for more details on its architecture</a></li> |
| </ul> |
| </div> |
| </div> |
| |
| <div class="row overview noborder"> |
| <div class="col-md-6"> |
| <h3>Designed for Next-Generation Hardware</h3> |
| <p> |
| The Kudu team has worked closely with engineers at Intel to harness the power |
| of the next generation of hardware technologies. |
| Kudu's storage is designed to take advantage of the IO |
| characteristics of solid state drives, and it includes an |
| experimental cache implementation based on the <a href="http://pmem.io">libpmem</a> |
| library which can store data in persistent memory. |
| </p> |
| <p> |
| Kudu is implemented in C++, so it can scale easily to large amounts |
| of memory per node. And because key storage data structures are designed to |
| be highly concurrent, it can scale easily to tens of cores. With an |
| in-memory columnar execution path, Kudu achieves good instruction-level |
| parallelism using SIMD operations from the SSE4 and AVX instruction sets. |
| </p> |
| </div> |
| <div class="col-md-6"> |
| <img class="img-responsive" src="/img/intel_logo.png" size="75%"/> |
| </div> |
| </div> |
| |
| |
| <!-------------------------------------------------------------> |
| |
| <footer class="footer"> |
| <p class="pull-left"> |
| <a href="http://incubator.apache.org"><img src="/img/apache-incubator.png" width="225" height="53" align="right"/></a> |
| </p> |
| <p class="small"> |
| Apache Kudu (incubating) is an effort undergoing incubation at the Apache Software |
| Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is |
| required of all newly accepted projects until a further review |
| indicates that the infrastructure, communications, and decision making |
| process have stabilized in a manner consistent with other successful |
| ASF projects. While incubation status is not necessarily a reflection |
| of the completeness or stability of the code, it does indicate that the |
| project has yet to be fully endorsed by the ASF. |
| |
| Copyright © 2016 The Apache Software Foundation. |
| </p> |
| </footer> |
| </div> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script> |
| <script src="/js/bootstrap.js"></script> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| |
| ga('create', 'UA-68448017-1', 'auto'); |
| ga('send', 'pageview'); |
| |
| </script> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script> |
| <script> |
| anchors.options = { |
| placement: 'right', |
| visible: 'touch', |
| }; |
| anchors.add(); |
| </script> |
| </body> |
| </html> |
| |