blob: fc9fd562a50911800bb90e7c7dbc12922516ade6 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Schema Design</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<div class="container doc-container">
<p> Looking for the <a href="/docs/0.20.0/">latest stable documentation</a>?</p>
<div class="row">
<div class="col-md-9 doc-content">
<p>
<a class="btn btn-default btn-xs visible-xs-inline-block visible-sm-inline-block" href="#toc">Table of Contents</a>
</p>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<h1 id="schema-design">Schema Design</h1>
<p>This page is meant to assist users in designing a schema for data to be ingested in Druid. Druid intakes denormalized data
and columns are one of three types: a timestamp, a dimension, or a measure (or a metric/aggregator as they are
known in Druid). This follows the <a href="https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems">standard naming convention</a>
of OLAP data.</p>
<p>For more detailed information:</p>
<ul>
<li>Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on.</li>
<li>Dimensions are fields that can be filtered on or grouped by. They are always single Strings, arrays of Strings, single Longs, single Doubles or single Floats.</li>
<li>Metrics are fields that can be aggregated. They are often stored as numbers (integers or floats) but can also be stored as complex objects like HyperLogLog sketches or approximate histogram sketches.</li>
</ul>
<p>Typical production tables (or datasources as they are known in Druid) have fewer than 100 dimensions and fewer
than 100 metrics, although, based on user testimony, datasources with thousands of dimensions have been created.</p>
<p>Below, we outline some best practices with schema design:</p>
<h2 id="numeric-dimensions">Numeric dimensions</h2>
<p>If the user wishes to ingest a column as a numeric-typed dimension (Long, Double or Float), it is necessary to specify the type of the column in the <code>dimensions</code> section of the <code>dimensionsSpec</code>. If the type is omitted, Druid will ingest a column as the default String type.</p>
<p>There are performance tradeoffs between string and numeric columns. Numeric columns are generally faster to group on
than string columns. But unlike string columns, numeric columns don&#39;t have indexes, so they are generally slower to
filter on.</p>
<p>See <a href="../ingestion/index.html#dimension-schema">Dimension Schema</a> for more information.</p>
<h2 id="high-cardinality-dimensions-e-g-unique-ids">High cardinality dimensions (e.g. unique IDs)</h2>
<p>In practice, we see that exact counts for unique IDs are often not required. Storing unique IDs as a column will kill
<a href="../ingestion/index.html#rollup">roll-up</a>, and impact compression. Instead, storing a sketch of the number of the unique IDs seen, and using that
sketch as part of aggregations, will greatly improve performance (up to orders of magnitude performance improvement), and significantly reduce storage.
Druid&#39;s <code>hyperUnique</code> aggregator is based off of Hyperloglog and can be used for unique counts on a high cardinality dimension.
For more information, see <a href="https://www.youtube.com/watch?v=Hpd3f_MLdXo">here</a>.</p>
<h2 id="nested-dimensions">Nested dimensions</h2>
<p>At the time of this writing, Druid does not support nested dimensions. Nested dimensions need to be flattened. For example,
if you have data of the following form:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>{&quot;foo&quot;:{&quot;bar&quot;: 3}}
</code></pre></div>
<p>then before indexing it, you should transform it to:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>{&quot;foo_bar&quot;: 3}
</code></pre></div>
<p>Druid is capable of flattening JSON input data, please see <a href="../ingestion/flatten-json.html">Flatten JSON</a> for more details.</p>
<h2 id="counting-the-number-of-ingested-events">Counting the number of ingested events</h2>
<p>A count aggregator at ingestion time can be used to count the number of events ingested. However, it is important to note
that when you query for this metric, you should use a <code>longSum</code> aggregator. A <code>count</code> aggregator at query time will return
the number of Druid rows for the time interval, which can be used to determine what the roll-up ratio was.</p>
<p>To clarify with an example, if your ingestion spec contains:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>...
&quot;metricsSpec&quot; : [
{
&quot;type&quot; : &quot;count&quot;,
&quot;name&quot; : &quot;count&quot;
},
...
</code></pre></div>
<p>You should query for the number of ingested rows with:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>...
&quot;aggregations&quot;: [
{ &quot;type&quot;: &quot;longSum&quot;, &quot;name&quot;: &quot;numIngestedEvents&quot;, &quot;fieldName&quot;: &quot;count&quot; },
...
</code></pre></div>
<h2 id="schema-less-dimensions">Schema-less dimensions</h2>
<p>If the <code>dimensions</code> field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
a dimension that has been excluded, or a metric column as a dimension. It should be noted that because of <a href="https://github.com/apache/incubator-druid/issues/658">#658</a>
these segments will be slightly larger than if the list of dimensions was explicitly specified in lexicographic order. This limitation
does not impact query correctness- just storage requirements.</p>
<p>Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.</p>
<h2 id="including-the-same-column-as-a-dimension-and-a-metric">Including the same column as a dimension and a metric</h2>
<p>One workflow with unique IDs is to be able to filter on a particular ID, while still being able to do fast unique counts on the ID column.
If you are not using schema-less dimensions, this use case is supported by setting the <code>name</code> of the metric to something different than the dimension.
If you are using schema-less dimensions, the best practice here is to include the same column twice, once as a dimension, and as a <code>hyperUnique</code> metric. This may involve
some work at ETL time.</p>
<p>As an example, for schema-less dimensions, repeat the same column:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>{&quot;device_id_dim&quot;:123, &quot;device_id_met&quot;:123}
</code></pre></div>
<p>and in your <code>metricsSpec</code>, include:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>{ &quot;type&quot; : &quot;hyperUnique&quot;, &quot;name&quot; : &quot;devices&quot;, &quot;fieldName&quot; : &quot;device_id_met&quot; }
</code></pre></div>
<p><code>device_id_dim</code> should automatically get picked up as a dimension.</p>
</div>
<div class="col-md-3">
<div class="searchbox">
<gcse:searchbox-only></gcse:searchbox-only>
</div>
<div id="toc" class="nav toc hidden-print">
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
<script>
$(function() {
$(".toc").load("/docs/0.13.0-incubating/toc.html");
// There is no way to tell when .gsc-input will be async loaded into the page so just try to set a placeholder until it works
var tries = 0;
var timer = setInterval(function() {
tries++;
if (tries > 300) clearInterval(timer);
var searchInput = $('input.gsc-input');
if (searchInput.length) {
searchInput.attr('placeholder', 'Search');
clearInterval(timer);
}
}, 100);
});
</script>
</body>
</html>