blob: 35359825ceffef078e1c60a651657a10ff4c684a [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Segment Size Optimization</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<div class="container doc-container">
<p> Looking for the <a href="/docs/0.19.0/">latest stable documentation</a>?</p>
<div class="row">
<div class="col-md-9 doc-content">
<p>
<a class="btn btn-default btn-xs visible-xs-inline-block visible-sm-inline-block" href="#toc">Table of Contents</a>
</p>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<h1 id="segment-size-optimization">Segment Size Optimization</h1>
<p>In Apache Druid (incubating), it&#39;s important to optimize the segment size because</p>
<ol>
<li>Druid stores data in segments. If you&#39;re using the <a href="../design/index.html#roll-up-modes">best-effort roll-up</a> mode,
increasing the segment size might introduce further aggregation which reduces the dataSource size.</li>
<li>When a query is submitted, that query is distributed to all Historicals and realtime tasks
which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool
to process a single segment. If segment sizes are too large, data might not be well distributed between data
servers, decreasing the degree of parallelism possible during query processing.
At the other extreme where segment sizes are too small, the scheduling
overhead of processing a larger number of segments per query can reduce
performance, as the threads that process each segment compete for the fixed
slots of the processing pool.</li>
</ol>
<p>It would be best if you can optimize the segment size at ingestion time, but sometimes it&#39;s not easy
especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case,
you can create segments with a sub-optimzed size first and optimize them later.</p>
<p>You may need to consider the followings to optimize your segments.</p>
<ul>
<li>Number of rows per segment: it&#39;s generally recommended for each segment to have around 5 million rows.
This setting is usually <em>more</em> important than the below &quot;segment byte size&quot;.
This is because Druid uses a single thread to process each segment,
and thus this setting can directly control how many rows each thread processes,
which in turn means how well the query execution is parallelized.</li>
<li>Segment byte size: it&#39;s recommended to set 300 ~ 700MB. If this value
doesn&#39;t match with the &quot;number of rows per segment&quot;, please consider optimizing
number of rows per segment rather than this value.</li>
</ul>
<div class="note">
The above recommendation works in general, but the optimal setting can
vary based on your workload. For example, if most of your queries
are heavy and take a long time to process each row, you may want to make
segments smaller so that the query processing can be more parallelized.
If you still see some performance issue after optimizing segment size,
you may need to find the optimal settings for your workload.
</div>
<p>There might be several ways to check if the compaction is necessary. One way
is using the <a href="../querying/sql.html#system-schema">System Schema</a>. The
system schema provides several tables about the current system status including the <code>segments</code> table.
By running the below query, you can get the average number of rows and average size for published segments.</p>
<div class="highlight"><pre><code class="language-sql" data-lang="sql"><span></span><span class="k">SELECT</span>
<span class="ss">&quot;start&quot;</span><span class="p">,</span>
<span class="ss">&quot;end&quot;</span><span class="p">,</span>
<span class="k">version</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">num_segments</span><span class="p">,</span>
<span class="k">AVG</span><span class="p">(</span><span class="ss">&quot;num_rows&quot;</span><span class="p">)</span> <span class="k">AS</span> <span class="n">avg_num_rows</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="ss">&quot;num_rows&quot;</span><span class="p">)</span> <span class="k">AS</span> <span class="n">total_num_rows</span><span class="p">,</span>
<span class="k">AVG</span><span class="p">(</span><span class="ss">&quot;size&quot;</span><span class="p">)</span> <span class="k">AS</span> <span class="n">avg_size</span><span class="p">,</span>
<span class="k">SUM</span><span class="p">(</span><span class="ss">&quot;size&quot;</span><span class="p">)</span> <span class="k">AS</span> <span class="n">total_size</span>
<span class="k">FROM</span>
<span class="n">sys</span><span class="p">.</span><span class="n">segments</span> <span class="n">A</span>
<span class="k">WHERE</span>
<span class="n">datasource</span> <span class="o">=</span> <span class="s1">&#39;your_dataSource&#39;</span> <span class="k">AND</span>
<span class="n">is_published</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span> <span class="k">DESC</span><span class="p">;</span>
</code></pre></div>
<p>Please note that the query result might include overshadowed segments.
In this case, you may want to see only rows of the max version per interval (pair of <code>start</code> and <code>end</code>).</p>
<p>Once you find your segments need compaction, you can consider the below two options:</p>
<ul>
<li>Turning on the <a href="../design/coordinator.html#compacting-segments">automatic compaction of Coordinators</a>.
The Coordinator periodically submits <a href="../ingestion/tasks.html#compaction-task">compaction tasks</a> to re-index small segments.
To enable the automatic compaction, you need to configure it for each dataSource via Coordinator&#39;s dynamic configuration.
See <a href="../operations/api-reference.html#compaction-configuration">Compaction Configuration API</a>
and <a href="../configuration/index.html#compaction-dynamic-configuration">Compaction Configuration</a> for details.</li>
<li>Running periodic Hadoop batch ingestion jobs and using a <code>dataSource</code>
inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
Details on how to do this can be found under <a href="../ingestion/update-existing-data.html">&#39;Updating Existing Data&#39;</a>.</li>
</ul>
</div>
<div class="col-md-3">
<div class="searchbox">
<gcse:searchbox-only></gcse:searchbox-only>
</div>
<div id="toc" class="nav toc hidden-print">
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
<script>
$(function() {
$(".toc").load("/docs/0.14.2-incubating/toc.html");
// There is no way to tell when .gsc-input will be async loaded into the page so just try to set a placeholder until it works
var tries = 0;
var timer = setInterval(function() {
tries++;
if (tries > 300) clearInterval(timer);
var searchInput = $('input.gsc-input');
if (searchInput.length) {
searchInput.attr('placeholder', 'Search');
clearInterval(timer);
}
}, 100);
});
</script>
</body>
</html>