blob: 327e041f0ef8936f060afef6409f017a2224586f [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Updating Existing Data</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<div class="container doc-container">
<p> Looking for the <a href="/docs/0.20.0/">latest stable documentation</a>?</p>
<div class="row">
<div class="col-md-9 doc-content">
<p>
<a class="btn btn-default btn-xs visible-xs-inline-block visible-sm-inline-block" href="#toc">Table of Contents</a>
</p>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<h1 id="updating-existing-data">Updating Existing Data</h1>
<p>Once you ingest some data in a dataSource for an interval and create Apache Druid (incubating) segments, you might want to make changes to
the ingested data. There are several ways this can be done.</p>
<h5 id="updating-dimension-values">Updating Dimension Values</h5>
<p>If you have a dimension where values need to be updated frequently, try first using <a href="../querying/lookups.html">lookups</a>. A
classic use case of lookups is when you have an ID dimension stored in a Druid segment, and want to map the ID dimension to a
human-readable String value that may need to be updated periodically.</p>
<h5 id="rebuilding-segments-reindexing">Rebuilding Segments (Reindexing)</h5>
<p>If lookups are not sufficient, you can entirely rebuild Druid segments for specific intervals of time. Rebuilding a segment
is known as reindexing the data. For example, if you want to add or remove columns from your existing segments, or you want to
change the rollup granularity of your segments, you will have to reindex your data.</p>
<p>We recommend keeping a copy of your raw data around in case you ever need to reindex your data.</p>
<h5 id="dealing-with-delayed-events-delta-ingestion">Dealing with Delayed Events (Delta Ingestion)</h5>
<p>If you have a batch ingestion pipeline and have delayed events come in and want to append these events to existing
segments and avoid the overhead of rebuilding new segments with reindexing, you can use delta ingestion.</p>
<h3 id="reindexing-and-delta-ingestion-with-hadoop-batch-ingestion">Reindexing and Delta Ingestion with Hadoop Batch Ingestion</h3>
<p>This section assumes the reader understands how to do batch ingestion using Hadoop. See
<a href="./hadoop.html">Hadoop batch ingestion</a> for more information. Hadoop batch-ingestion can be used for reindexing and delta ingestion.</p>
<p>Druid uses an <code>inputSpec</code> in the <code>ioConfig</code> to know where the data to be ingested is located and how to read it.
For simple Hadoop batch ingestion, <code>static</code> or <code>granularity</code> spec types allow you to read data stored in deep storage.</p>
<p>There are other types of <code>inputSpec</code> to enable reindexing and delta ingestion.</p>
<h4 id="datasource"><code>dataSource</code></h4>
<p>This is a type of <code>inputSpec</code> that reads data already stored inside Druid. This is used to allow &quot;re-indexing&quot; data and for &quot;delta-ingestion&quot; described later in <code>multi</code> type inputSpec.</p>
<table><thead>
<tr>
<th>Field</th>
<th>Type</th>
<th>Description</th>
<th>Required</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>String.</td>
<td>This should always be &#39;dataSource&#39;.</td>
<td>yes</td>
</tr>
<tr>
<td>ingestionSpec</td>
<td>JSON object.</td>
<td>Specification of Druid segments to be loaded. See below.</td>
<td>yes</td>
</tr>
<tr>
<td>maxSplitSize</td>
<td>Number</td>
<td>Enables combining multiple segments into single Hadoop InputSplit according to size of segments. With -1, druid calculates max split size based on user specified number of map task(mapred.map.tasks or mapreduce.job.maps). By default, one split is made for one segment. maxSplitSize is specified in bytes.</td>
<td>no</td>
</tr>
<tr>
<td>useNewAggs</td>
<td>Boolean</td>
<td>If &quot;false&quot;, then list of aggregators in &quot;metricsSpec&quot; of hadoop indexing task must be same as that used in original indexing task while ingesting raw data. Default value is &quot;false&quot;. This field can be set to &quot;true&quot; when &quot;inputSpec&quot; type is &quot;dataSource&quot; and not &quot;multi&quot; to enable arbitrary aggregators while reindexing. See below for &quot;multi&quot; type support for delta-ingestion.</td>
<td>no</td>
</tr>
</tbody></table>
<p>Here is what goes inside <code>ingestionSpec</code>:</p>
<table><thead>
<tr>
<th>Field</th>
<th>Type</th>
<th>Description</th>
<th>Required</th>
</tr>
</thead><tbody>
<tr>
<td>dataSource</td>
<td>String</td>
<td>Druid dataSource name from which you are loading the data.</td>
<td>yes</td>
</tr>
<tr>
<td>intervals</td>
<td>List</td>
<td>A list of strings representing ISO-8601 Intervals.</td>
<td>yes</td>
</tr>
<tr>
<td>segments</td>
<td>List</td>
<td>List of segments from which to read data from, by default it is obtained automatically. You can obtain list of segments to put here by making a POST query to Coordinator at url /druid/coordinator/v1/metadata/datasources/segments?full with list of intervals specified in the request paylod e.g. [&quot;2012-01-01T00:00:00.000/2012-01-03T00:00:00.000&quot;, &quot;2012-01-05T00:00:00.000/2012-01-07T00:00:00.000&quot;]. You may want to provide this list manually in order to ensure that segments read are exactly same as they were at the time of task submission, task would fail if the list provided by the user does not match with state of database when the task actually runs.</td>
<td>no</td>
</tr>
<tr>
<td>filter</td>
<td>JSON</td>
<td>See <a href="../querying/filters.html">Filters</a></td>
<td>no</td>
</tr>
<tr>
<td>dimensions</td>
<td>Array of String</td>
<td>Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have an explicit list of dimensions then all the dimension columns present in stored data will be read.</td>
<td>no</td>
</tr>
<tr>
<td>metrics</td>
<td>Array of String</td>
<td>Name of metric columns to load. By default, the list will be constructed from the &quot;name&quot; of all the configured aggregators.</td>
<td>no</td>
</tr>
<tr>
<td>ignoreWhenNoSegments</td>
<td>boolean</td>
<td>Whether to ignore this ingestionSpec if no segments were found. Default behavior is to throw error when no segments were found.</td>
<td>no</td>
</tr>
</tbody></table>
<p>For example</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="s2">&quot;ioConfig&quot;</span> <span class="err">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;hadoop&quot;</span><span class="p">,</span>
<span class="nt">&quot;inputSpec&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;dataSource&quot;</span><span class="p">,</span>
<span class="nt">&quot;ingestionSpec&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;dataSource&quot;</span><span class="p">:</span> <span class="s2">&quot;wikipedia&quot;</span><span class="p">,</span>
<span class="nt">&quot;intervals&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;2014-10-20T00:00:00Z/P2W&quot;</span><span class="p">]</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="err">...</span>
<span class="p">}</span>
</code></pre></div>
<h4 id="multi"><code>multi</code></h4>
<p>This is a composing inputSpec to combine other inputSpecs. This inputSpec is used for delta ingestion. You can also use a <code>multi</code> inputSpec to combine data from multiple dataSources. However, each particular dataSource can only be specified one time.
Note that, &quot;useNewAggs&quot; must be set to default value false to support delta-ingestion.</p>
<table><thead>
<tr>
<th>Field</th>
<th>Type</th>
<th>Description</th>
<th>Required</th>
</tr>
</thead><tbody>
<tr>
<td>children</td>
<td>Array of JSON objects</td>
<td>List of JSON objects containing other inputSpecs.</td>
<td>yes</td>
</tr>
</tbody></table>
<p>For example:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="s2">&quot;ioConfig&quot;</span> <span class="err">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;hadoop&quot;</span><span class="p">,</span>
<span class="nt">&quot;inputSpec&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;multi&quot;</span><span class="p">,</span>
<span class="nt">&quot;children&quot;</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;dataSource&quot;</span><span class="p">,</span>
<span class="nt">&quot;ingestionSpec&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;dataSource&quot;</span><span class="p">:</span> <span class="s2">&quot;wikipedia&quot;</span><span class="p">,</span>
<span class="nt">&quot;intervals&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;2012-01-01T00:00:00.000/2012-01-03T00:00:00.000&quot;</span><span class="p">,</span> <span class="s2">&quot;2012-01-05T00:00:00.000/2012-01-07T00:00:00.000&quot;</span><span class="p">],</span>
<span class="nt">&quot;segments&quot;</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">&quot;dataSource&quot;</span><span class="p">:</span> <span class="s2">&quot;test1&quot;</span><span class="p">,</span>
<span class="nt">&quot;interval&quot;</span><span class="p">:</span> <span class="s2">&quot;2012-01-01T00:00:00.000/2012-01-03T00:00:00.000&quot;</span><span class="p">,</span>
<span class="nt">&quot;version&quot;</span><span class="p">:</span> <span class="s2">&quot;v2&quot;</span><span class="p">,</span>
<span class="nt">&quot;loadSpec&quot;</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;local&quot;</span><span class="p">,</span>
<span class="nt">&quot;path&quot;</span><span class="p">:</span> <span class="s2">&quot;/tmp/index1.zip&quot;</span>
<span class="p">},</span>
<span class="nt">&quot;dimensions&quot;</span><span class="p">:</span> <span class="s2">&quot;host&quot;</span><span class="p">,</span>
<span class="nt">&quot;metrics&quot;</span><span class="p">:</span> <span class="s2">&quot;visited_sum,unique_hosts&quot;</span><span class="p">,</span>
<span class="nt">&quot;shardSpec&quot;</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;none&quot;</span>
<span class="p">},</span>
<span class="nt">&quot;binaryVersion&quot;</span><span class="p">:</span> <span class="mi">9</span><span class="p">,</span>
<span class="nt">&quot;size&quot;</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="nt">&quot;identifier&quot;</span><span class="p">:</span> <span class="s2">&quot;test1_2000-01-01T00:00:00.000Z_3000-01-01T00:00:00.000Z_v2&quot;</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;static&quot;</span><span class="p">,</span>
<span class="nt">&quot;paths&quot;</span><span class="p">:</span> <span class="s2">&quot;/path/to/more/wikipedia/data/&quot;</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">},</span>
<span class="err">...</span>
<span class="p">}</span>
</code></pre></div>
<p>It is STRONGLY RECOMMENDED to provide list of segments in <code>dataSource</code> inputSpec explicitly so that your delta ingestion task is idempotent. You can obtain that list of segments by making following call to the Coordinator.
POST <code>/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full</code>
Request Body: [interval1, interval2,...] for example [&quot;2012-01-01T00:00:00.000/2012-01-03T00:00:00.000&quot;, &quot;2012-01-05T00:00:00.000/2012-01-07T00:00:00.000&quot;]</p>
<h3 id="reindexing-with-native-batch-ingestion">Reindexing with Native Batch Ingestion</h3>
<p>This section assumes the reader understands how to do batch ingestion without Hadoop using <a href="../ingestion/native_tasks.html">Native Batch Indexing</a>,
which uses a &quot;firehose&quot; to know where and how to read the input data. <a href="firehose.html#ingestsegmentfirehose">IngestSegmentFirehose</a>
can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as
it has to do all processing inside a single process and can&#39;t scale. Please use Hadoop batch ingestion for production
scenarios dealing with more than 1GB of data.</p>
</div>
<div class="col-md-3">
<div class="searchbox">
<gcse:searchbox-only></gcse:searchbox-only>
</div>
<div id="toc" class="nav toc hidden-print">
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
<script>
$(function() {
$(".toc").load("/docs/0.14.1-incubating/toc.html");
// There is no way to tell when .gsc-input will be async loaded into the page so just try to set a placeholder until it works
var tries = 0;
var timer = setInterval(function() {
tries++;
if (tries > 300) clearInterval(timer);
var searchInput = $('input.gsc-input');
if (searchInput.length) {
searchInput.attr('placeholder', 'Search');
clearInterval(timer);
}
}, 100);
});
</script>
</body>
</html>