blob: fcb8d0f638c7040ea677dca313356edb6bc4cf49 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Apache Druid (incubating) Firehoses</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<div class="container doc-container">
<p> Looking for the <a href="/docs/0.23.0/">latest stable documentation</a>?</p>
<div class="row">
<div class="col-md-9 doc-content">
<p>
<a class="btn btn-default btn-xs visible-xs-inline-block visible-sm-inline-block" href="#toc">Table of Contents</a>
</p>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<h1 id="apache-druid-incubating-firehoses">Apache Druid (incubating) Firehoses</h1>
<p>Firehoses are used in <a href="../ingestion/native_tasks.html">native batch ingestion tasks</a>, stream push tasks automatically created by <a href="../ingestion/stream-push.html">Tranquility</a>, and the <a href="../ingestion/stream-pull.html">stream-pull (deprecated)</a> ingestion model.</p>
<p>They are pluggable and thus the configuration schema can and will vary based on the <code>type</code> of the firehose.</p>
<table><thead>
<tr>
<th>Field</th>
<th>Type</th>
<th>Description</th>
<th>Required</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>String</td>
<td>Specifies the type of firehose. Each value will have its own configuration schema, firehoses packaged with Druid are described below.</td>
<td>yes</td>
</tr>
</tbody></table>
<h2 id="additional-firehoses">Additional Firehoses</h2>
<p>There are several firehoses readily available in Druid, some are meant for examples, others can be used directly in a production environment.</p>
<p>For additional firehoses, please see our <a href="../development/extensions.html">extensions list</a>.</p>
<h3 id="localfirehose">LocalFirehose</h3>
<p>This Firehose can be used to read the data from files on local disk.
It can be used for POCs to ingest data on disk.
This firehose is <em>splittable</em> and can be used by <a href="./native_tasks.html#parallel-index-task">native parallel index tasks</a>.
Since each split represents a file in this firehose, each worker task of <code>index_parallel</code> will read a file.
A sample local firehose spec is shown below:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;local&quot;</span><span class="p">,</span>
<span class="nt">&quot;filter&quot;</span> <span class="p">:</span> <span class="s2">&quot;*.csv&quot;</span><span class="p">,</span>
<span class="nt">&quot;baseDir&quot;</span> <span class="p">:</span> <span class="s2">&quot;/data/directory&quot;</span>
<span class="p">}</span>
</code></pre></div>
<table><thead>
<tr>
<th>property</th>
<th>description</th>
<th>required?</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>This should be &quot;local&quot;.</td>
<td>yes</td>
</tr>
<tr>
<td>filter</td>
<td>A wildcard filter for files. See <a href="http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html">here</a> for more information.</td>
<td>yes</td>
</tr>
<tr>
<td>baseDir</td>
<td>directory to search recursively for files to be ingested.</td>
<td>yes</td>
</tr>
</tbody></table>
<h3 id="httpfirehose">HttpFirehose</h3>
<p>This Firehose can be used to read the data from remote sites via HTTP.
This firehose is <em>splittable</em> and can be used by <a href="./native_tasks.html#parallel-index-task">native parallel index tasks</a>.
Since each split represents a file in this firehose, each worker task of <code>index_parallel</code> will read a file.
A sample http firehose spec is shown below:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;http&quot;</span><span class="p">,</span>
<span class="nt">&quot;uris&quot;</span> <span class="p">:</span> <span class="p">[</span><span class="s2">&quot;http://example.com/uri1&quot;</span><span class="p">,</span> <span class="s2">&quot;http://example2.com/uri2&quot;</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<p>The below configurations can be optionally used for tuning the firehose performance.</p>
<table><thead>
<tr>
<th>property</th>
<th>description</th>
<th>default</th>
</tr>
</thead><tbody>
<tr>
<td>maxCacheCapacityBytes</td>
<td>Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.</td>
<td>1073741824</td>
</tr>
<tr>
<td>maxFetchCapacityBytes</td>
<td>Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.</td>
<td>1073741824</td>
</tr>
<tr>
<td>prefetchTriggerBytes</td>
<td>Threshold to trigger prefetching http objects.</td>
<td>maxFetchCapacityBytes / 2</td>
</tr>
<tr>
<td>fetchTimeout</td>
<td>Timeout for fetching a http object.</td>
<td>60000</td>
</tr>
<tr>
<td>maxFetchRetry</td>
<td>Maximum retry for fetching a http object.</td>
<td>3</td>
</tr>
</tbody></table>
<h3 id="ingestsegmentfirehose">IngestSegmentFirehose</h3>
<p>This Firehose can be used to read the data from existing druid segments.
It can be used ingest existing druid segments using a new schema and change the name, dimensions, metrics, rollup, etc. of the segment.
A sample ingest firehose spec is shown below -</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;ingestSegment&quot;</span><span class="p">,</span>
<span class="nt">&quot;dataSource&quot;</span> <span class="p">:</span> <span class="s2">&quot;wikipedia&quot;</span><span class="p">,</span>
<span class="nt">&quot;interval&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-01-01/2013-01-02&quot;</span>
<span class="p">}</span>
</code></pre></div>
<table><thead>
<tr>
<th>property</th>
<th>description</th>
<th>required?</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>This should be &quot;ingestSegment&quot;.</td>
<td>yes</td>
</tr>
<tr>
<td>dataSource</td>
<td>A String defining the data source to fetch rows from, very similar to a table in a relational database</td>
<td>yes</td>
</tr>
<tr>
<td>interval</td>
<td>A String representing ISO-8601 Interval. This defines the time range to fetch the data over.</td>
<td>yes</td>
</tr>
<tr>
<td>dimensions</td>
<td>The list of dimensions to select. If left empty, no dimensions are returned. If left null or not defined, all dimensions are returned.</td>
<td>no</td>
</tr>
<tr>
<td>metrics</td>
<td>The list of metrics to select. If left empty, no metrics are returned. If left null or not defined, all metrics are selected.</td>
<td>no</td>
</tr>
<tr>
<td>filter</td>
<td>See <a href="../querying/filters.html">Filters</a></td>
<td>no</td>
</tr>
</tbody></table>
<h4 id="sqlfirehose">SqlFirehose</h4>
<p>SqlFirehoseFactory can be used to ingest events residing in RDBMS. The database connection information is provided as part of the ingestion spec. For each query, the results are fetched locally and indexed. If there are multiple queries from which data needs to be indexed, queries are prefetched in the background upto <code>maxFetchCapacityBytes</code> bytes.
An example is shown below:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;sql&quot;</span><span class="p">,</span>
<span class="nt">&quot;database&quot;</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;mysql&quot;</span><span class="p">,</span>
<span class="nt">&quot;connectorConfig&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;connectURI&quot;</span> <span class="p">:</span> <span class="s2">&quot;jdbc:mysql://host:port/schema&quot;</span><span class="p">,</span>
<span class="nt">&quot;user&quot;</span> <span class="p">:</span> <span class="s2">&quot;user&quot;</span><span class="p">,</span>
<span class="nt">&quot;password&quot;</span> <span class="p">:</span> <span class="s2">&quot;password&quot;</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="nt">&quot;sqls&quot;</span> <span class="p">:</span> <span class="p">[</span><span class="s2">&quot;SELECT * FROM table1&quot;</span><span class="p">,</span> <span class="s2">&quot;SELECT * FROM table2&quot;</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<table><thead>
<tr>
<th>property</th>
<th>description</th>
<th>default</th>
<th>required?</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>This should be &quot;sql&quot;.</td>
<td></td>
<td>Yes</td>
</tr>
<tr>
<td>database</td>
<td>Specifies the database connection details.<code>type</code> should specify the database type and <code>connectorConfig</code> should specify the database connection properties via <code>connectURI</code>, <code>user</code> and <code>password</code></td>
<td></td>
<td>Yes</td>
</tr>
<tr>
<td>maxCacheCapacityBytes</td>
<td>Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.</td>
<td>1073741824</td>
<td>No</td>
</tr>
<tr>
<td>maxFetchCapacityBytes</td>
<td>Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.</td>
<td>1073741824</td>
<td>No</td>
</tr>
<tr>
<td>prefetchTriggerBytes</td>
<td>Threshold to trigger prefetching SQL result objects.</td>
<td>maxFetchCapacityBytes / 2</td>
<td>No</td>
</tr>
<tr>
<td>fetchTimeout</td>
<td>Timeout for fetching the result set.</td>
<td>60000</td>
<td>No</td>
</tr>
<tr>
<td>foldCase</td>
<td>Toggle case folding of database column names. This may be enabled in cases where the database returns case insensitive column names in query results.</td>
<td>false</td>
<td>No</td>
</tr>
<tr>
<td>sqls</td>
<td>List of SQL queries where each SQL query would retrieve the data to be indexed.</td>
<td></td>
<td>Yes</td>
</tr>
</tbody></table>
<h3 id="combiningfirehose">CombiningFirehose</h3>
<p>This firehose can be used to combine and merge data from a list of different firehoses.
This can be used to merge data from more than one firehose.</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;combining&quot;</span><span class="p">,</span>
<span class="nt">&quot;delegates&quot;</span> <span class="p">:</span> <span class="p">[</span> <span class="p">{</span> <span class="kc">f</span><span class="err">irehose</span><span class="mi">1</span> <span class="p">},</span> <span class="p">{</span> <span class="kc">f</span><span class="err">irehose</span><span class="mi">2</span> <span class="p">},</span> <span class="err">.....</span> <span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<table><thead>
<tr>
<th>property</th>
<th>description</th>
<th>required?</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>This should be &quot;combining&quot;</td>
<td>yes</td>
</tr>
<tr>
<td>delegates</td>
<td>list of firehoses to combine data from</td>
<td>yes</td>
</tr>
</tbody></table>
<h3 id="streaming-firehoses">Streaming Firehoses</h3>
<p>The firehoses shown below should only be used with the <a href="../ingestion/stream-pull.html">stream-pull (deprecated)</a> ingestion model, as they are not suitable for batch ingestion.</p>
<p>The EventReceiverFirehose is also used in tasks automatically generated by <a href="../ingestion/stream-push.html">Tranquility stream push</a>.</p>
<h4 id="eventreceiverfirehose">EventReceiverFirehose</h4>
<p>EventReceiverFirehoseFactory can be used to ingest events using an http endpoint.</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;receiver&quot;</span><span class="p">,</span>
<span class="nt">&quot;serviceName&quot;</span><span class="p">:</span> <span class="s2">&quot;eventReceiverServiceName&quot;</span><span class="p">,</span>
<span class="nt">&quot;bufferSize&quot;</span><span class="p">:</span> <span class="mi">10000</span>
<span class="p">}</span>
</code></pre></div>
<p>When using this firehose, events can be sent by submitting a POST request to the http endpoint:</p>
<p><code>http://&lt;peonHost&gt;:&lt;port&gt;/druid/worker/v1/chat/&lt;eventReceiverServiceName&gt;/push-events/</code></p>
<table><thead>
<tr>
<th>property</th>
<th>description</th>
<th>required?</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>This should be &quot;receiver&quot;</td>
<td>yes</td>
</tr>
<tr>
<td>serviceName</td>
<td>name used to announce the event receiver service endpoint</td>
<td>yes</td>
</tr>
<tr>
<td>bufferSize</td>
<td>size of buffer used by firehose to store events</td>
<td>no default(100000)</td>
</tr>
</tbody></table>
<p>Shut down time for EventReceiverFirehose can be specified by submitting a POST request to</p>
<p><code>http://&lt;peonHost&gt;:&lt;port&gt;/druid/worker/v1/chat/&lt;eventReceiverServiceName&gt;/shutdown?shutoffTime=&lt;shutoffTime&gt;</code></p>
<p>If shutOffTime is not specified, the firehose shuts off immediately.</p>
<h4 id="timedshutofffirehose">TimedShutoffFirehose</h4>
<p>This can be used to start a firehose that will shut down at a specified time.
An example is shown below:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;timed&quot;</span><span class="p">,</span>
<span class="nt">&quot;shutoffTime&quot;</span><span class="p">:</span> <span class="s2">&quot;2015-08-25T01:26:05.119Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;delegate&quot;</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;receiver&quot;</span><span class="p">,</span>
<span class="nt">&quot;serviceName&quot;</span><span class="p">:</span> <span class="s2">&quot;eventReceiverServiceName&quot;</span><span class="p">,</span>
<span class="nt">&quot;bufferSize&quot;</span><span class="p">:</span> <span class="mi">100000</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<table><thead>
<tr>
<th>property</th>
<th>description</th>
<th>required?</th>
</tr>
</thead><tbody>
<tr>
<td>type</td>
<td>This should be &quot;timed&quot;</td>
<td>yes</td>
</tr>
<tr>
<td>shutoffTime</td>
<td>time at which the firehose should shut down, in ISO8601 format</td>
<td>yes</td>
</tr>
<tr>
<td>delegate</td>
<td>firehose to use</td>
<td>yes</td>
</tr>
</tbody></table>
</div>
<div class="col-md-3">
<div class="searchbox">
<gcse:searchbox-only></gcse:searchbox-only>
</div>
<div id="toc" class="nav toc hidden-print">
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
<script>
$(function() {
$(".toc").load("/docs/0.14.1-incubating/toc.html");
// There is no way to tell when .gsc-input will be async loaded into the page so just try to set a placeholder until it works
var tries = 0;
var timer = setInterval(function() {
tries++;
if (tries > 300) clearInterval(timer);
var searchInput = $('input.gsc-input');
if (searchInput.length) {
searchInput.attr('placeholder', 'Search');
clearInterval(timer);
}
}, 100);
});
</script>
</body>
</html>