blob: 6abe29da1cd4d00bbeda1b85f2234884b94b910c [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
<link rel="icon" href="/favicon.ico" type="image/x-icon">
<title>Common Topology Patterns</title>
<!-- Bootstrap core CSS -->
<link href="/assets/css/bootstrap.min.css" rel="stylesheet">
<!-- Bootstrap theme -->
<link href="/assets/css/bootstrap-theme.min.css" rel="stylesheet">
<!-- Custom styles for this template -->
<link rel="stylesheet" href="http://fortawesome.github.io/Font-Awesome/assets/font-awesome/css/font-awesome.css">
<link href="/css/style.css" rel="stylesheet">
<link href="/assets/css/owl.theme.css" rel="stylesheet">
<link href="/assets/css/owl.carousel.css" rel="stylesheet">
<script type="text/javascript" src="/assets/js/jquery.min.js"></script>
<script type="text/javascript" src="/assets/js/bootstrap.min.js"></script>
<script type="text/javascript" src="/assets/js/owl.carousel.min.js"></script>
<script type="text/javascript" src="/assets/js/storm.js"></script>
<!-- Just for debugging purposes. Don't actually copy these 2 lines! -->
<!--[if lt IE 9]><script src="../../assets/js/ie8-responsive-file-warning.js"></script><![endif]-->
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<header>
<div class="container-fluid">
<div class="row">
<div class="col-md-5">
<a href="/index.html"><img src="/images/logo.png" class="logo" /></a>
</div>
<div class="col-md-5">
<h1>Version: 2.0.0</h1>
</div>
<div class="col-md-2">
<a href="/downloads.html" class="btn-std btn-block btn-download">Download</a>
</div>
</div>
</div>
</header>
<!--Header End-->
<!--Navigation Begin-->
<div class="navbar" role="banner">
<div class="container-fluid">
<div class="navbar-header">
<button class="navbar-toggle" type="button" data-toggle="collapse" data-target=".bs-navbar-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<nav class="collapse navbar-collapse bs-navbar-collapse" role="navigation">
<ul class="nav navbar-nav">
<li><a href="/index.html" id="home">Home</a></li>
<li><a href="/getting-help.html" id="getting-help">Getting Help</a></li>
<li><a href="/about/integrates.html" id="project-info">Project Information</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" id="documentation">Documentation <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="/releases/2.3.0/index.html">2.3.0</a></li>
<li><a href="/releases/2.2.0/index.html">2.2.0</a></li>
<li><a href="/releases/2.1.0/index.html">2.1.0</a></li>
<li><a href="/releases/2.0.0/index.html">2.0.0</a></li>
<li><a href="/releases/1.2.3/index.html">1.2.3</a></li>
</ul>
</li>
<li><a href="/talksAndVideos.html">Talks and Slideshows</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" id="contribute">Community <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="/contribute/Contributing-to-Storm.html">Contributing</a></li>
<li><a href="/contribute/People.html">People</a></li>
<li><a href="/contribute/BYLAWS.html">ByLaws</a></li>
</ul>
</li>
<li><a href="/2021/09/27/storm230-released.html" id="news">News</a></li>
</ul>
</nav>
</div>
</div>
<div class="container-fluid">
<h1 class="page-title">Common Topology Patterns</h1>
<div class="row">
<div class="col-md-12">
<!-- Documentation -->
<p class="post-meta"></p>
<div class="documentation-content"><p>This page lists a variety of common patterns in Storm topologies.</p>
<ol>
<li>Batching</li>
<li>BasicBolt</li>
<li>In-memory caching + fields grouping combo</li>
<li>Streaming top N</li>
<li>TimeCacheMap for efficiently keeping a cache of things that have been recently updated</li>
<li>CoordinatedBolt and KeyedFairBolt for Distributed RPC</li>
</ol>
<h3 id="batching">Batching</h3>
<p>Oftentimes for efficiency reasons or otherwise, you want to process a group of tuples in batch rather than individually. For example, you may want to batch updates to a database or do a streaming aggregation of some sort.</p>
<p>If you want reliability in your data processing, the right way to do this is to hold on to tuples in an instance variable while the bolt waits to do the batching. Once you do the batch operation, you then ack all the tuples you were holding onto.</p>
<p>If the bolt emits tuples, then you may want to use multi-anchoring to ensure reliability. It all depends on the specific application. See <a href="Guaranteeing-message-processing.html">Guaranteeing message processing</a> for more details on how reliability works.</p>
<h3 id="basicbolt">BasicBolt</h3>
<p>Many bolts follow a similar pattern of reading an input tuple, emitting zero or more tuples based on that input tuple, and then acking that input tuple immediately at the end of the execute method. Bolts that match this pattern are things like functions and filters. This is such a common pattern that Storm exposes an interface called <a href="javadocs/org/apache/storm/topology/IBasicBolt.html">IBasicBolt</a> that automates this pattern for you. See <a href="Guaranteeing-message-processing.html">Guaranteeing message processing</a> for more information.</p>
<h3 id="in-memory-caching-fields-grouping-combo">In-memory caching + fields grouping combo</h3>
<p>It&#39;s common to keep caches in-memory in Storm bolts. Caching becomes particularly powerful when you combine it with a fields grouping. For example, suppose you have a bolt that expands short URLs (like bit.ly, t.co, etc.) into long URLs. You can increase performance by keeping an LRU cache of short URL to long URL expansions to avoid doing the same HTTP requests over and over. Suppose component &quot;urls&quot; emits short URLS, and component &quot;expand&quot; expands short URLs into long URLs and keeps a cache internally. Consider the difference between the two following snippets of code:</p>
<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">"expand"</span><span class="o">,</span> <span class="k">new</span> <span class="n">ExpandUrl</span><span class="o">(),</span> <span class="n">parallelism</span><span class="o">)</span>
<span class="o">.</span><span class="na">shuffleGrouping</span><span class="o">(</span><span class="mi">1</span><span class="o">);</span>
</code></pre></div><div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">"expand"</span><span class="o">,</span> <span class="k">new</span> <span class="n">ExpandUrl</span><span class="o">(),</span> <span class="n">parallelism</span><span class="o">)</span>
<span class="o">.</span><span class="na">fieldsGrouping</span><span class="o">(</span><span class="s">"urls"</span><span class="o">,</span> <span class="k">new</span> <span class="n">Fields</span><span class="o">(</span><span class="s">"url"</span><span class="o">));</span>
</code></pre></div>
<p>The second approach will have vastly more effective caches, since the same URL will always go to the same task. This avoids having duplication across any of the caches in the tasks and makes it much more likely that a short URL will hit the cache.</p>
<h3 id="streaming-top-n">Streaming top N</h3>
<p>A common continuous computation done on Storm is a &quot;streaming top N&quot; of some sort. Suppose you have a bolt that emits tuples of the form [&quot;value&quot;, &quot;count&quot;] and you want a bolt that emits the top N tuples based on count. The simplest way to do this is to have a bolt that does a global grouping on the stream and maintains a list in memory of the top N items.</p>
<p>This approach obviously doesn&#39;t scale to large streams since the entire stream has to go through one task. A better way to do the computation is to do many top N&#39;s in parallel across partitions of the stream, and then merge those top N&#39;s together to get the global top N. The pattern looks like this:</p>
<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">"rank"</span><span class="o">,</span> <span class="k">new</span> <span class="n">RankObjects</span><span class="o">(),</span> <span class="n">parallelism</span><span class="o">)</span>
<span class="o">.</span><span class="na">fieldsGrouping</span><span class="o">(</span><span class="s">"objects"</span><span class="o">,</span> <span class="k">new</span> <span class="n">Fields</span><span class="o">(</span><span class="s">"value"</span><span class="o">));</span>
<span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">"merge"</span><span class="o">,</span> <span class="k">new</span> <span class="n">MergeObjects</span><span class="o">())</span>
<span class="o">.</span><span class="na">globalGrouping</span><span class="o">(</span><span class="s">"rank"</span><span class="o">);</span>
</code></pre></div>
<p>This pattern works because of the fields grouping done by the first bolt which gives the partitioning you need for this to be semantically correct. You can see an example of this pattern in storm-starter <a href="http://github.com/apache/storm/blob/v2.0.0/examples/storm-starter/src/jvm/org/apache/storm/starter/RollingTopWords.java">here</a>.</p>
<p>If however you have a known skew in the data being processed it can be advantageous to use partialKeyGrouping instead of fieldsGrouping. This will distribute the load for each key between two downstream bolts instead of a single one.</p>
<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">"count"</span><span class="o">,</span> <span class="k">new</span> <span class="n">CountObjects</span><span class="o">(),</span> <span class="n">parallelism</span><span class="o">)</span>
<span class="o">.</span><span class="na">partialKeyGrouping</span><span class="o">(</span><span class="s">"objects"</span><span class="o">,</span> <span class="k">new</span> <span class="n">Fields</span><span class="o">(</span><span class="s">"value"</span><span class="o">));</span>
<span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">"rank"</span> <span class="k">new</span> <span class="n">AggregateCountsAndRank</span><span class="o">(),</span> <span class="n">parallelism</span><span class="o">)</span>
<span class="o">.</span><span class="na">fieldsGrouping</span><span class="o">(</span><span class="s">"count"</span><span class="o">,</span> <span class="k">new</span> <span class="n">Fields</span><span class="o">(</span><span class="s">"key"</span><span class="o">))</span>
<span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">"merge"</span><span class="o">,</span> <span class="k">new</span> <span class="n">MergeRanksObjects</span><span class="o">())</span>
<span class="o">.</span><span class="na">globalGrouping</span><span class="o">(</span><span class="s">"rank"</span><span class="o">);</span>
</code></pre></div>
<p>The topology needs an extra layer of processing to aggregate the partial counts from the upstream bolts but this only processes aggregated values now so the bolt it is not subject to the load caused by the skewed data. You can see an example of this pattern in storm-starter <a href="http://github.com/apache/storm/blob/v2.0.0/examples/storm-starter/src/jvm/org/apache/storm/starter/SkewedRollingTopWords.java">here</a>.</p>
<h3 id="timecachemap-for-efficiently-keeping-a-cache-of-things-that-have-been-recently-updated">TimeCacheMap for efficiently keeping a cache of things that have been recently updated</h3>
<p>You sometimes want to keep a cache in memory of items that have been recently &quot;active&quot; and have items that have been inactive for some time be automatically expires. <a href="javadocs/org/apache/storm/utils/TimeCacheMap.html">TimeCacheMap</a> is an efficient data structure for doing this and provides hooks so you can insert callbacks whenever an item is expired.</p>
<h3 id="coordinatedbolt-and-keyedfairbolt-for-distributed-rpc">CoordinatedBolt and KeyedFairBolt for Distributed RPC</h3>
<p>When building distributed RPC applications on top of Storm, there are two common patterns that are usually needed. These are encapsulated by <a href="javadocs/org/apache/storm/task/CoordinatedBolt.html">CoordinatedBolt</a> and <a href="javadocs/org/apache/storm/task/KeyedFairBolt.html">KeyedFairBolt</a> which are part of the &quot;standard library&quot; that ships with the Storm codebase.</p>
<p><code>CoordinatedBolt</code> wraps the bolt containing your logic and figures out when your bolt has received all the tuples for any given request. It makes heavy use of direct streams to do this.</p>
<p><code>KeyedFairBolt</code> also wraps the bolt containing your logic and makes sure your topology processes multiple DRPC invocations at the same time, instead of doing them serially one at a time.</p>
<p>See <a href="Distributed-RPC.html">Distributed RPC</a> for more details.</p>
</div>
</div>
</div>
</div>
<footer>
<div class="container-fluid">
<div class="row">
<div class="col-md-3">
<div class="footer-widget">
<h5>Meetups</h5>
<ul class="latest-news">
<li><a href="http://www.meetup.com/Apache-Storm-Apache-Kafka/">Apache Storm & Apache Kafka</a> <span class="small">(Sunnyvale, CA)</span></li>
<li><a href="http://www.meetup.com/Apache-Storm-Kafka-Users/">Apache Storm & Kafka Users</a> <span class="small">(Seattle, WA)</span></li>
<li><a href="http://www.meetup.com/New-York-City-Storm-User-Group/">NYC Storm User Group</a> <span class="small">(New York, NY)</span></li>
<li><a href="http://www.meetup.com/Bay-Area-Stream-Processing">Bay Area Stream Processing</a> <span class="small">(Emeryville, CA)</span></li>
<li><a href="http://www.meetup.com/Boston-Storm-Users/">Boston Realtime Data</a> <span class="small">(Boston, MA)</span></li>
<li><a href="http://www.meetup.com/storm-london">London Storm User Group</a> <span class="small">(London, UK)</span></li>
<!-- <li><a href="http://www.meetup.com/Apache-Storm-Kafka-Users/">Seatle, WA</a> <span class="small">(27 Jun 2015)</span></li> -->
</ul>
</div>
</div>
<div class="col-md-3">
<div class="footer-widget">
<h5>About Apache Storm</h5>
<p>Apache Storm integrates with any queueing system and any database system. Apache Storm's spout abstraction makes it easy to integrate a new queuing system. Likewise, integrating Apache Storm with database systems is easy.</p>
</div>
</div>
<div class="col-md-3">
<div class="footer-widget">
<h5>First Look</h5>
<ul class="footer-list">
<li><a href="/releases/current/Rationale.html">Rationale</a></li>
<li><a href="/releases/current/Tutorial.html">Tutorial</a></li>
<li><a href="/releases/current/Setting-up-development-environment.html">Setting up development environment</a></li>
<li><a href="/releases/current/Creating-a-new-Storm-project.html">Creating a new Apache Storm project</a></li>
</ul>
</div>
</div>
<div class="col-md-3">
<div class="footer-widget">
<h5>Documentation</h5>
<ul class="footer-list">
<li><a href="/releases/current/index.html">Index</a></li>
<li><a href="/releases/current/javadocs/index.html">Javadoc</a></li>
<li><a href="/releases/current/FAQ.html">FAQ</a></li>
</ul>
</div>
</div>
</div>
<hr/>
<div class="row">
<div class="col-md-12">
<p align="center">Copyright © 2019 <a href="http://www.apache.org">Apache Software Foundation</a>. All Rights Reserved.
<br>Apache Storm, Apache, the Apache feather logo, and the Apache Storm project logos are trademarks of The Apache Software Foundation.
<br>All other marks mentioned may be trademarks or registered trademarks of their respective owners.</p>
</div>
</div>
</div>
</footer>
<!--Footer End-->
<!-- Scroll to top -->
<span class="totop"><a href="#"><i class="fa fa-angle-up"></i></a></span>
</body>
</html>