blob: dd6c644fa0bbeff42462a6c0f3b358b0d4e5bb3b [file] [log] [blame]
<!DOCTYPE html>
<!--
| Generated by Apache Maven Doxia Site Renderer 1.8 from src/site/markdown/metron-platform/metron-parsing/metron-parsing-storm/index.md at 2019-05-14
| Rendered using Apache Maven Fluido Skin 1.7
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="Date-Revision-yyyymmdd" content="20190514" />
<meta http-equiv="Content-Language" content="en" />
<title>Metron &#x2013; Parsers</title>
<link rel="stylesheet" href="../../../css/apache-maven-fluido-1.7.min.css" />
<link rel="stylesheet" href="../../../css/site.css" />
<link rel="stylesheet" href="../../../css/print.css" media="print" />
<script type="text/javascript" src="../../../js/apache-maven-fluido-1.7.min.js"></script>
<script type="text/javascript">
$( document ).ready( function() { $( '.carousel' ).carousel( { interval: 3500 } ) } );
</script>
</head>
<body class="topBarDisabled">
<div class="container-fluid">
<div id="banner">
<div class="pull-left"><a href="http://metron.apache.org/" id="bannerLeft"><img src="../../../images/metron-logo.png" alt="Apache Metron" width="148px" height="48px"/></a></div>
<div class="pull-right"></div>
<div class="clear"><hr/></div>
</div>
<div id="breadcrumbs">
<ul class="breadcrumb">
<li class=""><a href="http://www.apache.org" class="externalLink" title="Apache">Apache</a><span class="divider">/</span></li>
<li class=""><a href="http://metron.apache.org/" class="externalLink" title="Metron">Metron</a><span class="divider">/</span></li>
<li class=""><a href="../../../index.html" title="Documentation">Documentation</a><span class="divider">/</span></li>
<li class="active ">Parsers</li>
<li id="publishDate" class="pull-right"><span class="divider">|</span> Last Published: 2019-05-14</li>
<li id="projectVersion" class="pull-right">Version: 0.7.1</li>
</ul>
</div>
<div class="row-fluid">
<div id="leftColumn" class="span2">
<div class="well sidebar-nav">
<ul class="nav nav-list">
<li class="nav-header">User Documentation</li>
<li><a href="../../../index.html" title="Metron"><span class="icon-chevron-down"></span>Metron</a>
<ul class="nav nav-list">
<li><a href="../../../CONTRIBUTING.html" title="CONTRIBUTING"><span class="none"></span>CONTRIBUTING</a></li>
<li><a href="../../../Upgrading.html" title="Upgrading"><span class="none"></span>Upgrading</a></li>
<li><a href="../../../metron-analytics/index.html" title="Analytics"><span class="icon-chevron-right"></span>Analytics</a></li>
<li><a href="../../../metron-contrib/metron-docker/index.html" title="Docker"><span class="none"></span>Docker</a></li>
<li><a href="../../../metron-contrib/metron-performance/index.html" title="Performance"><span class="none"></span>Performance</a></li>
<li><a href="../../../metron-deployment/index.html" title="Deployment"><span class="icon-chevron-right"></span>Deployment</a></li>
<li><a href="../../../metron-interface/index.html" title="Interface"><span class="icon-chevron-right"></span>Interface</a></li>
<li><a href="../../../metron-platform/index.html" title="Platform"><span class="icon-chevron-down"></span>Platform</a>
<ul class="nav nav-list">
<li><a href="../../../metron-platform/Performance-tuning-guide.html" title="Performance-tuning-guide"><span class="none"></span>Performance-tuning-guide</a></li>
<li><a href="../../../metron-platform/metron-common/index.html" title="Common"><span class="none"></span>Common</a></li>
<li><a href="../../../metron-platform/metron-data-management/index.html" title="Data-management"><span class="none"></span>Data-management</a></li>
<li><a href="../../../metron-platform/metron-elasticsearch/index.html" title="Elasticsearch"><span class="none"></span>Elasticsearch</a></li>
<li><a href="../../../metron-platform/metron-enrichment/index.html" title="Enrichment"><span class="icon-chevron-right"></span>Enrichment</a></li>
<li><a href="../../../metron-platform/metron-hbase-server/index.html" title="Hbase-server"><span class="none"></span>Hbase-server</a></li>
<li><a href="../../../metron-platform/metron-indexing/index.html" title="Indexing"><span class="none"></span>Indexing</a></li>
<li><a href="../../../metron-platform/metron-job/index.html" title="Job"><span class="none"></span>Job</a></li>
<li><a href="../../../metron-platform/metron-management/index.html" title="Management"><span class="none"></span>Management</a></li>
<li><a href="../../../metron-platform/metron-parsing/index.html" title="Parsing"><span class="icon-chevron-down"></span>Parsing</a>
<ul class="nav nav-list">
<li><a href="../../../metron-platform/metron-parsing/metron-parsers/index.html" title="Parsers"><span class="icon-chevron-right"></span>Parsers</a></li>
<li><a href="../../../metron-platform/metron-parsing/metron-parsers-common/index.html" title="Parsers-common"><span class="icon-chevron-right"></span>Parsers-common</a></li>
<li class="active"><a href="#"><span class="none"></span>Parsing-storm</a></li>
</ul>
</li>
<li><a href="../../../metron-platform/metron-pcap-backend/index.html" title="Pcap-backend"><span class="none"></span>Pcap-backend</a></li>
<li><a href="../../../metron-platform/metron-solr/index.html" title="Solr"><span class="none"></span>Solr</a></li>
<li><a href="../../../metron-platform/metron-writer/index.html" title="Writer"><span class="none"></span>Writer</a></li>
</ul>
</li>
<li><a href="../../../metron-sensors/index.html" title="Sensors"><span class="icon-chevron-right"></span>Sensors</a></li>
<li><a href="../../../metron-stellar/stellar-3rd-party-example/index.html" title="Stellar-3rd-party-example"><span class="none"></span>Stellar-3rd-party-example</a></li>
<li><a href="../../../metron-stellar/stellar-common/index.html" title="Stellar-common"><span class="icon-chevron-right"></span>Stellar-common</a></li>
<li><a href="../../../metron-stellar/stellar-zeppelin/index.html" title="Stellar-zeppelin"><span class="none"></span>Stellar-zeppelin</a></li>
<li><a href="../../../use-cases/index.html" title="Use-cases"><span class="icon-chevron-right"></span>Use-cases</a></li>
</ul>
</li>
</ul>
<hr />
<div id="poweredBy">
<div class="clear"></div>
<div class="clear"></div>
<div class="clear"></div>
<div class="clear"></div>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"><img class="builtBy" alt="Built by Maven" src="../../../images/logos/maven-feather.png" /></a>
</div>
</div>
</div>
<div id="bodyColumn" class="span10" >
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<h1>Parsers</h1>
<p><a name="Parsers"></a></p>
<div class="section">
<h2><a name="Introduction"></a>Introduction</h2>
<p>Metron&#x2019;s parsers can be run in Storm topologies, complete with their own set of configuration options (e.g. parallelism). A script is provided to deploy a parser as a Storm topologoy.</p></div>
<div class="section">
<h2><a name="Parser_Configuration"></a>Parser Configuration</h2>
<ul>
<li><tt>spoutParallelism</tt> : The kafka spout parallelism (default to <tt>1</tt>). This can be overridden on the command line, and if there are multiple sensors should be in a comma separated list in the same order as the sensors.</li>
<li><tt>spoutNumTasks</tt> : The number of tasks for the spout (default to <tt>1</tt>). This can be overridden on the command line, and if there are multiple sensors should be in a comma separated list in the same order as the sensors.</li>
<li><tt>parserParallelism</tt> : The parser bolt parallelism (default to <tt>1</tt>). If there are multiple sensors, the last one&#x2019;s configuration will be used. This can be overridden on the command line.</li>
<li><tt>parserNumTasks</tt> : The number of tasks for the parser bolt (default to <tt>1</tt>). If there are multiple sensors, the last one&#x2019;s configuration will be used. This can be overridden on the command line.</li>
<li><tt>errorWriterParallelism</tt> : The error writer bolt parallelism (default to <tt>1</tt>). This can be overridden on the command line.</li>
<li><tt>errorWriterNumTasks</tt> : The number of tasks for the error writer bolt (default to <tt>1</tt>). This can be overridden on the command line.</li>
<li><tt>numWorkers</tt> : The number of workers to use in the topology (default is the storm default of <tt>1</tt>).</li>
<li><tt>numAckers</tt> : The number of acker executors to use in the topology (default is the storm default of <tt>1</tt>).</li>
<li><tt>spoutConfig</tt> : A map representing a custom spout config (this is a map). If there are multiple sensors, the configs will be merged with the last specified taking precedence. This can be overridden on the command line.</li>
<li><tt>stormConfig</tt> : The storm config to use (this is a map). This can be overridden on the command line. If both are specified, they are merged with CLI properties taking precedence.</li>
</ul>
<p><a name="Starting_the_Parser_Topology"></a></p>
<h1>Starting the Parser Topology</h1>
<p>Starting a particular parser topology on a running Metron deployment is as easy as running the <tt>start_parser_topology.sh</tt> script located in <tt>$METRON_HOME/bin</tt>. This utility will allow you to configure and start the running topology assuming that the sensor specific parser configuration exists within zookeeper.</p>
<p>The usage for <tt>start_parser_topology.sh</tt> is as follows:</p>
<div>
<div>
<pre class="source">usage: start_parser_topology.sh
-e,--extra_topology_options &lt;JSON_FILE&gt; Extra options in the form
of a JSON file with a map
for content.
-esc,--extra_kafka_spout_config &lt;JSON_FILE&gt; Extra spout config options
in the form of a JSON file
with a map for content.
Possible keys are:
retryDelayMaxMs,retryDelay
Multiplier,retryInitialDel
ayMs,stateUpdateIntervalMs
,bufferSizeBytes,fetchMaxW
ait,fetchSizeBytes,maxOffs
etBehind,metricsTimeBucket
SizeInSecs,socketTimeoutMs
-ewnt,--error_writer_num_tasks &lt;NUM_TASKS&gt; Error Writer Num Tasks
-ewp,--error_writer_p &lt;PARALLELISM_HINT&gt; Error Writer Parallelism
Hint
-h,--help This screen
-iwnt,--invalid_writer_num_tasks &lt;NUM_TASKS&gt; Invalid Writer Num Tasks
-iwp,--invalid_writer_p &lt;PARALLELISM_HINT&gt; Invalid Message Writer Parallelism Hint
-k,--kafka &lt;BROKER_URL&gt; Kafka Broker URL
-ksp,--kafka_security_protocol &lt;SECURITY_PROTOCOL&gt; Kafka Security Protocol
-mt,--message_timeout &lt;TIMEOUT_IN_SECS&gt; Message Timeout in Seconds
-mtp,--max_task_parallelism &lt;MAX_TASK&gt; Max task parallelism
-na,--num_ackers &lt;NUM_ACKERS&gt; Number of Ackers
-nw,--num_workers &lt;NUM_WORKERS&gt; Number of Workers
-ot,--output_topic &lt;KAFKA_TOPIC&gt; Output Kafka Topic
-pnt,--parser_num_tasks &lt;NUM_TASKS&gt; Parser Num Tasks
-pp,--parser_p &lt;PARALLELISM_HINT&gt; Parser Parallelism Hint
-s,--sensor &lt;SENSOR_TYPE&gt; Sensor Type
-snt,--spout_num_tasks &lt;NUM_TASKS&gt; Spout Num Tasks
-sp,--spout_p &lt;SPOUT_PARALLELISM_HINT&gt; Spout Parallelism Hint
-t,--test &lt;TEST&gt; Run in Test Mode
-z,--zk &lt;ZK_QUORUM&gt; Zookeeper Quroum URL
(zk1:2181,zk2:2181,...
</pre></div></div>
</div>
<div class="section">
<h2><a name="The_--extra_kafka_spout_config_Option"></a>The <tt>--extra_kafka_spout_config</tt> Option</h2>
<p>These options are intended to configure the Storm Kafka Spout more completely. These options can be specified in a JSON file containing a map associating the kafka spout configuration parameter to a value. The range of values possible to configure are:</p>
<ul>
<li><tt>spout.pollTimeoutMs</tt> - Specifies the time, in milliseconds, spent waiting in poll if data is not available. Default is 2s</li>
<li><tt>spout.firstPollOffsetStrategy</tt> - Sets the offset used by the Kafka spout in the first poll to Kafka broker upon process start. One of
<ul>
<li><tt>EARLIEST</tt></li>
<li><tt>LATEST</tt></li>
<li><tt>UNCOMMITTED_EARLIEST</tt> - Last uncommitted and if offsets aren&#x2019;t found, defaults to earliest. NOTE: This is the default.</li>
<li><tt>UNCOMMITTED_LATEST</tt> - Last uncommitted and if offsets aren&#x2019;t found, defaults to latest.</li>
</ul>
</li>
<li><tt>spout.offsetCommitPeriodMs</tt> - Specifies the period, in milliseconds, the offset commit task is periodically called. Default is 15s.</li>
<li><tt>spout.maxUncommittedOffsets</tt> - Defines the max number of polled offsets (records) that can be pending commit, before another poll can take place. Once this limit is reached, no more offsets (records) can be polled until the next successful commit(s) sets the number of pending offsets bellow the threshold. The default is 10,000,000.</li>
<li><tt>spout.maxRetries</tt> - Defines the max number of retrials in case of tuple failure. The default is to retry forever, which means that no new records are committed until the previous polled records have been acked. This guarantees at once delivery of all the previously polled records. By specifying a finite value for maxRetries, the user decides to sacrifice guarantee of delivery for the previous polled records in favor of processing more records.</li>
<li>Any of the configs in the Consumer API for <a class="externalLink" href="http://kafka.apache.org/0100/documentation.html#newconsumerconfigs">Kafka 0.10.x</a></li>
</ul>
<p>For instance, creating a JSON file which will set the offsets to <tt>UNCOMMITTED_EARLIEST</tt></p>
<div>
<div>
<pre class="source">{
&quot;spout.firstPollOffsetStrategy&quot; : &quot;UNCOMMITTED_EARLIEST&quot;
}
</pre></div></div>
<p>This would be loaded by passing the file as argument to <tt>--extra_kafka_spout_config</tt></p></div>
<div class="section">
<h2><a name="The_--extra_topology_options_Option"></a>The <tt>--extra_topology_options</tt> Option</h2>
<p>These options are intended to be Storm configuration options and will live in a JSON file which will be loaded into the Storm config. For instance, if you wanted to set a storm property on the config called <tt>topology.ticks.tuple.freq.secs</tt> to 1000 and <tt>storm.local.dir</tt> to <tt>/opt/my/path</tt> you could create a file called <tt>custom_config.json</tt> containing</p>
<div>
<div>
<pre class="source">{
&quot;topology.ticks.tuple.freq.secs&quot; : 1000,
&quot;storm.local.dir&quot; : &quot;/opt/my/path&quot;
}
</pre></div></div>
<p>and pass <tt>--extra_topology_options custom_config.json</tt> to <tt>start_parser_topology.sh</tt>.</p></div>
<div class="section">
<h2><a name="Parser_Topology"></a>Parser Topology</h2>
<p>The enrichment topology as started by the <tt>$METRON_HOME/bin/start_parser_topology.sh</tt> script uses a default of one executor per bolt. In a real production system, this should be customized by modifying the arguments sent to this utility.</p>
<ul>
<li>Topology Wide
<ul>
<li><tt>--num_workers</tt> : The number of workers for the topology</li>
<li><tt>--num_ackers</tt> : The number of ackers for the topology</li>
</ul>
</li>
<li>The Kafka Spout
<ul>
<li><tt>--spout_num_tasks</tt> : The number of tasks for the spout</li>
<li><tt>--spout_p</tt> : The parallelism hint for the spout</li>
<li>Ensure that the spout has enough parallelism so that it can dedicate a worker per partition in your kafka topic.</li>
</ul>
</li>
<li>The Parser Bolt
<ul>
<li><tt>--parser_num_tasks</tt> : The number of tasks for the parser bolt</li>
<li><tt>--parser_p</tt> : The parallelism hint for the spout</li>
<li>This is bolt that gets the most processing, so ensure that it is configured with sufficient parallelism to match your throughput expectations.</li>
</ul>
</li>
<li>The Error Message Writer Bolt
<ul>
<li><tt>--error_writer_num_tasks</tt> : The number of tasks for the error writer bolt</li>
<li><tt>--error_writer_p</tt> : The parallelism hint for the error writer bolt</li>
</ul>
</li>
</ul>
<p>Finally, if workers and executors are new to you, the following might be of use to you:</p>
<ul>
<li><a class="externalLink" href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">Understanding the Parallelism of a Storm Topology</a></li>
</ul></div>
<div class="section">
<h2><a name="Parser_Aggregation"></a>Parser Aggregation</h2>
<p>For performance reasons, multiple sensors can be aggregated into a single Storm topology. When this is done, there will be multiple Kafka spouts, but only a single parser bolt which will handle delegating to the correct parser as needed. There are some constraints around this, in particular regarding some configuration. Additionally, all sensors must flow to the same error topic. The Kafka topic is retrieved from the input Tuple itself.</p>
<p>A worked example of this can be found in the <a href="../../../use-cases/parser_chaining/index.html#aggregated-parsers-with-parser-chaining">Parser Chaining use case</a>.</p></div>
</div>
</div>
</div>
<hr/>
<footer>
<div class="container-fluid">
<div class="row-fluid">
© 2015-2016 The Apache Software Foundation. Apache Metron, Metron, Apache, the Apache feather logo,
and the Apache Metron project logo are trademarks of The Apache Software Foundation.
</div>
</div>
</footer>
</body>
</html>