blob: c33d96fd63442fb09dbf04dd925b129d26147953 [file] [log] [blame]
<!DOCTYPE html>
<!--
| Generated by Apache Maven Doxia Site Renderer 1.8 from src/site/markdown/use-cases/parser_chaining/index.md at 2019-05-14
| Rendered using Apache Maven Fluido Skin 1.7
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="Date-Revision-yyyymmdd" content="20190514" />
<meta http-equiv="Content-Language" content="en" />
<title>Metron &#x2013; Problem Statement</title>
<link rel="stylesheet" href="../../css/apache-maven-fluido-1.7.min.css" />
<link rel="stylesheet" href="../../css/site.css" />
<link rel="stylesheet" href="../../css/print.css" media="print" />
<script type="text/javascript" src="../../js/apache-maven-fluido-1.7.min.js"></script>
<script type="text/javascript">
$( document ).ready( function() { $( '.carousel' ).carousel( { interval: 3500 } ) } );
</script>
</head>
<body class="topBarDisabled">
<div class="container-fluid">
<div id="banner">
<div class="pull-left"><a href="http://metron.apache.org/" id="bannerLeft"><img src="../../images/metron-logo.png" alt="Apache Metron" width="148px" height="48px"/></a></div>
<div class="pull-right"></div>
<div class="clear"><hr/></div>
</div>
<div id="breadcrumbs">
<ul class="breadcrumb">
<li class=""><a href="http://www.apache.org" class="externalLink" title="Apache">Apache</a><span class="divider">/</span></li>
<li class=""><a href="http://metron.apache.org/" class="externalLink" title="Metron">Metron</a><span class="divider">/</span></li>
<li class=""><a href="../../index.html" title="Documentation">Documentation</a><span class="divider">/</span></li>
<li class="active ">Problem Statement</li>
<li id="publishDate" class="pull-right"><span class="divider">|</span> Last Published: 2019-05-14</li>
<li id="projectVersion" class="pull-right">Version: 0.7.1</li>
</ul>
</div>
<div class="row-fluid">
<div id="leftColumn" class="span2">
<div class="well sidebar-nav">
<ul class="nav nav-list">
<li class="nav-header">User Documentation</li>
<li><a href="../../index.html" title="Metron"><span class="icon-chevron-down"></span>Metron</a>
<ul class="nav nav-list">
<li><a href="../../CONTRIBUTING.html" title="CONTRIBUTING"><span class="none"></span>CONTRIBUTING</a></li>
<li><a href="../../Upgrading.html" title="Upgrading"><span class="none"></span>Upgrading</a></li>
<li><a href="../../metron-analytics/index.html" title="Analytics"><span class="icon-chevron-right"></span>Analytics</a></li>
<li><a href="../../metron-contrib/metron-docker/index.html" title="Docker"><span class="none"></span>Docker</a></li>
<li><a href="../../metron-contrib/metron-performance/index.html" title="Performance"><span class="none"></span>Performance</a></li>
<li><a href="../../metron-deployment/index.html" title="Deployment"><span class="icon-chevron-right"></span>Deployment</a></li>
<li><a href="../../metron-interface/index.html" title="Interface"><span class="icon-chevron-right"></span>Interface</a></li>
<li><a href="../../metron-platform/index.html" title="Platform"><span class="icon-chevron-right"></span>Platform</a></li>
<li><a href="../../metron-sensors/index.html" title="Sensors"><span class="icon-chevron-right"></span>Sensors</a></li>
<li><a href="../../metron-stellar/stellar-3rd-party-example/index.html" title="Stellar-3rd-party-example"><span class="none"></span>Stellar-3rd-party-example</a></li>
<li><a href="../../metron-stellar/stellar-common/index.html" title="Stellar-common"><span class="icon-chevron-right"></span>Stellar-common</a></li>
<li><a href="../../metron-stellar/stellar-zeppelin/index.html" title="Stellar-zeppelin"><span class="none"></span>Stellar-zeppelin</a></li>
<li><a href="../../use-cases/index.html" title="Use-cases"><span class="icon-chevron-down"></span>Use-cases</a>
<ul class="nav nav-list">
<li><a href="../../use-cases/forensic_clustering/index.html" title="Forensic_clustering"><span class="none"></span>Forensic_clustering</a></li>
<li><a href="../../use-cases/geographic_login_outliers/index.html" title="Geographic_login_outliers"><span class="none"></span>Geographic_login_outliers</a></li>
<li class="active"><a href="#"><span class="none"></span>Parser_chaining</a></li>
<li><a href="../../use-cases/typosquat_detection/index.html" title="Typosquat_detection"><span class="none"></span>Typosquat_detection</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<hr />
<div id="poweredBy">
<div class="clear"></div>
<div class="clear"></div>
<div class="clear"></div>
<div class="clear"></div>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"><img class="builtBy" alt="Built by Maven" src="../../images/logos/maven-feather.png" /></a>
</div>
</div>
</div>
<div id="bodyColumn" class="span10" >
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<h1>Problem Statement</h1>
<p><a name="Problem_Statement"></a></p>
<p>Aggregating many different types sensors into a single data source (e.g. syslog) and ingesting that aggregate sensor into Metron is a common pattern. It is not obvious precisely how to manage these types of aggregate sensors as they require two-pass parsing. This document will walk through an example of supporting this kind of multi-pass ingest.</p>
<p>Multi-pass parser involves the following requirements:</p>
<ul>
<li>The enveloping parser (e.g. the aggregation format such as syslog or plain CSV) may contain metadata which should be ingested along with the data.</li>
<li>The enveloping sensor contains many different sensor types</li>
</ul>
<p><a name="High_Level_Solution"></a></p>
<h1>High Level Solution</h1>
<p><img src="../../images/message_routing_high_level.svg" alt="High Level Approach" /></p>
<p>At a high level, we continue to maintain the architectural invariant of a 1-1 relationship between logical sensors and storm topologies. Eventually this relationship may become more complex, but at the moment the approach is to construct a routing parser which will have two responsibilities:</p>
<ul>
<li>Parse the envelope (e.g. syslog data) and extract any metadata fields from the envelope to pass along</li>
<li>Route the unfolded data to the appropriate kafka topic associated with the enveloped sensor data</li>
</ul>
<p>Because the data emitted from the routing parser is just like any data emitted from any other parser, in that it is a JSON blob like any data emitted from any parser, we will need to adjust the downstream parsers to extract the enveloped data from the JSON blob and treat it as the data to parse.</p>
<p><a name="Example"></a></p>
<h1>Example</h1>
<div class="section">
<h2><a name="Preliminaries"></a>Preliminaries</h2>
<p>We assume that the following environment variables are set:</p>
<ul>
<li><tt>METRON_HOME</tt> - the home directory for metron</li>
<li><tt>ZOOKEEPER</tt> - The zookeeper quorum (comma separated with port specified: e.g. <tt>node1:2181</tt> for full-dev)</li>
<li><tt>BROKERLIST</tt> - The Kafka broker list (comma separated with port specified: e.g. <tt>node1:6667</tt> for full-dev)</li>
<li><tt>ES_HOST</tt> - The elasticsearch master (and port) e.g. <tt>node1:9200</tt> for full-dev.</li>
</ul>
<p>Before editing configurations, be sure to pull the configs from zookeeper locally via</p>
<div>
<div>
<pre class="source">$METRON_HOME/bin/zk_load_configs.sh --mode PULL -z $ZOOKEEPER -o $METRON_HOME/config/zookeeper/ -f
</pre></div></div>
</div>
<div class="section">
<h2><a name="The_Scenario"></a>The Scenario</h2>
<p>Consider the following situation, we have some logs from a Cisco PIX device that we would like to ingest. The format is syslog, but multiple scenarios exist in the same log file. Specificaly, let&#x2019;s consider the sample logs <a class="externalLink" href="http://www.monitorware.com/en/logsamples/cisco-pix-61(2).php">here</a>.</p>
<p>The log lines in general have the following components:</p>
<ul>
<li>A timestamp</li>
<li>A message type tag</li>
<li>The message payload that is dependent upon the tag</li>
</ul>
<p>Let&#x2019;s consider two types of messages that we&#x2019;d like to parse:</p>
<ul>
<li>Tag <tt>6-302*</tt> which are connection creation and teardown messages e.g. <tt>Built UDP connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr 192.168.0.2/53</tt></li>
<li>Tag <tt>5-304*</tt> which are URL access events e.g. <tt>192.168.0.2 Accessed URL 66.102.9.99:/</tt></li>
</ul>
<p>A couple things are apparent from this:</p>
<ul>
<li>The formats we care about are easy to represent in grok, but are very different and logically represent very different sensors.</li>
<li>The syslog loglines output by this device has many types of events that I do not care about (yet).</li>
</ul>
<p>We will proceed to create 3 separate parsers:</p>
<ul>
<li>A <tt>pix_syslog_router</tt> parser which will:
<ul>
<li>Parse the timestamp field</li>
<li>Parse the payload into a field called <tt>data</tt></li>
<li>Parse the tag into a field called <tt>pix_type</tt></li>
<li>Route the enveloped messages to the appropriate kafka topic based on the tag</li>
</ul>
</li>
<li>A <tt>cisco-6-302</tt> and <tt>cisco-5-304</tt> parser which will append to the existing fields from the <tt>pix_syslog_router</tt> the sensor specific fields based on the tag type.</li>
</ul></div>
<div class="section">
<h2><a name="Cisco_PIX_Grok_Patterns"></a>Cisco PIX Grok Patterns</h2>
<p>In order to assist in these parsers, we&#x2019;re going to accumulate some grok expressions which will help us deal with these various parsers.</p>
<ul>
<li>Open a file <tt>~/cisco_patterns</tt> and place the following in there</li>
</ul>
<div>
<div>
<pre class="source">CISCO_ACTION Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted
CISCO_REASON Duplicate TCP SYN|Failed to locate egress interface|Invalid transport field|No matching connection|DNS Response|DNS Query|(?:%{WORD}\s*)*
CISCO_DIRECTION Inbound|inbound|Outbound|outbound
CISCOFW302020_302021 %{CISCO_ACTION:action}(?:%{CISCO_DIRECTION:direction})? %{WORD:protocol} connection %{GREEDYDATA:ignore} faddr %{IP:ip_dst_addr}/%{INT:icmp_seq_num}(?:\(%{DATA:fwuser}\))? gaddr %{IP:ip_src_xlated}/%{INT:icmp_code_xlated} laddr %{IP:ip_src_addr}/%{INT:icmp_code}( \(%{DATA:user}\))?
ACCESSED %{URIHOST:ip_src_addr} Accessed URL %{IP:ip_dst_addr}:%{URIPATHPARAM:uri_path}
CISCO_PIX %{GREEDYDATA:timestamp}: %PIX-%{NOTSPACE:pix_type}: %{GREEDYDATA:data}
</pre></div></div>
<ul>
<li>Place this pattern in HDFS at <tt>/tmp/cisco_patterns</tt> via <tt>hadoop fs -put ~/cisco_patterns /tmp</tt>
<ul>
<li>NOTE: In production, we&#x2019;d have more battle hardened patterns as well as place them in a more sensible location.</li>
</ul>
</li>
</ul></div>
<div class="section">
<h2><a name="The_pix_syslog_router_Parser"></a>The <tt>pix_syslog_router</tt> Parser</h2>
<ul>
<li>Create the <tt>pix_syslog_router</tt> kafka topic via:</li>
</ul>
<div>
<div>
<pre class="source">/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $ZOOKEEPER --create --topic pix_syslog_router --partitions 1 --replication-factor 1
</pre></div></div>
<ul>
<li>Create the <tt>pix_syslog_router</tt> parser by opening <tt>$METRON_HOME/config/zookeeper/parsers/pix_syslog_router.json</tt> and placing the following:</li>
</ul>
<div>
<div>
<pre class="source">{
&quot;parserClassName&quot; : &quot;org.apache.metron.parsers.GrokParser&quot;
,&quot;sensorTopic&quot; : &quot;pix_syslog_router&quot;
, &quot;parserConfig&quot;: {
&quot;grokPath&quot;: &quot;/tmp/cisco_patterns&quot;,
&quot;batchSize&quot; : 1,
&quot;patternLabel&quot;: &quot;CISCO_PIX&quot;,
&quot;timestampField&quot;: &quot;timestamp&quot;,
&quot;timeFields&quot; : [ &quot;timestamp&quot; ],
&quot;dateFormat&quot; : &quot;MMM dd yyyy HH:mm:ss&quot;,
&quot;kafka.topicField&quot; : &quot;logical_source_type&quot;
}
,&quot;fieldTransformations&quot; : [
{
&quot;transformation&quot; : &quot;REGEX_SELECT&quot;
,&quot;input&quot; : &quot;pix_type&quot;
,&quot;output&quot; : &quot;logical_source_type&quot;
,&quot;config&quot; : {
&quot;cisco-6-302&quot; : &quot;^6-302.*&quot;,
&quot;cisco-5-304&quot; : &quot;^5-304.*&quot;
}
}
]
}
</pre></div></div>
<p>A couple of things to note about this config:</p>
<ul>
<li>In the <tt>parserConfig</tt> section, note that we are specifying <tt>kafka.topicField</tt> is <tt>logical_source_field</tt>. This specifies that the parser will send messages to the topic specified in the <tt>logical_source_type</tt> field. If the field does not exist, then the message is not sent.</li>
<li>The <tt>REGEX_SELECT</tt> field transformation sets the <tt>logical_source_type</tt> field based on the value in the <tt>pix_type</tt> field, which recall is our tag. This will enable us to route the broad category of cisco firewall messages along to the specific parser.</li>
</ul></div>
<div class="section">
<h2><a name="The_cisco-6-302_Parser"></a>The <tt>cisco-6-302</tt> Parser</h2>
<ul>
<li>Create the <tt>cisco-6-302</tt> kafka topic via:</li>
</ul>
<div>
<div>
<pre class="source">/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $ZOOKEEPER --create --topic cisco-6-302 --partitions 1 --replication-factor 1
</pre></div></div>
<ul>
<li>Create the <tt>cisco-6-302</tt> parser by opening <tt>$METRON_HOME/config/zookeeper/parsers/cisco-6-302.json</tt> and placing the following:</li>
</ul>
<div>
<div>
<pre class="source">{
&quot;parserClassName&quot; : &quot;org.apache.metron.parsers.GrokParser&quot;
,&quot;sensorTopic&quot; : &quot;cisco-6-302&quot;
,&quot;rawMessageStrategy&quot; : &quot;ENVELOPE&quot;
,&quot;rawMessageStrategyConfig&quot; : {
&quot;messageField&quot; : &quot;data&quot;,
&quot;metadataPrefix&quot; : &quot;&quot;
}
, &quot;parserConfig&quot;: {
&quot;grokPath&quot;: &quot;/tmp/cisco_patterns&quot;,
&quot;batchSize&quot; : 1,
&quot;patternLabel&quot;: &quot;CISCOFW302020_302021&quot;
}
}
</pre></div></div>
<p>Note a couple of things:</p>
<ul>
<li>We are specifying the <tt>rawMessageStrategy</tt> to be <tt>ENVELOPE</tt> to indicate that it is not a straight data feed, but rather it&#x2019;s enveloped in a JSON map (i.e. the output of the `pix_syslog_router)</li>
<li>Because this is enveloped, we must specify the field which contains the actual raw data by setting <tt>messageField</tt> in <tt>rawMessageStrategyConfig</tt></li>
<li>You may be wondering why we specify <tt>metadataPrefix</tt> to be empty string. We want some of the fields in the enveloped message to be merged in without prefix. Most specifically, we want the <tt>timestamp</tt> field. By default, the prefix is <tt>metron.metadata</tt>.</li>
</ul></div>
<div class="section">
<h2><a name="The_cisco-5-304_Parser"></a>The <tt>cisco-5-304</tt> Parser</h2>
<ul>
<li>Create the <tt>cisco-5-304</tt> kafka topic via:</li>
</ul>
<div>
<div>
<pre class="source">/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $ZOOKEEPER --create --topic cisco-5-304 --partitions 1 --replication-factor 1
</pre></div></div>
<ul>
<li>Create the <tt>cisco-5-304</tt> parser by opening <tt>$METRON_HOME/config/zookeeper/parsers/cisco-5-304.json</tt> and placing the following:</li>
</ul>
<div>
<div>
<pre class="source">{
&quot;parserClassName&quot; : &quot;org.apache.metron.parsers.GrokParser&quot;
,&quot;sensorTopic&quot; : &quot;cisco-5-304&quot;
,&quot;rawMessageStrategy&quot; : &quot;ENVELOPE&quot;
,&quot;rawMessageStrategyConfig&quot; : {
&quot;messageField&quot; : &quot;data&quot;,
&quot;metadataPrefix&quot; : &quot;&quot;
}
, &quot;parserConfig&quot;: {
&quot;grokPath&quot;: &quot;/tmp/cisco_patterns&quot;,
&quot;batchSize&quot; : 1,
&quot;patternLabel&quot;: &quot;ACCESSED&quot;
}
}
</pre></div></div>
<p>Mostly the same comments from the previous parser apply here; we are just using a different pattern label.</p>
<p><a name="Start_the_Parsers"></a></p>
<h1>Start the Parsers</h1>
<p>Now we should start the parsers</p>
<ul>
<li>Push the configs that we&#x2019;ve created for the 3 parsers:</li>
</ul>
<div>
<div>
<pre class="source">$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -z $ZOOKEEPER -i $METRON_HOME/config/zookeeper/
</pre></div></div>
<ul>
<li>Start the <tt>cisco-6-302</tt> parser via</li>
</ul>
<div>
<div>
<pre class="source">$METRON_HOME/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s cisco-6-302
</pre></div></div>
<ul>
<li>Start the <tt>cisco-5-304</tt> parser via</li>
</ul>
<div>
<div>
<pre class="source">$METRON_HOME/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s cisco-5-304
</pre></div></div>
<ul>
<li>Start the <tt>pix_syslog_router</tt> parser via</li>
</ul>
<div>
<div>
<pre class="source">$METRON_HOME/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s pix_syslog_router
</pre></div></div>
<p><a name="Send_Data"></a></p>
<h1>Send Data</h1>
<ul>
<li>Create a file called <tt>~/data.log</tt> with the sample syslog loglines <a class="externalLink" href="http://www.monitorware.com/en/logsamples/cisco-pix-61(2).php">here</a>.</li>
<li>Send the data in via kafka console producer</li>
</ul>
<div>
<div>
<pre class="source">cat ~/data.log | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list $BROKERLIST --topic pix_syslog_router
</pre></div></div>
<p>You should see indices created for the <tt>cisco-5-304</tt> and <tt>cisco-6-302</tt> data with appropriate fields created for each type.</p>
<p><a name="Aggregated_Parsers_with_Parser_Chaining"></a></p>
<h1>Aggregated Parsers with Parser Chaining</h1>
<p>Chained parsers can be run as aggregated parsers. These parsers continue to use the sensor specific Kafka topics, and do not do internal routing to the appropriate sensor.</p>
<p>Instead of creating a topology per sensor, all 3 (<tt>pix-syslog-parser</tt>, <tt>cisco-5-304</tt>, and <tt>cisco-6-302</tt>) can be run in a single aggregated parser. It&#x2019;s also possible to aggregate a subset of these parsers (e.g. run <tt>cisco-6-302</tt> as it&#x2019;s own topology, and aggregate the other 2).</p>
<p>The step to start parsers then becomes</p>
<div>
<div>
<pre class="source">$METRON_HOME/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s cisco-6-302,cisco-5-304,pix_syslog_router
</pre></div></div>
<p>The flow through the Storm topology and Kafka topics:</p>
<p><img src="../../images/aggregated_parser_chaining_flow.svg" alt="Aggregated Flow" /></p></div>
</div>
</div>
</div>
<hr/>
<footer>
<div class="container-fluid">
<div class="row-fluid">
© 2015-2016 The Apache Software Foundation. Apache Metron, Metron, Apache, the Apache feather logo,
and the Apache Metron project logo are trademarks of The Apache Software Foundation.
</div>
</div>
</footer>
</body>
</html>