blob: 08c8ca71ffe820c1306bad60dd41ce6a390f8e48 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Regex Parser - Apache Apex Malhar Documentation</title>
<link rel="shortcut icon" href="../../favicon.ico">
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="../../css/highlight.css">
<script>
// Current page data
var mkdocs_page_name = "Regex Parser";
var mkdocs_page_input_path = "operators/regexparser.md";
var mkdocs_page_url = "/operators/regexparser/";
</script>
<script src="../../js/jquery-2.1.1.min.js"></script>
<script src="../../js/modernizr-2.8.3.min.js"></script>
<script type="text/javascript" src="../../js/highlight.pack.js"></script>
<script src="../../js/theme.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Apache Apex Malhar Documentation</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li>
<li class="toctree-l1 ">
<a class="" href="../..">Apache Apex Malhar</a>
</li>
<li>
<li>
<ul class="subnav">
<li><span>APIs</span></li>
<li class="toctree-l1 ">
<a class="" href="../../apis/calcite/">SQL</a>
</li>
</ul>
<li>
<li>
<ul class="subnav">
<li><span>Operators</span></li>
<li class="toctree-l1 ">
<a class="" href="../block_reader/">Block Reader</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../csvformatter/">CSV Formatter</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../csvParserOperator/">CSV Parser</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../deduper/">Deduper</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../enricher/">Enricher</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../fsInputOperator/">File Input</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../file_output/">File Output</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../file_splitter/">File Splitter</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../filter/">Filter</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../fixedWidthParserOperator/">Fixed Width Parser</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../ftpInputOperator/">FTP Input Operator</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../AbstractJdbcTransactionableOutputOperator/">Jdbc Output Operator</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../jdbcPollInputOperator/">JDBC Poller Input</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../jmsInputOperator/">JMS Input</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../jsonFormatter/">JSON Formatter</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../jsonParser/">JSON Parser</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../kafkaInputOperator/">Kafka Input</a>
</li>
<li class="toctree-l1 current">
<a class="current" href="./">Regex Parser</a>
<ul>
<li class="toctree-l3"><a href="#regex-parser-operator">Regex Parser Operator</a></li>
<li><a class="toctree-l4" href="#operator-objective">Operator Objective</a></li>
<li><a class="toctree-l4" href="#overview">Overview</a></li>
<li><a class="toctree-l4" href="#operator-information">Operator Information</a></li>
<li><a class="toctree-l4" href="#platform-attributes-that-influence-operator-behavior">Platform Attributes that influence operator behavior</a></li>
<li><a class="toctree-l4" href="#ports">Ports</a></li>
<li><a class="toctree-l4" href="#partitioning">Partitioning</a></li>
<li><a class="toctree-l4" href="#example">Example</a></li>
</ul>
</li>
<li class="toctree-l1 ">
<a class="" href="../s3outputmodule/">S3 Output Module</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../transform/">Transformer</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../windowedOperator/">Windowed Operator</a>
</li>
<li class="toctree-l1 ">
<a class="" href="../xmlParserOperator/">XML Parser</a>
</li>
</ul>
<li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Apache Apex Malhar Documentation</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>Operators &raquo;</li>
<li>Regex Parser</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="regex-parser-operator">Regex Parser Operator</h1>
<h2 id="operator-objective">Operator Objective</h2>
<p><strong>RegexParser</strong> is designed to parse records based on a regex pattern and construct a concrete java class also known as <a href="https://en.wikipedia.org/wiki/Plain_Old_Java_Object">"POJO"</a> out of it. User needs to provide the regex pattern and schema definition to describe the data pattern. Based on regex pattern, the operator will split the data and then schema definition will be used to map the incoming record to POJO. User can also provide date format if any, in the schema. The supported constraints are listed in <a href="#constraints">constraints table</a>.</p>
<p>The regex pattern has to match the tuple in its entirety. Valid records will be emitted as POJOs while invalid ones are emitted on the error port with an error message if the corresponding ports are connected.</p>
<p><strong>Note</strong>: field names of POJO must match field names in schema and in the same order as it appears in the incoming data.</p>
<h2 id="overview">Overview</h2>
<p>The operator is <strong>idempotent</strong>, <strong>fault-tolerant</strong> and <strong>partitionable</strong>.</p>
<h2 id="operator-information">Operator Information</h2>
<ol>
<li>Operator location: <strong><em>malhar-contrib</em></strong></li>
<li>Available since: <strong><em>3.7.0</em></strong></li>
<li>Operator state: <strong><em>Evolving</em></strong></li>
<li>Java Package: <a href="https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/parser/RegexParser.java">com.datatorrent.contrib.parser.RegexParser</a></li>
</ol>
<h2 id="properties-of-regexparser"><a name="props"></a>Properties of RegexParser</h2>
<p>User needs to set the <code>schema</code> JSON string which describes data as well as specify the format on date fields if any.
<strong>Note</strong>: In the examples below {ApplicationName} and {OperatorName} are placeholders for the respective names of the application and the operator.</p>
<p>e.g.</p>
<pre><code class="xml"> &lt;property&gt;
&lt;name&gt;dt.application.{ApplicationName}.operator.{OperatorName}.prop.schema&lt;/name&gt;
&lt;value&gt;{
&quot;fields&quot;: [
{
&quot;name&quot;: &quot;date&quot;,
&quot;type&quot;: &quot;Date&quot;,
&quot;constraints&quot;: {
&quot;format&quot;: &quot;yyyy:MM:dd:hh:mm:ss&quot;
}
},
{
&quot;name&quot;: &quot;id&quot;,
&quot;type&quot;: &quot;Integer&quot;
},
{
&quot;name&quot;: &quot;signInId&quot;,
&quot;type&quot;: &quot;String&quot;
},
{
&quot;name&quot;: &quot;ipAddress&quot;,
&quot;type&quot;: &quot;String&quot;
},
{
&quot;name&quot;: &quot;serviceId&quot;,
&quot;type&quot;: &quot;Double&quot;
},
{
&quot;name&quot;: &quot;accountId&quot;,
&quot;type&quot;: &quot;Long&quot;
},
{
&quot;name&quot;: &quot;platform&quot;,
&quot;type&quot;: &quot;Boolean&quot;
}
]
}
&lt;/value&gt;
&lt;/property&gt;
</code></pre>
<p>Note that <code>Boolean</code> type in the above example accepts case insensitive values for either true or false.</p>
<p>User needs to set the <code>splitRegexPattern</code> property whose value is the regular expression that describes the pattern of the incoming data.
Below is the example for setting <code>splitRegexPattern</code> from <code>properties.xml</code> of the application.</p>
<pre><code class="xml"> &lt;property&gt;
&lt;name&gt;dt.application.{ApplicationName}.operator.{OperatorName}.prop.splitRegexPattern&lt;/name&gt;
&lt;value&gt;.+\[SEQ=\w+\]\s*(\d+:[\d\d:]+)\s(\d+)\s* sign-in_id=(\S+) .*ip_address=(\S+).* service_id=(\S+).*account_id=(\S+).*platform=(\S+)&lt;/value&gt;
&lt;/property&gt;
</code></pre>
<table>
<thead>
<tr>
<th><strong>Property</strong></th>
<th><strong>Description</strong></th>
<th><strong>Type</strong></th>
<th><strong>Mandatory</strong></th>
<th><strong>Default Value</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><em>schema</em></td>
<td><a href="https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/parser/DelimitedSchema.java">Schema</a> describing data (see example above)</td>
<td>String</td>
<td>YES</td>
<td>N/A</td>
</tr>
<tr>
<td><em>splitRegexPattern</em></td>
<td>regex expression that describes the pattern of incoming data</td>
<td>String</td>
<td>YES</td>
<td>N/A</td>
</tr>
</tbody>
</table>
<h2 id="platform-attributes-that-influence-operator-behavior">Platform Attributes that influence operator behavior</h2>
<table>
<thead>
<tr>
<th><strong>Attribute</strong></th>
<th><strong>Description</strong></th>
<th><strong>Type</strong></th>
<th><strong>Mandatory</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><em>TUPLE_CLASS</em></td>
<td>TUPLE_CLASS attribute on output port which tells operator the class of POJO which need to be emitted</td>
<td>Class</td>
<td>Yes</td>
</tr>
</tbody>
</table>
<p>The operator takes care of converting the byte array received on the input port to a string by decoding using the JVM's default <code>Charset</code>. Then, splits the string using the <code>splitRegexPattern</code> and populates an object using the <code>schema</code>. Apex platform converts this object to the object of <code>TUPLE_CLASS</code> attribute value while emitting.</p>
<p>Below is the example for setting <code>TUPLE_CLASS</code> attribute on output port from <code>properties.xml</code> file of the application.</p>
<pre><code class="xml"> &lt;property&gt;
&lt;name&gt;dt.application.{ApplicationName}.operator.{OperatorName}.port.out.attr.TUPLE_CLASS&lt;/name&gt;
&lt;value&gt;com.datatorrent.tutorial.regexparser.ServerLog&lt;/value&gt;
&lt;/property&gt;
</code></pre>
<p>Below is the example for setting <code>TUPLE_CLASS</code> attribute on output port from <code>Application.java</code> file of the application.</p>
<pre><code class="java">RegexParser regexParser = dag.addOperator(&quot;regexParser&quot;, RegexParser.class);
dag.setOutputPortAttribute(regexParser.out, Context.PortContext.TUPLE_CLASS, ServerLog.class);
</code></pre>
<p>where the value (ServerLog) set above is the expected output POJO class from the operator and example is as below.</p>
<pre><code class="java"> public class ServerLog
{
private Date date;
private int id;
private String signInId;
private String ipAddress;
private double serviceId;
private long accountId;
private boolean platform;
public int getId()
{
return id;
}
public void setId(int id)
{
this.id = id;
}
public Date getDate()
{
return date;
}
public void setDate(Date date)
{
this.date = date;
}
public String getSignInId()
{
return signInId;
}
public void setSignInId(String signInId)
{
this.signInId = signInId;
}
public String getIpAddress()
{
return ipAddress;
}
public void setIpAddress(String ipAddress)
{
this.ipAddress = ipAddress;
}
public double getServiceId()
{
return serviceId;
}
public void setServiceId(double serviceId)
{
this.serviceId = serviceId;
}
public long getAccountId()
{
return accountId;
}
public void setAccountId(long accountId)
{
this.accountId = accountId;
}
public boolean getPlatform()
{
return platform;
}
public void setPlatform(boolean platform)
{
this.platform = platform;
}
}
</code></pre>
<p>Let us look at how the data gets populated into the POJO using the example <code>schema</code>, <code>splitRegexPattern</code> and <code>TUPLE_CLASS</code> definitions given above.</p>
<p>Consider sample event log as below that matches with the <code>splitRegexPattern</code>.</p>
<pre><code>2015-10-01T03:14:49.000-07:00 lvn-d1-dev DevServer[9876]: INFO: [EVENT][SEQ=248717] 2015:10:01:03:14:49 101 sign-in_id=11111@psop.com ip_address=1.1.1.1 service_id=IP1234-NPB12345_00 result=RESULT_SUCCES console_id=0000000138e91b4e58236bf32besdafasdfasdfasdfsadf account_id=11111 platform=pik
</code></pre>
<p>The below images depict the expression match on the data. The parentheses corresponding to <a href="https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#cg">capturing groups</a> are highlighted in green and each such group corresponds to one field of the POJO. There will be 7 such groups in the current example.</p>
<p><img alt="Regular Expression pattern match" src="../images/regexparser/regexcapturedgroups.png" /></p>
<p>The matched data in the event log is highlighted with 7 different colors below.</p>
<p><img alt="Matched Log Data" src="../images/regexparser/logcapturedgroups.png" /></p>
<p>The matched fields above will be populated onto an object based on the <code>schema</code> definition defined above. Object population will be based on one to one mapping from matched data to <code>schema</code> definition fields in the match order. Once the object is populated, it will be converted to the <code>TUPLE_CLASS</code> type while emitting on the output port <code>out</code> by the Apex platform.</p>
<h2 id="supported-datatypes-in-schema"><a name="dataTypes"></a>Supported DataTypes in Schema</h2>
<ul>
<li>Integer</li>
<li>Long</li>
<li>Double</li>
<li>Character</li>
<li>String</li>
<li>Boolean</li>
<li>Date</li>
<li>Float</li>
</ul>
<h2 id="schema-constraints"><a name="constraints"></a>Schema Constraints</h2>
<p>Only Date constraints are supported by the operator as of now.</p>
<table>
<thead>
<tr>
<th><strong>DataType</strong></th>
<th><strong>Constraints</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><em>Date</em></td>
<td>format</td>
<td>A simple date format as specified in the <a href="http://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html?is-external=true">SimpleDateFormat</a> class</td>
</tr>
</tbody>
</table>
<h2 id="ports">Ports</h2>
<table>
<thead>
<tr>
<th><strong>Port</strong></th>
<th><strong>Description</strong></th>
<th><strong>Type</strong></th>
<th><strong>Mandatory</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><em>in</em></td>
<td>Tuples that needs to be parsed are received on this port</td>
<td>byte[]</td>
<td>Yes</td>
</tr>
<tr>
<td><em>out</em></td>
<td>Valid tuples that are emitted as POJO</td>
<td>Object (POJO)</td>
<td>No</td>
</tr>
<tr>
<td><em>err</em></td>
<td>Invalid tuples are emitted with error message</td>
<td>KeyValPair &lt;String, String></td>
<td>No</td>
</tr>
</tbody>
</table>
<h2 id="partitioning">Partitioning</h2>
<p>Regex Parser can be statically or dynamically partitioned.</p>
<h3 id="static-partitioning">Static Partitioning</h3>
<p>This can be achieved in the below 2 ways.</p>
<p>Specifying the partitioner and number of partitions in the populateDAG() method.</p>
<pre><code class="java"> RegexParser regexParser = dag.addOperator(&quot;regexParser&quot;, RegexParser.class);
StatelessPartitioner&lt;RegexParser&gt; partitioner = new StatelessPartitioner&lt;RegexParser&gt;(2);
dag.setAttribute(regexParser, Context.OperatorContext.PARTITIONER, partitioner);
</code></pre>
<p>Specifying the partitioner in properties file.</p>
<pre><code class="xml"> &lt;property&gt;
&lt;name&gt;dt.application.{ApplicationName}.operator.{OperatorName}.attr.PARTITIONER&lt;/name&gt;
&lt;value&gt;com.datatorrent.common.partitioner.StatelessPartitioner:2&lt;/value&gt;
&lt;/property&gt;
</code></pre>
<p>Above lines will partition RegexParser statically 2 times. Above value can be changed accordingly to change the number of static partitions.</p>
<h3 id="dynamic-partitioning">Dynamic Partitioning</h3>
<p>RegexParser can be dynamically partitioned using the out-of-the-box partitioner:</p>
<h4 id="throughput-based">Throughput based</h4>
<p>Following code can be added to the <code>populateDAG</code> method of application to dynamically partition RegexParser:</p>
<pre><code class="java"> RegexParser regexParser = dag.addOperator(&quot;regexParser&quot;, RegexParser.class);
StatelessThroughputBasedPartitioner&lt;RegexParser&gt; partitioner = new StatelessThroughputBasedPartitioner&lt;&gt;();
partitioner.setCooldownMillis(conf.getLong(COOL_DOWN_MILLIS, 10000));
partitioner.setMaximumEvents(conf.getLong(MAX_THROUGHPUT, 30000));
partitioner.setMinimumEvents(conf.getLong(MIN_THROUGHPUT, 10000));
dag.setAttribute(regexParser, OperatorContext.STATS_LISTENERS, Arrays.asList(new StatsListener[]{partitioner}));
dag.setAttribute(regexParser, OperatorContext.PARTITIONER, partitioner);
</code></pre>
<p>Above code will dynamically partition RegexParser when the throughput changes.
If the overall throughput of regexParser goes beyond 30000 or less than 10000, the platform will repartition RegexParser
to balance throughput of a single partition to be between 10000 and 30000.
CooldownMillis of 10000 will be used as the threshold time for which the throughput change is observed.</p>
<h2 id="example">Example</h2>
<p>Coming Soon</p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../s3outputmodule/" class="btn btn-neutral float-right" title="S3 Output Module">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../kafkaInputOperator/" class="btn btn-neutral" title="Kafka Input"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../kafkaInputOperator/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../s3outputmodule/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
</body>
</html>