freemarker-generator-cli/src/site/markdown/cli/usage/parsing-with-grok.md

Parsing With Grok

Unleashing The Power Of Grok

Think of Grok as modular regular expressions with a pre-defined functionality to parse access logs or any other data where you can't comprehend the regular expression any longer, one very simple example is QUOTEDSTRING

QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))

And with Grok the QUOTEDSTRING is just a building block for an even more complex regular expression such as COMBINEDAPACHELOG

bin/freemarker-generator -t examples/templates/accesslog/combined-access.ftl examples/data/accesslog/combined-access.log

which gives you the following output

TIMESTAMP;VERB;REQUEST;HTTPVERSION
19/Jun/2005:06:44:17 +0200;GET;/wximages/wxwidgets02-small.png;1.1
19/Jun/2005:06:46:05 +0200;GET;/wximages/wxwidgets02-small.png;1.1
19/Jun/2005:06:47:37 +0200;GET;/wximages/wxwidgets02-small.png;1.1
19/Jun/2005:06:48:40 +0200;GET;/wiki.pl?WxWidgets_Bounties;1.1
19/Jun/2005:06:50:49 +0200;GET;/wiki.pl?WxWidgets_Compared_To_Other_Toolkits;1.1
19/Jun/2005:06:50:49 +0200;GET;/wxwiki.css;1.1
19/Jun/2005:06:50:49 +0200;GET;/wximages/wxwidgets02-small.png;1.1
19/Jun/2005:06:50:50 +0200;GET;/favicon.ico;1.1
19/Jun/2005:06:52:36 +0200;GET;/wximages/wxwidgets02-small.png;1.1
19/Jun/2005:06:53:14 +0200;GET;/;1.0

using the following FreeMarker template

<#ftl output_format="plainText" strip_whitespace=true>
<#assign grok = tools.grok.create("%{COMBINEDAPACHELOG}")>
<#assign dataSource = dataSources[0]>
<#assign lines = dataSource.getLineIterator()>

<#compress>
    TIMESTAMP;VERB;REQUEST;HTTPVERSION
    <#list lines as line>
        <#assign parts = grok.match(line).capture()>
        <#assign timestamp = parts["timestamp"]>
        <#assign verb = parts["verb"]>
        <#assign request = parts["request"]>
        <#assign httpversion = parts["httpversion"]>
        ${timestamp};${verb};${request};${httpversion}
    </#list>
</#compress>

While this looks small and tidy there are some nifty features

tools.grok.compile("%{COMBINEDAPACHELOG}") builds the Grok instance to parse access logs in Combined Format
The data source is streamed line by line and not loaded into memory in one piece
This also works for using stdin so are able to parse GB of access log or other files

Parse Server Log File And Generate CSV

A more practical example consists of parsing dozens of server logs files to determine response time of message processing, e.g.

2021-05-18 20:00:32,140 INFO  [aa.bb.ccc] (Thread-99) message response handled in: 62 ms; message counter: 2048; total message counter: 7094

In technical terms the FTL

Defines custom Grok pattern definitions
Compiles the Grok expression MY_SERVERLOG
Reads through all data sources passed on the command line
Use Grok to match lines and extract the timestamp and response times
Skip all execution times less than 5 ms because these are pings we are not interested in
Creates a simple CSV file

<#ftl output_format="plainText" strip_whitespace=true>
<#--
    Define custom grok pattern as map to match something like using "MY_SERVERLOG"
    2019-05-17 20:00:32,140 INFO  [xx.yyy.zzzz] (Thread-99) message response handled in: 62 ms; message counter: 2048; total message counter: 7094
-->
<#assign patternDefinitions = {
"MY_DATE": "%{YEAR}-%{MONTHNUM}-%{MONTHDAY}",
"MY_TIMESTAMP": "%{MY_DATE:date} %{TIME:time},%{INT:millis}",
"MY_MODULE": "\\[%{NOTSPACE}\\]",
"MY_THREAD": "\\(%{NOTSPACE}\\)",
"MY_SERVERLOG": "^%{MY_TIMESTAMP} %{LOGLEVEL}%{SPACE:UNWANTED}%{MY_MODULE} %{MY_THREAD} message response handled in: %{INT:response_time} ms; %{GREEDYDATA:UNWANTED}$"
}>

<#-- Instantiante the grok tool -->
<#assign grok = tools.grok.create("%{MY_SERVERLOG}", patternDefinitions)>

<#-- Iterate over all data sources and convert matching lines to CSV output -->
<#compress>
    TIMESTAMP;MILLIS
    <#if dataSources?has_content>
        <#list dataSources as dataSource>
            <#list dataSource.getLineIterator() as line>
                <#assign parts = grok.match(line).capture()>
                <#if parts?has_content>
                    <#-- Skip all response times less than 5 ms because these are boring pings -->
                    <#if parts.response_time?number gt 5>
                        ${parts.date}T${parts.time}.${parts.millis}+02:00;${parts.response_time}
                    </#if>
                </#if>
            </#list>
        </#list>
    </#if>
</#compress>

Executing the FTL yields the following output

> bin/freemarker-generator -t examples/templates/logs/csv/serverlog-to-csv.ftl examples/data/logs; echo
TIMESTAMP;MILLIS
2021-05-18T20:00:32.140;62
2021-05-18T21:00:32.140;162

Performance Considerations

When processing large logfiles you may experience performance issues - using regexp anchors as decribed in Killing your Logstash performance with Grok might help. Using regexp anchors reduced the execution time by 50% when parsing through 3 GB of logs with only small percentage of matching lines.