The Profiler is a feature extraction mechanism that can generate a profile describing the behavior of an entity on a network. An entity might be a server, user, subnet or application. Once a profile has been generated defining what normal behavior looks-like, models can be built that identify anomalous behavior.
This is achieved by summarizing the streaming telemetry data consumed by Metron over sliding windows. A summary statistic is applied to the data received within a given window. Collecting this summary across many windows results in a time series that is useful for analysis.
Any field contained within a message can be used to generate a profile. A profile can even be produced from combining fields that originate in different data sources. A user has considerable power to transform the data used in a profile by leveraging the Stellar language. A user only need configure the desired profiles in Zookeeper and ensure that the Profiler topology is running.
The Profiler configuration requires a JSON-formatted set of elements, many of which can contain Stellar code. The configuration contains the following elements.
Name | Description | |
---|---|---|
profile | Required | Unique name identifying the profile. |
foreach | Required | A separate profile is maintained “for each” of these. |
onlyif | Optional | Boolean expression that determines if a message should be applied to the profile. |
groupBy | Optional | One or more Stellar expressions used to group the profile measurements when persisted. |
init | Optional | One or more expressions executed at the start of a window period. |
update | Required | One or more expressions executed when a message is applied to the profile. |
result | Required | A Stellar expression that is executed when the window period expires. |
expires | Optional | Profile data is purged after this period of time, specified in milliseconds. |
profile
Required
A unique name identifying the profile. The field is treated as a string.
foreach
Required
A separate profile is maintained ‘for each’ of these. This is effectively the entity that the profile is describing. The field is expected to contain a Stellar expression whose result is the entity name.
For example, if ip_src_addr
then a separate profile would be maintained for each unique IP source address in the data; 10.0.0.1, 10.0.0.2, etc.
onlyif
Optional
An expression that determines if a message should be applied to the profile. A Stellar expression that returns a Boolean is expected. A message is only applied to a profile if this expression is true. This allows a profile to filter the messages that get applied to it.
groupBy
Optional
One or more Stellar expressions used to group the profile measurements when persisted. This is intended to sort the Profile data to allow for a contiguous scan when accessing subsets of the data.
The ‘groupBy’ expressions can refer to any field within a org.apache.metron.profiler.ProfileMeasurement
. A common use case would be grouping by day of week. This allows a contiguous scan to access all profile data for Mondays only. Using the following definition would achieve this.
"groupBy": [ "DAY_OF_WEEK()" ]
init
Optional
One or more expressions executed at the start of a window period. A map is expected where the key is the variable name and the value is a Stellar expression. The map can contain 0 or more variables/expressions. At the start of each window period the expression is executed once and stored in a variable with the given name.
"init": { "var1": "0", "var2": "1" }
update
Required
One or more expressions executed when a message is applied to the profile. A map is expected where the key is the variable name and the value is a Stellar expression. The map can include 0 or more variables/expressions. When each message is applied to the profile, the expression is executed and stored in a variable with the given name.
"update": { "var1": "var1 + 1", "var2": "var2 + 1" }
result
Required
A Stellar expression that is executed when the window period expires. The expression is expected to summarize the messages that were applied to the profile over the window period. The expression must result in a numeric value such as a Double, Long, Float, Short, or Integer.
expires
Optional
A numeric value that defines how many days the profile data is retained. After this time, the data expires and is no longer accessible. If no value is defined, the data does not expire.
Examples of the types of profiles that can be built include the following. Each shows the configuration that would be required to produce the profile. These examples assume a fictitious input messages that looks something like the following.
{ "ip_src_addr": "10.0.0.1", "protocol": "HTTPS", "length": "10", "bytes_in": "234" }, { "ip_src_addr": "10.0.0.2", "protocol": "HTTP", "length": "20", "bytes_in": "390" }, { "ip_src_addr": "10.0.0.3", "protocol": "DNS", "length": "30", "bytes_in": "560" }
The total number of bytes of HTTP data for each host. The following configuration would be used to generate this profile.
{ "profiles": [ { "profile": "example1", "foreach": "ip_src_addr", "onlyif": "protocol == 'HTTP'", "init": { "total_bytes": 0.0 }, "update": { "total_bytes": "total_bytes + bytes_in" }, "result": "total_bytes", "expires": 30 } ] }
This creates a profile...
The ratio of DNS traffic to HTTP traffic for each host. The following configuration would be used to generate this profile.
{ "profiles": [ { "profile": "example2", "foreach": "ip_src_addr", "onlyif": "protocol == 'DNS' or protocol == 'HTTP'", "init": { "num_dns": 1.0, "num_http": 1.0 }, "update": { "num_dns": "num_dns + (if protocol == 'DNS' then 1 else 0)", "num_http": "num_http + (if protocol == 'HTTP' then 1 else 0)" }, "result": "num_dns / num_http" } ] }
This creates a profile...
The average of the length
field of HTTP traffic. The following configuration would be used to generate this profile.
{ "profiles": [ { "profile": "example3", "foreach": "ip_src_addr", "onlyif": "protocol == 'HTTP'", "update": { "s": "STATS_ADD(s, length)" }, "result": "STATS_MEAN(s)" } ] }
This creates a profile...
length
field from each messageThe Profiler topology also accepts the following configuration settings.
Setting | Description |
---|---|
profiler.workers | The number of worker processes to create for the topology. |
profiler.executors | The number of executors to spawn per component. |
profiler.input.topic | The name of the Kafka topic from which to consume data. |
profiler.flush.interval.seconds | The duration of a profile's sliding window before it is flushed. |
profiler.hbase.salt.divisor | A salt is prepended to the row key to help prevent hotspotting. This constant is used to generate the salt. Ideally, this constant should be roughly equal to the number of nodes in the Hbase cluster. |
profiler.hbase.table | The name of the HBase table that profiles are written to. |
profiler.hbase.batch | The number of puts that are written in a single batch. |
profiler.hbase.flush.interval.seconds | The maximum number of seconds between batch writes to HBase. |
This section will describe the steps required to get your first profile running.
Launch the ‘Quick Dev’ environment.
$ cd metron-deployment/vagrant/quick-dev-platform/ $ ./run.sh
After the environment has been deployed, then login to the host.
$ vagrant ssh $ sudo su - $ cd /usr/metron/0.2.1BETA/
Create a table within HBase that will store the profile data. The table name and column family must match the Profiler topology configuration stored at /usr/metron/0.2.1BETA/config/profiler.properties
.
$ /usr/hdp/current/hbase-client/bin/hbase shell hbase(main):001:0> create 'profiler', 'P'
Shorten the flush intervals to more immediately see results. Edit the Profiler topology properties located at /usr/metron/0.2.1BETA/config/profiler.properties
. Alter the following two properties.
profiler.period.duration=30 profiler.period.duration.units=SECONDS profiler.hbase.flush.interval.seconds=5
Create the Profiler definition in a file located at /usr/metron/0.2.1BETA/config/zookeeper/profiler.json
. The following JSON will create a profile that simply counts the number of messages.
{ "profiles": [ { "profile": "test", "foreach": "ip_src_addr", "onlyif": "true", "init": { "sum": 0 }, "update": { "sum": "sum + 1" }, "result": "sum" } ] }
Upload the Profiler definition to Zookeeper.
$ bin/zk_load_configs.sh -m PUSH -i config/zookeeper/ -z node1:2181
Start the Profiler topology.
bin/start_profiler_topology.sh
Ensure that test messages are being sent to the Profiler's input topic in Kafka. The Profiler will consume messages from the inputTopic
in the Profiler definition.
Check the HBase table to validate that the Profiler is working.
$ /usr/hdp/current/hbase-client/bin/hbase shell hbase(main):001:0> count 'profiler'
ProfileMeasurement
- Represents a single data point within a Profile. A Profile is effectively a time series. To this end a Profile is composed of many ProfileMeasurement values which in aggregate form a time series.
ProfilePeriod
- The Profiler captures one ProfileMeasurement
each ProfilePeriod
. A ProfilePeriod
will occur at fixed, deterministic points in time. This allows for efficient retrieval of profile data.
RowKeyBuilder
- Builds row keys that can be used to read or write profile data to HBase.
ColumnBuilder
- Defines the columns of data stored with a profile measurement.
ProfileHBaseMapper
- Defines for the HBaseBolt
how profile measurements are stored in HBase. This class leverages a RowKeyBuilder
and ColumnBuilder
.
The Profiler is implemented as a Storm topology using the following bolts and spouts.
KafkaSpout
- A spout that consumes messages from a single Kafka topic. In most cases, the Profiler topology will consume messages from the indexing
topic. This topic contains fully enriched messages that are ready to be indexed. This ensures that profiles can take advantage of all the available data elements.
ProfileSplitterBolt
- The bolt responsible for filtering incoming messages and directing each to the one or more downstream bolts that are responsible for building a profile. Each message may be needed by 0, 1 or even many profiles. Each emitted tuple contains the ‘resolved’ entity name, the profile definition, and the input message.
ProfileBuilderBolt
- This bolt maintains all of the state required to build a profile. When the window period expires, the data is summarized as a ProfileMeasurement
, all state is flushed, and the ProfileMeasurement
is emitted. Each instance of this bolt is responsible for maintaining the state for a single Profile-Entity pair.
HBaseBolt
- A bolt that is responsible for writing to HBase. Most profiles will be flushed every 15 minutes or so. If each ProfileBuilderBolt
were responsible for writing to HBase itself, there would be little to no opportunity to optimize these writes. By aggregating the writes from multiple Profile-Entity pairs these writes can be batched, for example.