metron-analytics/metron-statistics/README.md - metron - Git at Google

 # Statistics and Mathematical Functions

 A variety of non-trivial and advanced analytics make use of statistics
 and advanced mathematical functions.  Particular, capturing the
 statistical snapshots in a scalable way can open up doors for more
 advanced analytics such as outlier analysis.  As such, this project is
 aimed at capturing a robust set of statistical functions and
 statistical-based algorithms in the form of Stellar functions.  These
 functions can be used from everywhere where Stellar is used.

 ## Stellar Functions

 ### Approximation Statistics

 #### `HLLP_ADD`
   * Description: Add value to the HyperLogLogPlus estimator set. See [HLLP README](HLLP.md)
   * Input:
     * hyperLogLogPlus - the hllp estimator to add a value to
     * value+ - value to add to the set. Takes a single item or a list.
   * Returns: The HyperLogLogPlus set with a new value added

 #### `HLLP_CARDINALITY`
   * Description: Returns HyperLogLogPlus-estimated cardinality for this set. See [HLLP README](HLLP.md)
   * Input:
     * hyperLogLogPlus - the hllp set
   * Returns: Long value representing the cardinality for this set

 #### `HLLP_INIT`
   * Description: Initializes the HyperLogLogPlus estimator set. p must be a value between 4 and sp and sp must be less than 32 and greater than 4. See [HLLP README](HLLP.md)
   * Input:
     * p - the precision value for the normal set
     * sp - the precision value for the sparse set. If p is set, but sp is 0 or not specified, the sparse set will be disabled.
   * Returns: A new HyperLogLogPlus set

 #### `HLLP_MERGE`
   * Description: Merge hllp sets together. The resulting estimator is initialized with p and sp precision values from the first provided hllp estimator set. See [HLLP README](HLLP.md)
   * Input:
     * hllp - List of hllp estimators to merge. Takes a single hllp set or a list.
   * Returns: A new merged HyperLogLogPlus estimator set

 ### Mathematical Functions

 #### `ABS`
 * Description: Returns the absolute value of a number.
 * Input:
   * number - The number to take the absolute value of
 * Returns: The absolute value of the number passed in.

 #### `BIN`
 * Description: Computes the bin that the value is in given a set of bounds.
 * Input:
   * value - The value to bin
   * bounds - A list of value bounds (excluding min and max) in sorted order.
 * Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N).  No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.


 ### Distributional Statistics

 #### `STATS_ADD`
   * Description: Adds one or more input values to those that are used to calculate the summary statistics.
   * Input:
     * stats - The Stellar statistics object.  If null, then a new one is initialized.
     * value+ - One or more numbers to add
   * Returns: A Stellar statistics object

 #### `STATS_BIN`
   * Description: Computes the bin that the value is in based on the statistical distribution.
   * Input:
     * stats - The Stellar statistics object
     * value - The value to bin
     * bounds? - A list of percentile bin bounds (excluding min and max) or a string representing a known and common set of bins.  For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. If this argument is omitted, then we assume a Quartile bin split.
   * Returns: "Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.

 #### `STATS_COUNT`
   * Description: Calculates the count of the values accumulated (or in the window if a window is used).
   * Input:
     * stats - The Stellar statistics object
   * Returns: The count of the values in the window or NaN if the statistics object is null.

 #### `STATS_GEOMETRIC_MEAN`
   * Description: Calculates the geometric mean of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The geometric mean of the values in the window or NaN if the statistics object is null.

 #### `STATS_INIT`
   * Description: Initializes a statistics object
   * Input:
     * window_size - The number of input data values to maintain in a rolling window in memory.  If window_size is equal to 0, then no rolling window is maintained. Using no rolling window is less memory intensive, but cannot calculate certain statistics like percentiles and kurtosis.
   * Returns: A Stellar statistics object

 #### `STATS_KURTOSIS`
   * Description: Calculates the kurtosis of the accumulated values (or in the window if a window is used).  See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The kurtosis of the values in the window or NaN if the statistics object is null.

 #### `STATS_MAX`
   * Description: Calculates the maximum of the accumulated values (or in the window if a window is used).
   * Input:
     * stats - The Stellar statistics object
   * Returns: The maximum of the accumulated values in the window or NaN if the statistics object is null.

 #### `STATS_MEAN`
   * Description: Calculates the mean of the accumulated values (or in the window if a window is used).
   * Input:
     * stats - The Stellar statistics object
   * Returns: The mean of the values in the window or NaN if the statistics object is null.

 #### `STATS_MERGE`
   * Description: Merges statistics objects.
   * Input:
     * statistics - A list of statistics objects
   * Returns: A Stellar statistics object

 #### `STATS_MIN`
   * Description: Calculates the minimum of the accumulated values (or in the window if a window is used).
   * Input:
     * stats - The Stellar statistics object
   * Returns: The minimum of the accumulated values in the window or NaN if the statistics object is null.

 #### `STATS_PERCENTILE`
   * Description: Computes the p'th percentile of the accumulated values (or in the window if a window is used).
   * Input:
     * stats - The Stellar statistics object
     * p - a double where 0 <= p < 1 representing the percentile
   * Returns: The p'th percentile of the data or NaN if the statistics object is null

 #### `STATS_POPULATION_VARIANCE`
   * Description: Calculates the population variance of the accumulated values (or in the window if a window is used).  See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The population variance of the values in the window or NaN if the statistics object is null.

 #### `STATS_QUADRATIC_MEAN`
   * Description: Calculates the quadratic mean of the accumulated values (or in the window if a window is used).  See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The quadratic mean of the values in the window or NaN if the statistics object is null.

 #### `STATS_SD`
   * Description: Calculates the standard deviation of the accumulated values (or in the window if a window is used).  See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The standard deviation of the values in the window or NaN if the statistics object is null.

 #### `STATS_SKEWNESS`
   * Description: Calculates the skewness of the accumulated values (or in the window if a window is used).  See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The skewness of the values in the window or NaN if the statistics object is null.

 #### `STATS_SUM`
   * Description: Calculates the sum of the accumulated values (or in the window if a window is used).
   * Input:
     * stats - The Stellar statistics object
   * Returns: The sum of the values in the window or NaN if the statistics object is null.

 #### `STATS_SUM_LOGS`
   * Description: Calculates the sum of the (natural) log of the accumulated values (or in the window if a window is used).  See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The sum of the (natural) log of the values in the window or NaN if the statistics object is null.

 #### `STATS_SUM_SQUARES`
   * Description: Calculates the sum of the squares of the accumulated values (or in the window if a window is used).
   * Input:
     * stats - The Stellar statistics object
   * Returns: The sum of the squares of the values in the window or NaN if the statistics object is null.

 #### `STATS_VARIANCE`
   * Description: Calculates the variance of the accumulated values (or in the window if a window is used).  See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
   * Input:
     * stats - The Stellar statistics object
   * Returns: The variance of the values in the window or NaN if the statistics object is null.


 ### Statistical Outlier Detection

 #### `OUTLIER_MAD_STATE_MERGE`
   * Description: Update the statistical state required to compute the Median Absolute Deviation.
   * Input:
     * [state] - A list of Median Absolute Deviation States to merge.  Generally these are states across time.
     * currentState? - The current state (optional)
   * Returns: The Median Absolute Deviation state

 #### `OUTLIER_MAD_ADD`
   * Description: Add a piece of data to the state.
   * Input:
     * state - The MAD state
     * value - The numeric value to add
   * Returns: The MAD state

 #### `OUTLIER_MAD_SCORE`
   * Description: Get the modified z-score normalized by the MAD: scale * | x_i - median(X) | / MAD.  See the first page of http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
   * Input:
     * state - The MAD state
     * value - The numeric value to score
     * scale? - Optionally the scale to use when computing the modified z-score.  Default is `0.6745`, see the first page of http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
   * Returns: The modified z-score

 # Outlier Analysis

 A common desire is to find anomalies in numerical data.  To that end,
 we have some simple statistical anomaly detectors.

 ## Median Absolute Deviation

 Much has been written about this robust estimator.  See the first page
 of http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
 for a good coverage of the good and the bad of MAD.  The usage, however
 is fairly straightforward:
 * Gather the statistical state required to compute the MAD
   * The distribution of the values of a univariate random variable over time.
   * The distribution of the absolute deviations of the values from the median.
 * Use this statistical state to score unseen values.  The higher the score, the more unlike the previously seen data the value is.

 There are a couple of issues which make MAD a bit hard to compute.
 First, the statistical state requires computing median, which can be
 computationally expensive to compute exactly.  To get around this, we
 use the OnlineStatisticalProvider to compute a sketch rather than the
 exact median.  Secondly, the statistical state for seasonal data should
 be limited to a fixed, trailing window.  We do this by ensuring that the
 MAD state is mergeable and able to be queried from within the Profiler.

 ### Example

 We will create a dummy data stream of gaussian noise to illustrate how
 to use the MAD functionality along with the profiler to tag messages as
 outliers or not.

 To do this, we will create a
 * data generator
 * parser
 * profiler profile
 * enrichment and threat triage

 #### Data Generator

 We can create a simple python script to generate a stream of gaussian
 noise at the frequency of one message per second as a python script
 which should be saved at `~/rand_gen.py`:
 ```
 #!/usr/bin/python
 import random
 import sys
 import time
 def main():
   mu = float(sys.argv[1])
   sigma = float(sys.argv[2])
   freq_s = int(sys.argv[3])
   while True:
     print str(random.gauss(mu, sigma))
     sys.stdout.flush()
     time.sleep(freq_s)

 if __name__ == '__main__':
   main()
 ```

 This script will take the following as arguments:
 * The mean of the data generated
 * The standard deviation of the data generated
 * The frequency (in seconds) of the data generated

 If, however, you'd like to test a longer tailed distribution, like the
 student t-distribution and have numpy installed, you can use the
 following as `~/rand_gen.py`:
 ```
 #!/usr/bin/python
 import random
 import sys
 import time
 import numpy as np

 def main():
   df = float(sys.argv[1])
   freq_s = int(sys.argv[2])
   while True:
     print str(np.random.standard_t(df))
     sys.stdout.flush()
     time.sleep(freq_s)

 if __name__ == '__main__':
   main()
 ```

 This script will take the following as arguments:
 * The degrees of freedom for the distribution
 * The frequency (in seconds) of the data generated

 #### The Parser

 We will create a parser that will take the single numbers in and create
 a message with a field called `value` in them using the `CSVParser`.

 Add the following file to
 `$METRON_HOME/config/zookeeper/parsers/mad.json`:
 ```
 {
   "parserClassName" : "org.apache.metron.parsers.csv.CSVParser"
  ,"sensorTopic" : "mad"
  ,"parserConfig" : {
     "columns" : {
       "value_str" : 0
                 }
                    }
  ,"fieldTransformations" : [
     {
     "transformation" : "STELLAR"
    ,"output" : [ "value" ]
    ,"config" : {
       "value" : "TO_DOUBLE(value_str)"
                }
     }
                            ]
 }
 ```

 #### Enrichment and Threat Intel

 We will set a threat triage level of `10` if a message generates a outlier score of more than 3.5.
 This cutoff will depend on your data and should be adjusted based on the
 assumed underlying distribution.  Note that under the assumptions of
 normality, MAD will act as a robust estimator of the standard deviation, so the cutoff
 should be considered the number of standard deviations away.  For other
 distributions, there are other interpretations which will make sense in
 the context of measuring the "degree different".  See
 http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
 for a brief discussion of this.

 Create the following in
 `$METRON_HOME/config/zookeeper/enrichments/mad.json`:

 ```
 {
   "index": "mad",
   "batchSize": 1,
   "enrichment": {
     "fieldMap": {
       "stellar" : {
         "config" : {
           "parser_score" : "OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(
 PROFILE_GET( 'sketchy_mad', 'global', PROFILE_FIXED(10, 'MINUTES')) ), value)"
          ,"is_alert" : "if parser_score > 3.5 then true else is_alert"
         }
       }
     }
   ,"fieldToTypeMap": { }
   },
   "threatIntel": {
     "fieldMap": { },
     "fieldToTypeMap": { },
     "triageConfig" : {
       "riskLevelRules" : [
         {
           "rule" : "parser_score > 3.5",
           "score" : 10
         }
       ],
       "aggregator" : "MAX"
     }
   }
 }
 ```

 #### The Profiler

 We can set up the profiler to track the MAD statistical state required
 to compute MAD.  For the purposes of this demonstration, we will
 configure the profiler to capture statistics on the minute mark.  We
 will capture a global statistical state for the `value` field and we
 will look back for a 5 minute window when computing the median.

 Create the following file at
 `$METRON_HOME/config/zookeeper/profiler.json`:

 ```
 {
   "profiles": [
     {
       "profile": "sketchy_mad",
       "foreach": "'global'",
       "onlyif": "true",
       "init" : {
         "s": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
 'global', PROFILE_FIXED(5, 'MINUTES')))"
                },
       "update": {
         "s": "OUTLIER_MAD_ADD(s, value)"
                 },
       "result": "s"
     }
   ]
 }
 ```

 Adjust `$METRON_HOME/config/zookeeper/global.json` to adjust the capture duration:
 ```
  "profiler.client.period.duration" : "1",
  "profiler.client.period.duration.units" : "MINUTES"
 ```

 Adjust `$METRON_HOME/config/profiler.properties` to adjust the capture
 duration by changing `profiler.period.duration=15` to `profiler.period.duration=1`

 #### Execute the Flow

 1. Install the elasticsearch head plugin by executing:
 `/usr/share/elasticsearch/bin/plugin install mobz/elasticsearch-head`

 2. Stopping all other parser topologies via monit

 3. Create the `mad` kafka topic by executing:
 `/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper node1:2181 --create --topic mad --partitions 1 --replication-factor 1`

 4. Push the modified configs by executing:
 `$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -z node1:2181 -i $METRON_HOME/config/zookeeper/`

 5. Start the profiler by executing:
 `$METRON_HOME/bin/start_profiler_topology.sh`

 6. Start the parser topology by executing:
 `$METRON_HOME/bin/start_parser_topology.sh -k node1:6667 -z node1:2181 -s mad`

 7. Ensure that the enrichment and indexing topologies are started.  If not, then start those via monit or by hand.

 8. Generate data into kafka by executing the following for at least 10 minutes:
 `~/rand_gen.py 0 1 1 | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list node1:6667 --topic mad`
 Note: if you chose the use the t-distribution script above, you would adjust the parameters of the `rand_gen.py` script accordingly.

 9. Stop the above with ctrl-c and send in an obvious outlier into kafka:
 `echo "1000" | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list node1:6667 --topic mad`

 You should be able to find the outlier via the elasticsearch head plugin by
 searching for the messages where `is_alert` is `true`.
	# Statistics and Mathematical Functions

	A variety of non-trivial and advanced analytics make use of statistics
	and advanced mathematical functions. Particular, capturing the
	statistical snapshots in a scalable way can open up doors for more
	advanced analytics such as outlier analysis. As such, this project is
	aimed at capturing a robust set of statistical functions and
	statistical-based algorithms in the form of Stellar functions. These
	functions can be used from everywhere where Stellar is used.

	## Stellar Functions

	### Approximation Statistics

	#### `HLLP_ADD`
	* Description: Add value to the HyperLogLogPlus estimator set. See [HLLP README](HLLP.md)
	* Input:
	* hyperLogLogPlus - the hllp estimator to add a value to
	* value+ - value to add to the set. Takes a single item or a list.
	* Returns: The HyperLogLogPlus set with a new value added

	#### `HLLP_CARDINALITY`
	* Description: Returns HyperLogLogPlus-estimated cardinality for this set. See [HLLP README](HLLP.md)
	* Input:
	* hyperLogLogPlus - the hllp set
	* Returns: Long value representing the cardinality for this set

	#### `HLLP_INIT`
	* Description: Initializes the HyperLogLogPlus estimator set. p must be a value between 4 and sp and sp must be less than 32 and greater than 4. See [HLLP README](HLLP.md)
	* Input:
	* p - the precision value for the normal set
	* sp - the precision value for the sparse set. If p is set, but sp is 0 or not specified, the sparse set will be disabled.
	* Returns: A new HyperLogLogPlus set

	#### `HLLP_MERGE`
	* Description: Merge hllp sets together. The resulting estimator is initialized with p and sp precision values from the first provided hllp estimator set. See [HLLP README](HLLP.md)
	* Input:
	* hllp - List of hllp estimators to merge. Takes a single hllp set or a list.
	* Returns: A new merged HyperLogLogPlus estimator set

	### Mathematical Functions

	#### `ABS`
	* Description: Returns the absolute value of a number.
	* Input:
	* number - The number to take the absolute value of
	* Returns: The absolute value of the number passed in.

	#### `BIN`
	* Description: Computes the bin that the value is in given a set of bounds.
	* Input:
	* value - The value to bin
	* bounds - A list of value bounds (excluding min and max) in sorted order.
	* Returns: Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.


	### Distributional Statistics

	#### `STATS_ADD`
	* Description: Adds one or more input values to those that are used to calculate the summary statistics.
	* Input:
	* stats - The Stellar statistics object. If null, then a new one is initialized.
	* value+ - One or more numbers to add
	* Returns: A Stellar statistics object

	#### `STATS_BIN`
	* Description: Computes the bin that the value is in based on the statistical distribution.
	* Input:
	* stats - The Stellar statistics object
	* value - The value to bin
	* bounds? - A list of percentile bin bounds (excluding min and max) or a string representing a known and common set of bins. For convenience, we have provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. If this argument is omitted, then we assume a Quartile bin split.
	* Returns: "Which bin N the value falls in such that bound(N-1) < value <= bound(N). No min and max bounds are provided, so values smaller than the 0'th bound go in the 0'th bin, and values greater than the last bound go in the M'th bin.

	#### `STATS_COUNT`
	* Description: Calculates the count of the values accumulated (or in the window if a window is used).
	* Input:
	* stats - The Stellar statistics object
	* Returns: The count of the values in the window or NaN if the statistics object is null.

	#### `STATS_GEOMETRIC_MEAN`
	* Description: Calculates the geometric mean of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The geometric mean of the values in the window or NaN if the statistics object is null.

	#### `STATS_INIT`
	* Description: Initializes a statistics object
	* Input:
	* window_size - The number of input data values to maintain in a rolling window in memory. If window_size is equal to 0, then no rolling window is maintained. Using no rolling window is less memory intensive, but cannot calculate certain statistics like percentiles and kurtosis.
	* Returns: A Stellar statistics object

	#### `STATS_KURTOSIS`
	* Description: Calculates the kurtosis of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The kurtosis of the values in the window or NaN if the statistics object is null.

	#### `STATS_MAX`
	* Description: Calculates the maximum of the accumulated values (or in the window if a window is used).
	* Input:
	* stats - The Stellar statistics object
	* Returns: The maximum of the accumulated values in the window or NaN if the statistics object is null.

	#### `STATS_MEAN`
	* Description: Calculates the mean of the accumulated values (or in the window if a window is used).
	* Input:
	* stats - The Stellar statistics object
	* Returns: The mean of the values in the window or NaN if the statistics object is null.

	#### `STATS_MERGE`
	* Description: Merges statistics objects.
	* Input:
	* statistics - A list of statistics objects
	* Returns: A Stellar statistics object

	#### `STATS_MIN`
	* Description: Calculates the minimum of the accumulated values (or in the window if a window is used).
	* Input:
	* stats - The Stellar statistics object
	* Returns: The minimum of the accumulated values in the window or NaN if the statistics object is null.

	#### `STATS_PERCENTILE`
	* Description: Computes the p'th percentile of the accumulated values (or in the window if a window is used).
	* Input:
	* stats - The Stellar statistics object
	* p - a double where 0 <= p < 1 representing the percentile
	* Returns: The p'th percentile of the data or NaN if the statistics object is null

	#### `STATS_POPULATION_VARIANCE`
	* Description: Calculates the population variance of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The population variance of the values in the window or NaN if the statistics object is null.

	#### `STATS_QUADRATIC_MEAN`
	* Description: Calculates the quadratic mean of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The quadratic mean of the values in the window or NaN if the statistics object is null.

	#### `STATS_SD`
	* Description: Calculates the standard deviation of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The standard deviation of the values in the window or NaN if the statistics object is null.

	#### `STATS_SKEWNESS`
	* Description: Calculates the skewness of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The skewness of the values in the window or NaN if the statistics object is null.

	#### `STATS_SUM`
	* Description: Calculates the sum of the accumulated values (or in the window if a window is used).
	* Input:
	* stats - The Stellar statistics object
	* Returns: The sum of the values in the window or NaN if the statistics object is null.

	#### `STATS_SUM_LOGS`
	* Description: Calculates the sum of the (natural) log of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The sum of the (natural) log of the values in the window or NaN if the statistics object is null.

	#### `STATS_SUM_SQUARES`
	* Description: Calculates the sum of the squares of the accumulated values (or in the window if a window is used).
	* Input:
	* stats - The Stellar statistics object
	* Returns: The sum of the squares of the values in the window or NaN if the statistics object is null.

	#### `STATS_VARIANCE`
	* Description: Calculates the variance of the accumulated values (or in the window if a window is used). See http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
	* Input:
	* stats - The Stellar statistics object
	* Returns: The variance of the values in the window or NaN if the statistics object is null.


	### Statistical Outlier Detection

	#### `OUTLIER_MAD_STATE_MERGE`
	* Description: Update the statistical state required to compute the Median Absolute Deviation.
	* Input:
	* [state] - A list of Median Absolute Deviation States to merge. Generally these are states across time.
	* currentState? - The current state (optional)
	* Returns: The Median Absolute Deviation state

	#### `OUTLIER_MAD_ADD`
	* Description: Add a piece of data to the state.
	* Input:
	* state - The MAD state
	* value - The numeric value to add
	* Returns: The MAD state

	#### `OUTLIER_MAD_SCORE`
	* Description: Get the modified z-score normalized by the MAD: scale * \| x_i - median(X) \| / MAD. See the first page of http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
	* Input:
	* state - The MAD state
	* value - The numeric value to score
	* scale? - Optionally the scale to use when computing the modified z-score. Default is `0.6745`, see the first page of http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
	* Returns: The modified z-score

	# Outlier Analysis

	A common desire is to find anomalies in numerical data. To that end,
	we have some simple statistical anomaly detectors.

	## Median Absolute Deviation

	Much has been written about this robust estimator. See the first page
	of http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
	for a good coverage of the good and the bad of MAD. The usage, however
	is fairly straightforward:
	* Gather the statistical state required to compute the MAD
	* The distribution of the values of a univariate random variable over time.
	* The distribution of the absolute deviations of the values from the median.
	* Use this statistical state to score unseen values. The higher the score, the more unlike the previously seen data the value is.

	There are a couple of issues which make MAD a bit hard to compute.
	First, the statistical state requires computing median, which can be
	computationally expensive to compute exactly. To get around this, we
	use the OnlineStatisticalProvider to compute a sketch rather than the
	exact median. Secondly, the statistical state for seasonal data should
	be limited to a fixed, trailing window. We do this by ensuring that the
	MAD state is mergeable and able to be queried from within the Profiler.

	### Example

	We will create a dummy data stream of gaussian noise to illustrate how
	to use the MAD functionality along with the profiler to tag messages as
	outliers or not.

	To do this, we will create a
	* data generator
	* parser
	* profiler profile
	* enrichment and threat triage

	#### Data Generator

	We can create a simple python script to generate a stream of gaussian
	noise at the frequency of one message per second as a python script
	which should be saved at `~/rand_gen.py`:
	```
	#!/usr/bin/python
	import random
	import sys
	import time
	def main():
	mu = float(sys.argv[1])
	sigma = float(sys.argv[2])
	freq_s = int(sys.argv[3])
	while True:
	print str(random.gauss(mu, sigma))
	sys.stdout.flush()
	time.sleep(freq_s)

	if __name__ == '__main__':
	main()
	```

	This script will take the following as arguments:
	* The mean of the data generated
	* The standard deviation of the data generated
	* The frequency (in seconds) of the data generated

	If, however, you'd like to test a longer tailed distribution, like the
	student t-distribution and have numpy installed, you can use the
	following as `~/rand_gen.py`:
	```
	#!/usr/bin/python
	import random
	import sys
	import time
	import numpy as np

	def main():
	df = float(sys.argv[1])
	freq_s = int(sys.argv[2])
	while True:
	print str(np.random.standard_t(df))
	sys.stdout.flush()
	time.sleep(freq_s)

	if __name__ == '__main__':
	main()
	```

	This script will take the following as arguments:
	* The degrees of freedom for the distribution
	* The frequency (in seconds) of the data generated

	#### The Parser

	We will create a parser that will take the single numbers in and create
	a message with a field called `value` in them using the `CSVParser`.

	Add the following file to
	`$METRON_HOME/config/zookeeper/parsers/mad.json`:
	```
	{
	"parserClassName" : "org.apache.metron.parsers.csv.CSVParser"
	,"sensorTopic" : "mad"
	,"parserConfig" : {
	"columns" : {
	"value_str" : 0
	}
	}
	,"fieldTransformations" : [
	{
	"transformation" : "STELLAR"
	,"output" : [ "value" ]
	,"config" : {
	"value" : "TO_DOUBLE(value_str)"
	}
	}
	]
	}
	```

	#### Enrichment and Threat Intel

	We will set a threat triage level of `10` if a message generates a outlier score of more than 3.5.
	This cutoff will depend on your data and should be adjusted based on the
	assumed underlying distribution. Note that under the assumptions of
	normality, MAD will act as a robust estimator of the standard deviation, so the cutoff
	should be considered the number of standard deviations away. For other
	distributions, there are other interpretations which will make sense in
	the context of measuring the "degree different". See
	http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
	for a brief discussion of this.

	Create the following in
	`$METRON_HOME/config/zookeeper/enrichments/mad.json`:

	```
	{
	"index": "mad",
	"batchSize": 1,
	"enrichment": {
	"fieldMap": {
	"stellar" : {
	"config" : {
	"parser_score" : "OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(
	PROFILE_GET( 'sketchy_mad', 'global', PROFILE_FIXED(10, 'MINUTES')) ), value)"
	,"is_alert" : "if parser_score > 3.5 then true else is_alert"
	}
	}
	}
	,"fieldToTypeMap": { }
	},
	"threatIntel": {
	"fieldMap": { },
	"fieldToTypeMap": { },
	"triageConfig" : {
	"riskLevelRules" : [
	{
	"rule" : "parser_score > 3.5",
	"score" : 10
	}
	],
	"aggregator" : "MAX"
	}
	}
	}
	```

	#### The Profiler

	We can set up the profiler to track the MAD statistical state required
	to compute MAD. For the purposes of this demonstration, we will
	configure the profiler to capture statistics on the minute mark. We
	will capture a global statistical state for the `value` field and we
	will look back for a 5 minute window when computing the median.

	Create the following file at
	`$METRON_HOME/config/zookeeper/profiler.json`:

	```
	{
	"profiles": [
	{
	"profile": "sketchy_mad",
	"foreach": "'global'",
	"onlyif": "true",
	"init" : {
	"s": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
	'global', PROFILE_FIXED(5, 'MINUTES')))"
	},
	"update": {
	"s": "OUTLIER_MAD_ADD(s, value)"
	},
	"result": "s"
	}
	]
	}
	```

	Adjust `$METRON_HOME/config/zookeeper/global.json` to adjust the capture duration:
	```
	"profiler.client.period.duration" : "1",
	"profiler.client.period.duration.units" : "MINUTES"
	```

	Adjust `$METRON_HOME/config/profiler.properties` to adjust the capture
	duration by changing `profiler.period.duration=15` to `profiler.period.duration=1`

	#### Execute the Flow

	1. Install the elasticsearch head plugin by executing:
	`/usr/share/elasticsearch/bin/plugin install mobz/elasticsearch-head`

	2. Stopping all other parser topologies via monit

	3. Create the `mad` kafka topic by executing:
	`/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper node1:2181 --create --topic mad --partitions 1 --replication-factor 1`

	4. Push the modified configs by executing:
	`$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -z node1:2181 -i $METRON_HOME/config/zookeeper/`

	5. Start the profiler by executing:
	`$METRON_HOME/bin/start_profiler_topology.sh`

	6. Start the parser topology by executing:
	`$METRON_HOME/bin/start_parser_topology.sh -k node1:6667 -z node1:2181 -s mad`

	7. Ensure that the enrichment and indexing topologies are started. If not, then start those via monit or by hand.

	8. Generate data into kafka by executing the following for at least 10 minutes:
	`~/rand_gen.py 0 1 1 \| /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list node1:6667 --topic mad`
	Note: if you chose the use the t-distribution script above, you would adjust the parameters of the `rand_gen.py` script accordingly.

	9. Stop the above with ctrl-c and send in an obvious outlier into kafka:
	`echo "1000" \| /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list node1:6667 --topic mad`

	You should be able to find the outlier via the elasticsearch head plugin by
	searching for the messages where `is_alert` is `true`.