| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| # Performance Utilities |
| |
| This project creates some useful performance monitoring and measurement |
| utilities. |
| |
| ## `load-tool.sh` |
| |
| The Load tool is intended to do the following: |
| * Generate a load at a specific events per second into kafka |
| * The messages are taken from a template file, where there is a message template per line |
| * The load can be biased (e.g. 80% of the load can be comprised of 20% of the templates) |
| * Monitor the kafka offsets for a topic to determine the events per second written |
| * This could be the topic that you are generating load on |
| * This could be another topic that represents the output of some topology (e.g. generate load on `enrichments` and monitor `indexing` to determine the throughput of the enrichment topology). |
| |
| ``` |
| usage: Generator |
| -bs,--sample_bias <BIAS_FILE> The discrete distribution to bias |
| the sampling. This is a CSV of 2 |
| columns. The first column is the % |
| of the templates and the 2nd column |
| is the probability (0-100) that |
| it's chosen. For instance: |
| 20,80 |
| 80,20 |
| implies that 20% of the templates |
| will comprise 80% of the output and |
| the remaining 80% of the templates |
| will comprise 20% of the output. |
| -c,--csv <CSV_FILE> A CSV file to emit monitoring data |
| to. The format is a CSV with the |
| following schema: timestamp, (name, |
| eps, historical_mean, |
| historical_stddev)+ |
| -cg,--consumer_group <GROUP_ID> Consumer Group. The default is |
| load.group |
| -e,--eps <EPS> The target events per second |
| -h,--help Generate Help screen |
| -k,--kafka_config <CONFIG_FILE> The kafka config. This is a file |
| containing a JSON map with the |
| kafka config. |
| -l,--lookback <LOOKBACK> When summarizing, how many |
| monitoring periods should we |
| summarize over? If 0, then no |
| summary. Default: 5 |
| -md,--monitor_delta_ms <TIME_IN_MS> The time (in ms) between monitoring |
| output. Default is 10000 |
| -mt,--monitor_topic <TOPIC> The kafka topic to monitor. |
| -ot,--output_topic <TOPIC> The kafka topic to write to |
| -p,--threads <NUM_THREADS> The number of threads to use when |
| extracting data. The default is |
| the number of cores of your |
| machine. |
| -sd,--send_delta_ms <TIME_IN_MS> The time (in ms) between sending a |
| batch of messages. Default is 100 |
| -t,--template <TEMPLATE_FILE> The template file to use for |
| generation. This should be a file |
| with a template per line with |
| $METRON_TS and $METRON_GUID in the |
| spots for timestamp and guid, if |
| you so desire them. |
| -tl,--time_limit_ms <MS> The total amount of time to run |
| this in milliseconds. By default, |
| it never stops. |
| -z,--zk_quorum <QUORUM> zookeeper quorum |
| ``` |
| |
| ## Templates |
| Messages are drawn from a template file. A template file has a message template per line. |
| For instance, let's say we want to generate JSON maps with fields: `source.type`, `ip_src_addr` |
| and `ip_dst_addr`. We can generate a template file with a template like the following per line: |
| ``` |
| { "source.type" : "asa", "ip_src_addr" : "127.0.0.1", "ip_dst_addr" : "191.168.1.1" } |
| ``` |
| |
| When messages are generated, there are some special replacements that can be used: `$METRON_TS` and `$METRON_GUID`. |
| We can adjust our previous template to use these like so: |
| ``` |
| { "source.type" : "asa", "ip_src_addr" : "127.0.0.1", "ip_dst_addr" : "191.168.1.1", "timestamp" : $METRON_TS, "guid" : "$METRON_GUID" } |
| ``` |
| One note about GUIDs generated. We do not generate global UUIDs, they are unique only within the context of a given generator run. |
| |
| ## Biased Sampling |
| |
| This load tool can be configured to use biased sampling. This is useful if, for instance, you are trying to model data which is not distributed |
| uniformly, like many types of network data. Generating synthetic data with similar distribution to your regular data will enable the caches |
| to be exercised in the same way, for instance, and yield a more realistic scenario. |
| |
| You specify the biases in a csv file with 2 columns: |
| * The first column represents the % of the templates |
| * The second column represents the % of the generated output. |
| |
| A simple example would be to generate samples based on Pareto's principle: |
| ``` |
| 20,80 |
| 80,20 |
| ``` |
| This would yield biases that mean the first 20% of the templates in the template file would comprise 80% of the output. |
| |
| A more complex example might be: |
| ``` |
| 20,80 |
| 20,5 |
| 50,1 |
| 10,14 |
| ``` |
| This would would imply: |
| * The first 20% of the templates would comprise 80% of the output |
| * The next 20% of the templates would comprise 5% of the output |
| * The next 50% of the templates would comprise 1% of the output |
| * The next 10% of the templates would comprise 14% of the output. |
| |
| ## CSV Output |
| |
| For those who would prefer a different visualization or wish to incorporate the output of this tool into an automated test, |
| you can specify a file to emit data in CSV format to via the `-c` or `--csv` option. |
| |
| The CSV columns are as follows: |
| * timestamp in epoch millis |
| |
| If you are generating synthetic data, then: |
| * "generated" |
| * The events per second generated |
| * The mean of the events per second generated for the the last `k` runs, where `k` is the lookback (set via `-l` and defaulted to `5`) |
| * The standard deviation of the events per second generated for the last `k` runs, where `k` is the lookback (set via `-l` and defaulted to `5`) |
| |
| If you are monitoring a topic, then: |
| * "throughput measured" |
| * The events per second measured |
| * The mean of the events per second measured for the the last `k` runs, where `k` is the lookback (set via `-l` and defaulted to `5`) |
| * The standard deviation of the events per second measured for the last `k` runs, where `k` is the lookback (set via `-l` and defaulted to `5`) |
| |
| Obviously, if you are doing both generating and monitoring the throughput of a topic, then all of the columns are added. |
| |
| An example of CSV output is: |
| ``` |
| 1520506955047,generated,,,,throughput measured,,, |
| 1520506964896,generated,1045,1045,0,throughput measured,,, |
| 1520506974896,generated,1000,1022,31,throughput measured,1002,1002,0 |
| 1520506984904,generated,999,1014,26,throughput measured,999,1000,2 |
| 1520506994896,generated,1000,1011,22,throughput measured,1000,1000,1 |
| 1520507004896,generated,1000,1008,20,throughput measured,1000,1000,1 |
| ``` |
| |
| ## Use-cases for the Load Tool |
| |
| ### Measure Throughput of a Topology |
| |
| One can use the load tool to monitor performance of a kafka-to-kafka topology. |
| For instance, we could monitor the throughput of the enrichment topology by monitoring the `enrichments` kafka topic: |
| ``` |
| $METRON_HOME/bin/load_tool.sh -mt enrichments -z $ZOOKEEPER |
| ``` |
| |
| ### Generate Synthetic Load and Measure Performance |
| |
| One can use the load tool to generate synthetic load and monitor performance of a kafka-to-kafka topology. For instance, we could |
| monitor the performance of the enrichment topology. It is advised to start the enrichment topology against a new topic and write |
| to a new topic so as to not pollute your downstream indices. So, for instance we could create a kafka topic called |
| `enrichments_load` by generating load on it. We could also create a new kafka topic called `indexing_load` and configure the enrichment |
| topology to output to it. We would then generate load on `enrichments_load` and monitor `indexing_load`. |
| ``` |
| #Threadpool of size 5, you want somewhere between 5 and 10 depending on the throughput numbers you're trying to drive |
| #Messages drawn from ~/dummy.templates, which is a message template per line |
| #Generate at a rate of 9000 messages per second |
| #Emit the data to a CSV file ~/measurements.csv |
| $METRON_HOME/bin/load_tool.sh -p 5 -ot enrichments_load -mt indexing_load -t ~/dummy.templates -eps 9000 -z $ZOOKEEPER -c ~/measurements.csv |
| ``` |
| |
| Now, with the help of a bash function and gnuplot we can generate a plot |
| of the historical throughput measurements for `indexing_load`: |
| ``` |
| # Ensure that you have installed gnuplot and the liberation font package |
| # via yum install -y gnuplot liberation-sans-fonts |
| # We will define a plot function that will generate a png plot. It takes |
| # one arg, the output file. It expects to have a 2 column CSV streamed |
| # with the first dimension being the timestamp and the second dimension |
| # being what you want plotted. |
| plot() { |
| awk -F, '{printf "%d %d\n", $1/1000, $2} END { print "e" }' | gnuplot -e "reset;clear;set style fill solid 1.0 border -1; set nokey;set title 'Throughput Measured'; set xlabel 'Time'; set boxwidth 0.5; set xtics rotate; set ylabel 'events/sec';set xdata time; set timefmt '%s';set format x '%H:%M:%S';set term png enhanced font '/usr/share/fonts/liberation/LiberationSans-Regular.ttf' 12 size 900,400; set output '$1';plot '< cat -' using 1:2 with line lt -1 lw 2;" |
| } |
| |
| # We want to transform the CSV file into a space separated file with the |
| # timestamp followed by the throughput measurements. |
| cat ~/measurements.csv | awk -F, '{printf "%d,%d\n", $1, $8 }' | plot performance_measurement.png |
| ``` |
| This generates a plot like so to `performance_measurement.png`: |
|  |