| # Indexing |
| |
| ## Introduction |
| |
| The `indexing` topology is a topology dedicated to taking the data |
| from the enrichment topology that have been enriched and storing the data in one or more supported indices |
| * HDFS as rolled text files, one JSON blob per line |
| * Elasticsearch |
| * Solr |
| |
| By default, this topology writes out to both HDFS and one of |
| Elasticsearch and Solr. |
| |
| Indices are written in batch and the batch size and batch timeout are specified in the |
| [Sensor Indexing Configuration](#sensor-indexing-configuration) via the `batchSize` and `batchTimeout` parameters. |
| These configs are variable by sensor type. |
| |
| ## Indexing Architecture |
| |
|  |
| |
| The indexing topology is extremely simple. Data is ingested into kafka |
| and sent to |
| * An indexing bolt configured to write to either elasticsearch or Solr |
| * An indexing bolt configured to write to HDFS under `/apps/metron/enrichment/indexed` |
| |
| By default, errors during indexing are sent back into the `indexing` kafka queue so that they can be indexed and archived. |
| |
| ## Sensor Indexing Configuration |
| The sensor specific configuration is intended to configure the |
| indexing used for a given sensor type (e.g. `snort`). |
| |
| Just like the global config, the format is a JSON stored in zookeeper and on disk at `$METRON_HOME/config/zookeeper/indexing`. Within the sensor-specific configuration, you can configure the individual writers. The writers currently supported are: |
| * `elasticsearch` |
| * `hdfs` |
| * `solr` |
| |
| Depending on how you start the indexing topology, it will have either |
| elasticsearch or solr and hdfs writers running. |
| |
| The configuration for an individual writer-specific configuration is a JSON map with the following fields: |
| * `index` : The name of the index to write to (defaulted to the name of the sensor). |
| * `batchSize` : The size of the batch that is written to the indices at once. Defaults to `1` (no batching). |
| * `batchTimeout` : The timeout after which a batch will be flushed even if batchSize has not been met. Optional. |
| If unspecified, or set to `0`, it defaults to a system-determined duration which is a fraction of the Storm |
| parameter `topology.message.timeout.secs`. Ignored if batchSize is `1`, since this disables batching. |
| * `enabled` : Whether the writer is enabled (default `true`). |
| |
| ### Indexing Configuration Examples |
| For a given sensor, the following scenarios would be indicated by |
| the following cases: |
| #### Base Case |
| ``` |
| { |
| } |
| ``` |
| or no file at all. |
| |
| * elasticsearch writer |
| * enabled |
| * batch size of 1 |
| * batch timeout system default |
| * index name the same as the sensor |
| * hdfs writer |
| * enabled |
| * batch size of 1 |
| * batch timeout system default |
| * index name the same as the sensor |
| |
| If a writer config is unspecified, then a warning is indicated in the |
| Storm console. e.g.: |
| `WARNING: Default and (likely) unoptimized writer config used for hdfs writer and sensor squid` |
| |
| #### Fully specified |
| ``` |
| { |
| "elasticsearch": { |
| "index": "foo", |
| "batchSize" : 100, |
| "batchTimeout" : 0, |
| "enabled" : true |
| }, |
| "hdfs": { |
| "index": "foo", |
| "batchSize": 1, |
| "batchTimeout" : 0, |
| "enabled" : true |
| } |
| } |
| ``` |
| * elasticsearch writer |
| * enabled |
| * batch size of 100 |
| * batch timeout system default |
| * index name of "foo" |
| * hdfs writer |
| * enabled |
| * batch size of 1 |
| * batch timeout system default |
| * index name of "foo" |
| |
| #### HDFS Writer turned off |
| ``` |
| { |
| "elasticsearch": { |
| "index": "foo", |
| "enabled" : true |
| }, |
| "hdfs": { |
| "index": "foo", |
| "batchSize": 100, |
| "batchTimeout" : 0, |
| "enabled" : false |
| } |
| } |
| ``` |
| * elasticsearch writer |
| * enabled |
| * batch size of 1 |
| * batch timeout system default |
| * index name of "foo" |
| * hdfs writer |
| * disabled |
| |
| # Updates to Indexed Data |
| |
| There are clear usecases where we would want to incorporate the capability to update indexed data. |
| Thus far, we have limited capabilities provided to support this use-case: |
| * Updates to the random access index (e.g. Elasticsearch and Solr) should be supported |
| * Updates to the cold storage index (e.g. HDFS) is not supported currently, however to support the batch |
| use-case updated documents will be provided in a NoSQL write-ahead log (e.g. a HBase table) and an Java API |
| will be provided to retrieve those updates scalably (i.e. a scan-free architecture). |
| |
| Put simply, the random access index will be always up-to-date, but the HDFS index will need to be |
| joined to the NoSQL write-ahead log to get current updates. |
| |
| ## The `IndexDao` Abstraction |
| |
| The indices mentioned above as part of Update should be pluggable by the developer so that |
| new write-ahead logs or real-time indices can be supported by providing an implementation supporting |
| the data access patterns. |
| |
| To support a new index, one would need to implement the `org.apache.metron.indexing.dao.IndexDao` abstraction |
| and provide update and search capabilities. IndexDaos may be composed and updates will be performed |
| in parallel. This enables a flexible strategy for specifying your backing store for updates at runtime. |
| For instance, currently the REST API supports the update functionality and may be configured with a list of |
| IndexDao implementations to use to support the updates. |
| |
| # Notes on Performance Tuning |
| |
| Default installed Metron is untuned for production deployment. By far |
| and wide, the most likely piece to require TLC from a performance |
| perspective is the indexing layer. An index that does not keep up will |
| back up and you will see errors in the kafka bolt. There |
| are a few knobs to tune to get the most out of your system. |
| |
| ## Kafka Queue |
| The `indexing` kafka queue is a collection point from the enrichment |
| topology. As such, make sure that the number of partitions in |
| the kafka topic is sufficient to handle the throughput that you expect. |
| |
| ## Indexing Topology |
| The `indexing` topology as started by the `$METRON_HOME/bin/start_elasticsearch_topology.sh` |
| or `$METRON_HOME/bin/start_solr_topology.sh` |
| script uses a default of one executor per bolt. In a real production system, this should |
| be customized by modifying the flux file in |
| `$METRON_HOME/flux/indexing/remote.yaml`. |
| * Add a `parallelism` field to the bolts to give Storm a parallelism |
| hint for the various components. Give bolts which appear to be bottlenecks (e.g. the indexing bolt) a larger hint. |
| * Add a `parallelism` field to the kafka spout which matches the number of partitions for the enrichment kafka queue. |
| * Adjust the number of workers for the topology by adjusting the |
| `topology.workers` field for the topology. |
| |
| Finally, if workers and executors are new to you or you don't know where |
| to modify the flux file, the following might be of use to you: |
| * [Understanding the Parallelism of a Storm Topology](http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/) |
| * [Flux Docs](http://storm.apache.org/releases/current/flux.html) |
| |
| ## Zeppelin Notebooks |
| Zeppelin notebooks can be added to `/src/main/config/zeppelin/` (and subdirectories can be created for organization). The placed files must be .json files and be named appropriately. |
| These files must be added to the metron.spec file and the RPMs rebuilt to be available to be loaded into Ambari. |
| |
| The notebook files will be found on the server in `$METRON_HOME/config/zeppelin` |
| |
| The Ambari Management Pack has a custom action to load these templates, ZEPPELIN_DASHBOARD_INSTALL, that will import them into Zeppelin. |