Indexing

Introduction

The indexing topology is a topology dedicated to taking the data from the enrichment topology that have been enriched and storing the data in one or more supported indices

HDFS as rolled text files, one JSON blob per line
Elasticsearch
Solr

By default, this topology writes out to both HDFS and one of Elasticsearch and Solr.

Minimal Assumptions for Message Structure

If a message is missing the source.type field, the message tuple will be failed and not written with an appropriate error indicated in the Storm UI and logs.

Indexing Architecture

Architecture

The indexing topology is extremely simple. Data is ingested into kafka and sent to

An indexing bolt configured to write to either elasticsearch or Solr
An indexing bolt configured to write to HDFS under /apps/metron/enrichment/indexed

By default, errors during indexing are sent back into the indexing kafka queue so that they can be indexed and archived.

Indexing Topology

The indexing topology as started by the $METRON_HOME/bin/start_elasticsearch_topology.sh or $METRON_HOME/bin/start_solr_topology.sh script uses a default of one executor per bolt. In a real production system, this should be customized by modifying the flux file in $METRON_HOME/flux/indexing/remote.yaml.

Add a parallelism field to the bolts to give Storm a parallelism hint for the various components. Give bolts which appear to be bottlenecks (e.g. the indexing bolt) a larger hint.
Add a parallelism field to the kafka spout which matches the number of partitions for the enrichment kafka queue.
Adjust the number of workers for the topology by adjusting the topology.workers field for the topology.

Finally, if workers and executors are new to you or you don't know where to modify the flux file, the following might be of use to you:

Rest endpoints

There are rest endpoints available to perform operations like start, stop, activate, deactivate on the indexing topologies.


`GET /api/v1/storm/indexing/batch`
`GET /api/v1/storm/indexing/batch/activate`
`GET /api/v1/storm/indexing/batch/deactivate`
`GET /api/v1/storm/indexing/batch/start`
`GET /api/v1/storm/indexing/batch/stop`
`GET /api/v1/storm/indexing/randomaccess`
`GET /api/v1/storm/indexing/randomaccess/activate`
`GET /api/v1/storm/indexing/randomaccess/deactivate`
`GET /api/v1/storm/indexing/randomaccess/start`
`GET /api/v1/storm/indexing/randomaccess/stop`