metron-platform/metron-solr/metron-solr-common/README.md - metron - Git at Google

 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 # Solr in Metron

 ## Table of Contents

 * [Introduction](#introduction)
 * [Configuration](#configuration)
 * [Installing](#installing)
 * [Schemas](#schemas)
 * [Collections](#collections)

 ## Introduction

 Metron ships with Solr 6.6.2 support. Solr Cloud can be used as the real-time portion of the datastore resulting from [metron-indexing](../metron-indexing/README.md).

 ## Configuration

 ### The Indexing Topology

 Solr is a viable option for indexing data in Metron and, similar to the Elasticsearch Writer, can be configured
 via the global config.  The following settings are possible as part of the global config:
 * `solr.zookeeper`
   * The zookeeper quorum associated with the SolrCloud instance.  This is a required field with no default.
 * `solr.commitPerBatch`
   * This is a boolean which defines whether the writer commits every batch.  The default is `true`.
   * _WARNING_: If you set this to `false`, then commits will happen based on the SolrClient's internal mechanism and
     worker failure *may* result data being acknowledged in storm but not written in Solr.
 * `solr.commit.soft`
   * This is a boolean which defines whether the writer makes a soft commit or a durable commit.  See [here](https://lucene.apache.org/solr/guide/6_6/near-real-time-searching.html#NearRealTimeSearching-AutoCommits)  The default is `false`.
   * _WARNING_: If you set this to `true`, then commits will happen based on the SolrClient's internal mechanism and
     worker failure *may* result data being acknowledged in storm but not written in Solr.
 * `solr.commit.waitSearcher`
   * This is a boolean which defines whether the writer blocks the commit until the data is available to search.  See [here](https://lucene.apache.org/solr/guide/6_6/near-real-time-searching.html#NearRealTimeSearching-AutoCommits)  The default is `true`.
   * _WARNING_: If you set this to `false`, then commits will happen based on the SolrClient's internal mechanism and
     worker failure *may* result data being acknowledged in storm but not written in Solr.
 * `solr.commit.waitFlush`
   * This is a boolean which defines whether the writer blocks the commit until the data is flushed.  See [here](https://lucene.apache.org/solr/guide/6_6/near-real-time-searching.html#NearRealTimeSearching-AutoCommits)  The default is `true`.
   * _WARNING_: If you set this to `false`, then commits will happen based on the SolrClient's internal mechanism and
     worker failure *may* result data being acknowledged in storm but not written in Solr.
 * `solr.collection`
   * The default solr collection (if unspecified, the name is `metron`).  By default, sensors will write to a collection associated with the index name in the
   indexing config for that sensor.  If that index name is the empty string, then the default collection will be used.
 * `solr.http.config`
   * This is a map which allows users to configure the Solr client's HTTP client.
   * Possible fields here are:
     * `socketTimeout` : Socket timeout measured in ms, closes a socket if read takes longer than x ms to complete
     throws `java.net.SocketTimeoutException: Read timed out exception`
     * `connTimeout` : Connection timeout measures in ms, closes a socket if connection cannot be established within x ms
     with a `java.net.SocketTimeoutException: Connection timed out`
     * `maxConectionsPerHost` : Maximum connections allowed per host
     * `maxConnections` :  Maximum total connections allowed
     * `retry` : Retry http requests on error
     * `allowCompression` :  Allow compression (deflate,gzip) if server supports it
     * `followRedirects` : Follow redirects
     * `httpBasicAuthUser` : Basic auth username
     * `httpBasicAuthPassword` : Basic auth password
     * `solr.ssl.checkPeerName` : Check peer name


 ## Installing

 Solr is installed in the [full dev environment for CentOS](../../metron-deployment/development/centos6) by default but is not started initially.  Navigate to `$METRON_HOME/bin`
 and start Solr Cloud by running `start_solr.sh`.

 Metron's Ambari MPack installs several scripts in `$METRON_HOME/bin` that can be used to manage Solr.  A script is also provided for installing Solr Cloud outside of full dev.
 The script performs the following tasks

 * Stops ES and Kibana
 * Downloads Solr
 * Installs Solr
 * Starts Solr Cloud

 _Note: for details on setting up Solr Cloud in production mode, see https://lucene.apache.org/solr/guide/6_6/taking-solr-to-production.html_

 Navigate to `$METRON_HOME/bin` and spin up Solr Cloud by running `install_solr.sh`.  After running this script,
 Elasticsearch and Kibana will have been stopped and you should now have an instance of Solr Cloud up and running at http://localhost:8983/solr/#/~cloud.  This manner of starting Solr
 will also spin up an embedded Zookeeper instance at port 9983. More information can be found [here](https://lucene.apache.org/solr/guide/6_6/getting-started-with-solrcloud.html)

 Solr can also be installed using [HDP Search 3](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_solr-search-installation/content/ch_hdp_search_30.html).  HDP Search 3 sets the Zookeeper root to
 `/solr` so this will need to be added to each url in the comma-separated list in Ambari UI -> Services -> Metron -> Configs -> Index Settings -> Solr Zookeeper Urls.  For example, in full dev
 this would be `node1:2181/solr`.

 ## Enabling Solr

 Elasticsearch is the real-time store used by default in Metron.  Solr can be enabled following these steps:

 1. Stop the Metron Indexing component in Ambari.
 1. Update Ambari UI -> Services -> Metron -> Configs -> Index Settings -> Solr Zookeeper Urls to match the Solr installation described in the previous section.
 1. Change Ambari UI -> Services -> Metron -> Configs -> Indexing -> Index Writer - Random Access -> Random Access Search Engine to `Solr`.
 1. Change Ambari UI -> Services -> Metron -> Configs -> REST -> Source Type Field Name to `source.type`.
 1. Change Ambari UI -> Services -> Metron -> Configs -> REST -> Threat Triage Score Field Name to `threat.triage.score`.
 1. Start the Metron Indexing component in Ambari.
 1. Restart Metron REST and the Alerts UI in Ambari.

 This will automatically create collections for the schemas shipped with Metron:

 * bro
 * snort
 * yaf
 * error (used internally by Metron)
 * metaalert (used internall by Metron)

 Any other collections must be created manually before starting the Indexing component.  Alerts should be present in the Alerts UI after enabling Solr.

 ## Schemas

 As of now, we have mapped out the Schemas in `src/main/config/schema`.
 Ambari will eventually install these, but at the moment it's manual and
 you should refer to the Solr documentation [https://lucene.apache.org/solr/guide/6_6](here) in general
 and [here](https://lucene.apache.org/solr/guide/6_6/documents-fields-and-schema-design.html) if you'd like to know more about schemas in Solr.

 In Metron's Solr DAO implementation, document updates involve reading a document, applying the update and replacing the original by reindexing the whole document.
 Indexing LatLonType and PointType field types stores data in internal fields that should not be returned in search results.  For these fields a dynamic field type matching the suffix needs to be added to store the data points.
 Solr 6+ comes with a new LatLonPointSpatialField field type that should be used instead of LatLonType if possible.  Otherwise, a LatLongType field should be defined as:
 ```
 <dynamicField name="*.location_point" type="location" multiValued="false" docValues="false"/>
 <dynamicField name="*_coordinate" type="pdouble" indexed="true" stored="false" docValues="false"/>
 <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
 ```
 A PointType field should be defined as:
 ```
 <dynamicField name="*.point" type="point" multiValued="false" docValues="false"/>
 <dynamicField name="*_point" type="pdouble" indexed="true" stored="false" docValues="false"/>
 <fieldType name="point" class="solr.PointType" subFieldSuffix="_point"/>
 ```
 If any copy fields are defined, stored and docValues should be set to false.

 ## Collections

 Convenience scripts are provided with Metron to create and delete collections.  Ambari uses these scripts to automatically create collections.  To use them outside of Ambari, a few environment variables must be set first:
 ```
 # Path to the zookeeper node used by Solr
 export ZOOKEEPER=node1:2181/solr
 # Set to true if Kerberos is enabled
 export SECURITY_ENABLED=true
 ```
 The scripts can then be called directly with the collection name as the first argument .  For example, to create the bro collection:
 ```
 $METRON_HOME/bin/create_collection.sh bro
 ```
 To delete the bro collection:
 ```
 $METRON_HOME/bin/delete_collection.sh bro
 ```
 The `create_collection.sh` script depends on schemas installed in `$METRON_HOME/config/schema`.  There are several schemas that come with Metron:

 * bro
 * snort
 * yaf
 * metaalert
 * error

 Additional schemas should be installed in that location if using the `create_collection.sh` script.  Any collection can be deleted with the `delete_collection.sh` script.
 These scripts use the [Solr Collection API](http://lucene.apache.org/solr/guide/6_6/collections-api.html).
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	# Solr in Metron

	## Table of Contents

	* [Introduction](#introduction)
	* [Configuration](#configuration)
	* [Installing](#installing)
	* [Schemas](#schemas)
	* [Collections](#collections)

	## Introduction

	Metron ships with Solr 6.6.2 support. Solr Cloud can be used as the real-time portion of the datastore resulting from [metron-indexing](../metron-indexing/README.md).

	## Configuration

	### The Indexing Topology

	Solr is a viable option for indexing data in Metron and, similar to the Elasticsearch Writer, can be configured
	via the global config. The following settings are possible as part of the global config:
	* `solr.zookeeper`
	* The zookeeper quorum associated with the SolrCloud instance. This is a required field with no default.
	* `solr.commitPerBatch`
	* This is a boolean which defines whether the writer commits every batch. The default is `true`.
	* _WARNING_: If you set this to `false`, then commits will happen based on the SolrClient's internal mechanism and
	worker failure may result data being acknowledged in storm but not written in Solr.
	* `solr.commit.soft`
	* This is a boolean which defines whether the writer makes a soft commit or a durable commit. See [here](https://lucene.apache.org/solr/guide/6_6/near-real-time-searching.html#NearRealTimeSearching-AutoCommits) The default is `false`.
	* _WARNING_: If you set this to `true`, then commits will happen based on the SolrClient's internal mechanism and
	worker failure may result data being acknowledged in storm but not written in Solr.
	* `solr.commit.waitSearcher`
	* This is a boolean which defines whether the writer blocks the commit until the data is available to search. See [here](https://lucene.apache.org/solr/guide/6_6/near-real-time-searching.html#NearRealTimeSearching-AutoCommits) The default is `true`.
	* _WARNING_: If you set this to `false`, then commits will happen based on the SolrClient's internal mechanism and
	worker failure may result data being acknowledged in storm but not written in Solr.
	* `solr.commit.waitFlush`
	* This is a boolean which defines whether the writer blocks the commit until the data is flushed. See [here](https://lucene.apache.org/solr/guide/6_6/near-real-time-searching.html#NearRealTimeSearching-AutoCommits) The default is `true`.
	* _WARNING_: If you set this to `false`, then commits will happen based on the SolrClient's internal mechanism and
	worker failure may result data being acknowledged in storm but not written in Solr.
	* `solr.collection`
	* The default solr collection (if unspecified, the name is `metron`). By default, sensors will write to a collection associated with the index name in the
	indexing config for that sensor. If that index name is the empty string, then the default collection will be used.
	* `solr.http.config`
	* This is a map which allows users to configure the Solr client's HTTP client.
	* Possible fields here are:
	* `socketTimeout` : Socket timeout measured in ms, closes a socket if read takes longer than x ms to complete
	throws `java.net.SocketTimeoutException: Read timed out exception`
	* `connTimeout` : Connection timeout measures in ms, closes a socket if connection cannot be established within x ms
	with a `java.net.SocketTimeoutException: Connection timed out`
	* `maxConectionsPerHost` : Maximum connections allowed per host
	* `maxConnections` : Maximum total connections allowed
	* `retry` : Retry http requests on error
	* `allowCompression` : Allow compression (deflate,gzip) if server supports it
	* `followRedirects` : Follow redirects
	* `httpBasicAuthUser` : Basic auth username
	* `httpBasicAuthPassword` : Basic auth password
	* `solr.ssl.checkPeerName` : Check peer name


	## Installing

	Solr is installed in the [full dev environment for CentOS](../../metron-deployment/development/centos6) by default but is not started initially. Navigate to `$METRON_HOME/bin`
	and start Solr Cloud by running `start_solr.sh`.

	Metron's Ambari MPack installs several scripts in `$METRON_HOME/bin` that can be used to manage Solr. A script is also provided for installing Solr Cloud outside of full dev.
	The script performs the following tasks

	* Stops ES and Kibana
	* Downloads Solr
	* Installs Solr
	* Starts Solr Cloud

	_Note: for details on setting up Solr Cloud in production mode, see https://lucene.apache.org/solr/guide/6_6/taking-solr-to-production.html_

	Navigate to `$METRON_HOME/bin` and spin up Solr Cloud by running `install_solr.sh`. After running this script,
	Elasticsearch and Kibana will have been stopped and you should now have an instance of Solr Cloud up and running at http://localhost:8983/solr/#/~cloud. This manner of starting Solr
	will also spin up an embedded Zookeeper instance at port 9983. More information can be found [here](https://lucene.apache.org/solr/guide/6_6/getting-started-with-solrcloud.html)

	Solr can also be installed using [HDP Search 3](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_solr-search-installation/content/ch_hdp_search_30.html). HDP Search 3 sets the Zookeeper root to
	`/solr` so this will need to be added to each url in the comma-separated list in Ambari UI -> Services -> Metron -> Configs -> Index Settings -> Solr Zookeeper Urls. For example, in full dev
	this would be `node1:2181/solr`.

	## Enabling Solr

	Elasticsearch is the real-time store used by default in Metron. Solr can be enabled following these steps:

	1. Stop the Metron Indexing component in Ambari.
	1. Update Ambari UI -> Services -> Metron -> Configs -> Index Settings -> Solr Zookeeper Urls to match the Solr installation described in the previous section.
	1. Change Ambari UI -> Services -> Metron -> Configs -> Indexing -> Index Writer - Random Access -> Random Access Search Engine to `Solr`.
	1. Change Ambari UI -> Services -> Metron -> Configs -> REST -> Source Type Field Name to `source.type`.
	1. Change Ambari UI -> Services -> Metron -> Configs -> REST -> Threat Triage Score Field Name to `threat.triage.score`.
	1. Start the Metron Indexing component in Ambari.
	1. Restart Metron REST and the Alerts UI in Ambari.

	This will automatically create collections for the schemas shipped with Metron:

	* bro
	* snort
	* yaf
	* error (used internally by Metron)
	* metaalert (used internall by Metron)

	Any other collections must be created manually before starting the Indexing component. Alerts should be present in the Alerts UI after enabling Solr.

	## Schemas

	As of now, we have mapped out the Schemas in `src/main/config/schema`.
	Ambari will eventually install these, but at the moment it's manual and
	you should refer to the Solr documentation [https://lucene.apache.org/solr/guide/6_6](here) in general
	and [here](https://lucene.apache.org/solr/guide/6_6/documents-fields-and-schema-design.html) if you'd like to know more about schemas in Solr.

	In Metron's Solr DAO implementation, document updates involve reading a document, applying the update and replacing the original by reindexing the whole document.
	Indexing LatLonType and PointType field types stores data in internal fields that should not be returned in search results. For these fields a dynamic field type matching the suffix needs to be added to store the data points.
	Solr 6+ comes with a new LatLonPointSpatialField field type that should be used instead of LatLonType if possible. Otherwise, a LatLongType field should be defined as:
	```
	<dynamicField name="*.location_point" type="location" multiValued="false" docValues="false"/>
	<dynamicField name="*_coordinate" type="pdouble" indexed="true" stored="false" docValues="false"/>
	<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
	```
	A PointType field should be defined as:
	```
	<dynamicField name="*.point" type="point" multiValued="false" docValues="false"/>
	<dynamicField name="*_point" type="pdouble" indexed="true" stored="false" docValues="false"/>
	<fieldType name="point" class="solr.PointType" subFieldSuffix="_point"/>
	```
	If any copy fields are defined, stored and docValues should be set to false.

	## Collections

	Convenience scripts are provided with Metron to create and delete collections. Ambari uses these scripts to automatically create collections. To use them outside of Ambari, a few environment variables must be set first:
	```
	# Path to the zookeeper node used by Solr
	export ZOOKEEPER=node1:2181/solr
	# Set to true if Kerberos is enabled
	export SECURITY_ENABLED=true
	```
	The scripts can then be called directly with the collection name as the first argument . For example, to create the bro collection:
	```
	$METRON_HOME/bin/create_collection.sh bro
	```
	To delete the bro collection:
	```
	$METRON_HOME/bin/delete_collection.sh bro
	```
	The `create_collection.sh` script depends on schemas installed in `$METRON_HOME/config/schema`. There are several schemas that come with Metron:

	* bro
	* snort
	* yaf
	* metaalert
	* error

	Additional schemas should be installed in that location if using the `create_collection.sh` script. Any collection can be deleted with the `delete_collection.sh` script.
	These scripts use the [Solr Collection API](http://lucene.apache.org/solr/guide/6_6/collections-api.html).