indexer-solr plugin is used for sending documents from one or more segments to a Solr server. The configuration for the index writers is on conf/index-writers.xml file, included in the official Nutch distribution and it's as follow:
<writer id="<writer_id>" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter"> <mapping> ... </mapping> <parameters> ... </parameters> </writer>
Each <writer>
element has two mandatory attributes:
<writer_id>
is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations.
org.apache.nutch.indexwriter.solr.SolrIndexWriter
corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the indexer-solr plugin.
The mapping section is explained here. The structure of this section is general for all index writers.
Each parameter has the form <param name="<name>" value="<value>"/>
and the parameters for this index writer are:
Parameter Name | Description | Default value |
---|---|---|
type | Specifies the SolrClient implementation to use. This is a string value of one of the following cloud or http. The values represent CloudSolrServer or HttpSolrServer respectively. | http |
url | Defines the fully qualified URL of Solr into which data should be indexed. Multiple URL can be provided using comma as a delimiter. When the value of type property is cloud, the URL should not include any collections or cores; just the root Solr path. | http://localhost:8983/solr/nutch |
collection | The collection used in requests. Only used when the value of type property is cloud. | |
weight.field | Field's name where the weight of the documents will be written. If it is empty no field will be used. | |
commitSize | Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory. Note: It does not explicitly trigger a server side commit. | 1000 |
auth | Whether to enable HTTP basic authentication for communicating with Solr. Use the username and password properties to configure your credentials. | false |
username | The username of Solr server. | username |
password | The password of Solr server. | password |
In the distribution of the indexer-solr plugin there is a schema.xml file available. Nutch does not use this file, but it is provided to Solr users as a reference/guide to facilitate the configuration of Solr.