solr/solr-ref-guide/src/distributed-search-with-index-sharding.adoc - lucene-solr - Git at Google

 = Distributed Search with Index Sharding
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 When using traditional index sharding, you will need to consider how to query your documents.

 It is highly recommended that you use <<solrcloud.adoc#,SolrCloud>> when needing to scale up or scale out. The setup described below is legacy and was used prior to the existence of SolrCloud. SolrCloud provides for a truly distributed set of features with support for things like automatic routing, leader election, optimistic concurrency and other sanity checks that are expected out of a distributed system.

 Everything on this page is specific to legacy setup of distributed search. Users trying out SolrCloud should not follow any of the steps or information below.

 Update reorders (i.e., replica A may see update X then Y, and replica B may see update Y then X). *deleteByQuery* also handles reorders the same way, to ensure replicas are consistent. All replicas of a shard are consistent, even if the updates arrive in a different order on different replicas.

 == Distributing Documents across Shards

 When not using SolrCloud, it is up to you to get all your documents indexed on each shard of your server farm. Solr supports distributed indexing (routing) in its true form only in the SolrCloud mode.

 In the legacy distributed mode, Solr does not calculate universal term/doc frequencies. For most large-scale implementations, it is not likely to matter that Solr calculates TF/IDF at the shard level. However, if your collection is heavily skewed in its distribution across servers, you may find misleading relevancy results in your searches. In general, it is probably best to randomly distribute documents to your shards.

 == Executing Distributed Searches with the shards Parameter

 If a query request includes the `shards` parameter, the Solr server distributes the request across all the shards listed as arguments to the parameter. The `shards` parameter uses this syntax:

 `host:port/base_url,host:port/base_url*`

 For example, the `shards` parameter below causes the search to be distributed across two Solr servers: *solr1* and **solr2**, both of which are running on port 8983:

 `\http://localhost:8983/solr/core1/select?shards=solr1:8983/solr/core1,solr2:8983/solr/core1&indent=true&q=ipod+solr`

 Rather than require users to include the shards parameter explicitly, it is usually preferred to configure this parameter as a default in the RequestHandler section of `solrconfig.xml`.

 [IMPORTANT]
 ====
 Do not add the `shards` parameter to the standard request handler; doing so may cause search queries may enter an infinite loop. Instead, define a new request handler that uses the `shards` parameter, and pass distributed search requests to that handler.
 ====

 With Legacy mode, only query requests are distributed. This includes requests to the SearchHandler (or any handler extending from `org.apache.solr.handler.component.SearchHandler`) using standard components that support distributed search.

 As in SolrCloud mode, when `shards.info=true`, distributed responses will include information about the shard (where each shard represents a logically different index or physical location)

 The following components support distributed search:

 * The *Query* component, which returns documents matching a query
 * The *Facet* component, which processes facet.query and facet.field requests where facets are sorted by count (the default).
 * The *Highlighting* component, which enables Solr to include "highlighted" matches in field values.
 * The *Stats* component, which returns simple statistics for numeric fields within the DocSet.
 * The *Debug* component, which helps with debugging.

 === Shards Whitelist

 The nodes allowed in the `shards` parameter is configurable through the `shardsWhitelist` property in `solr.xml`. This whitelist is automatically configured for SolrCloud but needs explicit configuration for leader/follower mode. Read more details in the section <<distributed-requests.adoc#configuring-the-shardhandlerfactory,Configuring the ShardHandlerFactory>>.

 == Limitations to Distributed Search

 Distributed searching in Solr has the following limitations:

 * Each document indexed must have a unique key.
 * If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent documents.
 * The index for distributed searching may become momentarily out of sync if a commit happens between the first and second phase of the distributed search. This might cause a situation where a document that once matched a query and was subsequently changed may no longer match the query but will still be retrieved. This situation is expected to be quite rare, however, and is only possible for a single query request.
 * The number of shards is limited by number of characters allowed for GET method's URI; most Web servers generally support at least 4000 characters, but many servers limit URI length to reduce their vulnerability to Denial of Service (DoS) attacks.
 * Shard information can be returned with each document in a distributed search by including `fl=id, [shard]` in the search request. This returns the shard URL.
 * In a distributed search, the data directory from the core descriptor overrides any data directory in `solrconfig.xml.`
 * Update commands may be sent to any server with distributed indexing configured correctly. Document adds and deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. *commit* commands and *deleteByQuery* commands are sent to every server in `shards`.

 Formerly a limitation was that TF/IDF relevancy computations only used shard-local statistics. This is still the case by default. If your data isn't randomly distributed, or if you want more exact statistics, then remember to configure the ExactStatsCache.

 == Avoiding Distributed Deadlock with Distributed Search

 Like in SolrCloud mode, inter-shard requests could lead to a distributed deadlock. It can be avoided by following the instructions in the section  <<distributed-requests.adoc#,Distributed Requests>>.

 == Testing Index Sharding on Two Local Servers

 For simple functional testing, it's easiest to just set up two local Solr servers on different ports. (In a production environment, of course, these servers would be deployed on separate machines.)

 .  Make two Solr home directories and copy `solr.xml` into the new directories:
 +
 [source,bash]
 ----
 mkdir example/nodes
 mkdir example/nodes/node1
 # Copy solr.xml into this solr.home
 cp server/solr/solr.xml example/nodes/node1/.
 # Repeat the above steps for the second node
 mkdir example/nodes/node2
 cp server/solr/solr.xml example/nodes/node2/.
 ----
 .  Start the two Solr instances
 +
 [source,bash]
 ----
 # Start first node on port 8983
 bin/solr start -s example/nodes/node1 -p 8983

 # Start second node on port 8984
 bin/solr start -s example/nodes/node2 -p 8984
 ----
 .  Create a core on both the nodes with the sample_techproducts_configs.
 +
 [source,bash]
 ----
 bin/solr create_core -c core1 -p 8983 -d sample_techproducts_configs
 # Create a core on the Solr node running on port 8984
 bin/solr create_core -c core1 -p 8984 -d sample_techproducts_configs
 ----
 .  In a third window, index an example document to each of the server:
 +
 [source,bash]
 ----
 bin/post -c core1 example/exampledocs/monitor.xml -port 8983

 bin/post -c core1 example/exampledocs/monitor2.xml -port 8984
 ----
 .  Search on the node on port 8983:
 +
 [source,bash]
 ----
 curl http://localhost:8983/solr/core1/select?q=*:*&wt=xml&indent=true
 ----
 +
 This should bring back one document.
 +
 Search on the node on port 8984:
 +
 [source,bash]
 ----
 curl http://localhost:8984/solr/core1/select?q=*:*&wt=xml&indent=true
 ----
 +
 This should also bring back a single document.
 +
 Now do a distributed search across both servers with your browser or `curl.` In the example below, an extra parameter 'fl' is passed to restrict the returned fields to id and name.
 +
 [source,bash]
 ----
 curl http://localhost:8983/solr/core1/select?q=*:*&indent=true&shards=localhost:8983/solr/core1,localhost:8984/solr/core1&fl=id,name&wt=xml
 ----
 +
 This should contain both the documents as shown below:
 +
 [source,xml]
 ----
 <response>
   <lst name="responseHeader">
     <int name="status">0</int>
     <int name="QTime">8</int>
     <lst name="params">
       <str name="q">*:*</str>
       <str name="shards">localhost:8983/solr/core1,localhost:8984/solr/core1</str>
       <str name="indent">true</str>
       <str name="fl">id,name</str>
       <str name="wt">xml</str>
     </lst>
   </lst>
   <result name="response" numFound="2" start="0" maxScore="1.0">
     <doc>
       <str name="id">3007WFP</str>
       <str name="name">Dell Widescreen UltraSharp 3007WFP</str>
     </doc>
     <doc>
       <str name="id">VA902B</str>
       <str name="name">ViewSonic VA902B - flat panel display - TFT - 19"</str>
     </doc>
   </result>
 </response>
 ----
	= Distributed Search with Index Sharding
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	When using traditional index sharding, you will need to consider how to query your documents.

	It is highly recommended that you use <<solrcloud.adoc#,SolrCloud>> when needing to scale up or scale out. The setup described below is legacy and was used prior to the existence of SolrCloud. SolrCloud provides for a truly distributed set of features with support for things like automatic routing, leader election, optimistic concurrency and other sanity checks that are expected out of a distributed system.

	Everything on this page is specific to legacy setup of distributed search. Users trying out SolrCloud should not follow any of the steps or information below.

	Update reorders (i.e., replica A may see update X then Y, and replica B may see update Y then X). deleteByQuery also handles reorders the same way, to ensure replicas are consistent. All replicas of a shard are consistent, even if the updates arrive in a different order on different replicas.

	== Distributing Documents across Shards

	When not using SolrCloud, it is up to you to get all your documents indexed on each shard of your server farm. Solr supports distributed indexing (routing) in its true form only in the SolrCloud mode.

	In the legacy distributed mode, Solr does not calculate universal term/doc frequencies. For most large-scale implementations, it is not likely to matter that Solr calculates TF/IDF at the shard level. However, if your collection is heavily skewed in its distribution across servers, you may find misleading relevancy results in your searches. In general, it is probably best to randomly distribute documents to your shards.

	== Executing Distributed Searches with the shards Parameter

	If a query request includes the `shards` parameter, the Solr server distributes the request across all the shards listed as arguments to the parameter. The `shards` parameter uses this syntax:

	`host:port/base_url,host:port/base_url*`

	For example, the `shards` parameter below causes the search to be distributed across two Solr servers: solr1 and solr2, both of which are running on port 8983:

	`\http://localhost:8983/solr/core1/select?shards=solr1:8983/solr/core1,solr2:8983/solr/core1&indent=true&q=ipod+solr`

	Rather than require users to include the shards parameter explicitly, it is usually preferred to configure this parameter as a default in the RequestHandler section of `solrconfig.xml`.

	[IMPORTANT]
	====
	Do not add the `shards` parameter to the standard request handler; doing so may cause search queries may enter an infinite loop. Instead, define a new request handler that uses the `shards` parameter, and pass distributed search requests to that handler.
	====

	With Legacy mode, only query requests are distributed. This includes requests to the SearchHandler (or any handler extending from `org.apache.solr.handler.component.SearchHandler`) using standard components that support distributed search.

	As in SolrCloud mode, when `shards.info=true`, distributed responses will include information about the shard (where each shard represents a logically different index or physical location)

	The following components support distributed search:

	* The Query component, which returns documents matching a query
	* The Facet component, which processes facet.query and facet.field requests where facets are sorted by count (the default).
	* The Highlighting component, which enables Solr to include "highlighted" matches in field values.
	* The Stats component, which returns simple statistics for numeric fields within the DocSet.
	* The Debug component, which helps with debugging.

	=== Shards Whitelist

	The nodes allowed in the `shards` parameter is configurable through the `shardsWhitelist` property in `solr.xml`. This whitelist is automatically configured for SolrCloud but needs explicit configuration for leader/follower mode. Read more details in the section <<distributed-requests.adoc#configuring-the-shardhandlerfactory,Configuring the ShardHandlerFactory>>.

	== Limitations to Distributed Search

	Distributed searching in Solr has the following limitations:

	* Each document indexed must have a unique key.
	* If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent documents.
	* The index for distributed searching may become momentarily out of sync if a commit happens between the first and second phase of the distributed search. This might cause a situation where a document that once matched a query and was subsequently changed may no longer match the query but will still be retrieved. This situation is expected to be quite rare, however, and is only possible for a single query request.
	* The number of shards is limited by number of characters allowed for GET method's URI; most Web servers generally support at least 4000 characters, but many servers limit URI length to reduce their vulnerability to Denial of Service (DoS) attacks.
	* Shard information can be returned with each document in a distributed search by including `fl=id, [shard]` in the search request. This returns the shard URL.
	* In a distributed search, the data directory from the core descriptor overrides any data directory in `solrconfig.xml.`
	* Update commands may be sent to any server with distributed indexing configured correctly. Document adds and deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. commit commands and deleteByQuery commands are sent to every server in `shards`.

	Formerly a limitation was that TF/IDF relevancy computations only used shard-local statistics. This is still the case by default. If your data isn't randomly distributed, or if you want more exact statistics, then remember to configure the ExactStatsCache.

	== Avoiding Distributed Deadlock with Distributed Search

	Like in SolrCloud mode, inter-shard requests could lead to a distributed deadlock. It can be avoided by following the instructions in the section <<distributed-requests.adoc#,Distributed Requests>>.

	== Testing Index Sharding on Two Local Servers

	For simple functional testing, it's easiest to just set up two local Solr servers on different ports. (In a production environment, of course, these servers would be deployed on separate machines.)

	. Make two Solr home directories and copy `solr.xml` into the new directories:
	+
	[source,bash]
	----
	mkdir example/nodes
	mkdir example/nodes/node1
	# Copy solr.xml into this solr.home
	cp server/solr/solr.xml example/nodes/node1/.
	# Repeat the above steps for the second node
	mkdir example/nodes/node2
	cp server/solr/solr.xml example/nodes/node2/.
	----
	. Start the two Solr instances
	+
	[source,bash]
	----
	# Start first node on port 8983
	bin/solr start -s example/nodes/node1 -p 8983

	# Start second node on port 8984
	bin/solr start -s example/nodes/node2 -p 8984
	----
	. Create a core on both the nodes with the sample_techproducts_configs.
	+
	[source,bash]
	----
	bin/solr create_core -c core1 -p 8983 -d sample_techproducts_configs
	# Create a core on the Solr node running on port 8984
	bin/solr create_core -c core1 -p 8984 -d sample_techproducts_configs
	----
	. In a third window, index an example document to each of the server:
	+
	[source,bash]
	----
	bin/post -c core1 example/exampledocs/monitor.xml -port 8983

	bin/post -c core1 example/exampledocs/monitor2.xml -port 8984
	----
	. Search on the node on port 8983:
	+
	[source,bash]
	----
	curl http://localhost:8983/solr/core1/select?q=:&wt=xml&indent=true
	----
	+
	This should bring back one document.
	+
	Search on the node on port 8984:
	+
	[source,bash]
	----
	curl http://localhost:8984/solr/core1/select?q=:&wt=xml&indent=true
	----
	+
	This should also bring back a single document.
	+
	Now do a distributed search across both servers with your browser or `curl.` In the example below, an extra parameter 'fl' is passed to restrict the returned fields to id and name.
	+
	[source,bash]
	----
	curl http://localhost:8983/solr/core1/select?q=:&indent=true&shards=localhost:8983/solr/core1,localhost:8984/solr/core1&fl=id,name&wt=xml
	----
	+
	This should contain both the documents as shown below:
	+
	[source,xml]
	----
	<response>
	<lst name="responseHeader">
	<int name="status">0</int>
	<int name="QTime">8</int>
	<lst name="params">
	<str name="q">:</str>
	<str name="shards">localhost:8983/solr/core1,localhost:8984/solr/core1</str>
	<str name="indent">true</str>
	<str name="fl">id,name</str>
	<str name="wt">xml</str>
	</lst>
	</lst>
	<result name="response" numFound="2" start="0" maxScore="1.0">
	<doc>
	<str name="id">3007WFP</str>
	<str name="name">Dell Widescreen UltraSharp 3007WFP</str>
	</doc>
	<doc>
	<str name="id">VA902B</str>
	<str name="name">ViewSonic VA902B - flat panel display - TFT - 19"</str>
	</doc>
	</result>
	</response>
	----