| = De-Duplication |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| If duplicate, or near-duplicate documents are a concern in your index, de-duplication may be worth implementing. |
| |
| Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the `Signature` class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented in a few ways: |
| |
| * MD5Signature: 128-bit hash used for exact duplicate detection. |
| * Lookup3Signature: 64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index. |
| * https://cwiki.apache.org/confluence/display/solr/TextProfileSignature[TextProfileSignature]: Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It's tunable but works best on longer text. |
| |
| Other, more sophisticated algorithms for fuzzy/near hashing can be added later. |
| |
| [IMPORTANT] |
| ==== |
| Adding in the de-duplication process will change the `allowDups` setting so that it applies to an update term (with `signatureField` in this case) rather than the unique field Term. |
| |
| Of course the `signatureField` could be the unique field, but generally you want the unique field to be unique. When a document is added, a signature will automatically be generated and attached to the document in the specified `signatureField`. |
| ==== |
| |
| == Configuration Options |
| |
| There are two places in Solr to configure de-duplication: in `solrconfig.xml` and in `schema.xml`. |
| |
| === In solrconfig.xml |
| |
| The `SignatureUpdateProcessorFactory` has to be registered in `solrconfig.xml` as part of an <<update-request-processors.adoc#,Update Request Processor Chain>>, as in this example: |
| |
| [source,xml] |
| ---- |
| <updateRequestProcessorChain name="dedupe"> |
| <processor class="solr.processor.SignatureUpdateProcessorFactory"> |
| <bool name="enabled">true</bool> |
| <str name="signatureField">id</str> |
| <bool name="overwriteDupes">false</bool> |
| <str name="fields">name,features,cat</str> |
| <str name="signatureClass">solr.processor.Lookup3Signature</str> |
| </processor> |
| <processor class="solr.LogUpdateProcessorFactory" /> |
| <processor class="solr.RunUpdateProcessorFactory" /> |
| </updateRequestProcessorChain> |
| ---- |
| |
| The `SignatureUpdateProcessorFactory` takes several properties: |
| |
| signatureClass:: |
| A Signature implementation for generating a signature hash. The default is `org.apache.solr.update.processor.Lookup3Signature`. |
| + |
| The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are: |
| |
| * `org.apache.solr.update.processor.Lookup3Signature` |
| * `org.apache.solr.update.processor.MD5Signature` |
| * `org.apache.solr.update.process.TextProfileSignature` |
| |
| fields:: |
| The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used. |
| |
| signatureField:: |
| The name of the field used to hold the fingerprint/signature. The field should be defined in `schema.xml`. The default is `signatureField`. |
| |
| enabled:: |
| Set to *false* to disable de-duplication processing. The default is *true*. |
| |
| overwriteDupes:: |
| If true, the default, when a document exists that already matches this signature, it will be overwritten. |
| |
| === In schema.xml |
| |
| If you are using a separate field for storing the signature, you must have it indexed: |
| |
| [source,xml] |
| ---- |
| <field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" /> |
| ---- |
| |
| Be sure to change your update handlers to use the defined chain, as below: |
| |
| [source,xml] |
| ---- |
| <requestHandler name="/update" class="solr.UpdateRequestHandler" > |
| <lst name="defaults"> |
| <str name="update.chain">dedupe</str> |
| </lst> |
| ... |
| </requestHandler> |
| ---- |
| |
| This example assumes you have other sections of your request handler defined. |
| |
| [TIP] |
| ==== |
| The update processor can also be specified per request with a parameter of `update.chain=dedupe`. |
| ==== |