De-Duplication
If duplicate, or near-duplicate documents are a concern in your index, de-duplication may be worth implementing.
Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm.
Solr natively supports de-duplication techniques of this type via the Signature class and allows for the easy addition of new hash/signature implementations.
A Signature can be implemented in a few ways:
- 
MD5Signature: 128-bit hash used for exact duplicate detection. 
- 
Lookup3Signature: 64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index. 
- 
TextProfileSignature: Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It’s tunable but works best on longer text. 
Other, more sophisticated algorithms for fuzzy/near hashing can be added later.
| Adding in the de-duplication process will change the  Of course the  | 
Configuration Options
There are two places in Solr to configure de-duplication: in solrconfig.xml and in the schema.
In solrconfig.xml
The SignatureUpdateProcessorFactory has to be registered in solrconfig.xml as part of an Update Request Processor Chain, as in this example:
<updateRequestProcessorChain default="true">
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.processor.SignatureUpdateProcessorFactory">
    <str name="signatureField">id</str>
    <str name="fields">name,features,cat</str>
    <str name="signatureClass">solr.processor.Lookup3Signature</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>The SignatureUpdateProcessorFactory takes several properties:
- signatureClass
- 
Optional Default: org.apache.solr.update.processor.Lookup3SignatureA Signature implementation for generating a signature hash. The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are: - 
org.apache.solr.update.processor.Lookup3Signature
- 
org.apache.solr.update.processor.MD5Signature
- 
org.apache.solr.update.process.TextProfileSignature
 
- 
- fields
- 
Optional Default: all fields The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used. 
- signatureField
- 
Optional Default: signatureFieldThe name of the field used to hold the fingerprint/signature. The field should be defined in your schema. 
- enabled
- 
Optional Default: trueSet to falseto disable de-duplication processing.
- overwriteDupes
- 
Optional Default: trueIf true, when a document exists that already matches this signature, it will be overwritten. If you are usingoverwriteDupes=truethesignatureFieldmust beindexed="true"in your Schema.
| Using  SignatureUpdateProcessorFactoryin SolrCloudThere are 2 important things to keep in mind when using  
 (Using any other  |