De-Duplication

If duplicate, or near-duplicate documents are a concern in your index, de-duplication may be worth implementing.

Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the Signature class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented in a few ways:

  • MD5Signature: 128-bit hash used for exact duplicate detection.

  • Lookup3Signature: 64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index.

  • TextProfileSignature: Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It’s tunable but works best on longer text.

Other, more sophisticated algorithms for fuzzy/near hashing can be added later.

Adding in the de-duplication process will change the allowDups setting so that it applies to an update term (with signatureField in this case) rather than the unique field Term.

Of course the signatureField could be the unique field, but generally you want the unique field to be unique. When a document is added, a signature will automatically be generated and attached to the document in the specified signatureField.

Configuration Options

There are two places in Solr to configure de-duplication: in solrconfig.xml and in the schema.

In solrconfig.xml

The SignatureUpdateProcessorFactory has to be registered in solrconfig.xml as part of an Update Request Processor Chain, as in this example:

<updateRequestProcessorChain default="true">
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.processor.SignatureUpdateProcessorFactory">
    <str name="signatureField">id</str>
    <str name="fields">name,features,cat</str>
    <str name="signatureClass">solr.processor.Lookup3Signature</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

The SignatureUpdateProcessorFactory takes several properties:

signatureClass

Optional

Default: org.apache.solr.update.processor.Lookup3Signature

A Signature implementation for generating a signature hash.

The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are:

  • org.apache.solr.update.processor.Lookup3Signature

  • org.apache.solr.update.processor.MD5Signature

  • org.apache.solr.update.process.TextProfileSignature

fields

Optional

Default: all fields

The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used.

signatureField

Optional

Default: signatureField

The name of the field used to hold the fingerprint/signature. The field should be defined in your schema.

enabled

Optional

Default: true

Set to false to disable de-duplication processing.

overwriteDupes

Optional

Default: true

If true, when a document exists that already matches this signature, it will be overwritten. If you are using overwriteDupes=true the signatureField must be indexed="true" in your Schema.

Using SignatureUpdateProcessorFactory in SolrCloud

There are 2 important things to keep in mind when using SignatureUpdateProcessorFactory with SolrCloud:

  1. The overwriteDupes=true setting does not work except in the special case of using the uniqueKey field as the signatureField. Attempting De-duplication on any other signatureField will not work correctly because of how updates are forwarded to replicas

  2. When using the uniqueKey field as the signatureField, SignatureUpdateProcessorFactory must be run prior to the DistributedUpdateProcessor to ensure that documents can be routed to the correct shard leader based on the (generated) uniqueKey field.

(Using any other signatureField with overwriteDupes=false — to generate a Signature for each document with out De-duplication — has no limitations.)