If duplicate, or near-duplicate documents are a concern in your index, de-duplication may be worth implementing.
Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the Signature
class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented several ways:
Method | Description |
---|---|
MD5Signature |
128-bit hash used for exact duplicate detection. |
Lookup3Signature |
64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index. |
Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It’s tunable but works best on longer text. |
Other, more sophisticated algorithms for fuzzy/near hashing can be added later.
Adding in the de-duplication process will change the Of course the |
Configuration Options
There are two places in Solr to configure de-duplication: in solrconfig.xml
and in schema.xml
.
In solrconfig.xml
The SignatureUpdateProcessorFactory
has to be registered in solrconfig.xml
as part of an Update Request Processor Chain, as in this example:
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
The SignatureUpdateProcessorFactory
takes several properties:
Parameter | Default | Description |
---|---|---|
signatureClass |
|
A Signature implementation for generating a signature hash. The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are:
|
fields |
all fields |
The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used. |
signatureField |
signatureField |
The name of the field used to hold the fingerprint/signature. The field should be defined in schema.xml. |
enabled |
true |
Enable/disable de-duplication processing. |
overwriteDupes |
true |
If true, when a document exists that already matches this signature, it will be overwritten. |
In schema.xml
If you are using a separate field for storing the signature, you must have it indexed:
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
Be sure to change your update handlers to use the defined chain, as below:
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
...
</requestHandler>
This example assumes you have other sections of your request handler defined.
The update processor can also be specified per request with a parameter of |