Package org.apache.solr.update.processor
Class TextProfileSignature
java.lang.Object
org.apache.solr.update.processor.Signature
org.apache.solr.update.processor.MD5Signature
org.apache.solr.update.processor.TextProfileSignature
This implementation is copied from Apache Nutch.
An implementation of a page signature. It calculates an MD5 hash of a plain text "profile" of a page.
The algorithm to calculate a page "profile" takes the plain text version of a page and performs the following steps:
- remove all characters except letters and digits, and bring all characters to lower case,
- split the text into tokens (all consecutive non-whitespace characters),
- discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
- sort the list of tokens by decreasing frequency,
- round down the counts of tokens to the nearest multiple of QUANT (
QUANT = QUANT_RATE * maxFreq, whereQUANT_RATEis 0.01f by default, andmaxFreqis the maximum token frequency). IfmaxFreqis higher than 1, then QUANT is always higher than 2 (which means that tokens with frequency 1 are always discarded). - tokens, which frequency after quantization falls below QUANT, are discarded.
- create a list of tokens and their quantized frequency, separated by spaces, in the order of decreasing frequency.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidbyte[]voidinit(org.apache.solr.common.params.SolrParams params)
-
Constructor Details
-
TextProfileSignature
public TextProfileSignature()
-
-
Method Details
-
init
public void init(org.apache.solr.common.params.SolrParams params) -
getSignature
public byte[] getSignature()- Overrides:
getSignaturein classMD5Signature
-
add
- Overrides:
addin classMD5Signature
-