Package org.apache.solr.update.processor
Class LanguageIdentifierUpdateProcessor
- java.lang.Object
-
- org.apache.solr.update.processor.UpdateRequestProcessor
-
- org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor
-
- All Implemented Interfaces:
Closeable,AutoCloseable,LangIdParams
- Direct Known Subclasses:
LangDetectLanguageIdentifierUpdateProcessor,OpenNLPLangDetectUpdateProcessor,TikaLanguageIdentifierUpdateProcessor
public abstract class LanguageIdentifierUpdateProcessor extends UpdateRequestProcessor implements LangIdParams
Identifies the language of a set of input fields. Also supports mapping of field names based on detected language. See Detecting Languages During Indexing in reference guide- Since:
- 3.5
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Field Summary
Fields Modifier and Type Field Description protected HashSet<String>allMapFieldsSetprotected StringdocIdFieldprotected booleanenabledprotected booleanenableMappingprotected booleanenforceSchemaprotected String[]fallbackFieldsprotected StringfallbackValueprotected String[]inputFieldsprotected HashSet<String>langAllowlistprotected StringlangFieldprotected PatternlangPatternprotected StringlangsFieldprotected HashMap<String,String>lcMapprotected String[]mapFieldsprotected booleanmapIndividualprotected HashSet<String>mapIndividualFieldsSetprotected booleanmapKeepOrigprotected HashMap<String,String>mapLcMapprotected booleanmapOverwriteprotected PatternmapPatternprotected StringmapReplaceStrprotected intmaxFieldValueCharsprotected intmaxTotalCharsprotected booleanoverwriteprotected IndexSchemaschemaprotected doublethresholdprotected PatterntikaSimilarityPattern-
Fields inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
next
-
Fields inherited from interface org.apache.solr.update.processor.LangIdParams
DOCID_FIELD_DEFAULT, DOCID_LANGFIELD_DEFAULT, DOCID_LANGSFIELD_DEFAULT, DOCID_PARAM, DOCID_THRESHOLD_DEFAULT, ENFORCE_SCHEMA, FALLBACK, FALLBACK_FIELDS, FIELDS_PARAM, LANG_ALLOWLIST, LANG_FIELD, LANG_WHITELIST, LANGS_FIELD, LANGUAGE_ID, LCMAP, MAP_ENABLE, MAP_FL, MAP_INDIVIDUAL, MAP_INDIVIDUAL_FL, MAP_KEEP_ORIG, MAP_LCMAP, MAP_OVERWRITE, MAP_PATTERN, MAP_PATTERN_DEFAULT, MAP_REPLACE, MAP_REPLACE_DEFAULT, MAX_FIELD_VALUE_CHARS, MAX_FIELD_VALUE_CHARS_DEFAULT, MAX_TOTAL_CHARS, MAX_TOTAL_CHARS_DEFAULT, OVERWRITE, THRESHOLD
-
-
Constructor Summary
Constructors Constructor Description LanguageIdentifierUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected StringconcatFields(org.apache.solr.common.SolrInputDocument doc)Concatenates content from input fields defined in langid.fl.protected abstract List<DetectedLanguage>detectLanguage(Reader solrDocReader)Detects language(s) from a reader, typically based on some fields in SolrInputDocument Classes wishing to implement their own language detection module should override this method.protected List<DetectedLanguage>detectLanguage(org.apache.solr.common.SolrInputDocument doc)Detects language(s) from all configured fieldsprotected StringgetMappedField(String currentField, String language)Returns the name of the field to map the current contents into, so that they are properly analyzed.booleanisEnabled()Tells if this processor is enabled or notprotected StringnormalizeLangCode(String langCode)Looks up language code in map (langid.lcmap) and returns mapped valueprotected voidprocess(org.apache.solr.common.SolrInputDocument doc)This is the main process method called from processAdd()voidprocessAdd(AddUpdateCommand cmd)protected StringresolveLanguage(String language, String fallbackLang)Chooses a language based on the list of candidates detectedprotected StringresolveLanguage(List<DetectedLanguage> languages, String fallbackLang)Chooses a language based on the list of candidates detectedvoidsetEnabled(boolean enabled)protected SolrInputDocumentReadersolrDocReader(org.apache.solr.common.SolrInputDocument doc, String[] fields)Returns a reader that streams String content from fields.-
Methods inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
close, doClose, finish, processCommit, processDelete, processMergeIndexes, processRollback
-
-
-
-
Field Detail
-
enabled
protected boolean enabled
-
inputFields
protected String[] inputFields
-
mapFields
protected String[] mapFields
-
mapPattern
protected Pattern mapPattern
-
mapReplaceStr
protected String mapReplaceStr
-
langField
protected String langField
-
langsField
protected String langsField
-
docIdField
protected String docIdField
-
fallbackValue
protected String fallbackValue
-
fallbackFields
protected String[] fallbackFields
-
enableMapping
protected boolean enableMapping
-
mapKeepOrig
protected boolean mapKeepOrig
-
overwrite
protected boolean overwrite
-
mapOverwrite
protected boolean mapOverwrite
-
mapIndividual
protected boolean mapIndividual
-
enforceSchema
protected boolean enforceSchema
-
threshold
protected double threshold
-
schema
protected IndexSchema schema
-
maxFieldValueChars
protected int maxFieldValueChars
-
maxTotalChars
protected int maxTotalChars
-
tikaSimilarityPattern
protected final Pattern tikaSimilarityPattern
-
langPattern
protected final Pattern langPattern
-
-
Constructor Detail
-
LanguageIdentifierUpdateProcessor
public LanguageIdentifierUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
-
-
Method Detail
-
processAdd
public void processAdd(AddUpdateCommand cmd) throws IOException
- Overrides:
processAddin classUpdateRequestProcessor- Throws:
IOException
-
process
protected void process(org.apache.solr.common.SolrInputDocument doc)
This is the main process method called from processAdd()- Parameters:
doc- the SolrInputDocument to modify
-
detectLanguage
protected List<DetectedLanguage> detectLanguage(org.apache.solr.common.SolrInputDocument doc)
Detects language(s) from all configured fields- Parameters:
doc- The solr document- Returns:
- List of detected language(s) according to RFC-3066
-
detectLanguage
protected abstract List<DetectedLanguage> detectLanguage(Reader solrDocReader)
Detects language(s) from a reader, typically based on some fields in SolrInputDocument Classes wishing to implement their own language detection module should override this method.- Parameters:
solrDocReader- A reader serving the text from the document to detect- Returns:
- List of detected language(s) according to RFC-3066
-
resolveLanguage
protected String resolveLanguage(String language, String fallbackLang)
Chooses a language based on the list of candidates detected- Parameters:
language- language code as a stringfallbackLang- the language code to use as a fallback- Returns:
- a string of the chosen language
-
resolveLanguage
protected String resolveLanguage(List<DetectedLanguage> languages, String fallbackLang)
Chooses a language based on the list of candidates detected- Parameters:
languages- a List of DetectedLanguages with certainty scorefallbackLang- the language code to use as a fallback- Returns:
- a string of the chosen language
-
normalizeLangCode
protected String normalizeLangCode(String langCode)
Looks up language code in map (langid.lcmap) and returns mapped value- Parameters:
langCode- the language code string returned from detector- Returns:
- the normalized/mapped language code
-
getMappedField
protected String getMappedField(String currentField, String language)
Returns the name of the field to map the current contents into, so that they are properly analyzed. For instance if the currentField is "text" and the code is "en", the new field would by default be "text_en". This method also performs custom regex pattern replace if configured. If enforceSchema=true and the resulting field name doesn't exist, then null is returned.- Parameters:
currentField- The current field namelanguage- the language code- Returns:
- The new schema field name, based on pattern and replace, or null if illegal
-
isEnabled
public boolean isEnabled()
Tells if this processor is enabled or not- Returns:
- true if enabled, else false
-
setEnabled
public void setEnabled(boolean enabled)
-
solrDocReader
protected SolrInputDocumentReader solrDocReader(org.apache.solr.common.SolrInputDocument doc, String[] fields)
Returns a reader that streams String content from fields. This is more memory efficient than building a full string buffer- Parameters:
doc- the solr documentfields- the field names to read- Returns:
- a reader over the fields
-
concatFields
protected String concatFields(org.apache.solr.common.SolrInputDocument doc)
Concatenates content from input fields defined in langid.fl. For test purposes only
-
-