Class LanguageIdentifierUpdateProcessor

    • Field Detail

      • enabled

        protected boolean enabled
      • inputFields

        protected String[] inputFields
      • mapFields

        protected String[] mapFields
      • mapPattern

        protected Pattern mapPattern
      • mapReplaceStr

        protected String mapReplaceStr
      • langField

        protected String langField
      • langsField

        protected String langsField
      • docIdField

        protected String docIdField
      • fallbackValue

        protected String fallbackValue
      • fallbackFields

        protected String[] fallbackFields
      • enableMapping

        protected boolean enableMapping
      • mapKeepOrig

        protected boolean mapKeepOrig
      • overwrite

        protected boolean overwrite
      • mapOverwrite

        protected boolean mapOverwrite
      • mapIndividual

        protected boolean mapIndividual
      • enforceSchema

        protected boolean enforceSchema
      • threshold

        protected double threshold
      • mapIndividualFieldsSet

        protected HashSet<String> mapIndividualFieldsSet
      • maxFieldValueChars

        protected int maxFieldValueChars
      • maxTotalChars

        protected int maxTotalChars
      • tikaSimilarityPattern

        protected final Pattern tikaSimilarityPattern
      • langPattern

        protected final Pattern langPattern
    • Method Detail

      • process

        protected void process​(org.apache.solr.common.SolrInputDocument doc)
        This is the main process method called from processAdd()
        Parameters:
        doc - the SolrInputDocument to modify
      • detectLanguage

        protected List<DetectedLanguage> detectLanguage​(org.apache.solr.common.SolrInputDocument doc)
        Detects language(s) from all configured fields
        Parameters:
        doc - The solr document
        Returns:
        List of detected language(s) according to RFC-3066
      • detectLanguage

        protected abstract List<DetectedLanguage> detectLanguage​(Reader solrDocReader)
        Detects language(s) from a reader, typically based on some fields in SolrInputDocument Classes wishing to implement their own language detection module should override this method.
        Parameters:
        solrDocReader - A reader serving the text from the document to detect
        Returns:
        List of detected language(s) according to RFC-3066
      • resolveLanguage

        protected String resolveLanguage​(String language,
                                         String fallbackLang)
        Chooses a language based on the list of candidates detected
        Parameters:
        language - language code as a string
        fallbackLang - the language code to use as a fallback
        Returns:
        a string of the chosen language
      • resolveLanguage

        protected String resolveLanguage​(List<DetectedLanguage> languages,
                                         String fallbackLang)
        Chooses a language based on the list of candidates detected
        Parameters:
        languages - a List of DetectedLanguages with certainty score
        fallbackLang - the language code to use as a fallback
        Returns:
        a string of the chosen language
      • normalizeLangCode

        protected String normalizeLangCode​(String langCode)
        Looks up language code in map (langid.lcmap) and returns mapped value
        Parameters:
        langCode - the language code string returned from detector
        Returns:
        the normalized/mapped language code
      • getMappedField

        protected String getMappedField​(String currentField,
                                        String language)
        Returns the name of the field to map the current contents into, so that they are properly analyzed. For instance if the currentField is "text" and the code is "en", the new field would by default be "text_en". This method also performs custom regex pattern replace if configured. If enforceSchema=true and the resulting field name doesn't exist, then null is returned.
        Parameters:
        currentField - The current field name
        language - the language code
        Returns:
        The new schema field name, based on pattern and replace, or null if illegal
      • isEnabled

        public boolean isEnabled()
        Tells if this processor is enabled or not
        Returns:
        true if enabled, else false
      • setEnabled

        public void setEnabled​(boolean enabled)
      • solrDocReader

        protected SolrInputDocumentReader solrDocReader​(org.apache.solr.common.SolrInputDocument doc,
                                                        String[] fields)
        Returns a reader that streams String content from fields. This is more memory efficient than building a full string buffer
        Parameters:
        doc - the solr document
        fields - the field names to read
        Returns:
        a reader over the fields
      • concatFields

        protected String concatFields​(org.apache.solr.common.SolrInputDocument doc)
        Concatenates content from input fields defined in langid.fl. For test purposes only