Language Detection
Solr can identify languages and map text to language-specific fields during indexing using the langid
UpdateRequestProcessor.
Solr supports three implementations of this feature:
-
Tika’s language detection feature: https://tika.apache.org/1.28.5/detection.html
-
LangDetect language detection: https://github.com/shuyo/language-detection
-
OpenNLP language detection: http://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.langdetect
You can see a comparison between the Tika and LangDetect implementations here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html. In general, the LangDetect implementation supports more languages with higher performance.
For specific information on each of these language identification implementations, including a list of supported languages for each, see the relevant project websites.
For more information about language analysis in Solr, see Language Analysis.
Module
This is provided via the langid
Solr Module that needs to be enabled before use.
Configuring Language Detection
You can configure the langid
UpdateRequestProcessor in solrconfig.xml
.
Both implementations take the same parameters, which are described in the following section.
At a minimum, you must specify the fields for language identification and a field for the resulting language code.
Configuring Tika Language Detection
Here is an example of a minimal Tika langid
configuration in solrconfig.xml
:
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
Configuring LangDetect Language Detection
Here is an example of a minimal LangDetect langid
configuration in solrconfig.xml
:
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
Configuring OpenNLP Language Detection
Here is an example of a minimal OpenNLP langid
configuration in solrconfig.xml
:
<processor class="org.apache.solr.update.processor.OpenNLPLangDetectUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
<str name="langid.model">langdetect-183.bin</str>
</lst>
</processor>
OpenNLP-specific Parameters
langid.model
-
Required
Default: none
An OpenNLP language detection model.
The OpenNLP project provides a pre-trained 103 language model on the OpenNLP site’s model dowload page. Model training instructions are provided on the OpenNLP website.
See Resource Loading for information on where to put the model.
langid Parameters
As previously mentioned, both implementations of the langid
UpdateRequestProcessor take the same parameters.
langid
-
Optional
Default:
true
When
true
, enables language detection. langid.fl
-
Required
Default: none
A comma- or space-delimited list of fields to be processed by
langid
. langid.langField
-
Required
Default: none
Specifies the field for the returned language code.
langid.langsField
-
Optional
Default: none
Specifies the field for a list of returned language codes. If you use
langid.map.individual
, each detected language will be added to this field. langid.overwrite
-
Optional
Default:
false
Specifies whether the content of the
langField
andlangsField
fields will be overwritten if they already contain values. langid.lcmap
-
Optional
Default: none
A space-separated list specifying colon delimited language code mappings to apply to the detected languages.
For example, you might use this to map Chinese, Japanese, and Korean to a common
cjk
code, and map both American and British English to a singleen
code by usinglangid.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en
.This affects both the values put into the
langField
andlangsField
fields, as well as the field suffixes when usinglangid.map
, unless overridden bylangid.map.lcmap
. langid.threshold
-
Optional
Default:
0.5
Specifies a threshold value between 0 and 1 that the language identification score must reach before
langid
accepts it.With longer text fields, a high threshold such as
0.8
will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results. langid.allowlist
-
Optional
Default: none
Specifies a list of allowed language identification codes. Use this in combination with
langid.map
to ensure that you only index documents into fields that are in your schema. langid.map
-
Optional
Default:
false
Enables field name mapping. If
true
, Solr will map field names for all fields listed inlangid.fl
. langid.map.fl
-
Optional
Default: none
A comma-separated list of fields for
langid.map
that is different than the fields specified inlangid.fl
. langid.map.keepOrig
-
Optional
Default:
false
If
true
, Solr will copy the field during the field name mapping process, leaving the original field in place. langid.map.individual
-
Optional
Default:
false
If
true
, Solr will detect and map languages for each field individually. langid.map.individual.fl
-
Optional
Default: none
A comma-separated list of fields for use with
langid.map.individual
that is different than the fields specified inlangid.fl
. langid.fallback
-
Optional
Default: none
Specifies a language code to use if no language is detected or specified in
langid.fallbackFields
. langid.fallbackFields
-
Optional
Default: none
If no language is detected that meets the
langid.threshold
score, or if the detected language is not on thelangid.allowlist
, this field specifies language codes to be used as fallback values.If no appropriate fallback languages are found, Solr will use the language code specified in
langid.fallback
. langid.map.lcmap
-
Optional
Default: none
A space-separated list specifying colon-delimited language code mappings to use when mapping field names.
For example, you might use this to make Chinese, Japanese, and Korean language fields use a common
*_cjk
suffix, and map both American and British English fields to a single*_en
by usinglangid.map.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en
.A list defined with this parameter will override any configuration set with
langid.lcmap
. langid.map.pattern
-
Optional
Default:
<field>_<language>
By default, fields are mapped as
<field>_<language>
. To change this pattern, you can specify a Java regular expression in this parameter. langid.map.replace
-
Optional
Default:
<field>_<language>
By default, fields are mapped as
<field>_<language>
. To change this pattern, you can specify a Java replace in this parameter. langid.enforceSchema
-
Optional
Default:
true
If
false
, thelangid
processor does not validate field names against your schema. This may be useful if you plan to rename or delete fields later in the update chain.