Detecting Languages During Indexing

Solr can identify languages and map text to language-specific fields during indexing using the langid UpdateRequestProcessor.

Solr supports two implementations of this feature:

Tika’s language detection feature: http://tika.apache.org/0.10/detection.html
LangDetect language detection: https://github.com/shuyo/language-detection

You can see a comparison between the two implementations here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html. In general, the LangDetect implementation supports more languages with higher performance.

For specific information on each of these language identification implementations, including a list of supported languages for each, see the relevant project websites.

For more information about language analysis in Solr, see Language Analysis.

Configuring Language Detection

You can configure the langid UpdateRequestProcessor in solrconfig.xml. Both implementations take the same parameters, which are described in the following section. At a minimum, you must specify the fields for language identification and a field for the resulting language code.

Configuring Tika Language Detection

Here is an example of a minimal Tika langid configuration in solrconfig.xml:

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
  <lst name="defaults">
    <str name="langid.fl">title,subject,text,keywords</str>
    <str name="langid.langField">language_s</str>
  </lst>
</processor>

Configuring LangDetect Language Detection

Here is an example of a minimal LangDetect langid configuration in solrconfig.xml:

<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
  <lst name="defaults">
    <str name="langid.fl">title,subject,text,keywords</str>
    <str name="langid.langField">language_s</str>
  </lst>
</processor>

`langid` Parameters

As previously mentioned, both implementations of the langid UpdateRequestProcessor take the same parameters.

Parameter Type Default Required Description

Parameter	Type	Default	Required	Description
langid	Boolean	true	no	Enables and disables language detection.
langid.fl	string	none	yes	A comma- or space-delimited list of fields to be processed by `langid`.
langid.langField	string	none	yes	Specifies the field for the returned language code.
langid.langsField	multivalued string	none	no	Specifies the field for a list of returned language codes. If you use `langid.map.individual`, each detected language will be added to this field.
langid.overwrite	Boolean	false	no	Specifies whether the content of the `langField` and `langsField` fields will be overwritten if they already contain values.
langid.lcmap	string	none	false	A space-separated list specifying colon delimited language code mappings to apply to the detected languages. For example, you might use this to map Chinese, Japanese, and Korean to a common `cjk` code, and map both American and British English to a single `en` code by using `langid.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en`. This affects both the values put into the `langField` and `langsField` fields, as well as the field suffixes when using `langid.map`, unless overridden by `langid.map.lcmap`
langid.threshold	float	0.5	no	Specifies a threshold value between 0 and 1 that the language identification score must reach before `langid` accepts it. With longer text fields, a high threshold such at 0.8 will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results.
langid.whitelist	string	none	no	Specifies a list of allowed language identification codes. Use this in combination with `langid.map` to ensure that you only index documents into fields that are in your schema.
langid.map	Boolean	false	no	Enables field name mapping. If true, Solr will map field names for all fields listed in `langid.fl`.
langid.map.fl	string	none	no	A comma-separated list of fields for `langid.map` that is different than the fields specified in `langid.fl`.
langid.map.keepOrig	Boolean	false	no	If true, Solr will copy the field during the field name mapping process, leaving the original field in place.
langid.map.individual	Boolean	false	no	If true, Solr will detect and map languages for each field individually.
langid.map.individual.fl	string	none	no	A comma-separated list of fields for use with `langid.map.individual` that is different than the fields specified in `langid.fl`.
langid.fallbackFields	string	none	no	If no language is detected that meets the `langid.threshold` score, or if the detected language is not on the `langid.whitelist`, this field specifies language codes to be used as fallback values. If no appropriate fallback languages are found, Solr will use the language code specified in `langid.fallback`.
langid.fallback	string	none	no	Specifies a language code to use if no language is detected or specified in `langid.fallbackFields`.
langid.map.lcmap	string	determined by `langid.lcmap`	no	A space-separated list specifying colon delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common `_cjk` suffix, and map both American and British English fields to a single `_en` by using `langid.map.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en`.
langid.map.pattern	Java regular expression	none	no	By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java regular expression in this parameter.
langid.map.replace	Java replace	none	no	By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java replace in this parameter.
langid.enforceSchema	Boolean	true	no	If false, the `langid` processor does not validate field names against your schema. This may be useful if you plan to rename or delete fields later in the UpdateChain.

langid

Boolean

true

Enables and disables language detection.

langid.fl

string

none

yes

A comma- or space-delimited list of fields to be processed by langid.

langid.langField

string

none

yes

Specifies the field for the returned language code.

langid.langsField

multivalued string

none

Specifies the field for a list of returned language codes. If you use langid.map.individual, each detected language will be added to this field.

langid.overwrite

Boolean

false

Specifies whether the content of the langField and langsField fields will be overwritten if they already contain values.

langid.lcmap

string

none

false

A space-separated list specifying colon delimited language code mappings to apply to the detected languages. For example, you might use this to map Chinese, Japanese, and Korean to a common cjk code, and map both American and British English to a single en code by using langid.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en. This affects both the values put into the langField and langsField fields, as well as the field suffixes when using langid.map, unless overridden by langid.map.lcmap

langid.threshold

float

0.5

Specifies a threshold value between 0 and 1 that the language identification score must reach before langid accepts it. With longer text fields, a high threshold such at 0.8 will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results.

langid.whitelist

string

none

Specifies a list of allowed language identification codes. Use this in combination with langid.map to ensure that you only index documents into fields that are in your schema.

langid.map

Boolean

false

Enables field name mapping. If true, Solr will map field names for all fields listed in langid.fl.

langid.map.fl

string

none

A comma-separated list of fields for langid.map that is different than the fields specified in langid.fl.

langid.map.keepOrig

Boolean

false

If true, Solr will copy the field during the field name mapping process, leaving the original field in place.

langid.map.individual

Boolean

false

If true, Solr will detect and map languages for each field individually.

langid.map.individual.fl

string

none

A comma-separated list of fields for use with langid.map.individual that is different than the fields specified in langid.fl.

langid.fallbackFields

string

none

If no language is detected that meets the langid.threshold score, or if the detected language is not on the langid.whitelist, this field specifies language codes to be used as fallback values. If no appropriate fallback languages are found, Solr will use the language code specified in langid.fallback.

langid.fallback

string

none

Specifies a language code to use if no language is detected or specified in langid.fallbackFields.

langid.map.lcmap

string

determined by langid.lcmap

A space-separated list specifying colon delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common *_cjk suffix, and map both American and British English fields to a single *_en by using langid.map.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en.

langid.map.pattern

Java regular expression

none

By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java regular expression in this parameter.

langid.map.replace

Java replace

none

By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java replace in this parameter.

langid.enforceSchema

Boolean

true

If false, the langid processor does not validate field names against your schema. This may be useful if you plan to rename or delete fields later in the UpdateChain.

Updating Parts of Documents De-Duplication

Configuring Language Detection

Configuring Tika Language Detection

Configuring LangDetect Language Detection

langid Parameters

`langid` Parameters