Solr can identify languages and map text to language-specific fields during indexing using the langid
UpdateRequestProcessor.
Solr supports two implementations of this feature:
-
Tika’s language detection feature: http://tika.apache.org/0.10/detection.html
-
LangDetect language detection: https://github.com/shuyo/language-detection
You can see a comparison between the two implementations here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html. In general, the LangDetect implementation supports more languages with higher performance.
For specific information on each of these language identification implementations, including a list of supported languages for each, see the relevant project websites.
For more information about language analysis in Solr, see Language Analysis.
Configuring Language Detection
You can configure the langid
UpdateRequestProcessor in solrconfig.xml
. Both implementations take the same parameters, which are described in the following section. At a minimum, you must specify the fields for language identification and a field for the resulting language code.
Configuring Tika Language Detection
Here is an example of a minimal Tika langid
configuration in solrconfig.xml
:
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
Configuring LangDetect Language Detection
Here is an example of a minimal LangDetect langid
configuration in solrconfig.xml
:
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
langid
Parameters
As previously mentioned, both implementations of the langid
UpdateRequestProcessor take the same parameters.
Parameter | Type | Default | Required | Description |
---|---|---|---|---|
langid |
Boolean |
true |
no |
Enables and disables language detection. |
langid.fl |
string |
none |
yes |
A comma- or space-delimited list of fields to be processed by |
langid.langField |
string |
none |
yes |
Specifies the field for the returned language code. |
langid.langsField |
multivalued string |
none |
no |
Specifies the field for a list of returned language codes. If you use |
langid.overwrite |
Boolean |
false |
no |
Specifies whether the content of the |
langid.lcmap |
string |
none |
false |
A space-separated list specifying colon delimited language code mappings to apply to the detected languages. For example, you might use this to map Chinese, Japanese, and Korean to a common |
langid.threshold |
float |
0.5 |
no |
Specifies a threshold value between 0 and 1 that the language identification score must reach before |
langid.whitelist |
string |
none |
no |
Specifies a list of allowed language identification codes. Use this in combination with |
langid.map |
Boolean |
false |
no |
Enables field name mapping. If true, Solr will map field names for all fields listed in |
langid.map.fl |
string |
none |
no |
A comma-separated list of fields for |
langid.map.keepOrig |
Boolean |
false |
no |
If true, Solr will copy the field during the field name mapping process, leaving the original field in place. |
langid.map.individual |
Boolean |
false |
no |
If true, Solr will detect and map languages for each field individually. |
langid.map.individual.fl |
string |
none |
no |
A comma-separated list of fields for use with |
langid.fallbackFields |
string |
none |
no |
If no language is detected that meets the |
langid.fallback |
string |
none |
no |
Specifies a language code to use if no language is detected or specified in |
langid.map.lcmap |
string |
determined by |
no |
A space-separated list specifying colon delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common |
langid.map.pattern |
Java regular expression |
none |
no |
By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java regular expression in this parameter. |
langid.map.replace |
Java replace |
none |
no |
By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java replace in this parameter. |
langid.enforceSchema |
Boolean |
true |
no |
If false, the |