Language Analysis
This section contains information about tokenizers and filters related to character set conversion or for use with specific languages.
For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and/or a relatively small set of punctuation characters.
In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words.
For information about language detection at index time, see Language Detection.
KeywordMarkerFilterFactory
Protects words from being modified by stemmers. A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.
A sample Solr protwords.txt
with comments can be found in the sample_techproducts_configs
configset directory:
With name
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer name="whitespace"/>
<filter name="keywordMarker" protected="protwords.txt" />
<filter name="porterStem" />
</analyzer>
</fieldtype>
With class name (legacy)
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldtype>
KeywordRepeatFilterFactory
Emits each token twice, one with the KEYWORD
attribute and once without.
If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.
To configure, add the KeywordRepeatFilterFactory
early in the analysis chain.
It is recommended to also include RemoveDuplicatesTokenFilterFactory
to avoid duplicates when tokens are not stemmed.
A sample fieldType configuration could look like this:
With name
<fieldtype name="english_stem_preserve_original" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="keywordRepeat" />
<filter name="porterStem" />
<filter name="removeDuplicates" />
</analyzer>
</fieldtype>
With class name (legacy)
<fieldtype name="english_stem_preserve_original" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeywordRepeatFilterFactory" />
<filter class="solr.PorterStemFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>
When adding the same token twice, it will also score twice (double), so you may have to re-tune your ranking rules. |
StemmerOverrideFilterFactory
Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers.
A customized mapping of words to stems, in a tab-separated file, can be specified to the dictionary
attribute in the schema.
Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer.
With name
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer name="whitespace"/>
<filter name="stemmerOverride" dictionary="stemdict.txt" />
<filter name="porterStem" />
</analyzer>
</fieldtype>
With class name (legacy)
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldtype>
A sample stemdict.txt
file is shown below:
# these must be tab-separated
monkeys monkey
otters otter
# some crazy ones that a stemmer would never do
dogs cat
If you have a checkout of Solr’s source code locally, you can also find this example in Solr’s test resources at solr/core/src/test-files/solr/collection1/conf/stemdict.txt
.
Dictionary Compound Word Token Filter
This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.
Compound words are most commonly found in Germanic languages.
Factory class: solr.DictionaryCompoundWordTokenFilterFactory
Arguments:
dictionary
-
Required
Default: none
The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with “#” are ignored.
See Resource Loading for more information.
minWordSize
-
Optional
Default:
5
Any token shorter than this is not decompounded.
minSubwordSize
-
Optional
Default:
2
Subwords shorter than this are not emitted as tokens.
maxSubwordSize
-
Optional
Default:
15
Subwords longer than this are not emitted as tokens.
onlyLongestMatch
-
Optional
Default:
true
If
true
, only the longest matching subwords will generate new tokens.
Example:
Assume that germanwords.txt
contains at least the following words: dumm kopf donau dampf schiff
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="dictionaryCompoundWord" dictionary="germanwords.txt"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>
In: "Donaudampfschiff dummkopf"
Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),
Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)
Unicode Collation
Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes.
Unicode Collation in Solr is fast, because all the work is done at index time.
Rather than specifying an analyzer within <fieldtype … class="solr.TextField">
, the solr.CollationField
and solr.ICUCollationField
field type classes provide this functionality.
solr.ICUCollationField
, which is backed by the ICU4J library, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs solr.CollationField
.
To use solr.ICUCollationField
, you must enable the analysis-extras Module.
solr.ICUCollationField
and solr.CollationField
fields can be created in two ways:
-
Based upon a system collator associated with a locale.
-
Based upon a tailored
RuleBasedCollator
ruleset.
ICUCollationField Attributes
Using a system collator
locale
-
Required
Default: none
RFC 3066 locale ID.
strength
-
Optional
Default: none
Valid values are
primary
,secondary
,tertiary
,quaternary
, oridentical
. See Comparison Levels in ICU Collation Concepts for more information. decomposition
-
Optional
Default: none
Valid values are
no
orcanonical
. See Normalization in ICU Collation Concepts for more information.
Using a Tailored ruleset
custom
-
Required
Default: none
Path to a UTF-8 text file containing rules supported by the ICU
RuleBasedCollator
strength
-
Optional
Default: none
Valid values are
primary
,secondary
,tertiary
,quaternary
, oridentical
. See Comparison Levels in ICU Collation Concepts for more information. decomposition
-
Optional
Default: none
Valid values are
no
orcanonical
. See Normalization in ICU Collation Concepts for more information.
Expert options
alternate
-
Optional
Default: none
Valid values are
shifted
ornon-ignorable
. Can be used to ignore punctuation or whitespace. caseLevel
-
Optional
Default:
false
If
true
, in combination withstrength="primary"
, accents are ignored but case is taken into account. See CaseLevel in ICU Collation Concepts for more information. caseFirst
-
Optional
Default: none
Valid values are
lower
orupper
. Useful to control which is sorted first when case is not ignored. numeric
-
Optional
Default:
false
If
true
, digits are sorted according to numeric value, e.g., foobar-9 sorts before foobar-10. variableTop
-
Optional
Default: none
Single character or contraction. Controls what is variable for
alternate
.
Sorting Text for a Specific Language
In this example, text is sorted according to the default German rules provided by ICU4J.
Locales are typically defined as a combination of language and country, but you can specify just the language if you want. For example, if you specify "de" as the language, you will get sorting that works well for the German language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for Switzerland.
<!-- Define a field type for German collation -->
<fieldType name="collatedGERMAN" class="solr.ICUCollationField"
locale="de"
strength="primary" />
...
<!-- Define a field to store the German collated manufacturer names. -->
<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false" docValues="true"/>
...
<!-- Copy the text to this field. We could create French, English, Spanish versions too,
and sort differently for different users! -->
<copyField source="manu" dest="manuGERMAN"/>
In the example above, we defined the strength as "primary". The strength of the collation determines how strict the sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores differences in case and accents.
Another example:
<fieldType name="polishCaseInsensitive" class="solr.ICUCollationField"
locale="pl_PL"
strength="secondary" />
...
<field name="city" type="text_general" indexed="true" stored="true"/>
...
<field name="city_sort" type="polishCaseInsensitive" indexed="true" stored="false"/>
...
<copyField source="city" dest="city_sort"/>
The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore case differences, but, unlike "primary" strength, a letter with diacritic(s) will be sorted differently from the same base letter without diacritics.
An example using the "city_sort" field to sort:
q=*:*&fl=city&sort=city_sort+asc
Sorting Text for Multiple Languages
There are two approaches to supporting multiple languages: if there is a small list of languages you wish to support, consider defining collated fields for each language and using copyField
.
However, adding a large number of sort fields can increase disk and indexing costs.
An alternative approach is to use the Unicode default
collator.
The Unicode default
or ROOT
locale has rules that are designed to work well for most languages.
To use the default
locale, simply define the locale as the empty string.
This Unicode default sort is still significantly more advanced than the standard Solr sort.
<fieldType name="collatedROOT" class="solr.ICUCollationField"
locale=""
strength="primary" />
Sorting Text with Custom Rules
You can define your own set of sorting rules. It’s easiest to take existing rules that are close to what you want and customize them.
In the example below, we create a custom rule set for German called DIN 5007-2. This rule set treats umlauts in German differently: it treats ö as equivalent to oe, ä as equivalent to ae, and ü as equivalent to ue. For more information, see the ICU RuleBasedCollator javadocs.
This example shows how to create a custom rule set for solr.ICUCollationField
and dump it to a file:
// get the default rules for Germany
// these are called DIN 5007-1 sorting
RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new ULocale("de", "DE"));
// define some tailorings, to make it DIN 5007-2 sorting.
// For example, this makes ö equivalent to oe
String DIN5007_2_tailorings =
"& ae , a\u0308 & AE , A\u0308"+
"& oe , o\u0308 & OE , O\u0308"+
"& ue , u\u0308 & UE , u\u0308";
// concatenate the default rules to the tailorings, and dump it to a String
RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() + DIN5007_2_tailorings);
String tailoredRules = tailoredCollator.getRules();
// write these to a file, be sure to use UTF-8 encoding!!!
FileOutputStream os = new FileOutputStream(new File("/solr_home/conf/customRules.dat"));
IOUtils.write(tailoredRules, os, "UTF-8");
This rule set can now be used for custom collation in Solr:
<fieldType name="collatedCUSTOM" class="solr.ICUCollationField"
custom="customRules.dat"
strength="primary" />
JDK Collation
As mentioned above, ICU Unicode Collation is better in several ways than JDK Collation, but if you cannot use ICU4J for some reason, you can use solr.CollationField
.
The principles of JDK Collation are the same as those of ICU Collation; you just specify language
, country
and variant
arguments instead of the combined locale
argument.
JDK Collation Attributes
Using a System collator (see Oracle’s list of locales supported in Java):
language
-
Required
Default: none
The ISO-639 language code.
country
-
Optional
Default: none
The ISO-3166 country code.
variant
-
Optional
Default: none
Vendor or browser-specific code.
strength
-
Optional
Default: none
Valid values are
primary
,secondary
,tertiary
oridentical
. See Java Collator javadocs for more information. decomposition
-
Optional
Default: none
Valid values are
no
,canonical
, orfull
. See Java Collator javadocs for more information.
Using a Tailored ruleset:
custom
-
Required
Default: none
Path to a UTF-8 text file containing rules supported by the
JDK RuleBasedCollator
. strength
-
Optional
Default: none
Valid values are
primary
,secondary
,tertiary
oridentical
. See Java Collator javadocs for more information. decomposition
-
Optional
Default: none
Valid values are
no
,canonical
, orfull
. See Java Collator javadocs for more information.
solr.CollationField
example:<fieldType name="collatedGERMAN" class="solr.CollationField"
language="de"
country="DE"
strength="primary" /> <!-- ignore Umlauts and letter case when sorting -->
...
<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false" docValues="true" />
...
<copyField source="manu" dest="manuGERMAN"/>
ASCII & Decimal Folding Filters
ASCII Folding
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Only those characters with reasonable ASCII alternatives are converted.
This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost.
Factory class: solr.ASCIIFoldingFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="asciiFolding"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
In: "Björn Ångström"
Tokenizer to Filter: "Björn", "Ångström"
Out: "Bjorn", "Angstrom"
Decimal Digit Folding
This filter converts any character in the Unicode "Decimal Number" general category (Nd
) into their equivalent Basic Latin digits (0-9).
This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost.
Factory class: solr.DecimalDigitFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="decimalDigit"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DecimalDigitFilterFactory"/>
</analyzer>
OpenNLP Integration
The Lucene module lucene/analysis/opennlp
provides OpenNLP integration via several analysis components: a tokenizer, a part-of-speech tagging filter, a phrase chunking filter, and a lemmatization filter.
In addition to these analysis components, Solr also provides an update request processor to extract named entities.
See also Update Processor Factories That Can Be Loaded as Plugins.
The OpenNLP Tokenizer must be used with all other OpenNLP analysis components, for two reasons. First, the OpenNLP Tokenizer detects and marks the sentence boundaries required by all the OpenNLP filters. Second, since the pre-trained OpenNLP models used by these filters were trained using the corresponding language-specific sentence-detection/tokenization models, the same tokenization using the same models must be used at runtime for optimal performance. |
To use the OpenNLP components, you must enable the analysis-extras Module.
OpenNLP Tokenizer
The OpenNLP Tokenizer takes two language-specific binary model files as parameters: a sentence detector model and a tokenizer model. The last token in each sentence is flagged, so that following OpenNLP-based filters can use this information to apply operations to tokens one sentence at a time. See the OpenNLP website for information on downloading pre-trained models.
Factory class: solr.OpenNLPTokenizerFactory
Arguments:
sentenceModel
-
Required
Default: none
The path of a language-specific OpenNLP sentence detection model file. See Resource Loading for more information.
tokenizerModel
-
Required
Default: none
The path of a language-specific OpenNLP tokenization model file. See Resource Loading for more information.
Example:
With name
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
</analyzer>
OpenNLP Part-Of-Speech Filter
This filter sets each token’s type attribute to the part of speech (POS) assigned by the configured model. See the OpenNLP website for information on downloading pre-trained models.
Lucene currently does not index token types, so if you want to keep this information, you have to preserve it either in a payload or as a synonym; see the examples below. |
Factory class: solr.OpenNLPPOSFilterFactory
Arguments:
posTaggerModel
-
Required
Default: none
The path of a language-specific OpenNLP POS tagger model file. See Resource Loading for more information.
Examples:
The OpenNLP tokenizer will tokenize punctuation, which is useful for following token filters.
Ordinarily you don’t want to include punctuation in your index, so the TypeTokenFilter
is included in the examples below, with stop.pos.txt
containing the following:
#
$
''
``
,
-LRB-
-RRB-
:
.
Index the POS for each token as a payload:
With name
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="typeAsPayload"/>
<filter name="type" types="stop.pos.txt"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
<filter class="solr.TypeAsPayloadFilterFactory"/>
<filter class="solr.TypeTokenFilterFactory" types="stop.pos.txt"/>
</analyzer>
Index the POS for each token as a synonym, after prefixing the POS with "@" (see the TypeAsSynonymFilter description):
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="typeAsSynonym" prefix="@"/>
<filter name="type" types="stop.pos.txt"/>
</analyzer>
Only index nouns - the keep.pos.txt
file contains lines NN
, NNS
, NNP
and NNPS
:
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="type" types="keep.pos.txt" useWhitelist="true"/>
</analyzer>
OpenNLP Phrase Chunking Filter
This filter sets each token’s type attribute based on the output of an OpenNLP phrase chunking model. The chunk labels replace the POS tags that previously were in each token’s type attribute. See the OpenNLP website for information on downloading pre-trained models.
Prerequisite: the OpenNLP Tokenizer and the OpenNLP Part-Of-Speech Filter must precede this filter.
Lucene currently does not index token types, so if you want to keep this information, you have to preserve it either in a payload or as a synonym; see the examples below. |
Factory class: solr.OpenNLPChunkerFilterFactory
Arguments:
chunkerModel
-
Required
Default: none
The path of a language-specific OpenNLP phrase chunker model file. See Resource Loading for more information.
Examples:
Index the phrase chunk label for each token as a payload:
With name
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="openNLPChunker" chunkerModel="en-chunker.bin"/>
<filter name="typeAsPayload"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
<filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="en-chunker.bin"/>
<filter class="solr.TypeAsPayloadFilterFactory"/>
</analyzer>
Index the phrase chunk label for each token as a synonym, after prefixing it with "#" (see the TypeAsSynonymFilter description):
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="openNLPChunker" chunkerModel="en-chunker.bin"/>
<filter name="typeAsSynonym" prefix="#"/>
</analyzer>
OpenNLP Lemmatizer Filter
This filter replaces the text of each token with its lemma. Both a dictionary-based lemmatizer and a model-based lemmatizer are supported. If both are configured, the dictionary-based lemmatizer is tried first, and then the model-based lemmatizer is consulted for out-of-vocabulary tokens. See the OpenNLP website for information on downloading pre-trained models.
Factory class: solr.OpenNLPLemmatizerFilter
Arguments:
Either dictionary
or lemmatizerModel
must be provided, and both may be provided - see the examples below:
dictionary
-
Optional
Default: none
The path of a lemmatization dictionary file. See Resource Loading for more information. The dictionary file must be encoded as UTF-8, with one entry per line, in the form
word[tab]lemma[tab]part-of-speech
, e.g.,wrote[tab]write[tab]VBD
. lemmatizerModel
-
Optional
Default: none
The path of a language-specific OpenNLP lemmatizer model file. See Resource Loading for more information.
Examples:
Perform dictionary-based lemmatization, and fall back to model-based lemmatization for out-of-vocabulary tokens (see the OpenNLP Part-Of-Speech Filter section above for information about using TypeTokenFilter
to avoid indexing punctuation):
With name
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="oenNLPLemmatizer"
dictionary="lemmas.txt"
lemmatizerModel="en-lemmatizer.bin"/>
<filter name="type" types="stop.pos.txt"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
<filter class="solr.OpenNLPLemmatizerFilterFactory"
dictionary="lemmas.txt"
lemmatizerModel="en-lemmatizer.bin"/>
<filter class="solr.TypeTokenFilterFactory" types="stop.pos.txt"/>
</analyzer>
Perform dictionary-based lemmatization only:
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="openNLPLemmatizer" dictionary="lemmas.txt"/>
<filter name="type" types="stop.pos.txt"/>
</analyzer>
Perform model-based lemmatization only, preserving the original token and emitting the lemma as a synonym (see the KeywordRepeatFilterFactory description)):
<analyzer>
<tokenizer name="openNLP"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
<filter name="openNLPPOS" posTaggerModel="en-pos-maxent.bin"/>
<filter name="keywordRepeat"/>
<filter name="openNLPLemmatizer" lemmatizerModel="en-lemmatizer.bin"/>
<filter name="removeDuplicates"/>
<filter name="type" types="stop.pos.txt"/>
</analyzer>
Language-Specific Factories
These factories are each designed to work with specific languages. The languages covered here are:
Arabic
Solr provides support for the Light-10 (PDF) stemming algorithm, and Lucene includes an example stopword list.
This algorithm defines both character normalization and stemming, so these are split into two filters to provide more flexibility.
Factory classes: solr.ArabicStemFilterFactory
, solr.ArabicNormalizationFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="arabicNormalization"/>
<filter name="arabicStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
Bengali
There are two filters written specifically for dealing with Bengali language.
They use the Lucene classes org.apache.lucene.analysis.bn.BengaliNormalizationFilter
and org.apache.lucene.analysis.bn.BengaliStemFilter
.
Factory classes: solr.BengaliStemFilterFactory
, solr.BengaliNormalizationFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="bengaliNormalization"/>
<filter name="bengaliStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BengaliNormalizationFilterFactory"/>
<filter class="solr.BengaliStemFilterFactory"/>
</analyzer>
Normalisation - মানুষ
→ মানুস
Stemming - সমস্ত
→ সমস্
Brazilian Portuguese
This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language.
It uses the Lucene class org.apache.lucene.analysis.br.BrazilianStemmer
.
Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.
Factory class: solr.BrazilianStemFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="brazilianStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
</analyzer>
In: "praia praias"
Tokenizer to Filter: "praia", "praias"
Out: "pra", "pra"
Bulgarian
Solr includes a light stemmer for Bulgarian, following this algorithm (PDF), and Lucene includes an example stopword list.
Factory class: solr.BulgarianStemFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="bulgarianStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BulgarianStemFilterFactory"/>
</analyzer>
Catalan
Solr can stem Catalan using the Snowball Porter Stemmer with an argument of language="Catalan"
.
Solr includes a set of contractions for Catalan, which can be stripped using solr.ElisionFilterFactory
.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
:
+
Required |
Default: none |
+
The stemmer language, Catalan
in this case.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="elision"
articles="lang/contractions_ca.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="Catalan" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory"
articles="lang/contractions_ca.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="Catalan" />
</analyzer>
In: "llengües llengua"
Tokenizer to Filter: "llengües"(1) "llengua"(2),
Out: "llengu"(1), "llengu"(2)
Traditional Chinese
The default configuration of the ICU Tokenizer is suitable for Traditional Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words.
To use this tokenizer, you must enable the analysis-extras Module.
The Standard Tokenizer can also be used to tokenize Traditional Chinese text. Following the Word Break rules from the Unicode Text Segmentation algorithm, it produces one token per Chinese character. When combined with CJK Bigram Filter, overlapping bigrams of Chinese characters are formed.
CJK Width Filter folds fullwidth ASCII variants into the equivalent Basic Latin forms.
Examples:
With name
<analyzer>
<tokenizer name="icu"/>
<filter name="cjkWidth"/>
<filter name="lowercase"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer>
<tokenizer name="standard"/>
<filter name="cjkBigram"/>
<filter name="cjkWidth"/>
<filter name="lowercase"/>
</analyzer>
CJK Bigram Filter
Forms bigrams (overlapping 2-character sequences) of CJK characters that are generated from the Standard Tokenizer or the ICU Tokenizer.
By default, all CJK characters produce bigrams, but finer grained control is available by specifying orthographic type arguments han
, hiragana
, katakana
, and hangul
.
When set to false
, characters of the corresponding type will be passed through as unigrams, and will not be included in any bigrams.
When a CJK character has no adjacent characters to form a bigram, it is output in unigram form.
If you want to always output both unigrams and bigrams, set the outputUnigrams
argument to true
.
In all cases, all non-CJK input is passed through unmodified.
Arguments:
han
-
Optional
Default:
true
If
false
, Han (Chinese) characters will not form bigrams. hiragana
-
Optional
Default:
true
If
false
, Hiragana (Japanese) characters will not form bigrams. katakana
-
Optional
Default:
true
If
false
, Katakana (Japanese) characters will not form bigrams. hangul
-
Optional
Default:
true
If
false
, Hangul (Korean) characters will not form bigrams. outputUnigrams
-
Optional
Default:
false
If
true
, in addition to forming bigrams, all characters are also passed through as unigrams.
See the example under Traditional Chinese.
Simplified Chinese
For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the HMM Chinese Tokenizer. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, you must enable the analysis-extras Module.
The default configuration of the ICU Tokenizer is also suitable for Simplified Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words. To use this tokenizer, you must enable the analysis-extras Module.
Also useful for Chinese analysis:
CJK Width Filter folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
Examples:
With name
<analyzer>
<tokenizer name="hmmChinese"/>
<filter name="cjkWidth"/>
<filter name="stop"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter name="porterStem"/>
<filter name="lowercase"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.HMMChineseTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer>
<tokenizer name="icu"/>
<filter name="cjkWidth"/>
<filter name="stop"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter name="lowercase"/>
</analyzer>
HMM Chinese Tokenizer
For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the solr.HMMChineseTokenizerFactory
in the analysis-extras
module.
This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model.
To use this tokenizer, you must enable the analysis-extras Module.
Factory class: solr.HMMChineseTokenizerFactory
Arguments: None
Examples:
To use the default setup with fallback to English Porter stemmer for English words, use:
<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
Or to configure your own analysis setup, use the solr.HMMChineseTokenizerFactory
along with your custom filter setup.
See an example of this in the Simplified Chinese section.
Czech
Solr includes a light stemmer for Czech, following this algorithm, and Lucene includes an example stopword list.
Factory class: solr.CzechStemFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="czechStem"/>
<analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CzechStemFilterFactory"/>
<analyzer>
In: "prezidenští, prezidenta, prezidentského"
Tokenizer to Filter: "prezidenští", "prezidenta", "prezidentského"
Out: "preziden", "preziden", "preziden"
Danish
Solr can stem Danish using the Snowball Porter Stemmer with an argument of language="Danish"
.
Also relevant are the Scandinavian normalization filters.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
-
Required
Default: none
The stemmer language,
Danish
in this case.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="snowballPorter" language="Danish" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Danish" />
</analyzer>
In: "undersøg undersøgelse"
Tokenizer to Filter: "undersøg"(1) "undersøgelse"(2),
Out: "undersøg"(1), "undersøg"(2)
Dutch
Solr can stem Dutch using the Snowball Porter Stemmer with an argument of language="Dutch"
.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
-
Required
Default: none
The stemmer language,
Dutch
in this case.
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="snowballPorter" language="Dutch"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch"/>
</analyzer>
In: "kanaal kanalen"
Tokenizer to Filter: "kanaal", "kanalen"
Out: "kanal", "kanal"
Estonian
Solr can stem Estonian using the Snowball Porter Stemmer with an argument of language="Estonian"
.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
-
Required
Default: none
The stemmer language,
Estonian
in this case.
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="snowballPorter" language="Estonian"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Estonian"/>
</analyzer>
In: "Taevani tõustes"
Tokenizer to Filter: "Taevani", "tõustes"
Out: "taevani", "tõus"
Finnish
Solr includes support for stemming Finnish, and Lucene includes an example stopword list.
Factory class: solr.FinnishLightStemFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="finnishLightStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.FinnishLightStemFilterFactory"/>
</analyzer>
In: "kala kalat"
Tokenizer to Filter: "kala", "kalat"
Out: "kala", "kala"
French
Elision Filter
Removes article elisions from a token stream. This filter can be useful for languages such as French, Catalan, Italian, and Irish.
Factory class: solr.ElisionFilterFactory
Arguments:
articles
-
Optional
Default: none
The pathname of a file that contains a list of articles, one per line, to be stripped. Articles are words such as "le", which are commonly abbreviated, such as in l’avion (the plane). This file should include the abbreviated form, which precedes the apostrophe. In this case, simply "l". If no
articles
attribute is specified, a default set of French articles is used. ignoreCase
-
Optional
Default:
false
If
true
, the filter ignores the case of words when comparing them to the common word file.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="elision"
ignoreCase="true"
articles="lang/contractions_fr.txt"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory"
ignoreCase="true"
articles="lang/contractions_fr.txt"/>
</analyzer>
In: "L’histoire d’art"
Tokenizer to Filter: "L’histoire", "d’art"
Out: "histoire", "art"
French Light Stem Filter
Solr includes three stemmers for French: one in the solr.SnowballPorterFilterFactory
, a lighter stemmer called solr.FrenchLightStemFilterFactory
, and an even less aggressive stemmer called solr.FrenchMinimalStemFilterFactory
.
Lucene includes an example stopword list.
Factory classes: solr.FrenchLightStemFilterFactory
, solr.FrenchMinimalStemFilterFactory
Arguments: None
Examples:
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="elision"
articles="lang/contractions_fr.txt"/>
<filter name="frenchLightStem"/>
</analyzer>
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="elision"
articles="lang/contractions_fr.txt"/>
<filter name="frenchMinimalStem"/>
</analyzer>
In: "le chat, les chats"
Tokenizer to Filter: "le", "chat", "les", "chats"
Out: "le", "chat", "le", "chat"
Galician
Solr includes a stemmer for Galician following this algorithm, and Lucene includes an example stopword list.
Factory class: solr.GalicianStemFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="galicianStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GalicianStemFilterFactory"/>
</analyzer>
In: "felizmente Luzes"
Tokenizer to Filter: "felizmente", "luzes"
Out: "feliz", "luz"
German
Solr includes four stemmers for German: one in the solr.SnowballPorterFilterFactory language="German"
, a stemmer called solr.GermanStemFilterFactory
, a lighter stemmer called solr.GermanLightStemFilterFactory
, and an even less aggressive stemmer called solr.GermanMinimalStemFilterFactory
.
Lucene includes an example stopword list.
Factory classes: solr.GermanStemFilterFactory
, solr.LightGermanStemFilterFactory
, solr.MinimalGermanStemFilterFactory
Arguments: None
Examples:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="germanStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.GermanStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="germanLightStem"/>
</analyzer>
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="germanMinimalStem"/>
</analyzer>
In: "haus häuser"
Tokenizer to Filter: "haus", "häuser"
Out: "haus", "haus"
Greek
This filter converts uppercase letters in the Greek character set to the equivalent lowercase character.
Factory class: solr.GreekLowerCaseFilterFactory
Arguments: None
Use of custom charsets is no longer supported as of Solr 3.1. If you need to index text in these encodings, please use Java’s character set conversion facilities (InputStreamReader, etc.) during I/O, so that Lucene can analyze this text as Unicode instead. |
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="greekLowercase"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
</analyzer>
Hindi
Solr includes support for stemming Hindi following this algorithm (PDF), support for common spelling differences through the solr.HindiNormalizationFilterFactory
, support for encoding differences through the solr.IndicNormalizationFilterFactory
following this algorithm, and Lucene includes an example stopword list.
Factory classes: solr.IndicNormalizationFilterFactory
, solr.HindiNormalizationFilterFactory
, solr.HindiStemFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="indicNormalization"/>
<filter name="hindiNormalization"/>
<filter name="hindiStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.IndicNormalizationFilterFactory"/>
<filter class="solr.HindiNormalizationFilterFactory"/>
<filter class="solr.HindiStemFilterFactory"/>
</analyzer>
Indonesian
Solr includes support for stemming Indonesian (Bahasa Indonesia) following this algorithm (PDF), and Lucene includes an example stopword list.
Factory class: solr.IndonesianStemFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="indonesianStem" stemDerivational="true" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.IndonesianStemFilterFactory" stemDerivational="true" />
</analyzer>
In: "sebagai sebagainya"
Tokenizer to Filter: "sebagai", "sebagainya"
Out: "bagai", "bagai"
Italian
Solr includes two stemmers for Italian: one in the solr.SnowballPorterFilterFactory language="Italian"
, and a lighter stemmer called solr.ItalianLightStemFilterFactory
.
Lucene includes an example stopword list.
Factory class: solr.ItalianStemFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="elision"
articles="lang/contractions_it.txt"/>
<filter name="italianLightStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory"
articles="lang/contractions_it.txt"/>
<filter class="solr.ItalianLightStemFilterFactory"/>
</analyzer>
In: "propaga propagare propagamento"
Tokenizer to Filter: "propaga", "propagare", "propagamento"
Out: "propag", "propag", "propag"
Irish
Solr can stem Irish using the Snowball Porter Stemmer with an argument of language="Irish"
.
Solr includes solr.IrishLowerCaseFilterFactory
, which can handle Irish-specific constructs.
Solr also includes a set of contractions for Irish which can be stripped using solr.ElisionFilterFactory
.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
-
Required
Default: none
The stemmer language,
Irish
in this case.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="elision"
articles="lang/contractions_ga.txt"/>
<filter name="irishLowercase"/>
<filter name="snowballPorter" language="Irish" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory"
articles="lang/contractions_ga.txt"/>
<filter class="solr.IrishLowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Irish" />
</analyzer>
In: "siopadóireacht síceapatacha b’fhearr m’athair"
Tokenizer to Filter: "siopadóireacht", "síceapatacha", "b’fhearr", "m’athair"
Out: "siopadóir", "síceapaite", "fearr", "athair"
Japanese
Solr includes support for analyzing Japanese, via the Lucene Kuromoji morphological analyzer, which includes several analysis components - more details on each below:
-
JapaneseIterationMarkCharFilter
normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. -
JapaneseTokenizer
tokenizes Japanese using morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation. -
JapaneseBaseFormFilter
replaces original terms with their base forms (a.k.a. lemmas). -
JapanesePartOfSpeechStopFilter
removes terms that have one of the configured parts-of-speech. -
JapaneseKatakanaStemFilter
normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.
Also useful for Japanese analysis, from lucene-analyzers-common:
-
CJKWidthFilter
folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
Japanese Iteration Mark CharFilter
Normalizes horizontal Japanese iteration marks (odoriji) to their expanded form. Vertical iteration marks are not supported.
Factory class: JapaneseIterationMarkCharFilterFactory
Arguments:
normalizeKanji
-
Optional
Default:
true
Set to
false
to not normalize kanji iteration marks. normalizeKana
-
Optional
Default:
true
Set to
false
to not normalize kana iteration marks.
Japanese Tokenizer
Tokenizer for Japanese that uses morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation.
JapaneseTokenizer
has a search
mode (the default) that does segmentation useful for search: a heuristic is used to segment compound terms into their constituent parts while also keeping the original compound terms as synonyms.
Factory class: solr.JapaneseTokenizerFactory
Arguments:
mode
-
Optional
Default: none
Use
search
mode to get a noun-decompounding effect useful for search.search
mode improves segmentation for search at the expense of part-of-speech accuracy. Valid values formode
are:-
normal
: default segmentation -
search
: segmentation useful for search (extra compound splitting) -
extended
: search mode plus unigramming of unknown words (experimental)For some applications it might be good to use
search
mode for indexing andnormal
mode for queries to increase precision and prevent parts of compounds from being matched and highlighted.
-
userDictionary
-
Optional
Default: none
Filename for a user dictionary, which allows overriding the statistical model with your own entries for segmentation, part-of-speech tags and readings without a need to specify weights. See
lang/userdict_ja.txt
for a sample user dictionary file. userDictionaryEncoding
-
Optional
Default:
UTF-8
User dictionary encoding.
discardPunctuation
-
Optional
Default:
true
Set to
false
to keep punctuation,true
to discard. discardCompoundToken
-
Optional
Default: none
Set to
false
to keep original compound tokens with thesearch
mode,true
to discard.
Japanese Base Form Filter
Replaces original terms' text with the corresponding base form (lemma).
(JapaneseTokenizer
annotates each term with its base form.)
Factory class: JapaneseBaseFormFilterFactory
Arguments: None
Japanese Part Of Speech Stop Filter
Removes terms with one of the configured parts-of-speech.
JapaneseTokenizer
annotates terms with parts-of-speech.
Factory class : JapanesePartOfSpeechStopFilterFactory
Arguments:
tags
-
Optional
Default: none
Filename for a list of parts-of-speech for which to remove terms. See
conf/lang/stoptags_ja.txt
in thesample_techproducts_config
configset for an example.
Japanese Katakana Stem Filter
Normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.
solr.CJKWidthFilterFactory
should be specified prior to this filter to normalize half-width katakana to full-width.
Factory class: JapaneseKatakanaStemFilterFactory
Arguments:
minimumLength
-
Optional
Default:
4
Terms below this length will not be stemmed. Value must be
2
or more.
CJK Width Filter
Folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
Factory class: CJKWidthFilterFactory
Arguments: None
Example:
With name
<fieldType name="text_ja" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer>
<!-- Uncomment if you need to handle iteration marks: -->
<!-- <charFilter name="japaneseIterationMark" /> -->
<tokenizer name="japanese" mode="search" userDictionary="lang/userdict_ja.txt"/>
<filter name="japaneseBaseForm"/>
<filter name="japanesePartOfSpeechStop" tags="lang/stoptags_ja.txt"/>
<filter name="cjkWidth"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_ja.txt"/>
<filter name="japaneseKatakanaStem" minimumLength="4"/>
<filter name="lowercase"/>
</analyzer>
</fieldType>
With class name (legacy)
<fieldType name="text_ja" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer>
<!-- Uncomment if you need to handle iteration marks: -->
<!-- <charFilter class="solr.JapaneseIterationMarkCharFilterFactory" /> -->
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"/>
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Korean
The Korean (nori) analyzer integrates Lucene’s nori analysis module into Solr. It uses the mecab-ko-dic dictionary to perform morphological analysis of Korean texts.
The dictionary was built with MeCab and defines a format for the features adapted for the Korean language.
Nori also has a user dictionary feature that allows overriding the statistical model with your own entries for segmentation, part-of-speech tags, and readings without a need to specify weights.
Example:
With name
<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer name="korean" decompoundMode="discard" outputUnknownUnigrams="false"/>
<filter name="koreanPartOfSpeechStop" />
<filter name="koreanReadingForm" />
<filter name="lowercase" />
</analyzer>
</fieldType>
With class name (legacy)
<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" outputUnknownUnigrams="false"/>
<filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
<filter class="solr.KoreanReadingFormFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Korean Tokenizer
Factory class: solr.KoreanTokenizerFactory
SPI name: korean
Arguments:
userDictionary
-
Optional
Default: none
Path to a user-supplied dictionary to add custom nouns or compound terms to the default dictionary.
userDictionaryEncoding
-
Optional
Default: none
Character encoding of the user dictionary.
decompoundMode
-
Optional
Default:
discard
Defines how to handle compound tokens. The options are:
-
none
: No decomposition for tokens. -
discard
: Tokens are decomposed and the original form is discarded. -
mixed
: Tokens are decomposed and the original form is retained.
-
outputUnknownUnigrams
-
Optional
Default:
false
If
true
, unigrams will be output for unknown words. discardPunctuation
-
Optional
Default:
true
If
true
, punctuation will be discarded.
Hebrew, Lao, Myanmar, Khmer
Lucene provides support, in addition to UAX#29 word break rules, for Hebrew’s use of the double and single quote characters, and for segmenting Lao, Myanmar, and Khmer into syllables with the solr.ICUTokenizerFactory
in the analysis-extras
module.
To use this tokenizer, you must enable the analysis-extras Module.
See ICUTokenizer for more information.
Latvian
Solr includes support for stemming Latvian, and Lucene includes an example stopword list.
Factory class: solr.LatvianStemFilterFactory
Arguments: None
Example:
With name
<fieldType name="text_lvstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="latvianStem"/>
</analyzer>
</fieldType>
With class name (legacy)
<fieldType name="text_lvstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.LatvianStemFilterFactory"/>
</analyzer>
</fieldType>
In: "tirgiem tirgus"
Tokenizer to Filter: "tirgiem", "tirgus"
Out: "tirg", "tirg"
Norwegian
Solr includes two classes for stemming Norwegian, NorwegianLightStemFilterFactory
and NorwegianMinimalStemFilterFactory
.
Lucene includes an example stopword list.
Another option is to use the Snowball Porter Stemmer with an argument of language="Norwegian".
For normalization, there is a NorwegianNormalizationFilterFactory
which is a variant of the Scandinavian normalization filters but with folding rules tuned for Norwegian.
Norwegian Light Stemmer
The NorwegianLightStemFilterFactory
requires a "two-pass" sort for the -dom and -het endings.
This means that in the first pass the word "kristendom" is stemmed to "kristen", and then all the general rules apply so it will be further stemmed to "krist".
The effect of this is that "kristen," "kristendom," "kristendommen," and "kristendommens" will all be stemmed to "krist."
The second pass is to pick up -dom and -het endings. Consider this example:
One pass | Two passes | ||
---|---|---|---|
Before |
After |
Before |
After |
forlegen |
forleg |
forlegen |
forleg |
forlegenhet |
forlegen |
forlegenhet |
forleg |
forlegenheten |
forlegen |
forlegenheten |
forleg |
forlegenhetens |
forlegen |
forlegenhetens |
forleg |
firkantet |
firkant |
firkantet |
firkant |
firkantethet |
firkantet |
firkantethet |
firkant |
firkantetheten |
firkantet |
firkantetheten |
firkant |
Factory class: solr.NorwegianLightStemFilterFactory
Arguments:
variant
-
Optional
Default:
nb
The Norwegian language variant to use. Valid values are:
-
nb:
Bokmål -
nn:
Nynorsk -
no:
both
-
Example:
With name
<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_no.txt" format="snowball"/>
<filter name="norwegianLightStem"/>
</analyzer>
</fieldType>
With class name (legacy)
<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_no.txt" format="snowball"/>
<filter class="solr.NorwegianLightStemFilterFactory"/>
</analyzer>
</fieldType>
In: "Forelskelsen"
Tokenizer to Filter: "forelskelsen"
Out: "forelske"
Norwegian Minimal Stemmer
The NorwegianMinimalStemFilterFactory
stems plural forms of Norwegian nouns only.
Factory class: solr.NorwegianMinimalStemFilterFactory
Arguments:
variant
-
Optional
Default:
nb
The Norwegian language variant to use. Valid values are:
-
nb:
Bokmål -
nn:
Nynorsk -
no:
both
-
Example:
<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_no.txt" format="snowball"/>
<filter name="norwegianMinimalStem"/>
</analyzer>
</fieldType>
In: "Bilens"
Tokenizer to Filter: "bilens"
Out: "bil"
Norwegian Normalization Filter
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØåÅ and folded variants (ae, oe and aa) by transforming them to æÆøØåÅ.
This is a variant of ScandinavianNormalizationFilter
, with folding rules customized for Norwegian.
Factory class: solr.NorwegianNormalizationFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="norwegianNormalization"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NorwegianNormalizationFilterFactory"/>
</analyzer>
In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"
Tokenizer to Filter: "blåbærsyltetøy", "blåbärsyltetöy", "blaabaersyltetoey", "blabarsyltetoy"
Out: "blåbærsyltetøy", "blåbærsyltetøy", "blåbærsyltetøy", "blabarsyltetoy"
Persian
Persian Filter Factories
Solr includes support for normalizing Persian, and Lucene includes an example stopword list.
Factory class: solr.PersianNormalizationFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="arabicNormalization"/>
<filter name="persianNormalization"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
</analyzer>
Polish
Solr provides support for Polish stemming with the solr.StempelPolishStemFilterFactory
, and solr.MorphologikFilterFactory
for lemmatization, in the analysis-extras
module.
The solr.StempelPolishStemFilterFactory
component includes an algorithmic stemmer with tables for Polish.
To use this tokenizer, you must enable the analysis-extras Module.
Factory class: solr.StempelPolishStemFilterFactory
and solr.MorfologikFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="stempelPolishStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StempelPolishStemFilterFactory"/>
</analyzer>
<analyzer>
<tokenizer name="standard"/>
<filter name="morfologik" dictionary="morfologik/stemming/polish/polish.dict"/>
<filter name="lowercase"/>
</analyzer>
In: ""studenta studenci"
Tokenizer to Filter: "studenta", "studenci"
Out: "student", "student"
More information about the Stempel stemmer is available in the Lucene javadocs.
Note the lower case filter is applied after the Morfologik stemmer; this is because the Polish dictionary contains proper names and then proper term case may be important to resolve disambiguities (or even lookup the correct lemma at all).
The Morfologik dictionary parameter value is a constant specifying which dictionary to choose.
The dictionary resource must be named path/to/language.dict
and have an associated .info
metadata file.
See the Morfologik project for details.
If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default.
Portuguese
Solr includes four stemmers for Portuguese: one in the solr.SnowballPorterFilterFactory
, an alternative stemmer called solr.PortugueseStemFilterFactory
, a lighter stemmer called solr.PortugueseLightStemFilterFactory
, and an even less aggressive stemmer called solr.PortugueseMinimalStemFilterFactory
.
Lucene includes an example stopword list.
Factory classes: solr.PortugueseStemFilterFactory
, solr.PortugueseLightStemFilterFactory
, solr.PortugueseMinimalStemFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="portugueseStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PortugueseStemFilterFactory"/>
</analyzer>
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="portugueseLightStem"/>
</analyzer>
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="portugueseMinimalStem"/>
</analyzer>
In: "praia praias"
Tokenizer to Filter: "praia", "praias"
Out: "pra", "pra"
Romanian
Solr can stem Romanian using the Snowball Porter Stemmer with an argument of language="Romanian"
.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
-
Required
Default: none
The stemmer language,
Romanian
in this case.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="wnowballPorter" language="Romanian" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Romanian" />
</analyzer>
Russian
Russian Stem Filter
Solr includes two stemmers for Russian: one in the solr.SnowballPorterFilterFactory language="Russian"
, and a lighter stemmer called solr.RussianLightStemFilterFactory
.
Lucene includes an example stopword list.
Factory class: solr.RussianLightStemFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="russianLightStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RussianLightStemFilterFactory"/>
</analyzer>
Scandinavian
Scandinavian is a language group spanning three languages Norwegian, Swedish and Danish which are very similar.
Swedish å, ä, ö are in fact the same letters as Norwegian and Danish å, æ, ø and thus interchangeable when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters.
In that situation almost all Swedish people use a, a, o instead of å, ä, ö. Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes permutations of everything above.
There are two filters for helping with normalization between Scandinavian languages: one is solr.ScandinavianNormalizationFilterFactory
trying to preserve the special characters (æäöå) and another solr.ScandinavianFoldingFilterFactory
which folds these to the more broad ø/ö → o, etc.
See also each language section for other relevant filters.
Scandinavian Normalization Filter
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
It’s a semantically less destructive solution than ScandinavianFoldingFilter
, most useful when a person with a Norwegian or Danish keyboard queries a Swedish index and vice versa.
This filter does not perform the common Swedish folds of å and ä to a nor ö to o.
Factory class: solr.ScandinavianNormalizationFilterFactory
Arguments: None
Example:
In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"
Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"
Out: "blåbærsyltetøj", "blåbærsyltetøj", "blåbærsyltetøj", "blabarsyltetoj"
Scandinavian Folding Filter
This filter folds Scandinavian characters åÅäæÄÆ → a and öÖøØ → o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.
It’s a semantically more destructive solution than ScandinavianNormalizationFilter
, but can in addition help with matching raksmorgas as räksmörgås.
Factory class: solr.ScandinavianFoldingFilterFactory
Arguments: None
Example:
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="scandinavianFolding"/>
</analyzer>
In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"
Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"
Out: "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj"
Serbian
Serbian Normalization Filter
Solr includes a filter that normalizes Serbian Cyrillic and Latin characters. Note that this filter only works with lowercased input.
For user tips & advice on using this filter, see Serbian Language Support in the Solr Wiki.
Factory class: solr.SerbianNormalizationFilterFactory
Arguments:
haircut
-
Optional
Default:
bald
Select the extend of normalization. Valid values are:
-
bald
: Cyrillic characters are first converted to Latin; then, Latin characters have their diacritics removed, with the exception of LATIN SMALL LETTER D WITH STROKE (U+0111) which is converted to “dj
” -
regular
: Only Cyrillic to Latin normalization will be applied, preserving the Latin diatrics
-
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="serbianNormalization" haircut="bald"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SerbianNormalizationFilterFactory" haircut="bald"/>
</analyzer>
Spanish
Solr includes two stemmers for Spanish: one in the solr.SnowballPorterFilterFactory language="Spanish"
, and a lighter stemmer called solr.SpanishLightStemFilterFactory
.
Lucene includes an example stopword list.
Factory class: solr.SpanishStemFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="spanishLightStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
In: "torear toreara torearlo"
Tokenizer to Filter: "torear", "toreara", "torearlo"
Out: "tor", "tor", "tor"
Swedish
Swedish Stem Filter
Solr includes three stemmers for Swedish: one in the solr.SnowballPorterFilterFactory language="Swedish"
, a lighter stemmer called solr.SwedishLightStemFilterFactory
, and a minimal stemmer solr.SwedishMinimalStemFilterFactory
.
The Light variant is based on simple rules, and removes suffixes like -het
, -heten
, -else
, -elser
etc., while the Minimal one only tries to normalize singular/plural endings like -er
, -ar
, -arne
etc. See the Lucene javadocs for more information.
The Swedish Light and Minimal stemmers are known to produce many conflicting word stems, significantly hurting search precision. It may be necessary to provide an extensive list of custom stemmer mappings to counteract this, e.g. using the StemmerOverrideFilter. |
Lucene includes an example stopword list.
Also relevant are the Scandinavian normalization filters.
Factory class: solr.SwedishStemFilterFactory
, solr.SwedishLightStemFilterFactory
and solr.SwedishMinimalStemFilterFactory
.
Arguments: None
Example (SwedishLightStemFilterFactory):
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="swedishLightStem"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SwedishLightStemFilterFactory"/>
</analyzer>
In: "kloke klokhet klokheten"
Tokenizer to Filter: "kloke", "klokhet", "klokheten"
Out: "klok", "klok", "klok"
Thai
This filter converts sequences of Thai characters into individual Thai words. Unlike European languages, Thai does not use whitespace to delimit words.
Factory class: solr.ThaiTokenizerFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="thai"/>
<filter name="lowercase"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.ThaiTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
Turkish
Solr includes support for stemming Turkish with the solr.SnowballPorterFilterFactory
; support for case-insensitive search with the solr.TurkishLowerCaseFilterFactory
; support for stripping apostrophes and following suffixes with solr.ApostropheFilterFactory
(see Role of Apostrophes in Turkish Information Retrieval); support for a form of stemming that truncating tokens at a configurable maximum length through the solr.TruncateTokenFilterFactory
(see Information Retrieval on Turkish Texts); and Lucene includes an example stopword list.
Factory class: solr.TurkishLowerCaseFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="apostrophe"/>
<filter name="turkishLowercase"/>
<filter name="snowballPorter" language="Turkish"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Turkish"/>
</analyzer>
Another example, illustrating diacritics-insensitive search:
<analyzer>
<tokenizer name="standard"/>
<filter name="apostrophe"/>
<filter name="turkishLowercase"/>
<filter name="asciiFoldingFilterFactory" preserveOriginal="true"/>
<filter name="keywordRepeat"/>
<filter name="truncate" prefixLength="5"/>
<filter name="removeDuplicates"/>
</analyzer>
Ukrainian
Solr provides support for Ukrainian lemmatization with the solr.MorphologikFilterFactory
, in the analysis-extras
module.
To use this filter, you must enable the analysis-extras Module.
Lucene also includes an example Ukrainian stopword list, in the lucene-analyzers-morfologik
jar.
Factory class: solr.MorfologikFilterFactory
Arguments:
dictionary
-
Required
Default: none
The path to a lemmatizer dictionary. The
lucene-analyzers-morfologik
jar contains a Ukrainian dictionary atorg/apache/lucene/analysis/uk/ukrainian.dict
.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="stop" words="org/apache/lucene/analysis/uk/stopwords.txt"/>
<filter name="lowercase"/>
<filter name="morfologik" dictionary="org/apache/lucene/analysis/uk/ukrainian.dict"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/uk/stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.MorfologikFilterFactory" dictionary="org/apache/lucene/analysis/uk/ukrainian.dict"/>
</analyzer>
The Morfologik dictionary
parameter value is a constant specifying which dictionary to choose.
The dictionary resource must be named path/to/language.dict
and have an associated .info
metadata file.
See the Morfologik project for details.
If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default.
Analysis Extras Module
Many of the language features listed above are supported by the analysis-extras
Solr Module that needs to be enabled before use.
Additional details on the specific jar files required can be found in the Module’s README.