Filters
Filters examine a stream of tokens and keep them, transform them, or discard them depending on the filter type being used.
About Filters
Like tokenizers, filters consume input and produce a stream of tokens.
Filters also derive from org.apache.lucene.analysis.TokenStream
but unlike tokenizers, a filter’s input is another TokenStream.
The job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides whether to pass it along, replace it, or discard it.
A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although this is less common. One hypothetical use for such a filter might be to normalize state names that would be tokenized as two words. For example, the single token "california" would be replaced with "CA", while the token pair "rhode" followed by "island" would become the single token "RI".
Because filters consume one TokenStream
and produce a new TokenStream
, they can be chained one after another indefinitely.
Each filter in the chain in turn processes the tokens produced by its predecessor.
The order in which you specify the filters is therefore significant.
Typically, the most general filtering is done first, and later filtering stages are more specialized.
Filter Configuration
Filters are configured with a <filter>
element in the schema file as a child of <analyzer>
, following the <tokenizer>
element.
For example:
With name
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="englishPorter"/>
</analyzer>
</fieldType>
With class name (legacy)
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>
This example starts with Solr’s standard tokenizer, which breaks the field’s text into tokens. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time.
The last filter in the above example is a stemmer filter that uses the Porter stemming algorithm.
Stemming
A stemmer is basically a set of mapping rules that maps the various forms of a word back to the base, or stem, word from which they derive.
For example, in English the words "hugs", "hugging" and "hugged" are all forms of the stem word "hug". The stemmer will replace all of these terms with "hug", which is what will be indexed. This means that a query for "hug" will match the term "hugged", but not "huge".
Conversely, applying a stemmer to your query terms will allow queries containing non-stemmed terms, like "hugging", to match documents with different variations of the same stem word, such as "hugged". This works because both the indexer and the query will map to the same stem ("hug").
Word stemming is, obviously, very language specific. Solr includes several language-specific stemmers created by the Snowball generator that are based on the Porter stemming algorithm. The generic Snowball Porter Stemmer Filter can be used to configure any of these language stemmers. Solr also includes a convenience wrapper for the English Snowball stemmer. There are also several purpose-built stemmers for non-English languages. These stemmers are described in Language Analysis.
Filters with Arguments
Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the <filter>
element.
For example:
With name
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer type="query">
<tokenizer name="pattern" pattern="; " />
<filter name="length" min="2" max="7"/>
</analyzer>
</fieldType>
With class name (legacy)
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="; " />
<filter class="solr.LengthFilterFactory" min="2" max="7"/>
</analyzer>
</fieldType>
The following sections describe the filter factories that are included in this release of Solr.
ASCII Folding Filter
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts characters from the following Unicode blocks:
-
Latin Extended-A (PDF)
-
Latin Extended-B (PDF)
-
Latin Extended-C (PDF)
-
Latin Extended-D (PDF)
-
IPA Extensions (PDF)
-
Phonetic Extensions (PDF)
-
General Punctuation (PDF)
-
Enclosed Alphanumerics (PDF)
-
Dingbats (PDF)
-
Supplemental Punctuation (PDF)
Factory class: solr.ASCIIFoldingFilterFactory
Arguments:
preserveOriginal
-
Optional
Default:
false
If
true
, the original token is preserved: "thé" → "the", "thé"
Example:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="asciiFolding" preserveOriginal="false" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizer"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
</analyzer>
In: "á" (Unicode character 00E1)
Out: "a" (ASCII character 97)
Beider-Morse Filter
Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar names, even if they are spelled differently or in different languages. More information about how this works is available in the section Beider-Morse Phonetic Matching.
BeiderMorseFilter changed its behavior in Solr 5.0 due to an update to version 3.04 of the BMPM algorithm. Older version of Solr implemented BMPM version 3.00 (see http://stevemorse.org/phoneticinfo.htm). Any index built using this filter with earlier versions of Solr will need to be rebuilt. |
Factory class: solr.BeiderMorseFilterFactory
Arguments:
nameType
-
Optional
Default:
GENERIC
Types of names. Valid values are
GENERIC
,ASHKENAZI
, orSEPHARDIC
. If not processing Ashkenazi or Sephardic names, useGENERIC
. ruleType
-
Optional
Default:
APPROX
Types of rules to apply. Valid values are
APPROX
orEXACT
. concat
-
Optional
Default:
true
Defines if multiple possible matches should be combined with a pipe (
|
). languageSet
-
Optional
Default:
auto
The language set to use. The value
auto
will allow the filter to identify the language, or a comma-separated list can be supplied.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="beiderMorse" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
</analyzer>
Classic Filter
This filter takes the output of the Classic Tokenizer and strips periods from acronyms and "'s" from possessives.
Factory class: solr.ClassicFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="classic"/>
<filter name="classic"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
</analyzer>
In: "I.B.M. cat’s can’t"
Tokenizer to Filter: "I.B.M", "cat’s", "can’t"
Out: "IBM", "cat", "can’t"
Common Grams Filter
This filter for use in index
time analysis creates word shingles by combining common tokens such as stop words with regular tokens.
This can result in an index with more unique terms, but is useful for creating phrase queries containing common words, such as "the cat", in a way that will typically be much faster than if the combined tokens are not used, because only the term positions of documents containing both terms in sequence have to be considered.
Correct usage requires being paired with Common Grams Query Filter during query
analysis.
These filters can also be combined with Stop Filter so searching for "the cat"
would match different documents then "a cat"
, while pathological searches for either "the"
or "a"
would not match any documents.
Factory class: solr.CommonGramsFilterFactory
Arguments:
words
-
Required
Default: none
The name of a common word file in .txt format, such as
stopwords.txt
. format
-
Optional
Default: none
If the stopwords list has been formatted for Snowball, you can specify
format="snowball"
so Solr can read the stopwords file. ignoreCase
-
Optional
Default:
false
If
true
, the filter ignores the case of words when comparing them to the common word file.
Example:
With name
<analyzer type="index">
<tokenizer name="whitespace"/>
<filter name="commonGrams" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="commonGramsQuery" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CommonGramsQueryFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
In: "the cat in the hat"
Tokenizer to Filter(s): "the", "cat", "in", "the", "hat"
(Index) Out: "the"(1), "the_cat"(1), "cat"(2), "cat_in"(2), "in"(3), "in_the"(3), "the"(4), "the_hat"(4), "hat"(5)
(Query) Out: "the_cat"(1), "cat_in"(2), "in_the"(3), "the_hat"(4)
Common Grams Query Filter
This filter is used for the query
time analysis aspect of Common Grams Filter — see that filter for a description of arguments, example configuration, and sample input/output.
Collation Key Filter
Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be used with advanced searches. We’ve covered this in much more detail in the section on Unicode Collation.
Daitch-Mokotoff Soundex Filter
Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of similar names, even if they are spelled differently. More information about how this works is available in the section on Phonetic Matching.
Factory class: solr.DaitchMokotoffSoundexFilterFactory
Arguments:
inject
-
Optional
Default:
true
If
true
, then new phonetic tokens are added to the stream. Otherwise, tokens are replaced with the phonetic equivalent. Setting this tofalse
will enable phonetic matching, but the exact spelling of the target word may not match.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="daitchMokotoffSoundex" inject="true"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DaitchMokotoffSoundexFilterFactory" inject="true"/>
</analyzer>
Double Metaphone Filter
This filter creates tokens using the DoubleMetaphone
encoding algorithm from commons-codec.
For more information, see Phonetic Matching.
Factory class: solr.DoubleMetaphoneFilterFactory
Arguments:
inject
-
Optional
Default:
true
If
true
, then new phonetic tokens are added to the stream. Otherwise, tokens are replaced with the phonetic equivalent. Setting this tofalse
will enable phonetic matching, but the exact spelling of the target word may not match. maxCodeLength
-
Optional
Default: none
The maximum length of the code to be generated.
Example:
Default behavior for inject (true
): keep the original token and add phonetic token(s) at the same position.
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="doubleMetaphone"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory"/>
</analyzer>
In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "Kuczewski"(4), "KSSK"(4), "KXFS"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the token they were derived from (immediately preceding). Note that "Kuczewski" has two encodings, which are added at the same position.
Example:
Discard original token (inject="false"
).
<analyzer>
<tokenizer name="standard"/>
<filter name="doubleMetaphone" inject="false"/>
</analyzer>
In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "KSSK"(4), "KXFS"(4)
Note that "Kuczewski" has two encodings, which are added at the same position.
Delimited Boost Filter
This filter adds a numeric floating point boost value to tokens, splitting on a delimiter character.
Factory class: solr.DelimitedBoostTokenFilterFactory
Arguments:
delimiter
-
Optional
Default:
|
(pipe symbol)The character used to separate the token and the boost.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="delimitedBoost"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DelimitedBoostTokenFilterFactory"/>
</analyzer>
In: "leopard|0.5 panthera uncia|0.9"
Tokenizer to Filter: "leopard|0.5"(1), "panthera"(2), "uncia|0.9"(3)
Out: "leopard"(1)[0.5], "panthera"(2), "uncia"(3)[0.9]
The numeric floating point in square brackets is a float token boost attribute.
Example:
Using a different delimiter (delimiter="/"
).
<analyzer>
<tokenizer name="standard"/>
<filter name="delimitedBoost" delimiter="/"/>
</analyzer>
In: "leopard/0.5 panthera uncia/0.9"
Tokenizer to Filter: "leopard/0.5"(1), "panthera"(2), "uncia/0.9"(3)
Out: "leopard"(1)[0.5], "panthera"(2), "uncia"(3)[0.9]
N.B. make sure the delimiter is compatible with the tokenizer you use
Edge N-Gram Filter
This filter generates edge n-gram tokens of sizes within the given range.
Factory class: solr.EdgeNGramFilterFactory
Arguments:
minGramSize
-
Required
Default: none
The minimum gram size, must be > 0.
maxGramSize
-
Required
Default: none
The maximum gram size, must be >=
minGramSize
. preserveOriginal
-
Optional
Default:
false
If
true
keep the original term even if it is shorter thanminGramSize
or longer thanmaxGramSize
.
Example:
Default behavior.
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="edgeNGram" minGramSize="1" maxGramSize="1"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "f", "s", "a", "t"
Example:
A range of 1 to 4.
<analyzer>
<tokenizer name="standard"/>
<filter name="edgeNGram" minGramSize="1" maxGramSize="4"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"
Example:
A range of 4 to 6.
<analyzer>
<tokenizer name="standard"/>
<filter name="edgeNGram" minGramSize="4" maxGramSize="6"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "four", "scor", "score", "twen", "twent", "twenty"
Example:
Preserve original term.
<analyzer>
<tokenizer name="standard"/>
<filter name="edgeNGram" minGramSize="2" maxGramSize="3" preserveOriginal="true"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "fo", "fou", "four", "sc, "sco", "score"
English Minimal Stem Filter
This filter stems plural English words to their singular form.
Factory class: solr.EnglishMinimalStemFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="englishMinimalStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
In: "dogs cats"
Tokenizer to Filter: "dogs", "cats"
Out: "dog", "cat"
English Possessive Filter
This filter removes singular possessives (trailing 's) from words. Note that plural possessives, e.g., the s' in "divers' snorkels", are not removed by this filter.
Factory class: solr.EnglishPossessiveFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="englishPossessive"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
In: "Man’s dog bites dogs' man"
Tokenizer to Filter: "Man’s", "dog", "bites", "dogs'", "man"
Out: "Man", "dog", "bites", "dogs'", "man"
Fingerprint Filter
This filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens. This can be useful for clustering/linking use cases.
Factory class: solr.FingerprintFilterFactory
Arguments:
separator
-
Optional
Default: space character
The character used to separate tokens combined into the single output token.
maxOutputTokenSize
-
Optional
Default:
1024
The maximum length of the summarized output token. If exceeded, no output token is emitted.
Example:
With name
<analyzer type="index">
<tokenizer name="whitespace"/>
<filter name="fingerprint" separator="_" />
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.FingerprintFilterFactory" separator="_" />
</analyzer>
In: "the quick brown fox jumped over the lazy dog"
Tokenizer to Filter: "the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"
Out: "brown_dog_fox_jumped_lazy_over_quick_the"
Flatten Graph Filter
This filter must be included on index-time analyzer specifications that include at least one graph-aware filter, including Synonym Graph Filter and Word Delimiter Graph Filter.
Factory class: solr.FlattenGraphFilterFactory
Arguments: None
See the examples below for Synonym Graph Filter and Word Delimiter Graph Filter.
Hunspell Stem Filter
The Hunspell Stem Filter
provides support for several languages.
You must provide the dictionary (.dic
) and rules (.aff
) files for each language you wish to use with the Hunspell Stem Filter.
You can download those language files here.
Be aware that your results will vary widely based on the quality of the provided dictionary and rules files. For example, some languages have only a minimal word list with no morphological information. On the other hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell stemmer may be a good choice.
Factory class: solr.HunspellStemFilterFactory
Arguments:
dictionary
-
Required
Default: none
The path to a dictionary file.
affix
-
Required
Default: none
The path of a rules file.
ignoreCase
-
Optional
Default:
false
Controls whether matching is case sensitive or not.
longestOnly
-
Optional
Default:
false
If
true
, only the longest term is emitted. strictAffixParsing
-
Optional
Default:
true
Controls whether the affix parsing is strict or not. If
true
, an error while reading an affix rule causes a ParseException, otherwise is ignored.
Example:
With name
<analyzer type="index">
<tokenizer name="whitespace"/>
<filter name="hunspellStem"
dictionary="en_GB.dic"
affix="en_GB.aff"
ignoreCase="true"
strictAffixParsing="true" />
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic"
affix="en_GB.aff"
ignoreCase="true"
strictAffixParsing="true" />
</analyzer>
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Hyphenated Words Filter
This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following token and the hyphen is discarded.
Note that for this filter to work properly, the upstream tokenizer must not remove trailing hyphen characters. This filter is generally only useful at index time.
Factory class: solr.HyphenatedWordsFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="whitespace"/>
<filter name="hyphenatedWords"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
</analyzer>
In: "A hyphen- ated word"
Tokenizer to Filter: "A", "hyphen-", "ated", "word"
Out: "A", "hyphenated", "word"
ICU Folding Filter
This filter is a custom Unicode normalization form that applies the foldings specified in Unicode TR #30: Character Foldings in addition to the NFKC_Casefold
normalization form as described in ICU Normalizer 2 Filter.
This filter is a better substitute for the combined behavior of the ASCII Folding Filter, Lower Case Filter, and ICU Normalizer 2 Filter.
To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Installing Plugins).
See solr/modules/analysis-extras/README.md
for instructions on which jars you need to add.
Factory class: solr.ICUFoldingFilterFactory
Arguments:
filter
-
Optional
Default: none
A Unicode set filter that can be used to e.g., exclude a set of characters from being processed. See the UnicodeSet javadocs for more information.
Example without a filter:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="icuFolding"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
Example with a filter to exclude Swedish/Finnish characters:
<analyzer>
<tokenizer name="standard"/>
<filter name="icuFolding" filter="[^åäöÅÄÖ]"/>
</analyzer>
For detailed information on this normalization form, see Unicode TR #30: Character Foldings.
ICU Normalizer 2 Filter
This filter normalizes text according to one of five Unicode Normalization Forms as described in Unicode Standard Annex #15:
-
NFC: (
name="nfc" mode="compose"
) Normalization Form C, canonical decomposition -
NFD: (
name="nfc" mode="decompose"
) Normalization Form D, canonical decomposition, followed by canonical composition -
NFKC: (
name="nfkc" mode="compose"
) Normalization Form KC, compatibility decomposition -
NFKD: (
name="nfkc" mode="decompose"
) Normalization Form KD, compatibility decomposition, followed by canonical composition -
NFKC_Casefold: (
name="nfkc_cf" mode="compose"
) Normalization Form KC, with additional Unicode case folding. Using the ICU Normalizer 2 Filter is a better-performing substitution for the Lower Case Filter and NFKC normalization.
Factory class: solr.ICUNormalizer2FilterFactory
Arguments:
form
-
Required
Default:
nfkc_cf
The name of the normalization form. Valid options are
nfc
,nfd
,nfkc
,nfkd
, ornfkc_cf
. mode
-
Required
Default:
compose
The mode of Unicode character composition and decomposition. Valid options are:
compose
ordecompose
. filter
-
Optional
Default: none
A Unicode set filter that can be used to e.g., exclude a set of characters from being processed. See the UnicodeSet javadocs for more information.
Example with NFKC_Casefold:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="icuNormalizer2" form="nfkc_cf" mode="compose"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ICUNormalizer2FilterFactory" form="nfkc_cf" mode="compose"/>
</analyzer>
Example with a filter to exclude Swedish/Finnish characters:
<analyzer>
<tokenizer name="standard"/>
<filter name="icuNormalizer2" form="nfkc_cf" mode="compose" filter="[^åäöÅÄÖ]"/>
</analyzer>
For detailed information about these normalization forms, see Unicode Normalization Forms.
To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Installing Plugins).
See solr/modules/analysis-extras/README.md
for instructions on which jars you need to add.
ICU Transform Filter
This filter applies ICU Tranforms to text. This filter supports only ICU System Transforms. Custom rule sets are not supported.
Factory class: solr.ICUTransformFilterFactory
Arguments:
id
-
Required
Default: none
The identifier for the ICU System Transform you wish to apply with this filter. For a full list of ICU System Transforms, see http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html.
direction
-
Optional
Default:
forward
The direction of the ICU transform. Valid options are
forward
andreverse
.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="icuTransform" id="Traditional-Simplified"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
</analyzer>
For detailed information about ICU Transforms, see http://userguide.icu-project.org/transforms/general.
To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Installing Plugins).
See solr/modules/analysis-extras/README.md
for instructions on which jars you need to add.
Keep Word Filter
This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop Words Filter. This filter can be useful for building specialized indices for a constrained set of terms.
Factory class: solr.KeepWordFilterFactory
Arguments:
words
-
Required
Default: none
Path to a text file containing the list of keep words, one per line. Blank lines and lines that begin with
\#
are ignored. This may be an absolute path, or a simple filename in the Solrconf
directory. format
-
Optional
Default: none
If the keepwords list has been formatted for Snowball, you can specify
format="snowball"
so Solr can read the keepwords file. ignoreCase
-
Optional
Default:
false
If
true
then comparisons are done case-insensitively. If this argument is true, then the words file is assumed to contain only lowercase words.
Example:
Where keepwords.txt
contains:
happy funny silly
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="keepWord" words="keepwords.txt"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "funny"
Example:
Same keepwords.txt
, case insensitive:
<analyzer>
<tokenizer name="standard"/>
<filter name="keepWord" words="keepwords.txt" ignoreCase="true"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "Happy", "funny"
Example:
Using LowerCaseFilterFactory before filtering for keep words, no ignoreCase
flag.
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="keepWord" words="keepwords.txt"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Filter to Filter: "happy", "sad", "or", "funny"
Out: "happy", "funny"
KStem Filter
KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem was written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is only appropriate for English language text.
Factory class: solr.KStemFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="kStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KStemFilterFactory"/>
</analyzer>
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Length Filter
This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.
Factory class: solr.LengthFilterFactory
Arguments:
min
-
Required
Default: none
Minimum token length. Tokens shorter than this are discarded.
max
-
Required
Default: none
Maximum token length. Must be larger than
min
. Tokens longer than this are discarded.
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="length" min="3" max="7"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="7"/>
</analyzer>
In: "turn right at Albuquerque"
Tokenizer to Filter: "turn", "right", "at", "Albuquerque"
Out: "turn", "right"
Limit Token Count Filter
This filter limits the number of accepted tokens, typically useful for index analysis.
By default, this filter ignores any tokens in the wrapped TokenStream
once the limit has been reached, which can result in reset()
being called prior to incrementToken()
returning false
.
For most TokenStream
implementations this should be acceptable, and faster than consuming the full stream.
If you are wrapping a TokenStream
which requires that the full stream of tokens be exhausted in order to function properly, use the consumeAllTokens="true"
option.
Factory class: solr.LimitTokenCountFilterFactory
Arguments:
maxTokenCount
-
Required
Default: none
Maximum token count. After this limit has been reached, tokens are discarded.
consumeAllTokens
-
Optional
Default:
false
Whether to consume (and discard) previous token filters' tokens after the maximum token count has been reached. See description above.
Example:
With name
<analyzer type="index">
<tokenizer name="whitespace"/>
<filter name="limitTokenCount" maxTokenCount="10"
consumeAllTokens="false" />
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10"
consumeAllTokens="false" />
</analyzer>
In: "1 2 3 4 5 6 7 8 9 10 11 12"
Tokenizer to Filter: "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
Out: "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
Limit Token Offset Filter
This filter limits tokens to those before a configured maximum start character offset. This can be useful to limit highlighting, for example.
By default, this filter ignores any tokens in the wrapped TokenStream
once the limit has been reached, which can result in reset()
being called prior to incrementToken()
returning false
.
For most TokenStream
implementations this should be acceptable, and faster than consuming the full stream.
If you are wrapping a TokenStream
which requires that the full stream of tokens be exhausted in order to function properly, use the consumeAllTokens="true"
option.
Factory class: solr.LimitTokenOffsetFilterFactory
Arguments:
maxStartOffset
-
Required
Default: none
Maximum token start character offset. After this limit has been reached, tokens are discarded.
consumeAllTokens
-
Optional
Default:
false
Whether to consume (and discard) previous token filters' tokens after the maximum start offset has been reached. See description above.
Example:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="limitTokenOffset" maxStartOffset="10"
consumeAllTokens="false" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LimitTokenOffsetFilterFactory" maxStartOffset="10"
consumeAllTokens="false" />
</analyzer>
In: "0 2 4 6 8 A C E"
Tokenizer to Filter: "0", "2", "4", "6", "8", "A", "C", "E"
Out: "0", "2", "4", "6", "8", "A"
Limit Token Position Filter
This filter limits tokens to those before a configured maximum token position.
By default, this filter ignores any tokens in the wrapped TokenStream
once the limit has been reached, which can result in reset()
being called prior to incrementToken()
returning false
.
For most TokenStream
implementations this should be acceptable, and faster than consuming the full stream.
If you are wrapping a TokenStream
which requires that the full stream of tokens be exhausted in order to function properly, use the consumeAllTokens="true"
option.
Factory class: solr.LimitTokenPositionFilterFactory
Arguments:
maxTokenPosition
-
Required
Default: none
Maximum token position. After this limit has been reached, tokens are discarded.
consumeAllTokens
-
Optional
Default:
false
Whether to consume (and discard) previous token filters' tokens after the maximum start offset has been reached. See description above.
Example:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="limitTokenPosition" maxTokenPosition="3"
consumeAllTokens="false" />
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="3"
consumeAllTokens="false" />
</analyzer>
In: "1 2 3 4 5"
Tokenizer to Filter: "1", "2", "3", "4", "5"
Out: "1", "2", "3"
Lower Case Filter
Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left unchanged.
Factory class: solr.LowerCaseFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
In: "Down With CamelCase"
Tokenizer to Filter: "Down", "With", "CamelCase"
Out: "down", "with", "camelcase"
Managed Stop Filter
This is specialized version of the Stop Words Filter Factory that uses a set of stop words that are managed from a REST API.
Arguments:
managed
-
Required
Default: none
The name that should be used for this set of stop words in the managed REST API.
Example:
With this configuration the set of words is named "english" and can be managed via /solr/collection_name/schema/analysis/stopwords/english
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="managedStop" managed="english"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory" managed="english"/>
</analyzer>
See Stop Filter for example input/output.
Managed Synonym Filter
This is specialized version of the Synonym Filter that uses a mapping on synonyms that is managed from a REST API.
Managed Synonym Filter has been Deprecated
Managed Synonym Filter has been deprecated in favor of Managed Synonym Graph Filter, which is required for multi-term synonym support. |
Factory class: solr.ManagedSynonymFilterFactory
For arguments and examples, see the Synonym Graph Filter below.
Managed Synonym Graph Filter
This is specialized version of the Synonym Graph Filter that uses a mapping on synonyms that is managed from a REST API.
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Managed Synonym Filter, which produces incorrect graphs for multi-token synonyms.
Although this filter produces correct token graphs, it cannot consume an input token graph correctly. |
Arguments:
managed
-
Required
Default: none
The name that should be used for this mapping on synonyms in the managed REST API.
Example:
With this configuration the set of mappings is named "english" and can be managed via /solr/collection_name/schema/analysis/synonyms/english
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="managedSynonymGraph" managed="english"/>
<filter name="flattenGraph"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer name="standard"/>
<filter name="managedSynonymGraph" managed="english"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedSynonymGraphFilterFactory" managed="english"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedSynonymGraphFilterFactory" managed="english"/>
</analyzer>
See Synonym Graph Filter below for example input/output.
MinHash Filter
Generates a repeatably random fixed number of hash tokens from all the input tokens in the stream. To do this it first consumes all of the input tokens from its source. This filter would normally be preceded by a Shingle Filter, as shown in the example below.
Each input token is hashed.
It is subsequently "rehashed" hashCount
times by combining with a set of precomputed hashes.
For each of the resulting hashes, the hash space is divided into bucketCount
buckets.
The lowest set of hashSetSize
hashes (usually a set of one) is generated for each bucket.
This filter generates one type of signature or sketch for the input tokens and can be used to compute Jaccard similarity between documents.
Arguments:
hashCount
-
Optional
Default:
1
The number of hashes to use.
bucketCount
-
Optional
Default:
512
The number of buckets to use.
hashSetSize
-
Optional
Default:
1
The size of the set for the lowest hashes from each bucket.
withRotation
-
Optional
Default: see description
If a hash bucket is empty, generate a hash value from the first previous bucket that has a value. The default is
true
if thebucketCount
is greater than1
andfalse
otherwise.
The number of hashes generated depends on the options above.
With the default settings for withRotation
, the number of hashes generated is hashCount
x bucketCount
x hashSetSize
⇒ 512, by default.
Example:
With name
<analyzer>
<tokenizer name="icu"/>
<filter name="icuFolding"/>
<filter name="shingle" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter name="minHash" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
In: "woof woof woof woof woof"
Tokenizer to Filter: "woof woof woof woof woof"
Out: "℁팽徭聙↝ꇁ홱杯", "℁팽徭聙↝ꇁ홱杯", "℁팽徭聙↝ꇁ홱杯", …. a total of 512 times
N-Gram Filter
Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by gram size.
Factory class: solr.NGramFilterFactory
Arguments:
minGramSize
-
Required
Default: none
The minimum gram size, must be > 0.
maxGramSize
-
Required
Default: none
The maximum gram size, must be >=
minGramSize
. preserveOriginal
-
Optional
Default:
false
If
true
keep the original term even if it is shorter thanminGramSize
or longer thanmaxGramSize
.
Example:
Default behavior.
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="nGram" minGramSize="1" maxGramSize="2"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"
Example:
A range of 1 to 4.
<analyzer>
<tokenizer name="standard"/>
<filter name="nGram" minGramSize="1" maxGramSize="4"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "o", "ou", "our", "u", "ur", "r", "s", "sc", "sco", "scor", "c", "co", "cor", "core", "o", "or", "ore", "r", "re", "e"
Example:
A range of 3 to 5.
<analyzer>
<tokenizer name="standard"/>
<filter name="nGram" minGramSize="3" maxGramSize="5"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"
Example:
Preserve original term.
<analyzer>
<tokenizer name="standard"/>
<filter name="nGram" minGramSize="2" maxGramSize="3" preserveOriginal="true"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "fo", "fou", "ou", "our", "ur", "four", "sc", "sco", "co", "cor", "or", "ore", "re", "score"
Numeric Payload Token Filter
This filter adds a numeric floating point payload value to tokens that match a given type.
Refer to the Javadoc for the org.apache.lucene.analysis.Token
class for more information about token types and payloads.
Factory class: solr.NumericPayloadTokenFilterFactory
Arguments:
payload
-
Required
Default: none
A floating point value that will be added to all matching tokens.
typeMatch
-
Required
Default: none
A token type name string. Tokens with a matching type name will have their payload set to the above floating point value.
Example:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="numericPayload" payload="0.75" typeMatch="word"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NumericPayloadTokenFilterFactory" payload="0.75" typeMatch="word"/>
</analyzer>
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0.75], "bang"[0.75], "boom"[0.75]
Pattern Replace Filter
This filter applies a regular expression to each token and, for those that match, substitutes the given replacement string in place of the matched pattern. Tokens which do not match are passed through unchanged.
Factory class: solr.PatternReplaceFilterFactory
Arguments:
pattern
-
Required
Default: none
The regular expression to test against each token, as per
java.util.regex.Pattern
. replacement
-
Required
Default: none
A string to substitute in place of the matched pattern. This string may contain references to capture groups in the regex pattern. See the Javadoc for
java.util.regex.Matcher
. replace
-
Optional
Default:
all
Indicates whether all occurrences of the pattern (
all
) in the token should be replaced, or only the first (first
).
Example:
Simple string replace:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="patternReplace" pattern="cat" replacement="dog"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog"/>
</analyzer>
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogydog"
Example:
String replacement, first occurrence only:
<analyzer>
<tokenizer name="standard"/>
<filter name="patternReplace" pattern="cat" replacement="dog" replace="first"/>
</analyzer>
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogycat"
Example:
More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is passed through.
<analyzer>
<tokenizer name="standard"/>
<filter name="patternReplace" pattern="(\D+)(\d+)$" replacement="$1_$2"/>
</analyzer>
In: "cat foo1234 9987 blah1234foo"
Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"
Out: "cat", "foo_1234", "9987", "blah1234foo"
Phonetic Filter
This filter creates tokens using one of the phonetic encoding algorithms in the org.apache.commons.codec.language
package.
For more information, see the section on Phonetic Matching.
Factory class: solr.PhoneticFilterFactory
Arguments:
encoder
-
Required
Default: none
The name of the encoder to use. The encoder name must be one of the following (case insensitive):
inject
-
Optional
Default:
true
If
true
, new phonetic tokens are added to the stream. Otherwise, tokens are replaced with the phonetic equivalent. Setting this tofalse
will enable phonetic matching, but the exact spelling of the target word may not match. maxCodeLength
-
Optional
Default: none
The maximum length of the code to be generated by the Metaphone or Double Metaphone encoders.
Example:
Default behavior for DoubleMetaphone encoding.
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="phonetic" encoder="DoubleMetaphone"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the token they were derived from (immediately preceding).
Example:
Discard original token.
<analyzer>
<tokenizer name="standard"/>
<filter name="phonetic" encoder="DoubleMetaphone" inject="false"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)
Example:
Default Soundex encoder.
<analyzer>
<tokenizer name="standard"/>
<filter name="phonetic" encoder="Soundex"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)
Porter Stem Filter
This filter applies the Porter Stemming Algorithm for English.
The results are similar to using the Snowball Porter Stemmer with the language="English"
argument.
But this stemmer is coded directly in Java and is not based on Snowball.
It does not accept a list of protected words and is only appropriate for English language text.
However, it has been benchmarked as four times faster than the English Snowball stemmer, so can provide a performance enhancement.
Factory class: solr.PorterStemFilterFactory
Arguments: None
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="porterStem"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Protected Term Filter
This filter enables a form of conditional filtering: it only applies its wrapped filters to terms that are not contained in a protected set.
Factory class: solr.ProtectedTermFilterFactory
Arguments:
protected
-
Required
Default: none
Comma-separated list of files containing protected terms, one per line.
wrappedFilters
-
Required
Default: none
Case-insensitive comma-separated list of
TokenFilterFactory
SPI names (strip trailing(Token)FilterFactory
from the factory name - see thejava.util.ServiceLoader interface
). Each filter name must be unique, so if you need to specify the same filter more than once, you must add case-insensitive unique-id
suffixes to each same-SPI-named filter (note that the-id
suffix is stripped prior to SPI lookup). ignoreCase
-
Optional
Default:
false
Ignore case when testing for protected words. If
true
, the protected list should contain lowercase words.
Example:
All terms except those in protectedTerms.txt
are truncated at 4 characters and lowercased:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="protectedTerm"
ignoreCase="true" protected="protectedTerms.txt"
wrappedFilters="truncate,lowercase"
truncate.prefixLength="4"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ProtectedTermFilterFactory"
ignoreCase="true" protected="protectedTerms.txt"
wrappedFilters="truncate,lowercase"
truncate.prefixLength="4"/>
</analyzer>
Example:
This example includes multiple same-named wrapped filters with unique -id
suffixes.
Note that both the filter SPI names and -id
suffixes are treated case-insensitively.
For all terms except those in protectedTerms.txt
, synonyms are added, terms are reversed, and then synonyms are added for the reversed terms:
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="protectedTerm"
ignoreCase="true" protected="protectedTerms.txt"
wrappedFilters="SynonymGraph-fwd,ReverseString,SynonymGraph-rev"
synonymgraph-FWD.synonyms="fwd-syns.txt"
synonymgraph-REV.synonyms="rev-syns.txt"/>
</analyzer>
Remove Duplicates Token Filter
The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates ONLY if they have the same text and position values.
Because positions must be the same, this filter might not do what a user expects it to do based on its name. It is a very specialized filter that is only useful in very specific circumstances. It has been so named for brevity, even though it is potentially misleading.
Factory class: solr.RemoveDuplicatesTokenFilterFactory
Arguments: None
Example:
One example of where RemoveDuplicatesTokenFilterFactory
is useful in situations where a synonym file is being used in conjunction with a stemmer.
In these situations, both the stemmer and the synonym filter can cause completely identical terms with the same positions to end up in the stream, increasing index size with no benefit.
Consider the following entry from a synonyms.txt
file:
Television, Televisions, TV, TVs
When used in the following configuration:
With name
<analyzer type="query">
<tokenizer name="standard"/>
<filter name="synonymGraph" synonyms="synonyms.txt"/>
<filter name="englishMinimalStem"/>
<filter name="removeDuplicates"/>
</analyzer>
With class name (legacy)
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
In: "Watch TV"
Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)
Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)
Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)
Out: "Watch"(1) "Television"(2) "TV"(2)
Reversed Wildcard Filter
This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not reversed.
Factory class: solr.ReversedWildcardFilterFactory
Arguments:
withOriginal
-
Optional
Default:
true
If
true
, the filter produces both original and reversed tokens at the same positions. Iffalse
, produces only reversed tokens. maxPosAsterisk
-
Optional
Default:
2
The maximum position of the asterisk wildcard ('*') that triggers the reversal of the query term. Terms with asterisks at positions above this value are not reversed.
maxPosQuestion
-
Optional
Default:
1
The maximum position of the question mark wildcard ('?') that triggers the reversal of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to 0 and
maxPosAsterisk
to 1. maxFractionAsterisk
-
Optional
Default:
0.0
An additional parameter that triggers the reversal if asterisk ('*') position is less than this fraction of the query token length.
minTrailing
-
Optional
Default:
2
The minimum number of trailing characters in a query token after the last wildcard character. For good performance this should be set to a value larger than
1
.
Example:
With name
<analyzer type="index">
<tokenizer name="whitespace"/>
<filter name="reversedWildcard" withOriginal="true"
maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
</analyzer>
In: "*foo *bar"
Tokenizer to Filter: "*foo", "*bar"
Out: "oof*", "rab*"
Shingle Filter
This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.
Factory class: solr.ShingleFilterFactory
Arguments:
minShingleSize
-
Optional
Default:
2
The minimum number of tokens per shingle. Must be higher than or equal to
2
. maxShingleSize
-
Optional
Default:
2
The maximum number of tokens per shingle. Must be higher than or equal to
minShingleSize
. outputUnigrams
-
Optional
Default:
true
If
true
, then each individual token is also included at its original position. outputUnigramsIfNoShingles
-
Optional
Default:
false
If
true
, then individual tokens will be output if no shingles are possible. tokenSeparator
-
Optional
Default: space character
The string to use when joining adjacent tokens to form a shingle.
fillerToken
-
Optional
Default:
_
(underscore)The character used to fill in for removed stop words in order to preserve position increments.
Example:
Default behavior.
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="shingle"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"/>
</analyzer>
In: "To be, or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)
Example:
A shingle size of four, do not include original token.
<analyzer>
<tokenizer name="standard"/>
<filter name="shingle" maxShingleSize="4" outputUnigrams="false"/>
</analyzer>
In: "To be, or not to be."
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)
Out: "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or not to"(3), "or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)
Snowball Porter Stemmer Filter
This filter factory instantiates a language-specific stemmer generated by Snowball. Snowball is a software package that generates pattern-based word stemmers. This type of stemmer is not as accurate as a table-based stemmer, but is faster and less complex. Table-driven stemmers are labor intensive to create and maintain and so are typically commercial products.
Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. For more information on Snowball, visit http://snowball.tartarus.org/.
StopFilterFactory
, CommonGramsFilterFactory
, and CommonGramsQueryFilterFactory
can optionally read stopwords in Snowball format (specify format="snowball"
in the configuration of those FilterFactories).
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
-
Optional
Default:
English
The name of a language, used to select the appropriate Porter stemmer to use. Case is significant. This string is used to select a package name in the
org.tartarus.snowball.ext
class hierarchy. protected
-
Required
Default:
protected
Path to a text file containing a list of protected words, one per line. Protected words will not be stemmed. Blank lines and lines that begin with
\#
are ignored. This may be an absolute path, or a simple file name in the Solrconf
directory.
Example:
Default behavior:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="snowballPorter"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory"/>
</analyzer>
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flip", "flip"
Example:
French stemmer, English words:
<analyzer>
<tokenizer name="standard"/>
<filter name="snowballPorter" language="French"/>
</analyzer>
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flipped", "flipping"
Example:
Spanish stemmer, Spanish words:
<analyzer>
<tokenizer name="standard"/>
<filter name="snowballPorter" language="Spanish"/>
</analyzer>
In: "cante canta"
Tokenizer to Filter: "cante", "canta"
Out: "cant", "cant"
Stop Filter
This filter discards, or stops analysis of, tokens that are on the given stop words list.
A standard stop words list is included in the Solr conf
directory, named stopwords.txt
, which is appropriate for typical English language text.
Factory class: solr.StopFilterFactory
Arguments:
words
-
Optional
Default: none
The path to a file that contains a list of stop words, one per line. Blank lines and lines that begin with
\#
are ignored. This may be an absolute path, or path relative to the Solrconf
directory. format
-
Optional
Default: none
If the stopwords list has been formatted for Snowball, you can specify
format="snowball"
so Solr can read the stopwords file. ignoreCase
-
Optional
Default:
false
Ignore case when testing for stop words. If
true
, the stop list should contain lowercase words.
Example:
Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="stop" words="stopwords.txt"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
</analyzer>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "what"(4)
Example:
<analyzer>
<tokenizer name="standard"/>
<filter name="stop" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "what"(4)
Suggest Stop Filter
Like Stop Filter, this filter discards, or stops analysis of, tokens that are on the given stop words list.
Suggest Stop Filter differs from Stop Filter in that it will not remove the last token unless it is followed by a token separator.
For example, a query "find the"
would preserve the 'the'
since it was not followed by a space, punctuation, etc., and mark it as a KEYWORD
so that following filters will not change or remove it.
By contrast, a query like “find the popsicle” would remove ‘the’ as a stopword, since it’s followed by a space.
When using one of the analyzing suggesters, you would normally use the ordinary StopFilterFactory
in your index analyzer and then SuggestStopFilter in your query analyzer.
Factory class: solr.SuggestStopFilterFactory
Arguments:
words
-
Optional
Default:
StopAnalyzer#ENGLISH_STOP_WORDS_SET
The name of a stopwords file to parse.
format
-
Optional
Default:
wordset
Defines how the words file will be parsed. If
words
is not specified, thenformat
must not be specified. The valid values for theformat
parameter are:-
wordset
: Supports one word per line (including any intra-word whitespace) and allows whole line comments beginning with the\#
character. Blank lines are ignored. -
snowball
: Allows for multiple words specified on each line, and trailing comments may be specified using the vertical line (|
). Blank lines are ignored.
-
ignoreCase
-
Optional
Default:
false
If
true
, matching is case-insensitive.
Example:
With name
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="lowercase"/>
<filter name="suggestStop" ignoreCase="true"
words="stopwords.txt" format="wordset"/>
</analyzer>
With class name (legacy)
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SuggestStopFilterFactory" ignoreCase="true"
words="stopwords.txt" format="wordset"/>
</analyzer>
In: "The The"
Tokenizer to Filter: "the"(1), "the"(2)
Out: "the"(2)
Synonym Filter
This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token.
Synonym Filter has been Deprecated
Synonym Filter has been deprecated in favor of Synonym Graph Filter, which is required for multi-term synonym support. |
Factory class: solr.SynonymFilterFactory
For arguments and examples, see the Synonym Graph Filter below.
Synonym Graph Filter
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Synonym Filter, because the indexer can’t directly consume a graph. To get fully correct positional queries when your synonym replacements are multiple tokens, you should instead apply synonyms using this filter at query time.
Although this filter produces correct token graphs, it cannot consume an input token graph correctly. |
Factory class: solr.SynonymGraphFilterFactory
Arguments:
synonyms
-
Required
Default: none
The path to a file that contains a list of synonyms, one per line. In the (default)
solr
format - see theformat
argument below for alternatives - blank lines and lines that begin with\#
are ignored. This may be a comma-separated list of paths. See Resource Loading for more information.There are two ways to specify synonym mappings:
-
A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token.
-
Two comma-separated lists of words with the symbol "=>" between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right.
-
ignoreCase
-
Optional
Default:
false
If
true
, synonyms will be matched case-insensitively. expand
-
Optional
Default:
true
If
true
, a synonym will be expanded to all equivalent synonyms. Iffalse
, all equivalent synonyms will be reduced to the first in the list. format
-
Optional
Default:
solr
Controls how the synonyms will be parsed. The short names
solr
(forSolrSynonymParser
andwordnet
(forWordnetSynonymParser
) are supported. You may alternatively supply the name of your ownSynonymMap.Builder
subclass. tokenizerFactory
-
Optional
Default:
WhitespaceTokenizerFactory
The name of the tokenizer factory to use when parsing the synonyms file. Arguments with the name prefix
tokenizerFactory.*
will be supplied as init params to the specified tokenizer factory.Any arguments not consumed by the synonym filter factory, including those without the
tokenizerFactory.*
prefix, will also be supplied as init params to the tokenizer factory.If
tokenizerFactory
is specified, thenanalyzer
may not be, and vice versa. analyzer
-
Optional
Default:
WhitespaceTokenizerFactory
The name of the analyzer class to use when parsing the synonyms file. If
analyzer
is specified, thentokenizerFactory
may not be, and vice versa.
For the following examples, assume a synonyms file named mysynonyms.txt
:
couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
Example:
With name
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="synonymGraph" synonyms="mysynonyms.txt"/>
<filter name="flattenGraph"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer name="standard"/>
<filter name="synonymGraph" synonyms="mysynonyms.txt"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
In: "teh small couch"
Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)
Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
Example:
In: "teh ginormous, humungous sofa"
Tokenizer to Filter: "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)
Out: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)
Weighted Synonyms:
Combining the DelimitedBoostFilter with the Synonym Graph Filter you can achieve Weighted synonyms at query time.
For more information feel free to refer to:
https://sease.io/2020/03/introducing-weighted-synonyms-in-apache-lucene.html
For the following examples, assume a synonyms file named boostedSynonyms.txt
:
leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85
lion => panthera leo|0.9, simba|0.8, kimba|0.75
Example:
With name
<analyzer type="query">
<tokenizer name="standard"/>
<filter name="synonymGraph" synonyms="boostedSynonyms.txt"/>
<filter name="delimitedBoost"/>
</analyzer>
In: "lion"
Tokenizer to Filter: "lion"(1)
Out: "panthera"(1), "leo"(2)[0.9], "simba"(1)[0.8], "kimba"(1)[0.75]
Token Offset Payload Filter
This filter adds the numeric character offsets of the token as a payload value for that token.
Factory class: solr.TokenOffsetPayloadTokenFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="tokenOffsetPayload"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TokenOffsetPayloadTokenFilterFactory"/>
</analyzer>
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0,4], "bang"[5,9], "boom"[10,14]
Trim Filter
This filter trims leading and/or trailing whitespace from tokens. Most tokenizers break tokens at whitespace, so this filter is most often used for special situations.
Factory class: solr.TrimFilterFactory
Arguments: None
Example:
The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove whitespace.
With name
<analyzer>
<tokenizer name="pattern" pattern=","/>
<filter name="trim"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=","/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
In: "one, two , three ,four "
Tokenizer to Filter: "one", " two ", " three ", "four "
Out: "one", "two", "three", "four"
Type As Payload Filter
This filter adds the token’s type, as an encoded byte sequence, as its payload.
Factory class: solr.TypeAsPayloadTokenFilterFactory
Arguments: None
Example:
With name
<analyzer>
<tokenizer name="whitespace"/>
<filter name="typeAsPayload"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TypeAsPayloadTokenFilterFactory"/>
</analyzer>
In: "Pay Bob’s I.O.U."
Tokenizer to Filter: "Pay", "Bob’s", "I.O.U."
Out: "Pay"[<ALPHANUM>], "Bob’s"[<APOSTROPHE>], "I.O.U."[<ACRONYM>]
Type As Synonym Filter
This filter adds the token’s type, as a token at the same position as the token, optionally with a configurable prefix prepended.
Factory class: solr.TypeAsSynonymFilterFactory
Arguments:
prefix
-
Optional
Default: none
The prefix to prepend to the token’s type.
ignore
-
Optional
Default: none
A comma-separated list of types to ignore and not convert to synonyms.
synFlagsMask
-
Optional
Default: see description
A mask (provided as an integer) to control what flags are propagated to the synonyms. The default value is an integer
-1
, i.e., the mask0xFFFFFFFF
- this mask propagates any flags as is.
Examples:
With the example below, each token’s type will be emitted verbatim at the same position:
With name
<analyzer>
<tokenizer name="standard"/>
<filter name="typeAsSynonym"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TypeAsSynonymFilterFactory"/>
</analyzer>
With the example below, for a token "example.com" with type <URL>
, the token emitted at the same position will be "_type_<URL>":
With name
<analyzer>
<tokenizer name="uax29URLEmail"/>
<filter name="typeAsSynonym" prefix="_type_"/>
</analyzer>
With class name (legacy)
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.TypeAsSynonymFilterFactory" prefix="_type_"/>
</analyzer>
Type Token Filter
This filter denies or allows a specified list of token types, assuming the tokens have type metadata associated with them. For example, the UAX29 URL Email Tokenizer emits "<URL>" and "<EMAIL>" typed tokens, as well as other types. This filter would allow you to pull out only e-mail addresses from text as tokens, if you wish.
Factory class: solr.TypeTokenFilterFactory
Arguments:
types
-
Required
Default: none
Defines the path to a file of types to filter.
useWhitelist
-
Optional
Default:
false
If
true
, the file defined intypes
should be used as include list. Iffalse
, or undefined, the file defined intypes
is used as a denylist.
Example:
With name
<analyzer>
<filter name="typeToken" types="stoptypes.txt" useWhitelist="true"/>
</analyzer>
With class name (legacy)
<analyzer>
<filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt" useWhitelist="true"/>
</analyzer>
Word Delimiter Filter
This filter splits tokens at word delimiters.
Word Delimiter Filter has been Deprecated
Word Delimiter Filter has been deprecated in favor of Word Delimiter Graph Filter, which is required to produce a correct token graph so that e.g., phrase queries can work correctly. |
Factory class: solr.WordDelimiterFilterFactory
For a full description, including arguments and examples, see the Word Delimiter Graph Filter below.
Word Delimiter Graph Filter
This filter splits tokens at word delimiters.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Word Delimiter Filter, because the indexer can’t directly consume a graph. To get fully correct positional queries when tokens are split, you should instead use this filter at query time.
Note: although this filter produces correct token graphs, it cannot consume an input token graph correctly.
The rules for determining delimiters are determined as follows:
-
A change in case within a word: "CamelCase" → "Camel", "Case". This can be disabled by setting
splitOnCaseChange="0"
. -
A transition from alpha to numeric characters or vice versa: "Gonzo5000" → "Gonzo", "5000" "4500XL" → "4500", "XL". This can be disabled by setting
splitOnNumerics="0"
. -
Non-alphanumeric characters (discarded): "hot-spot" → "hot", "spot"
-
A trailing "'s" is removed: "O’Reilly’s" → "O", "Reilly"
-
Any leading or trailing delimiters are discarded: "--hot-spot--" → "hot", "spot"
Factory class: solr.WordDelimiterGraphFilterFactory
Arguments:
generateWordParts
-
Optional
Default:
1
If non-zero, splits words at delimiters. For example:"CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
generateNumberParts
-
Optional
Default:
1
If non-zero, splits numeric strings at delimiters:"1947-32" → *"1947", "32"
splitOnCaseChange
-
Optional
Default:
1
If
0
, words are not split on camel-case changes:"BugBlaster-XL" → "BugBlaster", "XL". Example 1 below illustrates the default (non-zero) splitting behavior. splitOnNumerics
-
Optional
Default:
1
If
0
, don’t split words on transitions from alpha to numeric:"FemBot3000" → "Fem", "Bot3000" catenateWords
-
Optional
Default:
0
If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
catenateNumbers
-
Optional
Default:
0
If non-zero, maximal runs of number parts will be joined: 1947-32" → "194732"
catenateAll
-
Optional
Default:
0
If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" → "ZapMaster9000"
preserveOriginal
-
Optional
Default:
0
If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
protected
-
Optional
Default: none
The path to a file that contains a list of protected words that should be passed through without splitting.
stemEnglishPossessive
-
Optional
Default:
1
If
1
, strips the possessive's
from each subword. adjustOffsets
-
Optional
Default:
true
If
true
, the offsets of partial terms are adjusted. types
-
Optional
Default: none
The path to a file that contains character => type mappings, which enable customization of this filter’s splitting behavior. Recognized character types:
LOWER
,UPPER
,ALPHA
,DIGIT
,ALPHANUM
, andSUBWORD_DELIM
.The default for any character without a customized mapping is computed from Unicode character properties. Blank lines and comment lines starting with '#' are ignored. An example file:
# Don't split numbers at '$', '.' or ',' $ => DIGIT . => DIGIT \u002C => DIGIT # Don't split on ZWJ: https://en.wikipedia.org/wiki/Zero-width_joiner \u200D => ALPHANUM
Example:
Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.
With name
<analyzer type="index">
<tokenizer name="whitespace"/>
<filter name="wordDelimiterGraph"/>
<filter name="flattenGraph"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="wordDelimiterGraph"/>
</analyzer>
With class name (legacy)
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"/>
</analyzer>
In: "hot-spot RoboBlaster/9000 100XL"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL"
Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"
Example:
Do not split on case changes, and do not generate number parts. Note that by not generating number parts, tokens containing only numeric parts are ultimately discarded.
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="wordDelimiterGraph" generateNumberParts="0" splitOnCaseChange="0"/>
</analyzer>
In: "hot-spot RoboBlaster/9000 100-42"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100-42"
Out: "hot", "spot", "RoboBlaster", "9000"
Example:
Concatenate word parts and number parts, but not word and number parts that occur in the same token.
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="wordDelimiterGraph" catenateWords="1" catenateNumbers="1"/>
</analyzer>
In: "hot-spot 100+42 XL40"
Tokenizer to Filter: "hot-spot"(1), "100+42"(2), "XL40"(3)
Out: "hot"(1), "spot"(2), "hotspot"(2), "100"(3), "42"(4), "10042"(4), "XL"(5), "40"(6)
Example:
Concatenate all. Word and/or number parts are joined together.
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="wordDelimiterGraph" catenateAll="1"/>
</analyzer>
In: "XL-4000/ES"
Tokenizer to Filter: "XL-4000/ES"(1)
Out: "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)
Example:
Using a protected words list that contains "AstroBlaster" and "XL-5000" (among others).
<analyzer type="query">
<tokenizer name="whitespace"/>
<filter name="wordDelimiterGraph" protected="protwords.txt"/>
</analyzer>
In: "FooBar AstroBlaster XL-5000 ==ES-34-"
Tokenizer to Filter: "FooBar", "AstroBlaster", "XL-5000", "==ES-34-"
Out: "FooBar", "FooBar", "AstroBlaster", "XL-5000", "ES", "34"