Tokenizers

Tokenizers are responsible for breaking field data into lexical units, or tokens.

Each token is (usually) a sub-sequence of the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is not. Tokenizers read from a character stream (a Reader) and produce a sequence of token objects (a TokenStream).

Characters such as whitespace or other delimiters in the input stream may be discarded. They may also be added to or replaced, such as mapping aliases or abbreviations to normalized forms.

A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field. Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is the same text that occurs in the field, or that its length is the same as the original text. It’s also possible for more than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you use token metadata for things like highlighting search results in the field text.

About Tokenizers

You configure the tokenizer for a text field type in the schema with a <tokenizer> element, as a child of <analyzer>:

With name

<fieldType name="text" class="solr.TextField">
  <analyzer type="index">
    <tokenizer name="standard"/>
    <filter name="lowercase"/>
  </analyzer>
</fieldType>

With class name (legacy)

<fieldType name="text" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The name/class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the org.apache.lucene.analysis.TokenizerFactory. A TokenizerFactory’s create() method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.

Arguments may be passed to tokenizer factories by setting attributes on the <tokenizer> element.

With name

<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer type="query">
    <tokenizer name="pattern" pattern="; "/>
  </analyzer>
</fieldType>

With class name (legacy)

<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
  </analyzer>
</fieldType>

When to Use a CharFilter vs. a TokenFilter

There are several pairs of CharFilters and TokenFilters that have related (i.e., MappingCharFilter and ASCIIFoldingFilter) or nearly identical (i.e., PatternReplaceCharFilterFactory and PatternReplaceFilterFactory) functionality, and it may not always be obvious which is the best choice.

The decision about which to use depends largely on which Tokenizer you are using, and whether you need to preprocess the stream of characters.

For example, suppose you have a tokenizer such as StandardTokenizer and although you are pretty happy with how it works overall, you want to customize how some specific characters behave. You could modify the rules and re-build your own tokenizer with JFlex, but it might be easier to simply map some of the characters before tokenization with a CharFilter.

The following sections describe the tokenizer factory classes included in this release of Solr.

Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

  • The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

Note that words are split at hyphens.

The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.

Factory class: solr.StandardTokenizerFactory

Arguments:

maxTokenLength

Optional

Default: 255

Solr ignores tokens that exceed the number of characters specified by maxTokenLength.

Example:

With name

<analyzer>
  <tokenizer name="standard"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

Classic Tokenizer

The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the Unicode standard annex UAX#29 word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespace are kept as part of the token.

  • Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

  • Recognizes Internet domain names and email addresses and preserves them as a single token.

Factory class: solr.ClassicTokenizerFactory

Arguments:

maxTokenLength

Optional

Default: 255

Solr ignores tokens that exceed the number of characters specified by maxTokenLength.

Example:

With name

<analyzer>
  <tokenizer name="classic"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer>

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

Keyword Tokenizer

This tokenizer treats the entire text field as a single token.

Factory class: solr.KeywordTokenizerFactory

Arguments:

maxTokenLen

Optional

Default: 256

Maximum token length the tokenizer will emit.

Example:

With name

<analyzer>
  <tokenizer name="keyword"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Letter Tokenizer

This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

Factory class: solr.LetterTokenizerFactory

Arguments:

maxTokenLen

Optional

Default: 255

Maximum token length the tokenizer will emit.

Example:

With name

<analyzer>
  <tokenizer name="letter"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>

In: "I can’t."

Out: "I", "can", "t"

Lower Case Tokenizer

Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.

Factory class: solr.LowerCaseTokenizerFactory

Arguments:

maxTokenLen

Optional

Default: 255

Maximum token length the tokenizer will emit.

Example:

With name

<analyzer>
  <tokenizer name="lowercase"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>

In: "I just *LOVE* my iPhone!"

Out: "i", "just", "love", "my", "iphone"

N-Gram Tokenizer

Reads the field text and generates n-gram tokens of sizes in the given range.

Factory class: solr.NGramTokenizerFactory

Arguments:

minGramSize

Optional

Default: 1

The minimum n-gram size, must be > 0.

maxGramSize

Optional

Default: 2

The maximum n-gram size, must be >= minGramSize.

Example:

Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.

With name

<analyzer>
  <tokenizer name="nGram"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer>

In: "hey man"

Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

Example:

With an n-gram size range of 4 to 5:

With name

<analyzer>
  <tokenizer name="nGram" minGramSize="4" maxGramSize="5"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
</analyzer>

In: "bicycle"

Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

Edge N-Gram Tokenizer

Reads the field text and generates edge n-gram tokens of sizes in the given range.

Factory class: solr.EdgeNGramTokenizerFactory

Arguments:

minGramSize

Optional

Default: 1

The minimum n-gram size, must be > 0.

maxGramSize

Optional

Default: 1

The maximum n-gram size, must be >= minGramSize.

Example:

Default behavior (min and max default to 1):

With name

<analyzer>
  <tokenizer name="edgeNGram"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.EdgeNGramTokenizerFactory"/>
</analyzer>

In: "babaloo"

Out: "b"

Example:

Edge n-gram range of 2 to 5

With name

<analyzer>
  <tokenizer name="edgeNGram" minGramSize="2" maxGramSize="5"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
</analyzer>

In: "babaloo"

Out:"ba", "bab", "baba", "babal"

ICU Tokenizer

This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.

You can customize this tokenizer’s behavior by specifying per-script rule files. To add per-script rules, add a rulefiles argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi.

The default configuration for solr.ICUTokenizerFactory provides UAX#29 word break rules tokenization (like solr.StandardTokenizer), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters.

Factory class: solr.ICUTokenizerFactory

Arguments:

rulefile

Optional

Default: none

A comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path.

cjkAsWords

Optional

Default: true

If true, CJK text would undergo dictionary-based segmentation, and all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC. Otherwise, text will be segmented according to UAX#29 defaults.

myanmarAsWords

Optional

Default: true

If true, Myanmar text would undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.

Example:

With name

<analyzer>
  <!-- no customization -->
  <tokenizer name="icu"/>
</analyzer>

With class name (legacy)

<analyzer>
  <!-- no customization -->
  <tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>

With name

<analyzer>
  <tokenizer name="icu"
             rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.ICUTokenizerFactory"
             rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>

To use this tokenizer, you must add additional .jars to Solr’s classpath (as described in the section Installing Plugins). See the solr/modules/analysis-extras/README.md for information on which jars you need to add.

Path Hierarchy Tokenizer

This tokenizer creates synonyms from file path hierarchies.

Factory class: solr.PathHierarchyTokenizerFactory

Arguments:

delimiter

Required

Default: none

You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.

replace

Required

Default: none

Specifies the delimiter character Solr uses in the tokenized output.

reverse

Optional

Default: false

If true, switch the tokenizer behavior to build the path hierarchy in "reversed" order. This is typically useful for tokenizing the URLs.

skip

Optional

Default: 0

Number of leftmost (or rightmost, if reverse=true) path elements to drop from each emitted token.

Example:

Default behavior

With name

<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer name="pathHierarchy" delimiter="\" replace="/"/>
  </analyzer>
</fieldType>

With class name (legacy)

<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
  </analyzer>
</fieldType>

In: "c:\usr\local\apache"

Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

Example:

Reverse order

With name

<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer name="pathHierarchy" delimiter="." replace="." reverse="true"/>
  </analyzer>
</fieldType>

With class name (legacy)

<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="." replace="." reverse="true"/>
  </analyzer>
</fieldType>

In: "www.site.co.uk"

Out: "www.site.co.uk", "site.co.uk", "co.uk", "uk"

Regular Expression Pattern Tokenizer

This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.

Factory class: solr.PatternTokenizerFactory

Arguments:

pattern

Required

Default: none

The regular expression, as defined by in java.util.regex.Pattern.

group

Optional

Default: -1

Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.

Example:

A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.

With name

<analyzer>
  <tokenizer name="pattern" pattern="\s*,\s*"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
</analyzer>

In: "fee,fie, foe , fum, foo"

Out: "fee", "fie", "foe", "fum", "foo"

Example:

Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.

With name

<analyzer>
  <tokenizer name="pattern" pattern="[A-Z][A-Za-z]*" group="0"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
</analyzer>

In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

Example:

Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semicolon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.

With name

<analyzer>
  <tokenizer name="pattern" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
</analyzer>

In: "SKU: 1234, Part Number 5678, Part: 126-987"

Out: "1234", "5678", "126-987"

Simplified Regular Expression Pattern Tokenizer

This tokenizer is similar to the PatternTokenizerFactory described above, but uses Lucene RegExp pattern matching to construct distinct tokens for the input stream. The syntax is more limited than PatternTokenizerFactory, but the tokenization is quite a bit faster.

Factory class: solr.SimplePatternTokenizerFactory

Arguments:

pattern

Required

Default: none

The regular expression, as defined in the RegExp javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.

determinizeWorkLimit

Optional

Default: 10000

The limit on total state count for the determined automaton computed from the regexp.

Example:

To match tokens delimited by simple whitespace characters:

With name

<analyzer>
  <tokenizer name="simplePattern" pattern="[^ \t\r\n]+"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
</analyzer>

Simplified Regular Expression Pattern Splitting Tokenizer

This tokenizer is similar to the SimplePatternTokenizerFactory described above, but uses Lucene RegExp pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more limited than PatternTokenizerFactory, but the tokenization is quite a bit faster.

Factory class: solr.SimplePatternSplitTokenizerFactory

Arguments:

pattern

Required

Default: none

The regular expression, as defined by in the RegExp javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

determinizeWorkLimit

Optional

Default: 10000

The limit on total state count for the determined automaton computed from the regexp.

Example:

To match tokens delimited by simple whitespace characters:

With name

<analyzer>
  <tokenizer name="simplePatternSplit" pattern="[ \t\r\n]+"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
</analyzer>

UAX29 URL Email Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespace are kept as part of the token.

  • Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

  • Recognizes and preserves as single tokens the following:

    • Internet domain names containing top-level domains validated against the white list in the IANA Root Zone Database when the tokenizer was generated

    • email addresses

    • file://, http(s)://, and ftp:// URLs

    • IPv4 and IPv6 addresses

The UAX29 URL Email Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, <URL>, <EMAIL>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.

Factory class: solr.UAX29URLEmailTokenizerFactory

Arguments:

maxTokenLength

Optional

Default: 255

Solr ignores tokens that exceed the number of characters specified by maxTokenLength.

Example:

With name

<analyzer>
  <tokenizer name="uax29URLEmail"/>
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
</analyzer>

Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"

White Space Tokenizer

Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokens.

Factory class: solr.WhitespaceTokenizerFactory

Arguments:

rule

Optional

Default: java

Specifies how to define whitespace for the purpose of tokenization. Valid values:

maxTokenLen

Optional

Default: 255

Maximum token length the tokenizer will emit.

Example:

With name

<analyzer>
  <tokenizer name="whitespace" rule="java" />
</analyzer>

With class name (legacy)

<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
</analyzer>

In: "To be, or what?"

Out: "To", "be,", "or", "what?"

OpenNLP Tokenizer and OpenNLP Filters

See OpenNLP Integration for information about using the OpenNLP Tokenizer, along with information about available OpenNLP token filters.