Spell Checking
The SpellCheck component is designed to provide inline query suggestions based on other, similar, terms.
The basis for these suggestions can be terms in a field in Solr, externally created text files, or fields in other Lucene indexes.
Configuring the SpellCheckComponent
Define Spell Check in solrconfig.xml
The first step is to specify the source of terms in solrconfig.xml
.
There are a number of approaches to spell checking in Solr, discussed below.
IndexBasedSpellChecker
The IndexBasedSpellChecker
uses a Solr index as the basis for a parallel index used for spell checking.
It requires defining a field as the basis for the index terms; a common practice is to copy terms from some fields (such as title
, body
, etc.) to another field created for spell checking.
Here is an example of configuring IndexBasedSpellChecker
in solrconfig.xml
:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="classname">solr.IndexBasedSpellChecker</str>
<!-- required parameters -->
<str name="field">content</str>
<!-- optional parameters for IndexBasedSpellChecker -->
<str name="sourceLocation">./folder/with/index/files</str>
<!-- optional parameters for all spellcheckers -->
<str name="spellcheckIndexDir">./spellcheckerDir</str>
<str name="name">default</str>
<str name="fieldType">content_ft</str>
<str name="queryAnalyzerFieldType">text_general</str>
<str name="distanceMeasure">org.apache.lucene.search.spell.LevenshteinDistance</str>
<str name="comparatorClass">score</str>
<float name="accuracy">0.5</float>
<float name="thresholdTokenFrequency">0.0</float>
<str name="buildOnCommit">true</str>
<str name="buildOnOptimize">false</str>
</lst>
</searchComponent>
The first element defines the searchComponent
to use the solr.SpellCheckComponent
.
The classname
is the specific implementation of the SpellCheckComponent, in this case solr.IndexBasedSpellChecker
.
Defining the classname
is optional; if not defined, it will default to IndexBasedSpellChecker
.
The spellcheckIndexDir
defines the location of the directory that holds the spellcheck index, while the field
defines the source field (defined in the Schema) for spellcheck terms.
When choosing a field for the spellcheck index, it’s best to avoid a heavily processed field to get more accurate results.
If the field has many word variations from processing synonyms and/or stemming, the dictionary will be created with those variations in addition to more valid spelling data.
By default, this spellchecker builds its dictionary from the Solr index.
This can be changed by specifying sourceLocation
- a folder with static Lucene index files to use instead of the Solr index.
The spellchecker can be assigned a descriptive label, name
, - which can be helpful if the search component defines
multiple spellcheckers. With that, a spellcheck query can identify a subset of spellcheckers that should be consulted
(see Spell Check Parameters for more details).
The query analyzer for the field
is used to tokenize the spellcheck query.
If there’s a need to override that behavior, configure a fieldType
and the spellchecker
will use the query analyzer for that field type instead.
queryAnalyzerFieldType
is a field type from Solr’s schema, and works similarly to the fieldType
parameter.
The key difference is that Solr uses field
or fieldType
when it tokenizes the spellcheck query
supplied via spellcheck.q
, and uses queryAnalyzerFieldType
when the query is instead provided via the q
parameter.
The field type specified by this parameter should do minimal transformations. It’s usually a best practice to avoid types that aggressively stem or NGram, for instance, since those types of analysis can throw off spell checking.
Common configuration parameters like distanceMeasure
, comparatorClass
, accuracy
, and thresholdTokenFrequency
provide control over the returned spellcheck suggestions.
If the distanceMeasure
is not specified, Solr will use the Levenshtein metric which is the default metric for other spellchecker implementations as well (except for DirectSolrSpellChecker
).
When comparatorClass
is configured as "score", the suggestions with lower distance (i.e., higher similarity) scores are considered more relevant.
The alternative value is "freq" - this prioritizes suggestions with higher document frequency.
The accuracy
setting defines the threshold for a valid suggestion, and the thresholdTokenFrequency
setting allows
skipping suggestions which have low document frequency in the index.
Finally, buildOnCommit
and buildOnOptimize
define whether to build the spellcheck index at every commit (that is, every time new documents are added to the index)
or at every optimize request.
Both are optional, and can be omitted if you would rather set their values to false
.
DirectSolrSpellChecker
The DirectSolrSpellChecker
uses terms from the Solr index without building a parallel index like the IndexBasedSpellChecker
.
This spellchecker has the benefit of not having to be built regularly, meaning that the terms are always up-to-date with terms in the index.
Here is how this might be configured in solrconfig.xml
:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="classname">solr.DirectSolrSpellChecker</str>
<!-- required parameters -->
<str name="field">name</str>
<!-- optional parameters for DirectSolrSpellChecker -->
<int name="maxEdits">2</int>
<int name="minPrefix">1</int>
<int name="maxInspections">5</int>
<int name="minQueryLength">4</int>
<int name="maxQueryLength">40</int>
<float name="maxQueryFrequency">0.01</float>
<!-- optional parameters for all spellcheckers -->
<str name="name">default</str>
<str name="fieldType">name</str>
<str name="queryAnalyzerFieldType">text_general</str>
<str name="distanceMeasure">internal</str>
<str name="comparatorClass">score</str>
<float name="accuracy">0.5</float>
<float name="thresholdTokenFrequency">0.0</float>
</lst>
</searchComponent>
When choosing a field
to query for this spellchecker, you want one which has relatively little analysis performed on it (particularly analysis such as stemming).
Note that you need to specify a field to use for the suggestions, so like the IndexBasedSpellChecker
, you may want to copy data from fields like title
, body
, etc., to a field dedicated to providing spelling suggestions.
Many of the parameters relate to how this spellchecker should query the index for term suggestions.
The distanceMeasure
defines the metric to use during the spellcheck query - the default value for this spellchecker is "internal",
which corresponds to the Damerau-Levenshtein metric.
Because this spellchecker is querying the main index, you may want to limit how often it queries the index to be sure to avoid any performance conflicts with user queries.
The accuracy
setting defines the threshold for a valid suggestion, while maxEdits
defines the number of changes to the term to allow.
Since most spelling mistakes are only 1 letter off, setting this to 1 will reduce the number of possible suggestions (the default, however, is 2); the value can only be 1 or 2.
minPrefix
defines the minimum number of characters the terms should share.
Setting this to 1 means that the spelling suggestions will all start with the same letter, for example.
The maxInspections
parameter defines the maximum number of possible matches to review before returning results; the default is 5.
minQueryLength
defines how many characters must be in the query before suggestions are provided; the default is 4.
maxQueryLength
enables the spellchecker to skip over very long query terms, which can avoid expensive operations or exceptions.
There is no limit to term length by default.
At first, spellchecker analyses incoming query words by looking them up in the index.
Only query words which are absent from the index, or too rare (equal to or below maxQueryFrequency
) are considered as misspelled and used for finding suggestions.
Words which are more frequent than maxQueryFrequency
bypass spellchecker unchanged.
After suggestions for every misspelled word are found they are filtered for enough frequency with thresholdTokenFrequency
as boundary value.
These parameters (maxQueryFrequency
and thresholdTokenFrequency
) can be a percentage represented as a decimal value below 1 (such as 0.01
for or 1%
) or an absolute value (such as 4
).
When |
FileBasedSpellChecker
The FileBasedSpellChecker
uses an external file as a spelling dictionary.
This can be useful if using Solr as a spelling server, or if spelling suggestions don’t need to be based on actual terms in the index.
In solrconfig.xml
, you would define the searchComponent as so:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="classname">solr.FileBasedSpellChecker</str>
<!-- required parameters -->
<str name="sourceLocation">spellings.txt</str>
<!-- optional parameters for FileBasedSpellChecker -->
<str name="fieldType">text_general</str>
<str name="characterEncoding">UTF-8</str>
<!-- optional parameters for all spellcheckers -->
<str name="spellcheckIndexDir">./spellcheckerDir</str>
<str name="name">file</str>
<str name="queryAnalyzerFieldType">text_general</str>
<str name="distanceMeasure">org.apache.lucene.search.spell.LevenshteinDistance</str>
<str name="comparatorClass">score</str>
<float name="accuracy">0.5</float>
<float name="thresholdTokenFrequency">0.0</float>
<bool name="buildOnCommit">false</bool>
<bool name="buildOnOptimize">false</bool>
</lst>
</searchComponent>
The configuration is very similar to the IndexBasedSpellChecker
, and the differences here are the use of the sourceLocation
to define the location of the file of terms, and the use of characterEncoding
to define the encoding of the terms file.
If the fieldType
parameter is specified and matches a type from the Solr schema, Solr will build the spellcheck index
by first tokenizing each line from the external file using the fieldType
index analyzer, and then adding each token to the index.
If not, Solr will treat each line from the external file as an individual token, and add them to the spellcheck index as is.
In the previous example, name is used to name this specific definition of the spellchecker.
Multiple definitions can co-exist in a single |
WordBreakSolrSpellChecker
WordBreakSolrSpellChecker
offers suggestions by combining adjacent query terms and/or breaking terms into multiple words.
It is a SpellCheckComponent
enhancement, leveraging Lucene’s WordBreakSpellChecker
.
It can detect spelling errors resulting from misplaced whitespace without the use of shingle-based dictionaries and provides collation support for word-break errors, including cases where the user has a mix of single-word spelling errors and word-break errors in the same query.
It also provides shard support.
Here is how it might be configured in solrconfig.xml
:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="classname">solr.WordBreakSolrSpellChecker</str>
<!-- required parameters -->
<str name="field">lowerfilt</str>
<!-- optional parameters for WordBreakSpellChecker -->
<str name="combineWords">true</str>
<str name="breakWords">true</str>
<str name="breakSuggestionTieBreaker">max_freq</str>
<int name="maxChanges">1</int>
<int name="maxCombinedLength">20</int>
<int name="minBreakLength">1</int>
<int name="maxEvaluations">1000</int>
<int name="minSuggestionFreq">1</int>
<!-- optional parameters for all spellcheckers -->
<str name="name">wordbreak</str>
<str name="fieldType">lowerfilt_ft</str>
<str name="queryAnalyzerFieldType">text_general</str>
</lst>
</searchComponent>
Some of the parameters should be familiar from the discussion of the other spellcheckers, such as name
, classname
, and field
.
New for this spellchecker is combineWords
, which defines whether words should be combined in a dictionary search (default is true);
and breakWords
, which defines if words should be broken during a dictionary search (default is true).
maxChanges
is an integer which defines how many times the spellchecker should check collation possibilities against the index.
maxCombinedLength
allows skipping over the suggestions which are too long.
Similarly, minBreakLength
instructs the spellchecker to not break the word into parts that are too short.
maxEvaluations
defines the maximum number of word combinations to evaluate - a higher value might improve
the result quality, while a lower value might improve performance.
minSuggestionFreq
sets the minimum frequency a term must have to be included as part of a suggestion.
Finally, the breakSuggestionTieBreaker
setting ("max_freq" or "sum_freq") instructs Solr to
sort the suggestions by the number of word breaks, and then by the maximum or by the sum of all the component term’s
frequencies, respectively.
The spellchecker can be configured together with a traditional checker (i.e., DirectSolrSpellChecker
).
The results are combined and collations can contain a mix of corrections from both spellcheckers.
Add It to a Request Handler
Queries will be sent to a request handler.
If every request should generate a suggestion, then you would add the following to the requestHandler
that you are using:
<str name="spellcheck">true</str>
One of the possible parameters is the spellcheck.dictionary
to use, and multiples can be defined.
With multiple dictionaries, all specified dictionaries are consulted and results are interleaved.
Collations are created with combinations from the different spellcheckers, with care taken that multiple overlapping corrections do not occur in the same collation.
Here is an example with multiple dictionaries:
<requestHandler name="spellCheckWithWordbreak" class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.dictionary">wordbreak</str>
<str name="spellcheck.count">20</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
Spell Check Parameters
The SpellCheck component accepts the parameters described below.
spellcheck
-
Optional
Default:
false
This parameter turns on SpellCheck suggestions for the request. If
true
, then spelling suggestions will be generated. This is required if spell checking is desired. spellcheck.q
orq
-
Optional
Default: none
This parameter specifies the query to spellcheck.
If
spellcheck.q
is defined, then it is used; otherwise the original input query is used. Thespellcheck.q
parameter is intended to be the original query, minus any extra markup like field names, boosts, and so on. If theq
parameter is specified, then theSpellingQueryConverter
class is used to parse it into tokens; otherwise the WhitespaceTokenizer is used.The choice of which one to use is up to the application. Essentially, if you have a spelling "ready" version in your application, then it is probably better to use
spellcheck.q
. Otherwise, if you just want Solr to do the job, use theq
parameter.
The SpellingQueryConverter class does not deal properly with non-ASCII characters.
In this case, you have either to use spellcheck.q , or implement your own QueryConverter.
|
spellcheck.build
-
Optional
Default:
false
If set to
true
, this parameter creates the dictionary to be used for spell-checking. In a typical search application, you will need to build the dictionary before using the spell check. However, it’s not always necessary to build a dictionary first. For example, you can configure the spellchecker to use a dictionary that already exists.The dictionary will take some time to build, so this parameter should not be sent with every request.
spellcheck.reload
-
Optional
Default:
false
If set to
true
, this parameter reloads the spellchecker. The results depend on the implementation ofSolrSpellChecker.reload()
. In a typical implementation, reloading the spellchecker means reloading the dictionary. spellcheck.count
-
Optional
Default: see description
This parameter specifies the maximum number of suggestions that the spellchecker should return for a term. If this parameter isn’t set, the value defaults to
1
. If the parameter is set but not assigned a number, the value defaults to5
. If the parameter is set to a positive integer, that number becomes the maximum number of suggestions returned by the spellchecker. spellcheck.onlyMorePopular
-
Optional
Default:
false
If
true
, Solr will return suggestions that result in more hits for the query than the existing query. Note that this will return more popular suggestions even when the given query term is present in the index and considered "correct". spellcheck.maxResultsForSuggest
-
Optional
Default: none
If, for example, this is set to
5
and the user’s query returns 5 or fewer results, the spellchecker will report "correctlySpelled=false" and also offer suggestions (and collations if requested). Setting this greater than zero is useful for creating "did-you-mean?" suggestions for queries that return a low number of hits. spellcheck.alternativeTermCount
-
Optional
Default: none
Defines the number of suggestions to return for each query term existing in the index and/or dictionary. Presumably, users will want fewer suggestions for words with docFrequency>0. Also, setting this value enables context-sensitive spell suggestions.
spellcheck.extendedResults
-
Optional
Default:
false
If
true
, this parameter causes to Solr to return additional information about spellcheck results, such as the frequency of each original term in the index (origFreq
) as well as the frequency of each suggestion in the index (frequency
). Note that this result format differs from the non-extended one as the returned suggestion for a word is actually an array of lists, where each list holds the suggested term and its frequency. spellcheck.collate
-
Optional
Default:
false
If
true
, this parameter directs Solr to take the best suggestion for each token (if one exists) and construct a new query from the suggestions.For example, if the input query was "jawa class lording" and the best suggestion for "jawa" was "java" and "lording" was "loading", then the resulting collation would be "java class loading".
The
spellcheck.collate
parameter only returns collations that are guaranteed to result in hits if re-queried, even when applying originalfq
parameters. This is especially helpful when there is more than one correction per query.This only returns a query to be used. It does not actually run the suggested query. spellcheck.maxCollations
-
Optional
Default:
1
The maximum number of collations to return. This parameter is ignored if
spellcheck.collate
is false. spellcheck.maxCollationTries
-
Optional
Default:
0
This parameter specifies the number of collation possibilities for Solr to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. The default value of
0
is equivalent to not checking collations. This parameter is ignored ifspellcheck.collate
is false. spellcheck.maxCollationEvaluations
-
Optional
Default:
10000
This parameter specifies the maximum number of word correction combinations to rank and evaluate prior to deciding which collation candidates to test against the index. This is a performance safety-net in case a user enters a query with many misspelled words.
spellcheck.collateExtendedResults
-
Optional
Default:
false
If
true
, this parameter returns an expanded response format detailing the collations Solr found. This is ignored ifspellcheck.collate
is false. spellcheck.collateMaxCollectDocs
-
Optional
Default:
0
This parameter specifies the maximum number of documents that should be collected when testing potential collations against the index. A value of
0
indicates that all documents should be collected, resulting in exact hit-counts. Otherwise, an estimation is provided as a performance optimization in cases where exact hit-counts are unnecessary – the higher the value specified, the more precise the estimation.When
spellcheck.collateExtendedResults
isfalse
, the optimization is always used as if1
had been specified. spellcheck.collateParam.*
Prefix-
Optional
Default: none
This parameter prefix can be used to specify any additional parameters that you wish to the Spellchecker to use when internally validating collation queries. For example, even if your regular search results allow for loose matching of one or more query terms via parameters like
q.op=OR
andmm=20%
you can specify override parameters such asspellcheck.collateParam.q.op=AND&spellcheck.collateParam.mm=100%
to require that only collations consisting of words that are all found in at least one document may be returned. spellcheck.dictionary
-
Optional
Default:
default
This parameter causes Solr to use the dictionary named in the parameter’s argument. This parameter can be used to invoke a specific spellchecker on a per-request basis.
spellcheck.accuracy
-
Optional
Default: see description
Specifies an accuracy value to be used by the spell checking implementation to decide whether a result is worthwhile or not. The value is a float between 0 and 1. Defaults to
Float.MIN_VALUE
. spellcheck.<DICT_NAME>.key
-
Optional
Default: none
Specifies a key/value pair for the implementation handling a given dictionary. The value that is passed through is just
key=value
(spellcheck.<DICT_NAME>.
is stripped off).For example, given a dictionary called
foo
,spellcheck.foo.myKey=myValue
would result inmyKey=myValue
being passed through to the implementation handling the dictionaryfoo
.
Spell Check Example
Using Solr’s bin/solr start -e techproducts
example, this query shows the results of a simple request that defines a query using the spellcheck.q
parameter, and forces the collations to require all input terms must match:
http://localhost:8983/solr/techproducts/spell?df=text&spellcheck.q=delll+ultra+sharp&spellcheck=true&spellcheck.collateParam.q.op=AND&wt=xml
Results:
<lst name="spellcheck">
<lst name="suggestions">
<lst name="delll">
<int name="numFound">1</int>
<int name="startOffset">0</int>
<int name="endOffset">5</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst>
<str name="word">dell</str>
<int name="freq">1</int>
</lst>
</arr>
</lst>
<lst name="ultra sharp">
<int name="numFound">1</int>
<int name="startOffset">6</int>
<int name="endOffset">17</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst>
<str name="word">ultrasharp</str>
<int name="freq">1</int>
</lst>
</arr>
</lst>
</lst>
<bool name="correctlySpelled">false</bool>
<lst name="collations">
<lst name="collation">
<str name="collationQuery">dell ultrasharp</str>
<int name="hits">1</int>
<lst name="misspellingsAndCorrections">
<str name="delll">dell</str>
<str name="ultra sharp">ultrasharp</str>
</lst>
</lst>
</lst>
</lst>
Distributed SpellCheck
The SpellCheckComponent
also supports spellchecking on distributed indexes.
If you are using the SpellCheckComponent on a request handler other than "/select", you must provide the following two parameters:
shards
-
Required
Default: none
Specifies the shards in your distributed indexing configuration. For more information about distributed indexing, see Solr Cluster Types.
shards.qt
-
Required
Default: none
Specifies the request handler Solr uses for requests to shards. This parameter is not required for the
/select
request handler.
For example:
http://localhost:8983/solr/techproducts/spell?spellcheck=true&spellcheck.build=true&spellcheck.q=toyata&shards.qt=/spell&shards=solr-shard1:8983/solr/techproducts,solr-shard2:8983/solr/techproducts
In case of a distributed request to the SpellCheckComponent, the shards are requested for at least five suggestions even if the spellcheck.count
parameter value is less than five.
Once the suggestions are collected, they are ranked by the configured distance measure (Levenshtein distance by default) and then by aggregate frequency.