Suggester
The SuggestComponent in Solr provides users with automatic suggestions for query terms.
You can use this to implement a powerful auto-suggest feature in your search application.
Although it is possible to use the Spell Checking functionality to power autosuggest behavior, Solr has a dedicated SuggestComponent designed for this functionality.
This approach utilizes Lucene’s Suggester implementation and supports all of the lookup implementations available in Lucene.
The main features of this Suggester are:
Lookup implementation pluggability
Term dictionary pluggability, giving you the flexibility to choose the dictionary implementation
Distributed support
The solrconfig.xml
found in Solr’s “techproducts
” example has a Suggester implementation configured already. For more on search components, see the section RequestHandlers and SearchComponents in SolrConfig.
The “techproducts
” example solrconfig.xml
has a suggest
search component and a /suggest
request handler already configured. You can use that as the basis for your configuration, or create it from scratch, as detailed below.
Adding the Suggest Search Component
The first step is to add a search component to solrconfig.xml
and tell it to use the SuggestComponent. Here is some sample code that could be used.
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">cat</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
Suggester Search Component Parameters
The Suggester search component takes several configuration parameters.
The choice of the lookup implementation (lookupImpl
, how terms are found in the suggestion dictionary) and the dictionary implementation (dictionaryImpl
, how terms are stored in the suggestion dictionary) will dictate some of the parameters required.
Below are the main parameters that can be used no matter what lookup or dictionary implementation is used. In the following sections additional parameters are provided for each implementation.
searchComponent name
- Arbitrary name for the search component.
name
- A symbolic name for this suggester. You can refer to this name in the URL parameters and in the SearchHandler configuration. It is possible to have multiples of these in one
solrconfig.xml
file. lookupImpl
- Lookup implementation. There are several possible implementations, described below in the section Lookup Implementations. If not set, the default lookup is
JaspellLookupFactory
. dictionaryImpl
The dictionary implementation to use. There are several possible implementations, described below in the section Dictionary Implementations.
If not set, the default dictionary implementation is
HighFrequencyDictionaryFactory
. However, if asourceLocation
is used, the dictionary implementation will beFileDictionaryFactory
.field
A field from the index to use as the basis of suggestion terms. If
sourceLocation
is empty (meaning any dictionary implementation other thanFileDictionaryFactory
), then terms from this field in the index will be used.To be used as the basis for a suggestion, the field must be stored. You may want to use copyField rules to create a special 'suggest' field comprised of terms from other fields in documents. In any event, you very likely want a minimal amount of analysis on the field, so an additional option is to create a field type in your schema that only uses basic tokenizers or filters. One option for such a field type is shown here:
<fieldType class="solr.TextField" name="textSuggest" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
However, this minimal analysis is not required if you want more analysis to occur on terms. If using the
AnalyzingLookupFactory
as yourlookupImpl
, however, you have the option of defining the field type rules to use for index and query time analysis.sourceLocation
- The path to the dictionary file if using the
FileDictionaryFactory
. If this value is empty then the main index will be used as a source of terms and weights. storeDir
- The location to store the dictionary file.
buildOnCommit
andbuildOnOptimize
If
true
, the lookup data structure will be rebuilt after soft-commit. Iffalse
, the default, then the lookup data will be built only when requested by URL parametersuggest.build=true
. UsebuildOnCommit
to rebuild the dictionary with every soft-commit, orbuildOnOptimize
to build the dictionary only when the index is optimized.Some lookup implementations may take a long time to build, especially with large indexes. In such cases, using
buildOnCommit
orbuildOnOptimize
, particularly with a high frequency of softCommits is not recommended; it’s recommended instead to build the suggester at a lower frequency by manually issuing requests withsuggest.build=true
.buildOnStartup
If
true,
then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found.Enabling this to
true
could lead to the core talking longer to load (or reload) as the suggester data structure needs to be built, which can sometimes take a long time. It’s usually preferred to have this setting set tofalse
, the default, and build suggesters manually issuing requests withsuggest.build=true
.
Lookup Implementations
The lookupImpl
parameter defines the algorithms used to look up terms in the suggest index. There are several possible implementations to choose from, and some require additional parameters to be configured.
AnalyzingLookupFactory
A lookup that first analyzes the incoming text and adds the analyzed form to a weighted FST, and then does the same thing at lookup time.
This implementation uses the following additional properties:
suggestAnalyzerFieldType
- The field type to use for the query-time and build-time term suggestion analysis.
exactMatchFirst
- If
true
, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. preserveSep
- If
true
, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball). preservePositionIncrements
- If
true
, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default isfalse
.
FuzzyLookupFactory
This is a suggester which is an extension of the AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the Levenshtein algorithm.
This implementation uses the following additional properties:
exactMatchFirst
- If
true
, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. preserveSep
- If
true
, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball). maxSurfaceFormsPerAnalyzedForm
- The maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones.
maxGraphExpansions
- When building the FST ("index-time"), we add each path through the tokenstream graph as an individual entry. This places an upper-bound on how many expansions will be added for a single suggestion. The default is
-1
which means there is no limit. preservePositionIncrements
- If
true
, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default isfalse
. maxEdits
- The maximum number of string edits allowed. The system’s hard limit is
2
. The default is1
. transpositions
- If
true
, the default, transpositions should be treated as a primitive edit operation. nonFuzzyPrefix
- The length of the common non fuzzy prefix match which must match a suggestion. The default is
1
. minFuzzyLength
- The minimum length of query before which any string edits will be allowed. The default is
3
. unicodeAware
- If
true
, themaxEdits
,minFuzzyLength
,transpositions
andnonFuzzyPrefix
parameters will be measured in unicode code points (actual letters) instead of bytes. The default isfalse
.
AnalyzingInfixLookupFactory
Analyzes the input text and then suggests matches based on prefix matches to any tokens in the indexed text. This uses a Lucene index for its dictionary.
This implementation uses the following additional properties.
indexPath
- When using
AnalyzingInfixSuggester
you can provide your own path where the index will get built. The default isanalyzingInfixSuggesterIndexDir
and will be created in your collection’sdata/
directory. minPrefixChars
- Minimum number of leading characters before PrefixQuery is used (default is
4
). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster). allTermsRequired
- Boolean option for multiple terms. The default is
true
, all terms will be required. highlight
- Highlight suggest terms. Default is
true
.
This implementation supports Context Filtering.
BlendedInfixLookupFactory
An extension of the AnalyzingInfixSuggester
which provides additional functionality to weight prefix matches across the matched documents. It scores higher if a hit is closer to the start of the suggestion.
This implementation uses the following additional properties:
blenderType
- Used to calculate weight coefficient using the position of the first matching word. Available options are:
position_linear
weightFieldValue * (1 - 0.10*position)
: Matches to the start will be given a higher score. This is the default.position_reciprocal
weightFieldValue / (1 + position)
: Matches to the start will be given a higher score. The score of matches positioned far from the start of the suggestion decays faster than linear.position_exponential_reciprocal
weightFieldValue / pow(1 + position,exponent)
: Matches to the start will be given a higher score. The score of matches positioned far from the start of the suggestion decays faster than reciprocal.exponent
- An optional configuration variable for
position_exponential_reciprocal
to control how fast the score will decrease. Default2.0
.
numFactor
- The factor to multiply the number of searched elements from which results will be pruned. Default is
10
. indexPath
- When using
BlendedInfixSuggester
you can provide your own path where the index will get built. The default directory name isblendedInfixSuggesterIndexDir
and will be created in your collection’s data directory. minPrefixChars
- Minimum number of leading characters before PrefixQuery is used (the default is
4
). Prefixes shorter than this are indexed as character ngrams, which increases index size but makes lookups faster.
This implementation supports Context Filtering.
FreeTextLookupFactory
It looks at the last tokens plus the prefix of whatever final token the user is typing, if present, to predict the most likely next token. The number of previous tokens that need to be considered can also be specified. This suggester would only be used as a fallback, when the primary suggester fails to find any suggestions.
This implementation uses the following additional properties:
suggestFreeTextAnalyzerFieldType
- The analyzer used at "query-time" and "build-time" to analyze suggestions. This parameter is required.
ngrams
- The max number of tokens out of which singles will be made the dictionary. The default value is
2
. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions.
FSTLookupFactory
An automaton-based lookup. This implementation is slower to build, but provides the lowest memory cost. We recommend using this implementation unless you need more sophisticated matching results, in which case you should use the Jaspell implementation.
This implementation uses the following additional properties:
exactMatchFirst
- If
true
, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. weightBuckets
- The number of separate buckets for weights which the suggester will use while building its dictionary.
TSTLookupFactory
A simple compact ternary trie based lookup.
WFSTLookupFactory
A weighted automaton representation which is an alternative to FSTLookup
for more fine-grained ranking. WFSTLookup
does not use buckets, but instead a shortest path algorithm.
Note that it expects weights to be whole numbers. If weight is missing it’s assumed to be 1.0
. Weights affect the sorting of matching suggestions when spellcheck.onlyMorePopular=true
is selected: weights are treated as "popularity" score, with higher weights preferred over suggestions with lower weights.
JaspellLookupFactory
A more complex lookup based on a ternary trie from the JaSpell project. Use this implementation if you need more sophisticated matching results.
Dictionary Implementations
The dictionary implementations define how terms are stored. There are several options, and multiple dictionaries can be used in a single request if necessary.
DocumentDictionaryFactory
A dictionary with terms, weights, and an optional payload taken from the index.
This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation:
weightField
- A field that is stored or a numeric DocValue field. This parameter is optional.
payloadField
- The
payloadField
should be a field that is stored. This parameter is optional. contextField
- Field to be used for context filtering. Note that only some lookup implementations support filtering.
DocumentExpressionDictionaryFactory
This dictionary implementation is the same as the DocumentDictionaryFactory
but allows users to specify an arbitrary expression into the weightExpression
tag.
This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation:
payloadField
- The
payloadField
should be a field that is stored. This parameter is optional. weightExpression
- An arbitrary expression used for scoring the suggestions. The fields used must be numeric fields. This parameter is required.
contextField
- Field to be used for context filtering. Note that only some lookup implementations support filtering.
HighFrequencyDictionaryFactory
This dictionary implementation allows adding a threshold to prune out less frequent terms in cases where very common terms may overwhelm other terms.
This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation:
threshold
- A value between zero and one representing the minimum fraction of the total documents where a term should appear in order to be added to the lookup dictionary.
FileDictionaryFactory
This dictionary implementation allows using an external file that contains suggest entries. Weights and payloads can also be used.
If using a dictionary file, it should be a plain text file in UTF-8 encoding. You can use both single terms and phrases in the dictionary file. If adding weights or payloads, those should be separated from terms using the delimiter defined with the fieldDelimiter
property (the default is '\t', the tab representation). If using payloads, the first line in the file must specify a payload.
This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation:
fieldDelimiter
Specifies the delimiter to be used separating the entries, weights and payloads. The default is tab (
\t
).
Multiple Dictionaries
It is possible to include multiple dictionaryImpl
definitions in a single SuggestComponent definition.
To do this, simply define separate suggesters, as in this example:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">cat</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">string</str>
</lst>
<lst name="suggester">
<str name="name">altSuggester</str>
<str name="dictionaryImpl">DocumentExpressionDictionaryFactory</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="field">product_name</str>
<str name="weightExpression">((price * 2) + ln(popularity))</str>
<str name="sortField">weight</str>
<str name="sortField">price</str>
<str name="storeDir">suggest_fuzzy_doc_expr_dict</str>
<str name="suggestAnalyzerFieldType">text_en</str>
</lst>
</searchComponent>
When using these Suggesters in a query, you would define multiple suggest.dictionary
parameters in the request, referring to the names given for each Suggester in the search component definition. The response will include the terms in sections for each Suggester. See the Example Usages section below for an example request and response.
Adding the Suggest Request Handler
After adding the search component, a request handler must be added to solrconfig.xml
. This request handler works the same as any other request handler, and allows you to configure default parameters for serving suggestion requests. The request handler definition must incorporate the "suggest" search component defined previously.
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
Suggest Request Handler Parameters
The following parameters allow you to set defaults for the Suggest request handler:
suggest=true
- This parameter should always be
true
, because we always want to run the Suggester for queries submitted to this handler. suggest.dictionary
- The name of the dictionary component configured in the search component. This is a mandatory parameter. It can be set in the request handler, or sent as a parameter at query time.
suggest.q
- The query to use for suggestion lookups.
suggest.count
- Specifies the number of suggestions for Solr to return.
suggest.cfq
- A Context Filter Query used to filter suggestions based on the context field, if supported by the suggester.
suggest.build
- If
true
, it will build the suggester index. This is likely useful only for initial requests; you would probably not want to build the dictionary on every request, particularly in a production system. If you would like to keep your dictionary up to date, you should use thebuildOnCommit
orbuildOnOptimize
parameter for the search component. suggest.reload
- If
true
, it will reload the suggester index. suggest.buildAll
- If
true
, it will build all suggester indexes. suggest.reloadAll
- If
true
, it will reload all suggester indexes.
These properties can also be overridden at query time, or not set in the request handler at all and always sent at query time.
Context Filtering
Context filtering ( |
Example Usages
Get Suggestions with Weights
This is a basic suggestion using a single dictionary and a single Solr core.
Example query:
http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true&suggest.dictionary=mySuggester&suggest.q=elec
In this example, we’ve simply requested the string 'elec' with the suggest.q
parameter and requested that the suggestion dictionary be built with suggest.build
(note, however, that you would likely not want to build the index on every query - instead you should use buildOnCommit
or buildOnOptimize
if you have regularly changing documents).
Example response:
{
"responseHeader": {
"status": 0,
"QTime": 35
},
"command": "build",
"suggest": {
"mySuggester": {
"elec": {
"numFound": 3,
"suggestions": [
{
"term": "electronics and computer1",
"weight": 2199,
"payload": ""
},
{
"term": "electronics",
"weight": 649,
"payload": ""
},
{
"term": "electronics and stuff2",
"weight": 279,
"payload": ""
}
]
}
}
}
}
Using Multiple Dictionaries
If you have defined multiple dictionaries, you can use them in queries.
Example query:
http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.dictionary=mySuggester&suggest.dictionary=altSuggester&suggest.q=elec
In this example we have sent the string 'elec' as the suggest.q
parameter and named two suggest.dictionary
definitions to be used.
Example response:
{
"responseHeader": {
"status": 0,
"QTime": 3
},
"suggest": {
"mySuggester": {
"elec": {
"numFound": 1,
"suggestions": [
{
"term": "electronics and computer1",
"weight": 100,
"payload": ""
}
]
}
},
"altSuggester": {
"elec": {
"numFound": 1,
"suggestions": [
{
"term": "electronics and computer1",
"weight": 10,
"payload": ""
}
]
}
}
}
}
Context Filtering
Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The AnalyzingInfixLookupFactory
and BlendedInfixLookupFactory
currently support this feature, when backed by DocumentDictionaryFactory
.
Add contextField
to your suggester configuration. This example will suggest names and allow to filter by category:
Example context filtering suggest query:
http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true&suggest.dictionary=mySuggester&suggest.q=c&suggest.cfq=memory
The suggester will only bring back suggestions for products tagged with 'cat=memory'.