MoreLikeThis search component enables users to query for documents similar to a document in their result list.
It does this by using terms from the original document to find similar documents in the index.
There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link).
The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results.
The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.
How MoreLikeThis Works
MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the
mlt.fl parameter, below). For best results, the fields should have stored term vectors in
schema.xml. For example:
<field name="cat" ... termVectors="true" />
If term vectors are not stored,
MoreLikeThis will generate terms from stored fields. A
uniqueKey must also be stored in order for MoreLikeThis to work properly.
The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the
mlt.qf parameter, below) and a new document set is returned.
Common Parameters for MoreLikeThis
The table below summarizes the
MoreLikeThis parameters supported by Lucene/Solr. These parameters can be used with any of the three possible MoreLikeThis approaches.
- Specifies the fields to use for similarity. If possible, these should have stored
- Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document.
- Specifies the Minimum Document Frequency, the frequency at which words will be ignored which do not occur in at least this many documents.
- Specifies the Maximum Document Frequency, the frequency at which words will be ignored which occur in more than this many documents.
- Specifies the Maximum Document Frequency using a relative ratio to the number of documents in the index. The argument must be an integer between 0 and 100. For example 75 means the word will be ignored if it occurs in more than 75 percent of the documents in the index.
- Sets the minimum word length below which words will be ignored.
- Sets the maximum word length above which words will be ignored.
- Sets the maximum number of query terms that will be included in any generated query.
- Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support.
- Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or "false".
- Query fields and their boosts using the same format as that used by the DisMax Query Parser. These fields must also be specified in
Parameters for the MoreLikeThisComponent
Using MoreLikeThis as a search component returns similar documents for each document in the response set. In addition to the common parameters, these additional options are available:
- If set to
true, activates the
MoreLikeThiscomponent and enables Solr to return
- Specifies the number of similar documents to be returned for each result. The default value is 5.
Parameters for the MoreLikeThisHandler
The table below summarizes parameters accessible through the
MoreLikeThisHandler. It supports faceting, paging, and filtering using common query parameters, but does not work well with alternate query parsers.
- Specifies whether or not the response should include the matched document. If set to false, the response will look like a normal select response.
- Specifies an offset into the main query search results to locate the document on which the
MoreLikeThisquery should operate. By default, the query operates on the first result for the q parameter.
- Controls how the
MoreLikeThiscomponent presents the "interesting" terms (the top TF/IDF terms) for the query. Supports three settings. The setting list lists the terms. The setting none lists no terms. The setting details lists the terms along with the boost value used for each term. Unless
mlt.boost=true, all terms will have
MoreLikeThis Query Parser
mlt query parser provides a mechanism to retrieve documents similar to a given document, like the handler. More information on the usage of the mlt query parser can be found in the section Other Parsers.