In most search applications, the "top" matching results (sorted by score, or some other criteria) are displayed to some human user.
In many applications the UI for these sorted results are displayed to the user in "pages" containing a fixed number of matching results, and users don’t typically look at results past the first few pages worth of results.
Basic Pagination
In Solr, this basic paginated searching is supported using the start
and rows
parameters, and performance of this common behaviour can be tuned by utilizing the queryResultCache
and adjusting the queryResultWindowSize
configuration options based on your expected page sizes.
Basic Pagination Examples
The easiest way to think about simple pagination, is to simply multiply the page number you want (treating the "first" page number as "0") by the number of rows per page; such as in the following pseudo-code:
function fetch_solr_page($page_number, $rows_per_page) {
$start = $page_number * $rows_per_page
$params = [ q = $some_query, rows = $rows_per_page, start = $start ]
return fetch_solr($params)
}
How Basic Pagination is Affected by Index Updates
The start
parameter specified in a request to Solr indicates an absolute "offset" in the complete sorted list of matches that the client wants Solr to use as the beginning of the current "page".
If an index modification (such as adding or removing documents) which affects the sequence of ordered documents matching a query occurs in between two requests from a client for subsequent pages of results, then it is possible that these modifications can result in the same document being returned on multiple pages, or documents being "skipped" as the result set shrinks or grows.
For example, consider an index containing 26 documents like so:
id | name |
---|---|
1 |
A |
2 |
B |
… |
|
26 |
Z |
Followed by the following requests & index modifications interleaved:
-
A client requests
q=:&rows=5&start=0&sort=name asc
-
documents with the ids
1-5
will be returned to the client
-
-
Document id
3
is deleted -
The client requests "page #2" using
q=:&rows=5&start=5&sort=name asc
-
Documents
7-11
will be returned -
Document
6
has been skipped, since it is now the 5th document in the sorted set of all matching results – it would be returned on a new request for "page #1"
-
-
3 new documents are now added with the ids
90
,91
, and92
; All three documents have a name ofA
-
The client requests "page #3" using
q=:&rows=5&start=10&sort=name asc
-
Documents
9-13
will be returned -
Documents
9
,10
, and11
have now been returned on both page #2 and page #3 since they moved farther back in the list of sorted results
-
In typical situations these impacts from index changes on paginated searching don’t significantly affect user experience — either because they happen extremely infrequently in fairly static collections, or because the users recognize that the collection of data is constantly evolving and expect to see documents shift up and down in the result sets.
Performance Problems with "Deep Paging"
In some situations, the results of a Solr search are not destined for a simple paginated user interface.
When you wish to fetch a very large number of sorted results from Solr to feed into an external system, using very large values for the start
or rows
parameters can be very inefficient. Pagination using start
and rows
not only require Solr to compute (and sort) in memory all of the matching documents that should be fetched for the current page, but also all of the documents that would have appeared on previous pages.
While a request for start=0&rows=1000000
may be obviously inefficient because it requires Solr to maintain & sort in memory a set of 1 million documents, likewise a request for start=999000&rows=1000
is equally inefficient for the same reasons. Solr can’t compute which matching document is the 999001st result in sorted order, without first determining what the first 999000 matching sorted results are.
If the index is distributed, which is common when running in SolrCloud mode, then 1 million documents are retrieved from each shard. For a ten shard index, ten million entries must be retrieved and sorted to figure out the 1000 documents that match those query parameters.
Fetching A Large Number of Sorted Results: Cursors
As an alternative to increasing the "start" parameter to request subsequent pages of sorted results, Solr supports using a "Cursor" to scan through results.
Cursors in Solr are a logical concept that doesn’t involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue.
Using Cursors
To use a cursor with Solr, specify a cursorMark
parameter with the value of *
. You can think of this being analogous to start=0
as a way to tell Solr "start at the beginning of my sorted results" except that it also informs Solr that you want to use a Cursor.
In addition to returning the top N sorted results (where you can control N using the rows
parameter) the Solr response will also include an encoded String named nextCursorMark
. You then take the nextCursorMark
String value from the response, and pass it back to Solr as the cursorMark
parameter for your next request. You can repeat this process until you’ve fetched as many docs as you want, or until the nextCursorMark
returned matches the cursorMark
you’ve already specified — indicating that there are no more results.
Constraints when using Cursors
There are a few important constraints to be aware of when using cursorMark
parameter in a Solr request
-
cursorMark
andstart
are mutually exclusive parameters.-
Your requests must either not include a
start
parameter, or it must be specified with a value of “0
”.
-
-
sort
clauses must include the uniqueKey field (eitherasc
ordesc
).-
If
id
is your uniqueKey field, then sort parameters likeid asc
andname asc, id desc
would both work fine, butname asc
by itself would not
-
-
Sorts including Date Math based functions that involve calculations relative to
NOW
will cause confusing results, since every document will get a new sort value on every subsequent request. This can easily result in cursors that never end, and constantly return the same documents over and over – even if the documents are never updated.In this situation, choose & re-use a fixed value for the
NOW
request param in all of your cursor requests.
Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that cursorMark
would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark
value will identify a unique point in the sequence of documents.
Cursor Examples
Fetch All Docs
The pseudo-code shown here shows the basic logic involved in fetching all documents matching a query using a cursor:
// when fetching all docs, you might as well use a simple id sort
// unless you really need the docs to come back in a specific order
$params = [ q => $some_query, sort => 'id asc', rows => $r, cursorMark => '*' ]
$done = false
while (not $done) {
$results = fetch_solr($params)
// do something with $results
if ($params[cursorMark] == $results[nextCursorMark]) {
$done = true
}
$params[cursorMark] = $results[nextCursorMark]
}
Using SolrJ, this pseudo-code would be:
SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
If you wanted to do this by hand using curl, the sequence of requests would look something like this:
$ curl '...&rows=10&sort=id+asc&cursorMark=*'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 docs here ...
]},
"nextCursorMark":"AoEjR0JQ"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEjR0JQ'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 more docs here ...
]},
"nextCursorMark":"AoEpVkRCREIxQTE2"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpVkRCREIxQTE2'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 more docs here ...
]},
"nextCursorMark":"AoEmbWF4dG9y"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEmbWF4dG9y'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 2 docs here because we've reached the end.
]},
"nextCursorMark":"AoEpdmlld3Nvbmlj"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpdmlld3Nvbmlj'
{
"response":{"numFound":32,"start":0,"docs":[
// no more docs here, and note that the nextCursorMark
// matches the cursorMark param we used
]},
"nextCursorMark":"AoEpdmlld3Nvbmlj"}
Fetch First N docs, Based on Post Processing
Since the cursor is stateless from Solr’s perspective, your client code can stop fetching additional results as soon as you have decided you have enough information:
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
boolean hadEnough = doCustomProcessingOfResults(rsp);
if (hadEnough || cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
How Cursors are Affected by Index Updates
Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted list of matching documents. Instead, the cursorMark
specified in a request encapsulates information about the relative position of the last document returned, based on the absolute sort values of that document. This means that the impact of index modifications is much smaller when using a cursor compared to basic pagination. Consider the same example index described when discussing basic pagination:
id | name |
---|---|
1 |
A |
2 |
B |
… |
|
26 |
Z |
-
A client requests
q=:&rows=5&start=0&sort=name asc, id asc&cursorMark=*
-
Documents with the ids
1-5
will be returned to the client in order
-
-
Document id
3
is deleted -
The client requests 5 more documents using the
nextCursorMark
from the previous response-
Documents
6-10
will be returned — the deletion of a document that’s already been returned doesn’t affect the relative position of the cursor
-
-
3 new documents are now added with the ids
90
,91
, and92
; All three documents have a name ofA
-
The client requests 5 more documents using the
nextCursorMark
from the previous response-
Documents
11-15
will be returned — the addition of new documents with sort values already past does not affect the relative position of the cursor
-
-
Document id
1
is updated to change its 'name' toQ
-
Document id 17 is updated to change its 'name' to
A
-
The client requests 5 more documents using the
nextCursorMark
from the previous response-
The resulting documents are
16,1,18,19,20
in that order -
Because the sort value of document
1
changed so that it is after the cursor position, the document is returned to the client twice -
Because the sort value of document
17
changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress
-
In a nutshell: When fetching all results matching a query using cursorMark
, the only way index modifications can result in a document being skipped, or returned twice, is if the sort value of the document changes.
One way to ensure that a document will never be returned more then once, is to use the uniqueKey field as the primary (and therefore: only significant) sort criterion. In this situation, you will be guaranteed that each document is only returned once, no matter how it may be be modified during the use of the cursor. |
"Tailing" a Cursor
Because Cursor requests are stateless, and the cursorMark values encapsulate the absolute sort values of the last document returned from a search, it’s possible to "continue" fetching additional results from a cursor that has already reached its end. If new documents are added (or existing documents are updated) to the end of the results.
You can think of this as similar to using something like "tail -f" in Unix. The most common examples of how this can be useful is when you have a "timestamp" field recording when a document has been added/updated in your index. Client applications can continuously poll a cursor using a sort=timestamp asc, id asc
for documents matching a query, and always be notified when a document is added or updated matching the request criteria.
Another common example is when you have uniqueKey values that always increase as new documents are created, and you can continuously poll a cursor using sort=id asc
to be notified about new documents.
The pseudo-code for tailing a cursor is only a slight modification from our early example for processing all docs matching a query:
while (true) {
$doneForNow = false
while (not $doneForNow) {
$results = fetch_solr($params)
// do something with $results
if ($params[cursorMark] == $results[nextCursorMark]) {
$doneForNow = true
}
$params[cursorMark] = $results[nextCursorMark]
}
sleep($some_configured_delay)
}
For certain specialized cases, the /export handler may be an option. |