Exporting Result Sets
The /export request handler allows a fully sorted result set to be streamed out of Solr using a special rank query parser and response writer.
These have been specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
This feature uses a stream sorting technique that begins to send records within milliseconds and continues to stream results until the entire result set has been sorted and exported.
The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort-based stats.
Field Requirements
All the fields being sorted must have docValues set to true.
By default, fields in the field list (fl) must also have docValues.
However, you can include stored-only fields (fields without docValues) by setting the includeStoredFields parameter to true.
For more information, see the section on DocValues.
The /export RequestHandler
The /export request handler with the appropriate configuration is one of Solr’s out-of-the-box request handlers - see Implicit Request Handlers for more information.
Note that this request handler’s properties are defined as "invariants", which means they cannot be overridden by other properties passed at another time (such as at query time).
Requesting Results Export
You can use /export to make requests to export the result set of a query.
All queries must include sort and fl parameters, or the query will return an error.
Filter queries are also supported.
An optional parameter batchSize determines the size of the internal buffers for partial results.
The default value is 30000 but users may want to specify smaller values to limit the memory use (at the cost of degraded performance) or higher values to improve export performance (the relationship is not linear and larger values don’t bring proportionally larger performance increases).
An optional parameter includeStoredFields (default false) enables exporting fields that only have stored values (no docValues).
When set to true, fields without docValues but with stored values can be included in the field list (fl).
Note that retrieving stored fields may significantly impact export performance compared to docValues fields, as stored fields require additional I/O operations.
If all requested fields are docValues=true then the data will only be read from docValues.
This behavior applies to fields that are also stored=true and does not depend on the value of the includeStoredFields parameter.
The supported response writers are json and javabin.
For backward compatibility reasons wt=xsort is also supported as input, but wt=xsort behaves same as wt=json.
The default output format is json.
Here is an example of an export request of some indexed log data:
http://localhost:8983/solr/core_name/export?q=my-query&sort=severity+desc,timestamp+desc&fl=severity,timestamp,msg
Specifying the Sort Criteria
The sort property defines how documents will be sorted in the exported result set.
Results can be sorted by any field that has a field type of int, long, float, double, string.
The sort fields must be single valued fields and must have docValues enabled.
The export performance will get slower as you add more sort fields. If there is enough physical memory available outside of the JVM to load up the sort fields then the performance will be linearly slower with addition of sort fields. It can get worse otherwise.
Specifying the Field List
The fl property defines the fields that will be exported with the result set.
Any of the field types that can be sorted (i.e., int, long, float, double, string, date, boolean) can be used in the field list.
The fields can be single or multi-valued.
By default, fields in the field list must have docValues enabled.
However, when the includeStoredFields parameter is set to true, fields with only stored values (no docValues) can also be included.
Note that sort fields still require docValues, regardless of this setting.
Wildcard patterns can be used for the field list (e.g. fl=*_i) and will be expanded to the list of fields that match the pattern and are able to be exported, see Field Requirements.
Returning scores is not supported at this time.
Specifying the Local Streaming Expression
The optional expr property defines a stream expression that allows documents to be processed locally before they are exported in the result set.
Expressions have to use a special input() stream that represents original results from the /export handler.
Output from the stream expression then becomes the output from the /export handler.
The &streamLocalOnly=true flag is always set for this streaming expression.
Only stream decorators and evaluators are supported in these expressions - using any of the source expressions except for the pre-defined input() will result in an error.
Using stream expressions with the /export handler may result in dramatic performance improvements due to the local in-memory reduction of the number of documents to be returned.
Here’s an example of using top decorator for returning only top N results:
http://localhost:8983/solr/core_name/export?q=my-query&sort=timestamp+desc,&fl=timestamp,reporter,severity&expr=top(n=2,input(),sort="timestamp+desc")
(Note that the sort spec in the top decorator must match the sort spec in the
handler parameter).
Here’s an example of using unique decorator:
http://localhost:8983/solr/core_name/export?q=my-query&sort=reporter+desc,&fl=reporter&expr=unique(input(),over="reporter")
(Note that the over parameter must use one of the fields requested in the fl parameter).
Comparison with Cursors
The /export handler and cursor-based pagination offer different trade-offs for streaming large result sets.
| Export | Cursors | |
|---|---|---|
Advantages |
|
|
Disadvantages |
|
|
Details
With cursors, the query is re-executed for each page of results.
In contrast, /export runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over.
Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors.
However, this means IndexReaders are kept around for longer periods of time, which delays cleanup of memory and disk resources until the export completes.
The /export handler has significantly lower latency until the first document is returned, because the internal batch size is decoupled from the response message size.
With cursors, you typically need to set the rows parameter to a high value (e.g., 10k-100k depending on fl/document size) to achieve decent throughput, and provided you have enough memory (rows * shards * fl-size).
However, this creates a "glugging" effect: when you request a large batch, Solr must build the entire payload and send it over the wire while your client waits (assuming a sharded-collection).
Only after receiving and decoding this large payload can the client request the next batch, but in the interim Solr sits idle on this request.
With the /export handler, these steps are decoupled; Solr can continue sorting and decoding/encoding documents while waiting for more demand from the client.
The advantage of cursors is flexibility.
Cursors impose no constraints on the sort criteria except that you must include a unique key, which isn’t a real constraint.
Cursors work as part of SearchHandler and thus can include most/all capabilities of it like highlighting.
A cursorMark can be persisted and resumed later, even across restarts, or never continued if enough results were consumed to satisfy the use-case.
An /export stream must be consumed in a single session.
Cursors also support distributed queries by default while /export does not, although they can be achieved using
streaming expressions which are built on top of the /export handler.
Distributed Support
See the section Streaming Expressions for distributed support.