Exporting Result Sets

The /export request handler allows a fully sorted result set to be streamed out of Solr using a special rank query parser and response writer. These have been specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.

This feature uses a stream sorting technique that begins to send records within milliseconds and continues to stream results until the entire result set has been sorted and exported.

The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort-based stats.

Field Requirements

All the fields being sorted must have docValues set to true. By default, fields in the field list (fl) must also have docValues. However, you can include stored-only fields (fields without docValues) by setting the includeStoredFields parameter to true. For more information, see the section on DocValues.

The /export RequestHandler

The /export request handler with the appropriate configuration is one of Solr’s out-of-the-box request handlers - see Implicit Request Handlers for more information.

Note that this request handler’s properties are defined as "invariants", which means they cannot be overridden by other properties passed at another time (such as at query time).

Requesting Results Export

You can use /export to make requests to export the result set of a query.

All queries must include sort and fl parameters, or the query will return an error. Filter queries are also supported.

An optional parameter batchSize determines the size of the internal buffers for partial results. The default value is 30000 but users may want to specify smaller values to limit the memory use (at the cost of degraded performance) or higher values to improve export performance (the relationship is not linear and larger values don’t bring proportionally larger performance increases).

An optional parameter includeStoredFields (default false) enables exporting fields that only have stored values (no docValues). When set to true, fields without docValues but with stored values can be included in the field list (fl). Note that retrieving stored fields may significantly impact export performance compared to docValues fields, as stored fields require additional I/O operations. If all requested fields are docValues=true then the data will only be read from docValues. This behavior applies to fields that are also stored=true and does not depend on the value of the includeStoredFields parameter.

The supported response writers are json and javabin. For backward compatibility reasons wt=xsort is also supported as input, but wt=xsort behaves same as wt=json. The default output format is json.

Here is an example of an export request of some indexed log data:

http://localhost:8983/solr/core_name/export?q=my-query&sort=severity+desc,timestamp+desc&fl=severity,timestamp,msg

Specifying the Sort Criteria

The sort property defines how documents will be sorted in the exported result set. Results can be sorted by any field that has a field type of int, long, float, double, string. The sort fields must be single valued fields and must have docValues enabled.

The export performance will get slower as you add more sort fields. If there is enough physical memory available outside of the JVM to load up the sort fields then the performance will be linearly slower with addition of sort fields. It can get worse otherwise.

Specifying the Field List

The fl property defines the fields that will be exported with the result set. Any of the field types that can be sorted (i.e., int, long, float, double, string, date, boolean) can be used in the field list. The fields can be single or multi-valued.

By default, fields in the field list must have docValues enabled. However, when the includeStoredFields parameter is set to true, fields with only stored values (no docValues) can also be included. Note that sort fields still require docValues, regardless of this setting.

Wildcard patterns can be used for the field list (e.g. fl=*_i) and will be expanded to the list of fields that match the pattern and are able to be exported, see Field Requirements.

Returning scores is not supported at this time.

Specifying the Local Streaming Expression

The optional expr property defines a stream expression that allows documents to be processed locally before they are exported in the result set.

Expressions have to use a special input() stream that represents original results from the /export handler. Output from the stream expression then becomes the output from the /export handler. The &streamLocalOnly=true flag is always set for this streaming expression.

Only stream decorators and evaluators are supported in these expressions - using any of the source expressions except for the pre-defined input() will result in an error.

Using stream expressions with the /export handler may result in dramatic performance improvements due to the local in-memory reduction of the number of documents to be returned.

Here’s an example of using top decorator for returning only top N results:

http://localhost:8983/solr/core_name/export?q=my-query&sort=timestamp+desc,&fl=timestamp,reporter,severity&expr=top(n=2,input(),sort="timestamp+desc")

(Note that the sort spec in the top decorator must match the sort spec in the handler parameter).

Here’s an example of using unique decorator:

http://localhost:8983/solr/core_name/export?q=my-query&sort=reporter+desc,&fl=reporter&expr=unique(input(),over="reporter")

(Note that the over parameter must use one of the fields requested in the fl parameter).

Comparison with Cursors

The /export handler and cursor-based pagination offer different trade-offs for streaming large result sets.

Export Cursors

	Export	Cursors
Advantages	Query executed once — efficient Consistent snapshot (no duplicates or missing docs) Lower latency to the first document (typically) Decoupled reader and writer creates smoother flow	Sharded collection support, intrinsically supported Flexible sort criteria Resumable across requests and restarts Full `SearchHandler` features (highlighting, etc.)
Disadvantages	Requires streaming expressions for distributed queries Sort criteria can only be fields with docValues; no score Must consume in a single session A long session may retain old segments from being removed in a timely manner	Query re-executed for each page — inefficient Possible duplicates or missing docs with concurrent updates Higher latency to the first document (typically) Uneven flow; large batches needed for throughput

Advantages

Query executed once — efficient
Consistent snapshot (no duplicates or missing docs)
Lower latency to the first document (typically)
Decoupled reader and writer creates smoother flow

Sharded collection support, intrinsically supported
Flexible sort criteria
Resumable across requests and restarts
Full SearchHandler features (highlighting, etc.)

Disadvantages

Requires streaming expressions for distributed queries
Sort criteria can only be fields with docValues; no score
Must consume in a single session
A long session may retain old segments from being removed in a timely manner

Query re-executed for each page — inefficient
Possible duplicates or missing docs with concurrent updates
Higher latency to the first document (typically)
Uneven flow; large batches needed for throughput

Details

With cursors, the query is re-executed for each page of results. In contrast, /export runs the filter query once and the resulting segment-level bitmasks are applied once per segment, after which the documents are simply iterated over. Additionally, the segments that existed when the stream was opened are held open for the duration of the export, eliminating the disappearing or duplicate document issues that can occur with cursors. However, this means IndexReaders are kept around for longer periods of time, which delays cleanup of memory and disk resources until the export completes.

The /export handler has significantly lower latency until the first document is returned, because the internal batch size is decoupled from the response message size. With cursors, you typically need to set the rows parameter to a high value (e.g., 10k-100k depending on fl/document size) to achieve decent throughput, and provided you have enough memory (rows * shards * fl-size). However, this creates a "glugging" effect: when you request a large batch, Solr must build the entire payload and send it over the wire while your client waits (assuming a sharded-collection). Only after receiving and decoding this large payload can the client request the next batch, but in the interim Solr sits idle on this request. With the /export handler, these steps are decoupled; Solr can continue sorting and decoding/encoding documents while waiting for more demand from the client.

The advantage of cursors is flexibility. Cursors impose no constraints on the sort criteria except that you must include a unique key, which isn’t a real constraint. Cursors work as part of SearchHandler and thus can include most/all capabilities of it like highlighting. A cursorMark can be persisted and resumed later, even across restarts, or never continued if enough results were consumed to satisfy the use-case. An /export stream must be consumed in a single session. Cursors also support distributed queries by default while /export does not, although they can be achieved using streaming expressions which are built on top of the /export handler.

Distributed Support

See the section Streaming Expressions for distributed support.