DocValues

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

Why DocValues?

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

Enabling DocValues

To use docValues, you only need to enable it for a field that you will use it with. As with all schema design, you need to define a field type and then define fields of that type with docValues enabled. All of these actions are done in schema.xml.

Enabling a field for docValues only requires adding docValues="true" to the field (or field type) definition, as in this example from the schema.xml of Solr’s sample_techproducts_configs config set:

<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues.

DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are:

  • StrField and UUIDField.

    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.

    • If the field is multi-valued, Lucene will use the SORTED_SET type.

  • Any Trie* numeric fields, date fields and EnumField.

    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.

    • If the field is multi-valued, Lucene will use the SORTED_SET type.

  • Boolean fields

  • Int|Long|Float|Double|Date PointField

    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.

    • If the field is multi-valued, Lucene will use the SORTED_NUMERIC type.

These Lucene types are related to how the values are sorted and stored.

There is an additional configuration option available, which is to modify the docValuesFormat used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk. In some cases, however, you may choose to specify an alternative DocValuesFormat implementation. For example, you could choose to keep everything in memory by specifying docValuesFormat="Memory" on a field type:

<fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true" docValuesFormat="Memory" />

Please note that the docValuesFormat option may change in future releases.

Lucene index back-compatibility is only supported for the default codec. If you choose to customize the docValuesFormat in your schema.xml, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading.

Using DocValues

Sorting, Faceting & Functions

If docValues="true" for a field, then DocValues will automatically be used any time the field is used for sorting, faceting or function queries.

Retrieving DocValues During Search

Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. “fl=*”) for search queries depending on the effective value of the useDocValuesAsStored parameter for each field. For schema versions >= 1.6, the implicit default is useDocValuesAsStored="true". See Field Type Definitions and Properties & Defining Fields for more details.

When useDocValuesAsStored="false", non-stored DocValues fields can still be explicitly requested by name in the fl param, but will not match glob patterns ("*"). Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing).

In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.

When retrieving fields from their docValues form (using the /export handler, streaming expressions or if the field is requested in the fl parameter), two important differences between regular stored fields and docValues fields must be understood:

  1. Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.

  2. Multiple identical entries are collapsed into a single value. Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.