DocValues
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
Why DocValues?
The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.
For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).
In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
Enabling DocValues
To use docValues, you only need to enable it for a field that you will use it with. As with all schema design, you need to define a field type and then define fields of that type with docValues enabled. All of these actions are done in schema.xml
.
Enabling a field for docValues only requires adding docValues="true"
to the field (or field type) definition, as in this example from the schema.xml
of Solr’s sample_techproducts_configs
configset:
<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
If you have already indexed data into your Solr index, you will need to completely reindex your content after changing your field definitions in schema.xml in order to successfully use docValues.
|
DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are:
StrField
, andUUIDField
:- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
SORTED
type. - If the field is multi-valued, Lucene will use the
SORTED_SET
type. Entries are kept in sorted order and duplicates are removed.
- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
BoolField
:- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
SORTED
type. - If the field is multi-valued, Lucene will use the
SORTED_SET
type. Entries are kept in sorted order and duplicates are removed.
- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
- Any
*PointField
Numeric or Date fields,EnumFieldType
, andCurrencyFieldType
:- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
NUMERIC
type. - If the field is multi-valued, Lucene will use the
SORTED_NUMERIC
type. Entries are kept in sorted order and duplicates are kept.
- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
- Any of the deprecated
Trie*
Numeric or Date fields,EnumField
andCurrencyField
:- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
NUMERIC
type. - If the field is multi-valued, Lucene will use the
SORTED_SET
type. Entries are kept in sorted order and duplicates are removed.
- If the field is single-valued (i.e., multi-valued is false), Lucene will use the
These Lucene types are related to how the values are sorted and stored.
There is an additional configuration option available, which is to modify the docValuesFormat
used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk. In some cases, however, you may choose to specify an alternative DocValuesFormat implementation. For example, you could choose to keep everything in memory by specifying docValuesFormat="Direct"
on a field type:
<fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true" docValuesFormat="Direct" />
Please note that the docValuesFormat
option may change in future releases.
Lucene index back-compatibility is only supported for the default codec. If you choose to customize the docValuesFormat in your schema.xml , upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading.
|
Using DocValues
Sorting, Faceting & Functions
If docValues="true"
for a field, then DocValues will automatically be used any time the field is used for sorting, faceting or function queries.
Retrieving DocValues During Search
Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g., “fl=*”) for search queries depending on the effective value of the useDocValuesAsStored
parameter for each field. For schema versions >= 1.6, the implicit default is useDocValuesAsStored="true"
. See Field Type Definitions and Properties & Defining Fields for more details.
When useDocValuesAsStored="false"
, non-stored DocValues fields can still be explicitly requested by name in the fl param, but will not match glob patterns ("*"
). Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order rather than insertion order and may have duplicates removed, see above. If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires reindexing).
In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.
When retrieving fields from their docValues form (such as when using the /export handler, streaming expressions or if the field is requested in the fl
parameter), two important differences between regular stored fields and docValues fields must be understood:
- Order is not preserved. When retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.
- For field types using
SORTED_SET
(see above), multiple identical entries are collapsed into a single value. Thus if values 4, 5, 2, 4, 1 are inserted, the values returned will be 1, 2, 4, 5.