Reindexing

There are several types of changes to Solr configuration that require you to reindex your data.

These changes include editing properties of fields or field types; adding fields, field types, or copy field rules; upgrading Solr; and some system configuration properties.

It’s important to be aware that many changes require reindexing, because there are times when not reindexing can have negative consequences for Solr as a system, or for the ability of your users to find what they are looking for.

There is no process in Solr for programmatically reindexing data. When we say "reindex", we mean, literally, "index it again". However you got the data into the index the first time, you will run that process again. It is strongly recommended that Solr users index their data in a repeatable, consistent way, so that the process can be easily repeated when the need for reindexing arises.

Reindexing is recommended during major upgrades, so in addition to covering what types of configuration changes should trigger a reindex, this section will also cover strategies for reindexing.

Changes that Require Reindex

Schema Changes

All changes to a collection’s schema require reindexing. This is because many of the available options are only applied during the indexing process. Solr simply has no way to implement the desired change without reindexing the data.

To understand the general reason why reindexing is ever required, it’s helpful to understand the relationship between Solr’s schema and the underlying Lucene index. Lucene does not use a schema, it is a Solr-only concept. When you delete a field from Solr’s schema, it does not modify Lucene’s index in any way. When you add a field to Solr’s schema, the field does not exist in Lucene’s index until a document that contains the field is indexed.

This means that there are many types of schema changes that cannot be reflected in the index simply by modifying Solr’s schema. This is different from most database models where schemas are used. With regard to indexing, Solr’s schema acts like a rulebook for indexing documents by telling Lucene how to interpret the data being sent. Once the documents are in Lucene, Solr’s schema has no control over the underlying data structure.

In addition to the types of schema changes described in the following sections, changing the schema version property is equivalent to changing field type properties. This type of change is usually only made during or because of a major upgrade.

Adding or Deleting Fields

If you add or delete a field from Solr’s schema, it’s strongly recommended to reindex.

When you add a field, you generally do so with the intent to use the field in some way. Since documents were indexed before the field was added, the index will not hold any references to the field for earlier documents. If you want to use the new field for faceting, for example, the new field facet will not include any documents that were not indexed with the new field.

There is a slightly different situation when deleting a field. In this case, since simply removing the field from the schema doesn’t change anything about the index, the field will still be in the index until the documents are reindexed. In fact, Lucene may keep a reference to a deleted field forever (see also LUCENE-1761). This may only be an issue for your environment if you try to add a field that has the same name as a deleted field, but it can also be an issue for dynamic field rules that are later removed.

Changing Field and Field Type Properties

Solr has two ways of defining field properties.

The first is to define properties on a field type. These properties are then applied to all fields of that type unless they are explicitly overriden.

The second is an override to a property inherited from the field type defined on the field itself.

If a property has been defined for a field type but the property is not overridden by defining a different value for the property for a field, then changing the property on the field type is equivalent to changing it on the field itself.

Changes to any field/field type property described in Field Type Properties must be reindexed in order for the change to be reflected in all documents. The list of changes that require reindexing includes (but is not limited to):

  • Changing a field from stored to not stored, and vice versa.
  • Changing a field from indexed to not indexed, and vice versa.
  • Changing a field from multi-valued to single-valued, and vice versa.
  • Changing Field Analysis.
  • Changing the type of field, or the class for a field type.
  • Enabling or disabling docValues.

Be sure to reference the Field Type Properties section linked above for the complete list of properties that would require a reindex.

In some cases, it can be possible to change a field/field type property value and it will only apply to documents indexed after the change.

For example, you could change a field from being indexed (indexed="true") to no longer indexed (indexed="false") and over time, as documents are updated, the index will be purged of the fields that shouldn’t be indexed anymore.

You could also change a field from not being stored (stored="false") to being stored (stored="true"). In this case, if you want to use the field immediately, only documents indexed after the change will contain data in the field. However, you would need to ensure that your client is able to handle fields missing from documents that have not yet been reindexed.

It’s important to note this is not possible for all field/field type properties. If you change whether or not docValues are enabled, for example, you absolutely must reindex. This is due to the way docValues have been implemented in Lucene, and how Lucene handles dovValue segments.

Changing any field properties without reindexing is never recommended to ensure consistent behavior, and should only be attempted when you have tested thoroughly and feel confident that you understand the ramifications on your documents and front-end clients.

Changing Field Analysis

Beyond specific field-level properties, analysis chains are also configured on field types, and are applied at index and/or query time.

It’s possible to define separate analysis chains for indexing and query events, or you can define a single chain that is applied to both event types.

If you change the analysis chain that applies to indexing events, it is strongly recommended that you reindex. This is because all of the changes that occur due to the chain configuration are applied to documents as they are being indexed, and only reindexing will allow your changes to take effect on documents.

While reindexing after analyzer changes is not required, be aware that not reindexing can cause unexpected query results in many cases.

For example, if you indexed a number of documents and then decide you’d like to use the LowerCaseTokenizerFactory to ensure all text is converted to lower case, you will have a mix of entries in the field: some in their original case ("iPhone"), and newer documents in all lower-case ("iphone"). If you do not reindex the original set of documents, a query such as "iphone" will not match documents with "iPhone", because the schema rules enforce lower case on the query, but that’s not what is in the index.

The only time you do not have to reindex when changing a field type’s analysis chain is when the changes impact queries only (and you know that you do not need to make corresponding changes to the index analysis).

Solrconfig Changes

Only one parameter change to Solr’s solrconfig.xml requires reindexing. That parameter is the luceneMatchVersion, which controls the compatibility of Solr with Lucene changes. Since this parameter can change the rules for analysis behind the scenes, it’s always recommended to reindex when changing this value. Usually, however, this is only changed in conjunction with a major upgrade.

However, if you make a change to Solr’s Update Request Processors, it’s generally because you want to change something about how update requests (documents) are processed (indexed). In this case, you can decide based on the change if you want to reindex your documents to implement the changes you’ve made.

Similarly, if you change the codecFactory parameter in solrconfig.xml, it is again strongly recommended that you plan to reindex your documents to avoid unintended behavior.

Upgrades

When upgrading between major versions (for example, from a 7.x release to 8.0 or 8.x), a best practice is to always reindex your data. The reason for this is that subtle changes may occur in default field type definitions or the underlying code.

If you have not changed your schema as part of an upgrade from one minor release to another (such as, from 7.x to a later 7.x release), you can often skip reindexing your documents. However, when upgrading to a major release, you should plan to reindex your documents because of the likelihood of changes that break back-compatibility.

Reindexing Strategies

There are a few approaches available to perform the reindex.

The strategies described below ensure that the Lucene index is completely dropped so you can recreate it to accommodate your changes. They allow you to recreate the Lucene index without having Lucene segments lingering with stale data.

Delete All Documents

The best approach is to first delete everything from the index, and then index your data again. You can delete all documents with a "delete-by-query", such as this:

curl -X POST -H 'Content-Type: application/json' --data-binary '{"delete":{"query":"*:*" }}' http://localhost:8983/solr/my_collection/update

It’s important to verify that all documents have been deleted, as that ensures the Lucene index segments have been deleted as well.

To verify that there are no segments in your index, look in the data directory and confirm it is empty. Since the data directory can be customized, see the section Specifying a Location for Index Data with the dataDir Parameter for where to look to find the index files.

Note you will need to verify the indexes have been removed in every shard and every replica on every node of a cluster. It is not sufficient to only query for the number of documents because you may have no documents but still have index segments.

Once the indexes have been cleared, you can start reindexing by re-running the original index process.

Index to Another Collection

In cases where you cannot take a production collection offline to delete all the documents, one option is to use Solr’s collection alias feature.

This option is only available for Solr installations running in SolrCloud mode.

With this approach, you will index your documents into a newly created collection and once everything is completed, create an alias for the collection and point your front-end at the collection alias. Queries will be routed to the new collection seamlessly.

Here is an example of creating an alias that points to a single collection:

http://localhost:8983/solr/admin/collections?action=CREATEALIAS&name=myData&collections=newCollection

Once the alias is in place and you are satisfied you no longer need the old data, you can delete the old collection with the DELETE command of the Collections API:

http://localhost:8983/solr/admin/collections?action=DELETE&name=oldCollection

Changes that Do Not Require Reindex

The types of changes that do not require or strongly indicate reindexing are changes that do not impact the index.

Creating or modifying request handlers, search components, and other elements of solrconfig.xml don’t require reindexing.

Cluster and core management actions, such as adding nodes, replicas, or new cores, or splitting shards, also don’t require reindexing.