Schemaless Mode | Apache Solr Reference Guide 7.5

Schemaless Mode is a set of Solr features that, when used together, allow users to rapidly construct an effective schema by simply indexing sample data, without having to manually edit the schema.

These Solr features, all controlled via solrconfig.xml, are:

Managed schema: Schema modifications are made at runtime through Solr APIs, which requires the use of a schemaFactory that supports these changes. See the section Schema Factory Definition in SolrConfig for more details.
Field value class guessing: Previously unseen fields are run through a cascading set of value-based parsers, which guess the Java class of field values - parsers for Boolean, Integer, Long, Float, Double, and Date are currently available.
Automatic schema field addition, based on field value class(es): Previously unseen fields are added to the schema, based on field value Java classes, which are mapped to schema field types - see Solr Field Types.

Using the Schemaless Example

The three features of schemaless mode are pre-configured in the _default config set in the Solr distribution. To start an example instance of Solr using these configs, run the following command:

bin/solr start -e schemaless

This will launch a single Solr server, and automatically create a collection (named “gettingstarted”) that contains only three fields in the initial schema: id, _version_, and _text_.

You can use the /schema/fields Schema API to confirm this: curl http://localhost:8983/solr/gettingstarted/schema/fields will output:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "fields":[{
      "name":"_text_",
      "type":"text_general",
      "multiValued":true,
      "indexed":true,
      "stored":false},
    {
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true}]}

Configuring Schemaless Mode

As described above, there are three configuration elements that need to be in place to use Solr in schemaless mode. In the _default config set included with Solr these are already configured. If, however, you would like to implement schemaless on your own, you should make the following changes.

Enable Managed Schema

As described in the section Schema Factory Definition in SolrConfig, Managed Schema support is enabled by default, unless your configuration specifies that ClassicIndexSchemaFactory should be used.

You can configure the ManagedIndexSchemaFactory (and control the resource file used, or disable future modifications) by adding an explicit <schemaFactory/> like the one below, please see Schema Factory Definition in SolrConfig for more details on the options available.

<schemaFactory class="ManagedIndexSchemaFactory">
  <bool name="mutable">true</bool>
  <str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>

Enable Field Class Guessing

In Solr, an UpdateRequestProcessorChain defines a chain of plugins that are applied to documents before or while they are indexed.

The field guessing aspect of Solr’s schemaless mode uses a specially-defined UpdateRequestProcessorChain that allows Solr to guess field types. You can also define the default field type classes to use.

To start, you should define it as follows (see the javadoc links below for update processor factory documentation):

  <updateProcessor class="solr.UUIDUpdateProcessorFactory" name="uuid"/>
  <updateProcessor class="solr.RemoveBlankFieldUpdateProcessorFactory" name="remove-blank"/>
  <updateProcessor class="solr.FieldNameMutatingUpdateProcessorFactory" name="field-name-mutating"> (1)
    <str name="pattern">[^\w-\.]</str>
    <str name="replacement">_</str>
  </updateProcessor>
  <updateProcessor class="solr.ParseBooleanFieldUpdateProcessorFactory" name="parse-boolean"/> (2)
  <updateProcessor class="solr.ParseLongFieldUpdateProcessorFactory" name="parse-long"/>
  <updateProcessor class="solr.ParseDoubleFieldUpdateProcessorFactory" name="parse-double"/>
  <updateProcessor class="solr.ParseDateFieldUpdateProcessorFactory" name="parse-date">
    <arr name="format">
      <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
      <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
      <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>
      <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>
      <str>yyyy-MM-dd'T'HH:mm:ssZ</str>
      <str>yyyy-MM-dd'T'HH:mm:ss</str>
      <str>yyyy-MM-dd'T'HH:mmZ</str>
      <str>yyyy-MM-dd'T'HH:mm</str>
      <str>yyyy-MM-dd HH:mm:ss.SSSZ</str>
      <str>yyyy-MM-dd HH:mm:ss,SSSZ</str>
      <str>yyyy-MM-dd HH:mm:ss.SSS</str>
      <str>yyyy-MM-dd HH:mm:ss,SSS</str>
      <str>yyyy-MM-dd HH:mm:ssZ</str>
      <str>yyyy-MM-dd HH:mm:ss</str>
      <str>yyyy-MM-dd HH:mmZ</str>
      <str>yyyy-MM-dd HH:mm</str>
      <str>yyyy-MM-dd</str>
    </arr>
  </updateProcessor>
  <updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields"> (3)
    <lst name="typeMapping">
      <str name="valueClass">java.lang.String</str> (4)
      <str name="fieldType">text_general</str>
      <lst name="copyField"> (5)
        <str name="dest">*_str</str>
        <int name="maxChars">256</int>
      </lst>
      <!-- Use as default mapping instead of defaultFieldType -->
      <bool name="default">true</bool>
    </lst>
    <lst name="typeMapping">
      <str name="valueClass">java.lang.Boolean</str>
      <str name="fieldType">booleans</str>
    </lst>
    <lst name="typeMapping">
      <str name="valueClass">java.util.Date</str>
      <str name="fieldType">pdates</str>
    </lst>
    <lst name="typeMapping">
      <str name="valueClass">java.lang.Long</str> (6)
      <str name="valueClass">java.lang.Integer</str>
      <str name="fieldType">plongs</str>
    </lst>
    <lst name="typeMapping">
      <str name="valueClass">java.lang.Number</str>
      <str name="fieldType">pdoubles</str>
    </lst>
  </updateProcessor>

  <!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields"> (7)
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

There are many things defined in this chain. Let’s step through a few of them.

1	First, we’re using the FieldNameMutatingUpdateProcessorFactory to lower-case all field names. Note that this and every following `<processor>` element include a `name`. These names will be used in the final chain definition at the end of this example.
2	Next we add several update request processors to parse different field types. Note the ParseDateFieldUpdateProcessorFactory includes a long list of possible date formations that would be parsed into valid Solr dates. If you have a custom date, you could add it to this list (see the link to the Javadocs below to get information on how).
3	Once the fields have been parsed, we define the field types that will be assigned to those fields. You can modify any of these that you would like to change.
4	In this definition, if the parsing step decides the incoming data in a field is a string, we will put this into a field in Solr with the field type `text_general`. This field type by default allows Solr to query on this field.
5	After we’ve added the `text_general` field, we have also defined a copy field rule that will copy all data from the new `text_general` field to a field with the same name suffixed with `_str`. This is done by Solr’s dynamic fields feature. By defining the target of the copy field rule as a dynamic field in this way, you can control the field type used in your schema. The default selection allows Solr to facet, highlight, and sort on these fields.
6	This is another example of a mapping rule. In this case we define that when either of the `Long` or `Integer` field parsers identify a field, they should both map their fields to the `plongs` field type.
7	Finally, we add a chain definition that calls the list of plugins. These plugins are each called by the names we gave to them when we defined them. We can also add other processors to the chain, as shown here. Note we have also given the entire chain a `name` ("add-unknown-fields-to-the-schema"). We’ll use this name in the next section to specify that our update request handler should use this chain definition.

This chain definition will make a number of copy field rules for string fields to be created from corresponding text fields. If your data causes you to end up with a lot of copy field rules, indexing may be slowed down noticeably, and your index size will be larger. To control for these issues, it’s recommended that you review the copy field rules that are created, and remove any which you do not need for faceting, sorting, highlighting, etc.

If you’re interested in more information about the classes used in this chain, here are links to the Javadocs for update processor factories mentioned above:

Set the Default UpdateRequestProcessorChain

Once the UpdateRequestProcessorChain has been defined, you must instruct your UpdateRequestHandlers to use it when working with index updates (i.e., adding, removing, replacing documents).

There are two ways to do this. The update chain shown above has a default=true attribute which will use it for any update handler.

An alternative, more explicit way is to use InitParams to set the defaults on all /update request handlers:

<initParams path="/update/**">
  <lst name="defaults">
    <str name="update.chain">add-unknown-fields-to-the-schema</str>
  </lst>
</initParams>

After all of these changes have been made, Solr should be restarted or the cores reloaded.

Disabling Automatic Field Guessing

Automatic field creation can be disabled with the update.autoCreateFields property. To do this, you can use bin/solr config with a command such as:

bin/solr config -c mycollection -p 8983 -action set-user-property -property update.autoCreateFields -value false

Examples of Indexed Documents

Once the schemaless mode has been enabled (whether you configured it manually or are using the _default configset), documents that include fields that are not defined in your schema will be indexed, using the guessed field types which are automatically added to the schema.

For example, adding a CSV document will cause unknown fields to be added, with fieldTypes based on values:

curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Content-type:application/csv" -d '
id,Artist,Album,Released,Rating,FromDistributor,Sold
44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'

Output indicating success:

<response>
  <lst name="responseHeader"><int name="status">0</int><int name="QTime">106</int></lst>
</response>

The fields now in the schema (output from curl http://localhost:8983/solr/gettingstarted/schema/fields ):

{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "fields":[{
      "name":"Album",
      "type":"text_general"},
    {
      "name":"Artist",
      "type":"text_general"},
    {
      "name":"FromDistributor",
      "type":"plongs"},
    {
      "name":"Rating",
      "type":"pdoubles"},
    {
      "name":"Released",
      "type":"pdates"},
    {
      "name":"Sold",
      "type":"plongs"},
    {
      "name":"_root_", ...},
    {
      "name":"_text_", ...},
    {
      "name":"_version_", ...},
    {
      "name":"id", ...}
]}

In addition string versions of the text fields are indexed, using copyFields to a *_str dynamic field: (output from curl http://localhost:8983/solr/gettingstarted/schema/copyfields ):

{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "copyFields":[{
      "source":"Artist",
      "dest":"Artist_str",
      "maxChars":256},
    {
      "source":"Album",
      "dest":"Album_str",
      "maxChars":256}]}

You Can Still Be Explicit

Even if you want to use schemaless mode for most fields, you can still use the Schema API to pre-emptively create some fields, with explicit types, before you index documents that use them.

Internally, the Schema API and the Schemaless Update Processors both use the same Managed Schema functionality.

Also, if you do not need the *_str version of a text field, you can simply remove the copyField definition from the auto-generated schema and it will not be re-added since the original field is now defined.

Once a field has been added to the schema, its field type is fixed. As a consequence, adding documents with field value(s) that conflict with the previously guessed field type will fail. For example, after adding the above document, the “Sold” field has the fieldType plongs, but the document below has a non-integral decimal value in this field:

curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Content-type:application/csv" -d '
id,Description,Sold
19F,Cassettes by the pound,4.93'

This document will fail, as shown in this output:

<response>
  <lst name="responseHeader">
    <int name="status">400</int>
    <int name="QTime">7</int>
  </lst>
  <lst name="error">
    <str name="msg">ERROR: [doc=19F] Error adding field 'Sold'='4.93' msg=For input string: "4.93"</str>
    <int name="code">400</int>
  </lst>
</response>

DocValues Understanding Analyzers, Tokenizers, and Filters