Transforming and Indexing Custom JSON

If you have JSON documents that you would like to index without transforming them into Solr’s structure, you can add them to Solr by including some parameters with the update request.

These parameters provide information on how to split a single JSON file into multiple Solr documents and how to map fields to Solr’s schema. One or more valid JSON documents can be sent to the /update/json/docs path with the configuration params.

Mapping Parameters

These parameters allow you to define how a JSON file should be read for multiple Solr documents.

split

Optional

Default: none

Defines the path at which to split the input JSON into multiple Solr documents and is required if you have multiple documents in a single JSON file. If the entire JSON makes a single Solr document, the path must be “/”.

It is possible to pass multiple split paths by separating them with a pipe (|), for example: split=/|/foo|/foo/bar. If one path is a child of another, they automatically become a child document.

f

Required

Default: none

Provides multivalued mapping to map document field names to Solr field names. The format of the parameter is target-field-name:json-path, as in f=first:/first. The json-path is required. The target-field-name is the Solr document field name, and is optional. If not specified, it is automatically derived from the input JSON. The default target field name is the fully qualified name of the field.

Wildcards can be used here, see Using Wildcards for Field Names below for more information.

mapUniqueKeyOnly

Optional

Default: false

This parameter is particularly convenient when the fields in the input JSON are not available in the schema and schemaless mode is not enabled. This will index all the fields into the default search field (using the df parameter) and only the uniqueKey field is mapped to the corresponding field in the schema. If the input JSON does not have a value for the uniqueKey field then a UUID is generated for the same.

df

Optional

Default: none

If the mapUniqueKeyOnly flag is enabled, the update handler needs a field where the data should be indexed to. This is the same field that other handlers use as a default search field.

srcField

Optional

Default: none

This is the name of the field to which the JSON source will be stored into. This can only be used if split=/ (i.e., you want your JSON input file to be indexed as a single Solr document). Note that atomic updates will cause the field to be out-of-sync with the document.

echo

Optional

Default: false

This is for debugging purpose only. Set it to true if you want the docs to be returned as a response. Nothing will be indexed.

For example, if we have a JSON file that includes two documents, we could define an update request like this:

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/cores/techproducts/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

With this request, we have defined that "exams" contains multiple documents. In addition, we have mapped several fields from the input document to Solr fields.

When the update request is complete, the following two documents will be added to the index:

{
  "first":"John",
  "last":"Doe",
  "marks":90,
  "test":"term1",
  "subject":"Maths",
  "grade":8
}
{
  "first":"John",
  "last":"Doe",
  "marks":86,
  "test":"term1",
  "subject":"Biology",
  "grade":8
}

In the prior example, all of the fields we wanted to use in Solr had the same names as they did in the input JSON. When that is the case, we can simplify the request by only specifying the json-path portion of the f parameter, as in this example:

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/first'\
'&f=/last'\
'&f=/grade'\
'&f=/exams/subject'\
'&f=/exams/test'\
'&f=/exams/marks'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/cores/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/first'\
'&f=/last'\
'&f=/grade'\
'&f=/exams/subject'\
'&f=/exams/test'\
'&f=/exams/marks'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/first'\
'&f=/last'\
'&f=/grade'\
'&f=/exams/subject'\
'&f=/exams/test'\
'&f=/exams/marks'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

In this example, we simply named the field paths (such as /exams/test). Solr will automatically attempt to add the content of the field from the JSON input to the index in a field with the same name.

Documents will be rejected during indexing if the fields do not exist in the schema before indexing. So, if you are NOT using schemaless mode, you must pre-create all fields. If you are working in Schemaless Mode, however, fields that don’t exist will be created on the fly with Solr’s best guess for the field type.

Reusing Parameters in Multiple Requests

You can store and re-use parameters with Solr’s Request Parameters API.

Say we wanted to define parameters to split documents at the exams field, and map several other fields. We could make an API request such as:

V1 API

 curl http://localhost:8983/solr/techproducts/config/params -H 'Content-type:application/json' -d '{
 "set": {
   "my_params": {
     "split": "/exams",
     "f": ["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"]
 }}}'

V2 API User-Managed / Single-Node Solr

curl http://localhost:8983/api/cores/techproducts/config/params -H 'Content-type:application/json' -d '{
 "set": {
   "my_params": {
     "split": "/exams",
     "f": ["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"]
 }}}'

V2 API SolrCloud

curl http://localhost:8983/api/collections/techproducts/config/params -H 'Content-type:application/json' -d '{
 "set": {
   "my_params": {
     "split": "/exams",
     "f": ["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"]
 }}}'

When we send the documents, we’d use the useParams parameter with the name of the parameter set we defined:

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs?useParams=my_params' -H 'Content-type:application/json' -d '{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [{
      "subject": "Maths",
      "test": "term1",
      "marks": 90
    },
    {
      "subject": "Biology",
      "test": "term1",
      "marks": 86
    }
  ]
}'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/cores/techproducts/update/json?useParams=my_params' -H 'Content-type:application/json' -d '{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [{
      "subject": "Maths",
      "test": "term1",
      "marks": 90
    },
    {
      "subject": "Biology",
      "test": "term1",
      "marks": 86
    }
  ]
}'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json?useParams=my_params' -H 'Content-type:application/json' -d '{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [{
      "subject": "Maths",
      "test": "term1",
      "marks": 90
    },
    {
      "subject": "Biology",
      "test": "term1",
      "marks": 86
    }
  ]
}'

Using Wildcards for Field Names

Instead of specifying all the field names explicitly, it is possible to specify wildcards to map fields automatically.

There are two restrictions: wildcards can only be used at the end of the json-path, and the split path cannot use wildcards.

A single asterisk * maps only to direct children, and a double asterisk ** maps recursively to all descendants. The following are example wildcard path mappings:

f=$FQN:/**: maps all fields to the fully qualified name ($FQN) of the JSON field. The fully qualified name is obtained by concatenating all the keys in the hierarchy with a period (.) as a delimiter. This is the default behavior if no f path mappings are specified.
f=/docs/*: maps all the fields under docs and in the name as given in JSON
f=/docs/**: maps all the fields under docs and its children in the name as given in JSON
f=searchField:/docs/*: maps all fields under /docs to a single field called ‘searchField’
f=searchField:/docs/**: maps all fields under /docs and its children to searchField

With wildcards we can further simplify our previous example as follows:

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/**'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/cores/techproducts/update/json'\
'?split=/exams'\
'&f=/**'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json'\
'?split=/exams'\
'&f=/**'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

Because we want the fields to be indexed with the field names as they are found in the JSON input, the double wildcard in f=/** will map all fields and their descendants to the same fields in Solr.

It is also possible to send all the values to a single field and do a full text search on that. This is a good option to blindly index and query JSON documents without worrying about fields and schema.

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/'\
'&f=txt:/**'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/cores/techproducts/update/json'\
'?split=/'\
'&f=txt:/**'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json'\
'?split=/'\
'&f=txt:/**'\
 -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

In the above example, we’ve said all of the fields should be added to a field in Solr named 'txt'. This will add multiple fields to a single field, so whatever field you choose should be multi-valued.

The default behavior is to use the fully qualified name (FQN) of the node. So, if we don’t define any field mappings, like this:

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs?split=/exams'\
    -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/cores/techproducts/update/json?split=/exams'\
    -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json?split=/exams'\
    -H 'Content-type:application/json' -d '
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
    {
      "subject": "Maths",
      "test"   : "term1",
      "marks"  : 90},
    {
      "subject": "Biology",
      "test"   : "term1",
      "marks"  : 86}
  ]
}'

The indexed documents would be added to the index with fields that look like this:

{
  "first":"John",
  "last":"Doe",
  "grade":8,
  "exams.subject":"Maths",
  "exams.test":"term1",
  "exams.marks":90},
{
  "first":"John",
  "last":"Doe",
  "grade":8,
  "exams.subject":"Biology",
  "exams.test":"term1",
  "exams.marks":86}

Multiple Documents in a Single Payload

This functionality supports documents in the JSON Lines format (.jsonl), which specifies one document per line.

For example:

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs' -H 'Content-type:application/json' -d '
{ "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1", "marks":90}
{ "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1", "marks":86}'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Content-type:application/json' -d '
{ "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1", "marks":90}
{ "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1", "marks":86}'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Content-type:application/json' -d '
{ "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1", "marks":90}
{ "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1", "marks":86}'

Or even an array of documents, as in this example:

V1 API

curl 'http://localhost:8983/solr/techproducts/update/json/docs' -H 'Content-type:application/json' -d '[
{"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1", "marks":90},
{"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1", "marks":86}]'

V2 API User-Managed / Single-Node Solr

curl 'http://localhost:8983/api/cores/techproducts/update/json' -H 'Content-type:application/json' -d '[
{"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1", "marks":90},
{"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1", "marks":86}]'

V2 API SolrCloud

curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Content-type:application/json' -d '[
{"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1", "marks":90},
{"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1", "marks":86}]'

Tips for Custom JSON Indexing

Schemaless mode: This handles field creation automatically. The field guessing may not be exactly as you expect, but it works. The best thing to do is to setup a local server in schemaless mode, index a few sample docs and create those fields in your real setup with proper field types before indexing
Pre-created Schema: Post your docs to the /update/json/docs endpoint with echo=true. This gives you the list of field names you need to create. Create the fields before you actually index
No schema, only full-text search: All you need to do is to do full-text search on your JSON. Set the configuration as given in the Setting JSON Defaults section.

Setting JSON Defaults

It is possible to send any JSON to the /update/json/docs endpoint and the default configuration of the component is as follows:

<initParams path="/update/json/docs">
  <lst name="defaults">
    <!-- this ensures that the entire JSON doc will be stored verbatim into one field -->
    <str name="srcField">_src_</str>
    <!-- This means the uniqueKeyField will be extracted from the fields and
         all fields go into the 'df' field. In this config df is already configured to be 'text'
     -->
    <str name="mapUniqueKeyOnly">true</str>
    <!-- The default search field where all the values are indexed to -->
    <str name="df">text</str>
  </lst>
</initParams>

So, if no parameters are passed, the entire JSON file would get indexed to the _src_ field and all the values in the input JSON would go to a field named text. If there is a value for the uniqueKey it is stored and if no value could be obtained from the input JSON, a UUID is created and used as the uniqueKey field value.

Alternately, use the Request Parameters feature to set these parameters, as shown earlier in the section Reusing Parameters in Multiple Requests.

V1 API

 curl http://localhost:8983/solr/techproducts/config/params -H 'Content-type:application/json' -d '{
"set": {
  "full_txt": {
    "srcField": "_src_",
    "mapUniqueKeyOnly" : true,
    "df": "text"
}}}'

V2 API User-Managed / Single-Node Solr

 curl http://localhost:8983/api/cores/techproducts/config/params -H 'Content-type:application/json' -d '{
"set": {
  "full_txt": {
    "srcField": "_src_",
    "mapUniqueKeyOnly" : true,
    "df": "text"
}}}'

V2 API SolrCloud

 curl http://localhost:8983/api/collections/techproducts/config/params -H 'Content-type:application/json' -d '{
"set": {
  "full_txt": {
    "srcField": "_src_",
    "mapUniqueKeyOnly" : true,
    "df": "text"
}}}'

To use these parameters, send the parameter useParams=full_txt with each request.