Text to Vector

This module brings the power of Large Language Models (*LLM*s) to Solr.

More specifically, it provides a text-to-vector capability, used on documents or queries, via integrating with popular external services that do this.

The state-of-the-art of such services use an LLM, hence the name of this module. Without this module, vectors must be supplied to Solr for indexing & searching, possibly coordinating with such services.

Main Concepts

From Text to Vector

The aim of encoding text to numerical vectors is to represent text in a way that semantically similar sentences are encoded to vectors close in a vector space.

Often this process is called 'text embedding' as it projects a piece of text into a high-dimensional latent vector space and embeds the text with such vector.

Vector distance metrics (algorithms) can then be used to compute a pairwise similarity, producing a score.

Large Language Models

Specific Large Language Models are able to encode text to a numerical vector.

These models are often called Embedding Models as they encode text to vector embeddings.

For additional information you can refer to this blog post.

Text to Vector Online Services

Training, fine-tuning and operating such Large Language Models is expensive.

Many companies focus on this aspect and let users access APIs to encode the text (at the price of a license fee).

Apache Solr uses LangChain4j to connect to such apis.

This module sends your documents and queries off to some hosted service on the internet. There are cost, privacy, performance, and service availability implications on such a strong dependency that should be diligently examined before employing this module in a serious way.

At the moment a subset of the text vectorisation services supported by LangChain4j is supported by Solr.

Disclaimer: Apache Solr is in no way affiliated to any of these corporations or services.

If you want to add support for additional services or improve the support for the existing ones, feel free to contribute:

Contributing to Solr

Module

This is provided via the llm Solr Module that needs to be enabled before use.

LLM Configuration

You need to register / configure the plugins provided by the LLM module that you want to use. This is done in solrconfig.xml.

Declaration of the knn_text_to_vector query parser.

<queryParser name="knn_text_to_vector" class="org.apache.solr.llm.textvectorisation.search.TextToVectorQParserPlugin"/>

Text to Vector Lifecycle

Models

A model encodes text to a vector.
A model in Solr is a reference to an external API that runs the Large Language Model responsible for text vectorisation.

the Solr vectorisation model specifies the parameters to access the APIs, the model doesn’t run internally in Solr

A model is described by these parameters:

class

Required

Default: none

The model implementation. Accepted values:

dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel.
dev.langchain4j.model.mistralai.MistralAiEmbeddingModel.
dev.langchain4j.model.openai.OpenAiEmbeddingModel.
dev.langchain4j.model.cohere.CohereEmbeddingModel.

name

Required

Default: none

The identifier of your model, this is used by any component that intends to use the model (knn_text_to_vector query parser).

params

Optional

Default: none

Each model class has potentially different params. Many are shared but for the full set of parameters of the model you are interested in please refer to the official documentation of the LangChain4j version included in Solr: Vectorisationm Models in LangChain4j.

Supported Models

Apache Solr uses LangChain4j to support text vectorisation. The models currently supported are:

Hugging Face
MistralAI
OpenAI
Cohere

{
  "class": "dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel",
  "name": "<a-name-for-your-model>",
  "params": {
    "accessToken": "<your-huggingface-api-key>",
    "modelId": "<a-huggingface-vectorisation-model>"
  }
}

{
  "class": "dev.langchain4j.model.mistralai.MistralAiEmbeddingModel",
  "name": "<a-name-for-your-model>",
  "params": {
    "baseUrl": "https://api.mistral.ai/v1",
    "apiKey": "<your-mistralAI-api-key>",
    "modelName": "<a-mistralAI-vectorisation-model>",
    "timeout": 60,
    "logRequests": true,
    "logResponses": true,
    "maxRetries": 5
  }
}

{
  "class": "dev.langchain4j.model.openai.OpenAiEmbeddingModel",
  "name": "<a-name-for-your-model>",
  "params": {
    "baseUrl": "https://api.openai.com/v1",
    "apiKey": "<your-openAI-api-key>",
    "modelName": "<a-openAI-vectorisation-model>",
    "timeout": 60,
    "logRequests": true,
    "logResponses": true,
    "maxRetries": 5
  }
}

{
  "class": "dev.langchain4j.model.cohere.CohereEmbeddingModel",
  "name": "<a-name-for-your-model>",
  "params": {
    "baseUrl": "https://api.cohere.ai/v1/",
    "apiKey": "<your-cohere-api-key>",
    "modelName": "<a-cohere-vectorisation-model>",
    "inputType": "search_document",
    "timeout": 60,
    "logRequests": true,
    "logResponses": true
  }
}

Uploading a Model

To upload the model in a /path/myModel.json file, please run:

curl -XPUT 'http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json'

To view all models:

http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store

To delete the currentModel model:

curl -XDELETE 'http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store/currentModel'

To view the model you just uploaded please open the following URL in a browser:

http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store

Example: /path/myModel.json

{
  "class": "dev.langchain4j.model.openai.OpenAiEmbeddingModel",
  "name": "openai-1",
  "params": {
    "baseUrl": "https://api.openai.com/v1",
    "apiKey": "apiKey-openAI",
    "modelName": "text-embedding-3-small",
    "timeout": 60,
    "logRequests": true,
    "logResponses": true,
    "maxRetries": 5
  }
}

Documentation Indexing time

Enriching documents with vectors at indexing time

To vectorise textual fields of your documents at indexing time you need to configure an Update Request Processor Chain that includes at least one TextToVectorUpdateProcessor update request processor (you can include more than one, if you want to vectorise multiple fields):

<updateRequestProcessorChain name="textToVector">
  <processor class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
   <str name="inputField">_text_</str>
   <str name="outputField">vector</str>
   <str name="model">dummy-1</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

The TextToVectorUpdateProcessor update request processor vectorises the content of the 'inputField' for each document processed at indexing time.

The resulting vector is added as a value for the 'outputField'.

To perform the vectorisation it leverages a 'model' you have previously uploaded in the text-to-vector-model-store.

This update processor sends your document field content off to some hosted service on the internet. There are serious performance implications that should be diligently examined before employing this component in production. It will slow down substantially your indexing pipeline so make sure to stress test your solution before going live.

For more details on how to work with update request processors in Apache Solr, please refer to the dedicated page: Update Request Processor

Index first and enrich your documents with vectors on a second pass

Vectorising text using a hosted service may be slow, so, depending on your use case it could be a good idea to index first your documents and then add vectors iteratively.

This can be done in Solr defining two update request processors chains: one that includes all the processors you need, excluded the TextToVectorUpdateProcessor (let’s call it 'no-vectorisation') and one that includes only the TextToVectorUpdateProcessor (let’s call it 'vectorisation').

<updateRequestProcessorChain name="no-vectorisation">
<processor class="solr.processor1">
   ...
  </processor>
...
<processor class="solr.processorN">
   ...
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

<updateRequestProcessorChain name="vectorisation">
<processor class="solr.processor1">
   ...
  </processor>
...
<processor class="solr.processorN">
   ...
  </processor>
<processor class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
   <str name="inputField">_text_</str>
   <str name="outputField">vector</str>
   <str name="model">dummy-1</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

You would index your documents first using the 'no-vectorisation' and when finished, incrementally repeat the indexing targeting the 'vectorisation' chain.

This implies you need to send the documents you want to index to Solr twice and re-run any other update request processor you need, in the second chain. This has data traffic implications(you transfer your documents over the network twice) and processing implications (if you have other update request processors in your chain, those must be repeated the second time as we are literally replacing the indexed documents one by one).

If your use case is compatible with Partial Updates, you can do better:

You still define two chains, but this time the 'vectorisation' one only includes the 'TextToVectorUpdateProcessor' (and the Mandatory Processors )

<updateRequestProcessorChain name="no-vectorisation">
<processor class="solr.processor1">
   ...
  </processor>
...
<processor class="solr.processorN">
   ...
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

<updateRequestProcessorChain name="vectorisation">
<processor class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
   <str name="inputField">_text_</str>
   <str name="outputField">vector</str>
   <str name="model">dummy-1</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

Add to your schema a simple field that will be useful to track the vectorisation and use atomic updates:

<field name="vectorised" type="boolean" indexed="true" stored="false" docValues="true" default="false"/>

In the first pass just index your documents using your reliable and fast 'no-vectorisation' chain.

On the second pass, re-index all your documents using atomic updates and targeting the 'vectorisation' chain:

{"id":"mydoc",
 "vectorised":{"set":true}
}

What will happen is that internally Solr fetches the stored content of the docs to update, all the existing fields are retrieved and a re-indexing happens, targeting the 'vectorisation' chain that will add the vector and set the boolean 'vectorised' field to 'true'.

Faceting or querying on the boolean 'vectorised' field can also give you a quick idea on how many documents have been enriched with vectors.

Running a Text-to-Vector Query

To run a query that vectorises your query text, using a model you previously uploaded is simple:

?q={!knn_text_to_vector model=a-model f=vector topK=10}hello world query

The search results retrieved are the k=10 nearest documents to the vector encoded from the query hello world query, using the model a-model.

For more details on how to work with vector search query parsers in Apache Solr, please refer to the dedicated page: Dense Vector Search