Dense Vector Search
Solr’s Dense Vector Search adds support for indexing and searching dense numerical vectors.
Deep learning can be used to produce a vector representation of both the query and the documents in a corpus of information.
These neural network-based techniques are usually referred to as neural search, an industry derivation from the academic field of Neural information Retrieval.
Important Concepts
Dense Vector Representation
A traditional tokenized inverted index can be considered to model text as a "sparse" vector, in which each term in the corpus corresponds to one vector dimension. In such a model, the number of dimensions is generally quite high (corresponding to the term dictionary cardinality), and the vector for any given document contains mostly zeros (hence it is sparse, as only a handful of terms that exist in the overall index will be present in any given document).
Dense vector representation contrasts with term-based sparse vector representation in that it distills approximate semantic meaning into a fixed (and limited) number of dimensions.
The number of dimensions in this approach is generally much lower than the sparse case, and the vector for any given document is dense, as most of its dimensions are populated by non-zero values.
In contrast to the sparse approach (for which tokenizers are used to generate sparse vectors directly from text input) the task of generating vectors must be handled in application logic external to Apache Solr.
There may be cases where it makes sense to directly search data that natively exists as a vector (e.g., scientific data); but in a text search context, it is likely that users will leverage deep learning models such as BERT to encode textual information as dense vectors, supplying the resulting vectors to Apache Solr explicitly at index and query time.
For additional information you can refer to this blog post.
Dense Retrieval
Given a dense vector v that models the information need, the easiest approach for providing dense vector retrieval would be to calculate the distance (euclidean, dot product, etc.) between v and each vector d that represents a document in the corpus of information.
This approach is quite expensive, so many approximate strategies are currently under active research.
The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graph.
It provides efficient approximate nearest neighbor search for high dimensional vectors.
Index Time
This is the Apache Solr field type designed to support dense vector search:
DenseVectorField
The dense vector field gives the possibility of indexing and searching dense vectors of float elements.
For example:
[1.0, 2.5, 3.7, 4.1]
Here’s how DenseVectorField should be configured in the schema:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>
vectorDimension-
Required
Default: none
The dimension of the dense vector to pass in.
Accepted values: Any integer.
similarityFunction-
Optional
Default:
euclideanVector similarity function; used in search to return top K most similar vectors to a target vector.
Accepted values:
euclidean,dot_productorcosine.-
euclidean: Euclidean distance -
dot_product: Dot product
-
| this similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vectors must be of unit length, including both document and query vectors. Using dot product with vectors that are not unit length can result in errors or poor search results. |
-
cosine: Cosine similarity
the cosine similarity scores returned by Solr are normalized like this : (1 + cosine_similarity) / 2.
|
| the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and cannot normalize them in advance. |
To use the following advanced parameters that customise the codec format and the hyperparameter of the HNSW algorithm, make sure the Schema Codec Factory, is in use.
Here’s how DenseVectorField can be configured with the advanced hyperparameters:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine" knnAlgorithm="hnsw" hnswMaxConnections="10" hnswBeamWidth="40"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>
knnAlgorithm-
Optional
Default:
hnsw(advanced) Specifies the underlying knn algorithm to use
Accepted values:
hnsw.
Please note that the knnAlgorithm accepted values may change in future releases.
vectorEncoding-
Optional
Default:
FLOAT32(advanced) Specifies the underlying encoding of the dense vector elements. This affects memory/disk impact for both the indexed and stored fields (if enabled)
Accepted values:
FLOAT32,BYTE. hnswMaxConnections-
Optional
Default:
16(advanced) This parameter is specific for the
hnswknn algorithm:Controls how many of the nearest neighbor candidates are connected to the new node.
It has the same meaning as
Mfrom the 2018 paper.Accepted values: Any integer.
hnswBeamWidth-
Optional
Default:
100(advanced) This parameter is specific for the
hnswknn algorithm:It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node.
It has the same meaning as
efConstructionfrom the 2018 paper.Accepted values: Any integer.
DenseVectorField supports the attributes: indexed, stored.
| currently multivalue is not supported |
Here’s how a DenseVectorField should be indexed:
-
JSON
-
XML
-
SolrJ
[{ "id": "1",
"vector": [1.0, 2.5, 3.7, 4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7, 65.1]
}
]
<add>
<doc>
<field name="id">1</field>
<field name="vector">1.0</field>
<field name="vector">2.5</field>
<field name="vector">3.7</field>
<field name="vector">4.1</field>
</doc>
<doc>
<field name="id">2</field>
<field name="vector">1.5</field>
<field name="vector">5.5</field>
<field name="vector">6.7</field>
<field name="vector">65.1</field>
</doc>
</add>
final SolrClient client = getSolrClient();
final SolrInputDocument d1 = new SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));
final SolrInputDocument d2 = new SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));
client.add(Arrays.asList(d1, d2));
Query Time
Apache Solr provides three query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The knn query parser, the vectorSimilarity query parser and the knn_text_to_vector query parser.
All parsers return scores for retrieved documents that are the approximate distance to the target vector (defined by the similarityFunction configured at indexing time) and both support "Pre-Filtering" the document graph to reduce the number of candidate vectors evaluated (without needing to compute their vector similarity distances).
Common parameters for both query parsers are:
f-
Required
Default: none
The
DenseVectorFieldto search in. preFilter-
Optional
Default: Depends on usage, see below.
Specifies an explicit list of Pre-Filter query strings to use.
includeTags-
Optional
Default: none
Indicates that only
fqfilters with the specifiedtagshould be considered for implicit Pre-Filtering. Must not be combined withpreFilter. excludeTags-
Optional
Default: none
Indicates that
fqfilters with the specifiedtagshould be excluded from consideration for implicit Pre-Filtering. Must not be combined withpreFilter.
knn Query Parser
The knn k-nearest neighbors query parser matches k-nearest documents to the target vector.
In addition to the common parameters described above, it takes the following parameters:
topK-
Optional
Default: 10
How many k-nearest results to return.
Here’s an example of a simple knn search:
?q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
The search results retrieved are the k=10 nearest documents to the vector in input [1.0, 2.0, 3.0, 4.0], ranked by the similarityFunction configured at indexing time.
knn_text_to_vector Query Parser
The knn_text_to_vector query parser encode a textual query to a vector using a dedicated Large Language Model(fine tuned for the task of encoding text to vector for sentence similarity) and matches k-nearest neighbours documents to such query vector.
In addition to the parameters in common with the other dense-retrieval query parsers, it takes the following:
model-
Required
Default: none
The model to use to encode the text to a vector. Must reference an existing model loaded into the
/schema/text-to-vector-model-store. topK-
Optional
Default: 10
How many k-nearest results to return.
Here’s an example of a simple knn_text_to_vector search:
?q={!knn_text_to_vector model=a-model f=vector topK=10}hello world query
The search results retrieved are the k=10 nearest documents to the vector encoded from the query hello world query, using the model a-model.
For more details on how to work with vectorise text in Apache Solr, please refer to the dedicated page: Text to Vector
vectorSimilarity Query Parser
The vectorSimilarity vector similarity query parser matches documents whose similarity with the target vector is a above a minimum threshold.
In addition to the common parameters described above, it takes the following parameters:
minReturn-
Required
Default: none
Minimum similarity threshold of nodes in the graph to be returned as matches
minTraverse-
Optional
Default: -Infinity
Minimum similarity of nodes in the graph to continue traversal of their neighbors
Here’s an example of a simple vectorSimilarity search:
?q={!vectorSimilarity f=vector minReturn=0.7}[1.0, 2.0, 3.0, 4.0]
The search results retrieved are all documents whose similarity with the input vector [1.0, 2.0, 3.0, 4.0] is at least 0.7 based on the similarityFunction configured at indexing time
knn Query Parser
You should use the knn query parser when:
-
you search for the top-K closest vectors to a query vector
-
you work directly with vectors (no text encoding is involved)
-
you want to a have a fine-grained control over the way you encode text to vector and prefer to do it outside of Apache Solr
knn_text_to_vector Query Parser
You should use the knn_text_to_vector query parser when:
-
you search for the top-K closest vectors to a query text
-
you work directly with text and want Solr to handle the encoding to vector behind the scenes
-
you are building demos/prototypes
|
Apache Solr uses LangChain4j to interact with Large Language Models. The integration is experimental and we are going to improve our stress-test and benchmarking coverage of this query parser in future iterations: if you care about raw performance you may prefer to encode the text outside of Solr |
vectorSimilarity Query Parser
You should use the vectorSimilarity query parser when:
-
you search for the closest vectors to a query vector within a similarity threshold
-
you work directly with vectors (no text encoding is involved)
-
you want to a have a fine-grained control over the way you encode text to vector and prefer to do it outside of Apache Solr
Graph Pre-Filtering
Pre-Filtering the set of candidate documents considered when walking the graph can be specified either explicitly, or implicitly (based on existing fq params) depending on how and when these dense vector query parsers are used.
Explicit Pre-Filtering
The preFilter parameter can be specified explicitly to reduce the number of candidate documents evaluated for the distance calculation:
?q={!vectorSimilarity f=vector minReturn=0.7 preFilter=inStock:true}[1.0, 2.0, 3.0, 4.0]
In the above example, only documents matching the Pre-Filter inStock:true will be candidates for consideration when evaluating the vectorSimilarity search against the specified vector.
The preFilter parameter may be blank (ex: preFilter="") to indicate that no Pre-Filtering should be performed; or it may be multi-valued — either through repetition, or via duplicated Parameter References.
These two examples are equivalent:
?q={!knn f=vector topK=10 preFilter=category:AAA preFilter=inStock:true}[1.0, 2.0, 3.0, 4.0]
?q={!knn f=vector topK=10 preFilter=$knnPreFilter}[1.0, 2.0, 3.0, 4.0]
&knnPreFilter=category:AAA
&knnPreFilter=inStock:true
Implicit Pre-Filtering
While the preFilter parameter may be explicitly specified on any usage of the knn or vectorSimilarity query parsers, the default Pre-Filtering behavior (when no preFilter parameter is specified) will vary based on how the query parser is used:
-
When used as the main
qparam:fqfilters in the request (that are not Solr Post Filters) will be combined to form an implicit Graph Pre-Filter.-
This default behavior optimizes the number of vector distance calculations considered, eliminating documents that would eventually be excluded by an
fqfilter anyway. -
includeTagsandexcludeTagsmay be used to limit the set offqfilters used in the Pre-Filter.
-
-
When a vector search query parser is used as an
fqparam, or as a subquery clause in a larger query: No implicit Pre-Filter is used.-
includeTagsandexcludeTagsmust not be used in these situations.
-
The example request below shows two usages of vector query parsers that will get no implicit Pre-Filtering from any of the fq parameters, because neither usage is as the main q param:
?q=(color_str:red OR {!vectorSimilarity f=color_vector minReturn=0.7 v="[1.0, 2.0, 3.0, 4.0]"})
&fq={!knn f=title_vector topK=10}[9.0, 8.0, 7.0, 6.0]
&fq=inStock:true
However, the next example shows a basic request where all fq parameters will be used as implicit Pre-Filters on the main knn query:
?q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
&fq=category:AAA
&fq=inStock:true
If we modify the above request to add tags to the fq parameters, we can specify an includeTags option on the knn parser to limit which fq filters are used for Pre-Filtering:
?q={!knn f=vector topK=10 includeTags=for_knn}[1.0, 2.0, 3.0, 4.0]
&fq=category:AAA
&fq={!tag=for_knn}inStock:true
In this example, only the inStock:true filter will be used for Pre-Filtering to find the the topK=10 documents, and the category:AAA filter will be applied independently; possibly resulting in less then 10 total matches.
Some use cases where includeTags and/or excludeTags may be more useful then an explicit preFilter parameters:
-
You have some
fqparameters that are re-used on many requests (even when you don’t use search dense vector fields) that you wish to be used as Pre-Filters when you do search dense vector fields. -
You typically want all
fqparams to be used as graph Pre-Filters on yourknnqueries, but when users "drill down" on Facets, you want thefqparameters you add to be excluded from the Pre-Filtering so that the result set gets smaller; instead of just computing a newtopKset.
Usage in Re-Ranking Query
Both dense vector search query parsers can be used to rerank first pass query results:
&q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
|
When using The second pass score(deriving from knn) is calculated only if the document This means the second pass The final ranked list of results will have the first pass score(main query Details about using the ReRank Query Parser can be found in the Query Re-Ranking section. |