Solr Glossary

These are common terms used with Solr.

Solr Terms

Where possible, terms are linked to relevant parts of the Solr Reference Guide for more information.

Jump to a letter:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A

Atomic updates: An approach to updating only one or more fields of a document, instead of reindexing the entire document.

B

Boolean operators: These control the inclusion or exclusion of keywords in a query by using operators such as AND, OR, and NOT.

C

Cluster: In Solr, a cluster is a set of Solr nodes operating in coordination with each other via ZooKeeper, and managed as a unit. A cluster may contain many collections. See also SolrCloud.
Collection: In Solr, one or more Documents grouped together in a single logical index using a single configuration and Schema.

In SolrCloud a collection may be divided up into multiple logical shards, which may in turn be distributed across many nodes.

Single-node installations and user-managed clusters use instead the concept of a Core. "Collection" is most frequently used in the SolrCloud context, but as it represents a "logical index", the term may be used to refer to individual cores in a user-managed cluster as well.
Commit: To make document changes permanent in the index. In the case of added documents, they would be searchable after a commit.
Core: An individual Solr instance (represents a logical index). Multiple cores can run on a single node. See also SolrCloud.
Core reload: To re-initialize a Solr core after changes to the schema file, solrconfig.xml or other configuration files.

D

Distributed search: Distributed search is one where queries are processed across more than one Shard.
Document: A group of fields and their values. Documents are the basic unit of data in a collection. Documents are assigned to shards using standard hashing, or by specifically assigning a shard within the document ID. Documents are versioned after each write operation.

E

Ensemble: A ZooKeeper term to indicate multiple ZooKeeper instances running simultaneously and in coordination with each other for fault tolerance.

F

Facet: The arrangement of search results into categories based on indexed terms.
Field: The content to be indexed/searched along with metadata defining how the content should be processed by Solr.

I

Inverse document frequency (IDF): A measure of the general importance of a term. It is calculated as the number of total Documents divided by the number of Documents that a particular word occurs in the collection. See http://en.wikipedia.org/wiki/Tf-idf and the Lucene TFIDFSimilarity javadocs for more info on TF-IDF based scoring and Lucene scoring in particular. See also Term frequency.
Inverted index: A way of creating a searchable index that lists every word and the documents that contain those words, similar to an index in the back of a book which lists words and the pages on which they can be found. When performing keyword searches, this method is considered more efficient than the alternative, which would be to create a list of documents paired with every word used in each document. Since users search using terms they expect to be in documents, finding the term before the document saves processing resources and time.

L

Leader: A single Replica for each Shard that takes charge of coordinating index updates (document additions or deletions) to other replicas in the same shard. This is a transient responsibility assigned to a node via an election, if the current Shard Leader goes down, a new node will automatically be elected to take its place. See also SolrCloud.

M

Metadata: Literally, data about data. Metadata is information about a document, such as its title, author, or location.

N

Natural language query: A search that is entered as a user would normally speak or write, as in, "What is aspirin?"
Node: A JVM instance running Solr. Also known as a Solr server.

O

Optimistic concurrency: Also known as "optimistic locking", this is an approach that allows for updates to documents currently in the index while retaining locking or version control.
Overseer: A single node in SolrCloud that is responsible for processing and coordinating actions involving the entire cluster. It keeps track of the state of existing nodes, collections, shards, and replicas, and assigns new replicas to nodes. This is a transient responsibility assigned to a node via an election, if the current Overseer goes down, a new node will be automatically elected to take its place. See also SolrCloud.

Q

Query parser: A query parser processes the terms entered by a user.

R

Recall: The ability of a search engine to retrieve all of the possible matches to a user’s query.
Relevance: The appropriateness of a document to the search conducted by the user.
Replica: A Core that acts as a physical copy of a Shard in a SolrCloud Collection.
Replication: A method of copying a leader index from one server to one or more "follower" or "child" servers.
RequestHandler: Logic and configuration parameters that tell Solr how to handle incoming "requests", whether the requests are to return search results, to index documents, or to handle other custom situations.

S

SearchComponent: Logic and configuration parameters used by request handlers to process query requests. Examples of search components include faceting, highlighting, and "more like this" functionality.
Shard: In SolrCloud, a logical partition of a single Collection. Every shard consists of at least one physical Replica, but there may be multiple Replicas distributed across multiple Nodes for fault tolerance. See also SolrCloud.
SolrCloud: Umbrella term for a suite of functionality in Solr which allows managing a Cluster of Solr Nodes for scalability, fault tolerance, and high availability.
Solr Schema (managed-schema.xml or schema.xml): The Solr index Schema defines the fields to be indexed and the type for the field (text, integers, etc.). By default schema data can be "managed" at run time using the Schema API and is typically kept in a file named managed-schema.xml which Solr modifies as needed, but a collection may be configured to use a static Schema, which is only loaded on startup from a human edited configuration file - typically named schema.xml. See Schema Factory Configuration for details.
SolrConfig (solrconfig.xml): The Apache Solr configuration file. Defines indexing options, RequestHandlers, highlighting, spellchecking and various other configurations. The file, solrconfig.xml, is located in the Solr home conf directory.
Spell Check: The ability to suggest alternative spellings of search terms to a user, as a check against spelling errors causing few or zero results.
Stopwords: Generally, words that have little meaning to a user’s search but which may have been entered as part of a natural language query. Stopwords are generally very small pronouns, conjunctions and prepositions (such as, "the", "with", or "and")
Suggester: Functionality in Solr that provides the ability to suggest possible query terms to users as they type.
Synonyms: Synonyms generally are terms which are near to each other in meaning and may substitute for one another. In a search engine implementation, synonyms may be abbreviations as well as words, or terms that are not consistently hyphenated. Examples of synonyms in this context would be "Inc." and "Incorporated" or "iPod" and "i-pod".

T

Term frequency: The number of times a word occurs in a given document. See http://en.wikipedia.org/wiki/Tf-idf and the Lucene TFIDFSimilarity javadocs for more info on TF-IDF based scoring and Lucene scoring in particular. See also Inverse document frequency (IDF).
Transaction log: An append-only log of write operations maintained by each Replica. This log is required with SolrCloud implementations and is created and managed automatically by Solr.

W

Wildcard: A wildcard allows a substitution of one or more letters of a word to account for possible variations in spelling or tenses.

Z

ZooKeeper: Also known as Apache ZooKeeper. The system used by SolrCloud to keep track of configuration files and node names for a cluster. A ZooKeeper cluster is used as the central configuration store for the cluster, a coordinator for operations requiring distributed synchronization, and the system of record for cluster topology. See also SolrCloud.