Overview of SolrCloud Autoscaling

Autoscaling in Solr aims to provide good defaults so a SolrCloud cluster remains balanced and stable in the face of various cluster change events. This balance is achieved by satisfying a set of rules and sorting preferences to select the target of cluster management operations automatically on cluster events.

A simple example is automatically adding a replica for a SolrCloud collection when a node containing an existing replica goes down.

The goal of autoscaling in SolrCloud is to make cluster management easier, more automatic, and more intelligent. It aims to provide good defaults such that the cluster remains balanced and stable in the face of various events such as a node joining the cluster or leaving the cluster. This is achieved by satisfying a set of rules and sorting preferences that help Solr select the target of cluster management operations.

There are three distinct problems that this feature solves:

  • When to run cluster management tasks? For example, we might want to add a replica when an existing replica is no longer alive.
  • Which cluster management task to run? For example, do we add a new replica or should we move an existing one to a new node?
  • How do we run the cluster management tasks so the cluster remains balanced and stable?

Before we get into the details of how each of these problems are solved, let’s take a quick look at the easiest way to setup autoscaling for your cluster.

Quick Start: Automatically Adding Replicas

Say that we want to create a collection which always requires us to have three replicas available for each shard all the time. We can set the replicationFactor=3 while creating the collection, but what happens if a node containing one or more of the replicas either crashed or was shutdown for maintenance? In such a case, we’d like to create additional replicas to replace the ones that are no longer available to preserve the original number of replicas.

We have an easy way to enable this behavior without needing to understand the autoscaling features in depth. We can create a collection with such behavior by adding an additional parameter autoAddReplicas=true with the CREATE command of the Collection API. For example:

/admin/collections?action=CREATE&name=_name_of_collection_&numShards=1&replicationFactor=3&autoAddReplicas=true

A collection created with autoAddReplicas=true will be monitored by Solr such that if a node containing a replica of this collection goes down, Solr will add new replicas on other nodes after waiting for up to thirty seconds for the node to come back.

You can see the section Autoscaling Automatically Adding Replicas to learn more about how to enable or disable this feature as well as other details.

The selection of the node that will host the new replica is made according to the default cluster preferences that we will learn more about in the next sections.

Cluster Preferences

Cluster preferences allow you to tell Solr how to assess system load on each node. This information is used to guide selection of the node(s) on which cluster management operations will be performed.

In general, when an operation increases replica counts, the least loaded qualified node will be chosen, and when the operation reduces replica counts, the most loaded qualified node will be chosen.

The default cluster preferences are [{minimize:cores},{maximize:freedisk}], which tells Solr to minimize the number of cores on all nodes and if number of cores are equal, maximize the free disk space available. In this case, the least loaded node is the one with the fewest cores or if two nodes have an equal number of cores, the node with the most free disk space.

You can learn more about preferences in the section on Cluster Preferences Specification.

Cluster Policy

A cluster policy is a set of rules that a node, shard, or collection must satisfy before it can be chosen as the target of a cluster management operation. These rules are applied across the cluster regardless of the collection being managed. For example, the rule {"cores":"<10", "node":"#ANY"} means that any node must have less than 10 Solr cores in total, regardless of which collection they belong to.

There are many metrics on which the rule can be based, e.g., system load average, heap usage, free disk space, etc. The full list of supported metrics can be found in the section describing Autoscaling Policy Rule Attributes.

When a node, shard, or collection does not satisfy a policy rule, we call it a violation. By default, cluster management operations will fail if there is even one violation. You can allow operations to succeed in the face of a violation by marking the corresponding rule with "strict":false. When you do this, Solr ensures that cluster management operations minimize the number of violations.

Solr also supports collection-specific policies, which operate in tandem with the cluster policy.

Triggers

Now that we have an idea about how cluster management operations use policies and preferences help Solr keep the cluster balanced and stable, we can talk about when to invoke such operations.

Triggers are used to watch for events such as a node joining or leaving the cluster. When the event happens, the trigger executes a set of actions that compute and execute a plan, i.e., a set of operations to change the cluster so that the policy and preferences are respected.

The autoAddReplicas parameter passed with the CREATE Collection API command in the Quick Start section above automatically creates a trigger that watches for a node going away. When the trigger fires, it executes a set of actions that compute and execute a plan to move all replicas hosted by the lost node to new nodes in the cluster. The target nodes are chosen based on the policy and preferences.

You can learn more about Triggers in the section Autoscaling Triggers.

Trigger Actions

A trigger executes actions that tell Solr what to do in response to the trigger. Solr ships with two actions that are added to every trigger by default. The first is called the ComputePlanAction and the other is ExecutePlanAction. The former computes the cluster management operations necessary to stabilize the cluster and the latter executes them on the cluster.

You can learn more about Trigger Actions in the section Autoscaling Trigger Actions.

Listeners

An Autoscaling listener can be attached to a trigger. Solr calls the listener each time the trigger fires as well as before and after the actions performed by the trigger. Listeners are useful as a call back mechanism to perform tasks such as logging or informing external systems about events. For example, a listener is automatically added by Solr to each trigger to log details of the trigger fire and actions to the .system collection.

You can learn more about Listeners in the section Autoscaling Listeners.

Autoscaling APIs

The autoscaling APIs available at /admin/autoscaling can be used to read and modify each of the components discussed above.

You can learn more about these APIs in the section Autoscaling API.