Aliases
SolrCloud has the ability to query one or more collections via an alternative name. These alternative names for collections are known as aliases, and are useful when you want to:
- Atomically switch to using a newly (re)indexed collection with zero down time (by re-defining the alias)
- Insulate the client programming versus changes in collection names
- Issue a single query against several collections with identical schemas
There are two types of aliases: standard aliases and routed aliases. Within routed aliases, there are two types: category-routed aliases and time-routed aliases. These types are discussed in this section.
It’s possible to send collection update commands to aliases, but only to those that either resolve to a single collection or those that define the routing between multiple collections (Routed Aliases). In other cases update commands are rejected with an error since there is no logic by which to distribute documents among the multiple collections.
Standard Aliases
Standard aliases are created and updated using the CREATEALIAS command.
The current list of collections that are members of an alias can be verified via the CLUSTERSTATUS command.
The full definition of all aliases including metadata about that alias (in the case of routed aliases, see below) can be verified via the LISTALIASES command.
Alternatively this information is available by checking /aliases.json
in ZooKeeper with either the native ZooKeeper
client or in the tree page of the cloud menu in the admin UI.
Aliases may be deleted via the DELETEALIAS command. When deleting an alias, underlying collections are unaffected.
Any alias (standard or routed) that references multiple collections may complicate relevancy. By default, SolrCloud scores documents on a per-shard basis. With multiple collections in an alias this is always a problem, so if you have a use case for which BM25 or TF/IDF relevancy is important you will want to turn on one of the ExactStatsCache implementations. However, for analytical use cases where results are sorted on numeric, date, or alphanumeric field values, rather than relevancy calculations, this is not a problem. |
Routed Aliases
To address the update limitations associated with standard aliases and provide additional useful features, the concept of routed aliases has been developed. There are presently two types of routed alias: time routed and category routed. These are described in detail below, but share some common behavior.
When processing an update for a routed alias, Solr initializes its
UpdateRequestProcessor chain as usual, but
when DistributedUpdateProcessor
(DUP) initializes, it detects that the update targets a routed alias and injects
RoutedAliasUpdateProcessor
(RAUP) in front of itself.
RAUP, in coordination with the Overseer, is the main part of a routed alias, and must immediately precede DUP. It is not
possible to configure custom chains with other types of UpdateRequestProcessors between RAUP and DUP.
Ideally, as a user of a routed alias, you needn’t concern yourself with the particulars of the collection naming pattern
since both queries and updates may be done via the alias.
When adding data, you should usually direct documents to the alias (e.g., reference the alias name instead of any collection).
The Solr server and CloudSolrClient
will direct an update request to the first collection that an alias points to.
Once the server receives the data it will perform the necessary routing.
It’s extremely important with all routed aliases that the route values NOT change. Reindexing a document with a different route value for the same ID produces two distinct documents with the same ID accessible via the alias. All query time behavior of the routed alias is undefined and not easily predictable once duplicate ID’s exist. |
It is a bad idea to use "data driven" mode (aka schemaless-mode) with routed aliases, as duplicate schema mutations might happen concurrently leading to errors. |
Time Routed Aliases
Time Routed Aliases (TRAs) are a SolrCloud feature that manages an alias and a time sequential series of collections.
It automatically creates new collections and (optionally) deletes old ones as it routes documents to the correct collection based on its timestamp. This approach allows for indefinite indexing of data without degradation of performance otherwise experienced due to the continuous growth of a single index.
If you need to store a lot of timestamped data in Solr, such as logs or IoT sensor data, then this feature probably makes more sense than creating one sharded hash-routed collection.
How It Works
First you create a time routed aliases using the CREATEALIAS command with the desired router settings. Most of the settings are editable at a later time using the ALIASPROP command.
The first collection will be created automatically, along with an alias pointing to it. Each underlying Solr "core" in a collection that is a member of a TRA has a special core property referencing the alias. The name of each collection is comprised of the TRA name and the start timestamp (UTC), with trailing zeros and symbols truncated.
The collections list for a TRA is always reverse sorted, and thus the connection path of the request will route to the
lead collection. Using CloudSolrClient
is preferable as it can reduce the number of underlying physical HTTP requests by one.
If you know that a particular set of documents to be delivered is going to a particular older collection then you could
direct it there from the client side as an optimization but it’s not necessary. CloudSolrClient
does not (yet) do this.
RAUP first reads TRA configuration from the alias properties when it is initialized. As it sees each document, it checks for changes to TRA properties, updates its cached configuration if needed, and then determines which collection the document belongs to:
If RAUP needs to send it to a time segment represented by a collection other than the one that the client chose to communicate with, then it will do so using mechanisms shared with DUP. Once the document is forwarded to the correct collection (i.e., the correct TRA time segment), it skips directly to DUP on the target collection and continues normally, potentially being routed again to the correct shard & replica within the target collection.
If it belongs in the current collection (which is usually the case if processing events as they occur), the document passes through to DUP. DUP does its normal collection-level processing that may involve routing the document to another shard & replica.
If the timestamp on the document is more recent than the most recent TRA segment, then a new collection needs to be added at the front of the TRA. RAUP will create this collection, add it to the alias, and then forward the document to the collection it just created. This can happen recursively if more than one collection needs to be created.
Each time a new collection is added, the oldest collections in the TRA are examined for possible deletion, if that has been configured. All this happens synchronously, potentially adding seconds to the update request and indexing latency.
If
router.preemptiveCreateMath
is configured and if the document arrives within this window then it will occur asynchronously. See Time Routed Alias Parameters for more information.
Any other type of update like a commit or delete is routed by RAUP to all collections. Generally speaking, this is not a performance concern. When Solr receives a delete or commit wherein nothing is deleted or nothing needs to be committed, then it’s pretty cheap.
Limitations & Assumptions
Only time routed aliases are supported. If you instead have some other sequential number, you could fake it as a time (e.g., convert to a timestamp assuming some epoch and increment).
The smallest possible interval is one second. No other routing scheme is supported, although this feature was developed with considerations that it could be extended/improved to other schemes.
The underlying collections form a contiguous sequence without gaps. This will not be suitable when there are large gaps in the underlying data, as Solr will insist that there be a collection for each increment. This is due in part to Solr calculating the end time of each interval collection based on the timestamp of the next collection, since it is otherwise not stored in any way.
Avoid sending updates to the oldest collection if you have also configured that old collections should be automatically deleted. It could lead to exceptions bubbling back to the indexing client.
Category Routed Aliases
Category Routed Aliases (CRAs) are a feature to manage aliases and a set of dependent collections based on the value of a single field.
CRAs automatically create new collections but because the partitioning is on categorical information rather than continuous numerically based values there’s no logic for automatic deletion. This approach allows for simplified indexing of data that must be segregated into collections for cluster management or security reasons.
How It Works
First you create a category routed alias using the CREATEALIAS command with the desired router settings. Most of the settings are editable at a later time using the ALIASPROP command.
The alias will be created with a special place-holder collection which will always be named
myAlias__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA__TEMP
. The first document indexed into the CRA
will create a second collection named myAlias__CRA__foo
(for a routed field value of foo
). The second document
indexed will cause the temporary place holder collection to be deleted. Thereafter collections will be created whenever
a new value for the field is encountered.
To guard against runaway collection creation options for limiting the total number of categories, and for rejecting values that don’t match, a regular expression parameter is provided (see Category Routed Alias Parameters for details). |
+ Note that by providing very large or very permissive values for these options you are accepting the risk that garbled data could potentially create thousands of collections and bring your cluster to a grinding halt.
Field values (and thus the collection names) are case sensitive.
As elsewhere in Solr, manipulation and
cleaning of the data is expected to be done by external processes before data is sent to Solr, with one exception.
Throughout Solr there are limitations on the allowable characters in collection names. Any characters other than ASCII
alphanumeric characters (A-Za-z0-9
), hyphen (-
) or underscore (_
) are replaced with an underscore when calculating
the collection name for a category. For a CRA named myAlias
the following table shows how collection names would be
calculated:
Value | CRA Collection Name |
---|---|
foo | myAlias__CRA__foo |
Foo | myAlias__CRA__Foo |
foo bar | myAlias__CRA__foo_bar |
FOÓB&R | myAlias__CRA__FO_B_R |
中文的东西 | myAlias__CRA_______ |
foo__CRA__bar | Causes 400 Bad Request |
<null> | Causes 400 Bad Request |
Since collection creation can take upwards of 1-3 seconds, systems inserting data in a CRA should be constructed to handle such pauses whenever a new collection is created. Unlike time routed aliases, there is no way to predict the next value so such pauses are unavoidable.
There is no automated means of removing a category. If a category needs to be removed from a CRA the following procedure is recommended:
- Ensure that no documents with the value corresponding to the category to be removed will be sent either by stopping indexing or by fixing the incoming data stream
- Modify the alias definition in ZooKeeper, removing the collection corresponding to the category.
- Delete the collection corresponding to the category. Note that if the collection is not removed from the alias first, this step will fail.
Limitations & Assumptions
CRAs are presently unsuitable for non-English data values due to the limits on collection names. This can be worked around by duplicating the route value to a url safe Base64-encoded field and routing on that value instead.
The check for the CRA infix is independent of the regular expression validation and occurs after the name of the collection to be created has been calculated. It may not be avoided and is necessary to support future features.
Dimensional Routed Aliases
For cases where the desired segregation of of data relates to two fields and combination into a single field during indexing is impractical, or the TRA behavior is desired across multiple categories, Dimensional Routed aliases may be used. This feature has been designed to handle an arbitrary number and combination of category and time dimensions in any order, but users are cautioned to carefully consider the total number of collections that will result from such configurations. Collection counts in the high hundreds or low 1000’s begin to pose significant challenges with ZooKeeper.
DRA’s are a new feature and presently only 2 dimensions are supported. More dimensions will be supported in the future (see https://issues.apache.org/jira/browse/SOLR-13628 for progress) |
How It Works
First you create a dimensional routed alias with the desired router settings for each dimension. See the CREATEALIAS command documentation for details on how to specify the per-dimension configuration. Typical collection names will be of the form (example is for category x time example, with 30 minute intervals):
myalias__CRA__someCategory__TRA__2019-07-01_00_30
Note that the initial collection will be a throw away place holder for any DRA containing a category based dimension. Name generation for each sub-part of a collection name is identical to the corresponding potion of the component dimension type. (e.g., a category value generating CRA or TRA would still produce an error)
The prior warning about reindexing documents with different route value applies to every dimension of a DRA. DRA’s are inappropriate for documents where categories or timestamps used in routing will change (this of course applies to other route values in future RA types too). |
As with all Routed Aliases, DRA’s impose some costs if your data is not well behaved. In addition to the normal caveats of each component dimension there is a need for care in sending new categories after the DRA has been running for a while. Ordered Dimensions (time) behave slightly differently from Unordered (category) dimensions. Ordered dimensions rely on the iteration order of the collections in the alias and therefore cannot tolerate the generation of collection names out of order. The this means that of this is that when an ordered dimension such as time is a component of a DRA and the DRA experiences receipt of a document with a novel category with a time value corresponding to a time slice other than the starting time-slice for the time dimension, several collections will need to be created before the document can be indexed. This "new category effect" is identical to the behavior you would get with a TRA if you picked a start-date too far in the past.
For example given a Dimensional[time,category] DRA with start time of 2019-07-01T00:00:00Z the pattern of collections created for 4 documents might look like this:
No documents
Aliased collections:
// temp avoids empty alias error conditions myalias__TRA__2019-07-01__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP
Doc 1
time: 2019-07-01T00:00:00Z
category: someCategory
Aliased collections:
// temp retained to avoid empty alias during race with collection creation myalias__TRA__2019-07-01__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP myalias__TRA__2019-07-01__CRA__someCategory
Doc 2
time: 2019-07-02T00:04:00Z
category: otherCategory
Aliased collections:
// temp can now be deleted without risk of having an empty alias. myalias__TRA__2019-07-01__CRA__someCategory myalias__TRA__2019-07-01__CRA__otherCategory // 2 collections created in one update myalias__TRA__2019-07-02__CRA__otherCategory
Doc 3
time: 2019-07-03T00:12:00Z
category: thirdCategory
Aliased collections:
myalias__TRA__2019-07-01__CRA__someCategory myalias__TRA__2019-07-01__CRA__otherCategory myalias__TRA__2019-07-02__CRA__otherCategory myalias__TRA__2019-07-01__CRA__thirdCategory // 3 collections created in one update! myalias__TRA__2019-07-02__CRA__thirdCategory myalias__TRA__2019-07-03__CRA__thirdCategory
Doc 4
time: 2019-07-03T00:12:00Z
category: someCategory
Aliased collections:
myalias__TRA__2019-07-01__CRA__someCategory myalias__TRA__2019-07-01__CRA__otherCategory myalias__TRA__2019-07-02__CRA__otherCategory myalias__TRA__2019-07-01__CRA__thirdCategory myalias__TRA__2019-07-02__CRA__thirdCategory myalias__TRA__2019-07-03__CRA__thirdCategory myalias__TRA__2019-07-02__CRA__someCategory // 2 collections created in one update myalias__TRA__2019-07-03__CRA__someCategory
Therefore the sweet spot for DRA’s is for a data set with a well standardized set of dimensions that are not changing and where the full set of permutations occur regularly. If a new category is introduced at a later date and indexing latency is an important SLA feature, there are a couple strategies to mitigate this effect:
If the number of extra time slices to be created is not very large, then sending a single document out of band from regular indexing, and waiting for collection creation to complete before allowing the new category to be sent via the SLA constrained process.
If the above procedure is likely to create an extreme number of collections, and the earliest possible document in the new category is known, the start time for the time dimension may be adjusted using the ALIASPROP command
Improvement Possibilities
Routed aliases are a relatively new feature of SolrCloud that can be expected to be improved. Some potential areas for improvement that are not implemented yet are:
TRAs: Searches with time filters should only go to applicable collections.
TRAs: Ways to automatically optimize (or reduce the resources of) older collections that aren’t expected to receive more updates, and might have less search demand.
CRAs: Intrinsic support for non-English text via Base64 encoding.
CRAs: Supply an initial list of values for cases where these are known before hand to reduce pauses during indexing.
DRAs: Support for more than 2 dimensions.
CloudSolrClient
could route documents to the correct collection based on the route value instead always picking the latest/first.Presently only updates are routed and queries are distributed to all collections in the alias, but future features might enable routing of the query to the single appropriate collection based on a special parameter or perhaps a filter on the routed field.
Collections might be constrained by their size instead of or in addition to time or category value. This might be implemented as another type of routed alias, or possibly as an option on the existing routed aliases
Compatibility with CDCR.
Option for deletion of aliases that also deletes the underlying collections in one step. Routed Aliases may quickly create more collections than expected during initial testing. Removing them after such events is overly tedious.
As always, patches and pull requests are welcome!
Collection Commands and Aliases
Starting with version 8.1 SolrCloud supports using alias names in collection commands where normally a collection name is expected. This works only when the following criteria are satisfied:
a request parameter
followAliases=true
is usedan alias must not refer to more than one collection
an alias must not refer to a Routed Alias
If all criteria are satisfied then the command will resolve all alias names and operate on the collections the aliases refer to as if it was invoked with the collection names instead. Otherwise the command will not be executed and an exception will be thrown.
The followAliases=true
parameter should be used with care so that the resolved targets are indeed the intended ones.
In case of multi-level aliases or shadow aliases (an alias with the same name as an existing collection but pointing
to other collections) the use of this option is strongly discouraged because effects may be difficult to
predict correctly.