Cross Datacenter Replication
Solr Cross DC is a simple cross-data-center fail-over solution for Apache Solr.
Overview
Apache Solr CrossDC is a robust fail-over solution for Apache Solr, facilitating seamless replication of Solr updates across multiple data centers. It provides high availability and disaster recovery for your Solr clusters via three key components
- CrossDC Solr Module
-
A suite of Solr plugins responsible for intercepting Solr requests in the source Solr instance, and then sending them to a distributed queue:
-
MirroringUpdateProcessor
plugin intercepts indexing updates, -
MirroringCollectionsHandler
plugin intercepts collection admin requests, -
MirroringConfigSetsHandler
plugin intercepts ConfigSet management requests.
-
- CrossDC Manager
-
A separate application that pulls mirrored requests from the distributed queue and forwards them to a Solr cluster in the backup data center.
- Apache Kafka
-
A distributed queue system that links the Module and Manager.
Setup Procedure
Implementing the Solr CrossDC involves the following steps:
-
Apache Kafka Cluster: Ensure the availability of an Apache Kafka cluster. This acts as the distributed queue interconnecting data centers.
-
CrossDC Solr Module: Install this Solr Module on each node in your Solr cluster (in both primary and backup data centers).
-
Configure solrconfig.xml to reference the new
MirroringUpdateProcessor
and set it up with the Kafka cluster. -
Optionally configure the
solr.xml
to useMirroringCollectionsHandler
andMirroringConfigSetsHandler
if necessary.
-
-
CrossDC Manager: Install this application in the backup data center, then configure it to connect to both the Kafka and backup Solr clusters.
Detailed Configuration & Startup
CrossDC Solr Module
Mirroring Updates
-
Install the
cross-dc
Solr Module. -
Add the new UpdateProcessor in your ConfigSet’s
solrconfig.xml
:
<updateRequestProcessorChain name="mirrorUpdateChain" default="true"> <processor class="solr.MirroringUpdateRequestProcessorFactory"> <str name="bootstrapServers">${solr.crossdc.bootstrapServers:}</str> <str name="topicName">${solr.crossdc.topicName:}</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
-
Add an external version constraint UpdateProcessor to the update chain added to
solrconfig.xml
to accept user-provided update versions. See the documentation for both UpdateRequestProcessorFactories and the DocBasedVersionConstraintsProcessor.
<updateRequestProcessorChain name="mirrorUpdateChain" default="true"> <processor class="solr.MirroringUpdateRequestProcessorFactory"> <str name="bootstrapServers">${solr.crossdc.bootstrapServers:}</str> <str name="topicName">${solr.crossdc.topicName:}</str> </processor> <processor class="solr.DocBasedVersionConstraintsProcessorFactory"> <bool name="ignoreOldUpdates">true</bool> <str name="versionField">my_version_l</str> <str name="deleteVersionParam">del_version</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
-
Start or restart your Solr clusters.
Mirroring Collection Admin Requests
Add the following line to solr.xml
:
<solr>
<str name="collectionsHandler">solr.MirroringCollectionsHandler</str>
…
</solr>
In addition to the general properties that determine distributed queue parameters, this handler supports the following properties:
solr.crossdc.mirrorCollections
-
comma-separated list of collections for which the admin commands will be mirrored. If this list is empty or the property is not set then admin commands for all collections will be mirrored.
Mirroring ConfigSet Admin Requests
Add the following line to solr.xml
:
<solr>
<str name="configSetsHandler">solr.MirroringConfigSetsHandler</str>
…
</solr>
Configuration Properties for the CrossDC Solr Module:
The required configuration properties are:
solr.crossdc.bootstrapServers
-
list of servers used to connect to the Kafka cluster.
solr.crossdc.topicName
-
Kafka topicName used to indicate which Kafka queue the Solr updates will be pushed on. This topic must already exist.
Optional configuration properties:
solr.crossdc.batchSizeBytes
<integer>-
maximum batch size in bytes for the Kafka queue
solr.crossdc.bufferMemoryBytes
<integer>-
memory allocated by the MirroringURP in total for buffering
solr.crossdc.lingerMs
<integer>-
amount of time that the MirroringURP will wait to add to a batch
solr.crossdc.requestTimeoutMS
<integer>-
request timeout for the MirroringURP
solr.crossdc.indexUnmirrorableDocs
<boolean>-
if set to True, updates that are too large for the Kafka queue will still be indexed on the primary.
solr.crossdc.enableDataCompression
<boolean>-
whether to use compression for data sent over the Kafka queue - can be none (default), gzip, snappy, lz4, or zstd
solr.crossdc.numRetries
<integer>-
Setting a value greater than zero will cause the MirroringURP to resend any record whose send fails with a potentially transient error.
solr.crossdc.retryBackoffMs
<integer>-
The amount of time to wait before attempting to retry a failed request to a given topic partition.
solr.crossdc.deliveryTimeoutMS
<integer>-
Updates sent to the Kafka queue will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first
solr.crossdc.maxRequestSizeBytes
<integer>-
The maximum size of a Kafka queue request in bytes - limits the number of requests that will be sent over the queue in a single batch.
solr.crossdc.dlqTopicName
<string>: If not empty then requests that failed processingmaxAttempts
times will be sent to a "dead letter queue" topic in Kafka (must exist if configured). solr.crossdc.mirrorCommits
<boolean>-
If
true
then standalone commit requests will be mirrored, otherwise they will be processed only locally. solr.crossdc.expandDbq
<enum>-
If set to
expand
(default) then Delete-By-Query will be expanded before mirroring into series of Delete-By-Id, which may help with correct processing of out-of-order requests on the consumer side. If set tonone
then Delete-By-Query requests will be mirrored as-is.
CrossDC Manager
-
Start the Manager process using the included start script at
solr/cross-dc-manager/bin/cross-dc-manager
(orcross-dc-manager.cmd
for Windows).-
The Manager can also be run via the docker image. The
cross-dc-manager
script will be found on the$PATH
.
-
-
Configure the CrossDC Manager with Java system properties using the
JAVA_OPTS
environment variable.
API Endpoints
Currently the following endpoints are exposed (on local port configured using port
property, default is 8090):
/metrics
- (GET)-
This endpoint returns JSON-formatted metrics describing various aspects of document processing in Consumer.
/threads
- (GET)-
Returns a plain-text thread dump of the JVM running the Consumer application.
Configuration Properties for the CrossDC Manager:
The required configuration properties are:
solr.crossdc.bootstrapServers
-
list of Kafka bootstrap servers.
solr.crossdc.topicName
-
Kafka topicName used to indicate which Kafka queue the Solr updates will be pushed to. This can be a comma separated list for the Manager if you would like to consume multiple topics.
solr.crossdc.zkConnectString
-
Zookeeper connection string used to connect to Solr.
Optional configuration properties:
solr.crossdc.consumerProcessingThreads
<integer>-
The number of threads used by the manager to concurrently process updates from the Kafka queue.
port
<integer>-
Local port for the API endpoints. Default is
8090
. solr.crossdc.collapseUpdates
<enum>-
When set to
all
then all incoming update requests (up tomaxCollapseRecords
) will be collapsed into a single UpdateRequest, as long as their parameters are identical. When set topartial
(default) then only requests without deletions will be collapsed - requests with any delete ops will be sent individually in order to preserve ordering of updates. When set tonone
the incoming update requests will be sent individually without any collapsing. NOTE: requests of other types than UPDATE are never collapsed. solr.crossdc.maxCollapseRecords
<integer>-
Maximum number of incoming update request to collapse into a single outgoing request. Default is
500
.
Optional configuration properties used when the manager must retry by putting updates back on the Kafka queue:
solr.crossdc.batchSizeBytes
-
maximum batch size in bytes for the Kafka queue
solr.crossdc.bufferMemoryBytes
-
memory allocated by the Manager in total for buffering
solr.crossdc.lingerMs
-
amount of time that the ProManagerducer will wait to add to a batch
solr.crossdc.requestTimeoutMS
-
request timeout for the Manager
solr.crossdc.maxPollIntervalMs
-
the maximum delay between invocations of poll() when using consumer group management.
Central Configuration Option
Manage configuration centrally in Solr’s Zookeeper cluster by placing a properties file called crossdc.properties
in
the root Solr Zookeeper znode, eg, /solr/crossdc.properties
.
The solr.crossdc.bootstrapServers
and solr.crossdc.topicName
properties can be included in this file.
-
For the CrossDC Solr Module, all crossdc configuration properties can be placed here.
-
For the CrossDC Manager application you can also configure all crossdc properties here, however you will need to set the
zkConnectString
as a system property so that the manager knows where to find the file.
Disabling CrossDC via Configuration
To make the Cross DC UpdateProcessor optional in a common solrconfig.xml
, use the enabled attribute.
Setting the solr.crossdc.enabled
system property or Collection Property to false will turn the processor into a NOOP in the chain for either the whole Solr Node (via system property) or Solr Collection (via collection property).
<updateRequestProcessorChain name="mirrorUpdateChain" default="true">
<processor class="solr.MirroringUpdateRequestProcessorFactory">
<bool name="enabled">${solr.crossdc.enabled:true}</bool>
<str name="bootstrapServers">${solr.crossdc.bootstrapServers:}</str>
<str name="topicName">${solr.crossdc.topicName:}</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Limitations
-
When
solr.crossdc.expandDbq
property is set toexpand
(default) then Delete-By-Query converts to a series of Delete-By-Id, which can be much less efficient for queries matching large numbers of documents. Setting this property tonone
results in forwarding a real Delete-By-Query - this reduces the amount of data to mirror but may cause different results due to the potential re-ordering of failed & re-submitted requests between Consumer and the target Solr. -
When
solr.crossdc.collapseUpdates
is set toall
then multiple requests containing a mix of add and delete ops will be collapsed into a single outgoing request. This will cause the original ordering of add / delete ops to be lost (because Solr processing of an update request always processes all add ops first, and only then the delete ops), which may affect the final outcome when some of the ops refer to the same document ids.