Class OverseerStatusCmd

  • All Implemented Interfaces:

    public class OverseerStatusCmd
    extends Object
    implements CollApiCmds.CollectionApiCommand
    This command returns stats about the Overseer, the cluster state updater and collection API activity occurring within the current Overseer node (this is important because distributed operations occurring on other nodes are not included in these stats, for example distributed cluster state updates or Per Replica States updates).

    More fundamentally, when the Collection API command execution is distributed, this specific command is not being run on the Overseer anyway (but then not much is running on the overseer as cluster state updates are distributed as well) so Overseer stats and status can't be returned and actually do not even make sense. Zookeeper based queue metrics do not make sense either because Zookeeper queues are then not used.

    The Stats instance returned by CollectionCommandContext.getOverseerStats() when running in the Overseer is created in Overseer.start() and passed to the cluster state updater from where it is also propagated to the various Zookeeper queues to register various events. This class is the only place where it is used in the Collection API implementation, and only to return results.

    TODO: create a new command returning node specific Collection API/Config set API/cluster state updates stats such as success and failures?

    The structure of the returned results is as follows:

    • leader: ID of the current overseer leader node
    • overseer_queue_size: count of entries in the /overseer/queue Zookeeper queue/directory
    • overseer_work_queue_size: count of entries in the /overseer/queue-work Zookeeper queue/directory
    • overseer_collection_queue_size: count of entries in the /overseer/collection-queue-work Zookeeper queue/directory
    • overseer_operations: map (of maps) of success and error counts for operations. The operations (keys) tracked in this map are:
      • am_i_leader (Overseer checking it is still the elected Overseer as it processes cluster state update messages)
      • configset_<config set operation>
      • Cluster state change operation names from CollectionParams.CollectionAction (not all of them!) and OverseerAction (the complete list: create, delete, createshard, deleteshard, addreplica, addreplicaprop, deletereplicaprop, balanceshardunique, modifycollection, state, leader, deletecore, addroutingrule, removeroutingrule, updateshardstate, downnode and quit with this last one unlikely to be observed since the Overseer is exiting right away)
      • update_state (when Overseer cluster state updater persists changes in Zookeeper)
      For each key, the value is a map composed of:
      • requests: success count of the given operation
      • errors: error count of the operation
      • More metrics (see below)
    • collection_operations: map (of maps) of success and error counts for collection related operations. The operations(keys) tracked in this map are all operations that start with collection_, but the collection_ prefix is stripped of the returned value. Possible keys are therefore:
      • am_i_leader: originating in a stat called collection_am_i_leader representing Overseer checking it is still the elected Overseer as it processes Collection API and Config Set API messages.
      • Collection API operation names from CollectionParams.CollectionAction (the stripped collection_ prefix gets added in OverseerCollectionMessageHandler.getTimerName(String))
      For each key, the value is a map composed of:
      • requests: success count of the given operation
      • errors: error count of the operation
      • recent_failures: an optional entry containing a list of maps, each map having two entries, one with key request with a failed request properties (a ZkNodeProps) and the other with key response with the corresponding response properties (a SolrResponse).
      • More metrics (see below)
    • overseer_queue: metrics on operations done on the Zookeeper queue /overseer/queue (see metrics below).
      The operations that can be done on the queue and that can be keys whose values are a metrics map are:
      • offer
      • peek
      • peek_wait
      • peek_wait_forever
      • peekTopN_wait
      • peekTopN_wait_forever
      • poll
      • remove
      • remove_event
      • take
    • overseer_internal_queue: same as above but for queue /overseer/queue-work
    • collection_queue: same as above but for queue /overseer/collection-queue-work

    Maps returned as values of keys in overseer_operations, collection_operations, overseer_queue, overseer_internal_queue and collection_queue include additional stats. These stats are provided by MetricUtils, and represent metrics on each type of operation execution (be it failed or successful), see calls to Stats.time(String). The metric keys are:

    • avgRequestsPerSecond
    • 5minRateRequestsPerSecond
    • 15minRateRequestsPerSecond
    • avgTimePerRequest
    • medianRequestTime
    • 75thPcRequestTime
    • 95thPcRequestTime
    • 99thPcRequestTime
    • 999thPcRequestTime