Managed SolrCloud Rolling Updates

Since v0.2.7

Solr Clouds are complex distributed systems, and thus require a more delicate and informed approach to rolling updates.

If the Managed update strategy is specified in the Solr Cloud CRD, then the Solr Operator will take control over deleting SolrCloud pods when they need to be updated.

The operator will find all pods that have not been updated yet and choose the next set of pods to delete for an update, given the following workflow.

Pod Update Workflow

The logic goes as follows:

  1. Find the pods that are out-of-date
  2. Update all out-of-date pods that do not have a started Solr container.
    • This allows for updating a pod that cannot start, even if other pods are not available.
    • This step does not respect the maxPodsUnavailable option, because these pods have not even started the Solr process.
  3. Retrieve the cluster state of the SolrCloud if there are any ready pods.
    • If no pods are ready, then there is no endpoint to retrieve the cluster state from.
  4. Sort the pods in order of safety for being restarted. Sorting order reference
  5. Iterate through the sorted pods, greedily choosing which pods to update. Selection logic reference
    • The maximum number of pods that can be updated are determined by starting with maxPodsUnavailable, then subtracting the number of updated pods that are unavailable as well as the number of not-yet-started, out-of-date pods that were updated in a previous step. This check makes sure that any pods taken down during this step do not violate the maxPodsUnavailable constraint.

Pod Update Sorting Order

The pods are sorted by the following criteria, in the given order. If any two pods on a criterion, then the next criteria (in the following order) is used to sort them.

In this context the pods sorted highest are the first chosen to be updated, the pods sorted lowest will be selected last.

  1. If the pod is the overseer, it will be sorted lowest.
  2. If the pod is not represented in the clusterState, it will be sorted highest.
    • A pod is not in the clusterstate if it does not host any replicas and is not the overseer.
  3. Number of leader replicas hosted in the pod, sorted low -> high
  4. Number of active or recovering replicas hosted in the pod, sorted low -> high
  5. Number of total replicas hosted in the pod, sorted low -> high
  6. If the pod is not a liveNode, then it will be sorted lower.
  7. Any pods that are equal on the above criteria will be sorted lexicographically.

Pod Update Selection Logic

Loop over the sorted pods, until the number of pods selected to be updated has reached the maximum. This maximum is calculated by taking the given, or default, maxPodsUnavailable and subtracting the number of updated pods that are unavailable or have yet to be re-created.

  • If the pod is the overseer, then all other pods must be updated and available. Otherwise, the overseer pod cannot be updated.
  • If the pod contains no replicas, the pod is chosen to be updated.
    WARNING: If you use Solr worker nodes for streaming expressions, you will likely want to set maxPodsUnavailable to a value you are comfortable with.
  • If Solr Node of the pod is not live, the pod is chosen to be updated.
  • If all replicas in the pod are in a down or recovery_failed state, the pod is chosen to be updated.
  • If the taking down the replicas hosted in the pod would not violate the given maxShardReplicasUnavailable, then the pod can be updated. Once a pod with replicas has been chosen to be updated, the replicas hosted in that pod are then considered unavailable for the rest of the selection logic.
    • Some replicas in the shard may already be in a non-active state, or may reside on Solr Nodes that are not “live”. The maxShardReplicasUnavailable calculation will take these replicas into account, as a starting point.
    • If a pod contains non-active replicas, and the pod is chosen to be updated, then the pods that are already non-active will not be double counted for the maxShardReplicasUnavailable calculation.