Secondary Schedulers and the Coscheduling Plugin

The default Kubernetes scheduler places Pods one at a time, as resources become available. For batch and HPC workloads that require gang scheduling—all Pods of a job scheduled together or none at all—this behavior is insufficient. Secondary schedulers and scheduler plugins extend or work alongside the kube-scheduler to provide gang semantics until they are natively supported. This module introduces why secondary schedulers are needed, how they work, and focuses on the coscheduling plugin and its PodGroup concept, then outlines the drawbacks of the coscheduling approach.

Why Secondary Schedulers Are Needed

The kube-scheduler is designed for independent placement: each Pod is evaluated on its own, and the first Pod that fits gets scheduled. For a single replica or a small Deployment, that is efficient. For a distributed job that needs dozens or hundreds of Pods to run as a unit, it leads to:

  • Resource fragmentation — Some Pods get placed while others remain Pending. The job cannot make progress, but the scheduled Pods still hold resources.

  • Starvation and deadlock — Competing jobs can each get a subset of their Pods scheduled, so no job ever gets the full set it needs.

  • Wasted capacity — Partially scheduled gangs consume resources without producing useful work.

Secondary schedulers (and scheduler plugins) address this by implementing scheduling logic that the default scheduler does not provide. They can run as a separate process that assigns Pods to nodes, or they can extend the kube-scheduler via the Kubernetes Scheduling Framework as a plugin that participates in the normal scheduling cycle. In both cases, the goal is to provide gang scheduling: either all Pods in a gang are placed, or none are, so that batch jobs only start when they can run to completion as a unit.

Gang Scheduling Until Native Support

Gang scheduling means treating a set of Pods as a single unit for scheduling. Until the kube-scheduler supports gang scheduling natively (e.g., via a built-in plugin or API), cluster operators and frameworks rely on:

  • Custom schedulers — A separate scheduler process that selects nodes for Pods that require gang semantics. Pods are typically configured with a custom schedulerName so they are scheduled by this secondary scheduler instead of the default one.

  • Scheduler plugins — Extensions to the kube-scheduler that run in the same process and participate in the scheduling cycle. Plugins can filter nodes, score them, or reserve resources. A coscheduling plugin uses this mechanism to delay placing any Pod in a gang until the scheduler can place the entire gang.

The coscheduling plugin is a prominent example of the plugin approach: it integrates with the standard scheduler and uses a PodGroup custom resource to define which Pods belong to the same gang. That way, gang scheduling is achieved without replacing the default scheduler, and it can work alongside other plugins and policies.

The Coscheduling Plugin and PodGroup

The coscheduling plugin is a scheduler plugin (often used with the Kubernetes Scheduling Framework) that provides gang scheduling by grouping Pods into a PodGroup and scheduling the group as a whole.

PodGroup

A PodGroup is a custom resource that represents a gang of Pods that must be scheduled together. Typical fields include:

  • MinMember (or minMember) — The minimum number of Pods that must be schedulable before any Pod in the group is scheduled. Often set to the total size of the gang so that all-or-nothing semantics are enforced.

  • Association with Pods — Pods are linked to the PodGroup via a label (e.g., pod-group.scheduling.x-k8s.io: <name>) or an annotation. The coscheduling plugin uses this to identify which Pods belong to the same group.

  • Status — The PodGroup status reflects whether the group has enough schedulable Pods (e.g., Schedulable or Pending) so that controllers or users can see why a gang is not yet running.

When a Pod that belongs to a PodGroup is considered for scheduling, the plugin does not place it immediately. Instead, it checks whether at least minMember Pods from that PodGroup can all be placed at the same time. Only when that condition is satisfied does the plugin allow the scheduler to bind those Pods. As a result, either the required number of Pods in the group are scheduled together, or none of them are—achieving gang scheduling.

How Coscheduling Works

The flow is roughly:

  1. Job or controller creates a PodGroup — A higher-level controller (e.g., a training operator or a job framework) creates a PodGroup with a name and minMember (e.g., 64 for a 64-Pod job).

  2. Pods are created with PodGroup membership — Each Pod is labeled or annotated with the PodGroup name so the coscheduling plugin can associate it with the group.

  3. Pods enter the scheduling queue — When the kube-scheduler runs the coscheduling plugin, the plugin sees that the Pod belongs to a PodGroup and checks how many Pods in that group are currently pending and how many could be placed.

  4. Gang check — The plugin (conceptually) simulates or checks whether at least minMember Pods of the group can be scheduled. If yes, it allows those Pods to be scheduled; if no, it permits the Pod to be scheduled (so the scheduler can try) only when the gang condition is met, or it rejects the attempt so the Pod stays pending until the next cycle.

  5. All-or-nothing binding — Once the plugin allows the gang to proceed, the scheduler binds the Pods. Thus, gang scheduling is achieved: either the required minimum of Pods in the group are placed together, or none are.

Implementations may differ in detail (e.g., whether the plugin runs in the filter phase, reserve phase, or as a separate extension point), but the idea is the same: use PodGroup metadata to delay scheduling until the whole gang can be placed.

Drawbacks of Coscheduling

Using the coscheduling plugin (or a custom secondary scheduler) for gang scheduling has several drawbacks:

  • Not yet native to kube-scheduler — Coscheduling is implemented as an out-of-tree or custom plugin and a separate PodGroup CRD. You must install and maintain the plugin and the CRD, and upgrade them as the Kubernetes Scheduling Framework evolves. Native support in kube-scheduler would reduce this burden.

  • Coordination and correctness — The plugin must correctly count pending Pods, respect minMember, and avoid races where some Pods are bound and others are not. Edge cases (e.g., PodGroup deleted while Pods are pending, or partial failures) require careful handling.

  • Starvation and priority — If the cluster never has enough free capacity to schedule a large gang, the entire PodGroup stays pending. Without integration with a queue (e.g., Kueue) or priority mechanisms, large jobs can block smaller ones or sit indefinitely.

  • Operator and framework dependency — Job frameworks and operators must create PodGroups and label Pods correctly. Misconfiguration (wrong label, missing PodGroup, or incorrect minMember) leads to Pods that never schedule or that schedule without gang semantics.

  • Resource efficiency vs fairness — Holding back a whole gang until capacity is free can improve utilization by avoiding partially scheduled jobs, but it can also reduce throughput if many small gangs could run in the gaps that one large gang is waiting for. Tuning and queue policies are often needed.

For these reasons, coscheduling is a practical way to get gang scheduling today, but long-term, native scheduler support combined with a queue layer (such as Kueue) for admission and fairness is the direction the ecosystem is moving.

Summary

Secondary schedulers and scheduler plugins address the need for gang scheduling on Kubernetes when the default scheduler does not support it natively. The coscheduling plugin uses a PodGroup custom resource to group Pods and only allows scheduling when a minimum number of Pods in the group can be placed together, providing all-or-nothing semantics. This enables batch and HPC jobs to run as a unit but comes with drawbacks: extra components to maintain, dependency on correct PodGroup and labeling, possible starvation, and the need for integration with queuing and priority. Understanding coscheduling helps you evaluate when to use it and how it fits with queue systems like Kueue and with future native gang scheduling in the kube-scheduler.