DEV Community

๐Ÿšฆ Meet Kueue: Smart Job Queueing for Kubernetes ๐Ÿง โš™๏ธ

๐Ÿšฆ Meet Kueue: Smart Job Queueing for Kubernetes ๐Ÿง โš™๏ธ

Hey everyone ๐Ÿ‘‹ If you run batch jobs, data pipelines, or any kind of AI and ML training on Kubernetes, you have probably hit this wall. Kubernetes is fantastic at deciding WHERE a pod should run, but it is surprisingly clueless about WHEN a job should start. ๐Ÿ˜…

You submit ten jobs, the cluster fills up, and the rest just sit there as Pending. No real queue, no priority, no fairness between teams. One noisy team can eat all your expensive nodes while everyone else waits. ๐Ÿฅฒ

That is exactly the gap Kueue fills, and today I want to walk you through it with a pile of hands-on examples you can run on any cluster, even your homelab. ๐Ÿก

๐Ÿ‘‰ Key takeaway up front: Kueue is a job-level manager that holds your jobs in a real queue and only admits them when there is enough quota to actually run them.

๐Ÿงช Everything in this guide was tested against Kueue v0.18.1 using the v1beta2 API. I pinned every command and manifest to that version so you do not get surprised by API drift.

๐Ÿ“‹ What we will cover

  • โœ… Why Kubernetes needs a queue
  • โœ… The building blocks in plain language
  • โœ… Installing Kueue
  • โœ… Setting up quota with a ResourceFlavor, a ClusterQueue, and a LocalQueue
  • โœ… Submitting a Job and watching it get queued and admitted
  • โœ… Priority-based admission
  • โœ… Partial admission and elastic jobs
  • โœ… Multiple resource flavors for x86 and arm
  • โœ… Fair sharing between teams with cohorts
  • โœ… Dedicated quota with a shared fallback
  • โœ… Queueing a plain Pod
  • โœ… Why this matters a lot for GPUs and your cloud bill

๐Ÿค” Why Kubernetes needs a queue

Native Kubernetes scheduling is pod-centric. The scheduler looks at one pod at a time and tries to place it. That works great for long-running services.

Batch workloads are different. They have a beginning and an end, they often need a fixed chunk of capacity, and they compete with other teams for the same nodes. Without a queueing layer you get:

  • โœ… Jobs that fail or stay Pending when resources are tight
  • โœ… No quota governance, so one team can starve the others
  • โœ… No admission priority, so a quick experiment can block production training

๐Ÿง  What is Kueue

Kueue is a Kubernetes native job queueing system, maintained as a kubernetes-sigs project. It does not replace the scheduler. It sits in front of it.

๐Ÿ›‚ Here is the simple mental model. Think of the Kubernetes scheduler as the runway, and Kueue as the control tower deciding which flight is cleared for takeoff and when. โœˆ๏ธ

When a job arrives, Kueue suspends it, creates a matching Workload object, checks if there is enough quota, and only then lets the pods be created. If there is no room, the job waits politely in the queue instead of failing.

๐Ÿงฉ The building blocks

There are four pieces you need to know, plus one bonus piece for teams.

  • โœ… ResourceFlavor ๐Ÿฆ Describes a type of resource, usually tied to node labels. For example x86 nodes versus arm nodes, or GPU nodes versus CPU nodes. If you do not need to distinguish node types, you use one empty flavor.
  • โœ… ClusterQueue ๐Ÿฆ A cluster-scoped object that holds the actual quota. This is where you say how much cpu, memory, or how many GPUs are available. Users do not submit to it directly.
  • โœ… LocalQueue ๐Ÿ“ฅ A namespaced object that points to a ClusterQueue. This is what users actually target with their jobs.
  • โœ… Workload ๐Ÿ“ฆ The internal object Kueue creates for each job to track its admission state. You usually just observe it.
  • โœ… Cohort ๐Ÿ‘ฅ (bonus) A group of ClusterQueues that can borrow each other's unused quota. This is the magic behind fair sharing between teams.

๐Ÿ› ๏ธ Step 1: Install Kueue

The simplest method is to apply the released manifests with server-side apply.

kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.18.1/manifests.yaml

The controller runs in the kueue-system namespace. Give it a few seconds and check it is healthy.

kubectl get deploy -n kueue-system

You should see the controller manager become ready.

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kueue-controller-manager   1/1       1            1        30s

Prefer Helm? Kueue publishes an OCI chart for each release. Just make sure the chart version matches the release you want.

helm install kueue oci://registry.k8s.io/kueue/charts/kueue \
  --version = 0.18.1 \
  --namespace kueue-system \
  --create-namespace \
  --wait --timeout 300s

๐Ÿฆ Step 2: Create a ResourceFlavor

Since we are not distinguishing node types in this first demo, an empty flavor is all we need.

# default-flavor.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: "default-flavor"

Apply it.

kubectl apply -f default-flavor.yaml

๐Ÿฆ Step 3: Create a ClusterQueue

Now we define the quota for the whole cluster. Here we allow 9 cpu and 36Gi of memory, all served by our single flavor.

# cluster-queue.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {}  # match all namespaces
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi

Apply it.

kubectl apply -f cluster-queue.yaml

One important detail: the flavor name under spec.resourceGroups must match the ResourceFlavor name from step 2. If they do not match, the ClusterQueue will not become ready. ๐Ÿ”—

๐Ÿ“ฅ Step 4: Create a LocalQueue

Users cannot send work to a ClusterQueue directly. They need a LocalQueue in their namespace that points to it. We will put ours in the default namespace.

# default-user-queue.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  namespace: "default"
  name: "user-queue"
spec:
  clusterQueue: "cluster-queue"

Apply it.

kubectl apply -f default-user-queue.yaml

Quick tip: you can apply all three of the above at once using the example bundle from the project.

kubectl apply -f https://kueue.sigs.k8s.io/examples/admin/single-clusterqueue-setup.yaml

๐Ÿš€ Step 5: Submit your first Job

This is the only change your users need to make to an existing Job. Add the kueue.x-k8s.io/queue-name label pointing to the LocalQueue, and make sure each pod declares resource requests.

# sample-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-job-
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  parallelism: 3
  completions: 3
  template:
    spec:
      containers:
      - name: dummy-job
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        command: ["/bin/sh"]
        args: ["-c", "sleep 60"]
        resources:
          requests:
            cpu: "1"
            memory: "200Mi"
      restartPolicy: Never

Notice that you do not need to set the job to suspended yourself. Kueue manages suspension for you through a webhook and decides the best moment to start it. ๐Ÿช„

Create the job.

kubectl create -f sample-job.yaml

๐Ÿ”ญ Step 6: Watch the queue work

List your local queues. The alias queues also works.

kubectl -n default get localqueues
NAME         CLUSTERQUEUE   PENDING WORKLOADS
user-queue   cluster-queue  0

Kueue creates a Workload object for your job. Have a look.

kubectl -n default get workloads.kueue.x-k8s.io
NAME                  QUEUE        RESERVED IN   ADMITTED   AGE
sample-job-xxxxx      user-queue   cluster-queue   True       3s

Want the full story? Describe the workload. When there is not enough quota, you will see it sit unadmitted with a clear message.

kubectl -n default describe workload sample-job-xxxxx
Status:
  Conditions:
    Message: workload didn't fit
    Reason:  Pending
    Status:  False
    Type:    Admitted

The moment quota frees up, Kueue admits it automatically. If you describe the Job itself, the event timeline tells the whole story.

Events:
  Type    Reason             From                  Message
  ----    ------             ----                  -------
  Normal  Suspended          job-controller        Job suspended
  Normal  CreatedWorkload    kueue-job-controller  Created Workload: default/sample-job-xxxxx
  Normal  Started            kueue-job-controller  Admitted by clusterQueue cluster-queue
  Normal  Resumed            job-controller        Job resumed
  Normal  Completed          job-controller        Job completed

No babysitting required. ๐ŸŽ‰

๐Ÿฅ‡ Example: priority-based admission

Inside a queue, not all jobs are equal. With a WorkloadPriorityClass you can control admission and preemption priority independently from pod priority. Production training jumps the line ahead of throwaway experiments. ๐ŸŽ๏ธ

First create the priority class.

# sample-priority.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: WorkloadPriorityClass
metadata:
  name: sample-priority
value: 10000
description: "Sample priority"

Then point a Job at it with the kueue.x-k8s.io/priority-class label.

# priority-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: sample-job
  labels:
    kueue.x-k8s.io/queue-name: user-queue
    kueue.x-k8s.io/priority-class: sample-priority
spec:
  parallelism: 3
  completions: 3
  suspend: true
  template:
    spec:
      containers:
      - name: dummy-job
        image: registry.k8s.io/e2e-test-images/agnhost:latest
        args: ["pause"]
      restartPolicy: Never

Higher value means higher priority for queuing and preemption. The neat part is this priority does not touch the pod priority, so it does not interfere with your normal Kubernetes scheduling. ๐Ÿ‘Œ

โœ‚๏ธ Example: partial admission

Sometimes a big job can still make progress with fewer pods. With the kueue.x-k8s.io/job-min-parallelism annotation, Kueue can admit the job at a reduced parallelism instead of leaving it Pending.

# partial-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: sample-job-partial-admission
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue
  annotations:
    kueue.x-k8s.io/job-min-parallelism: "5"
spec:
  parallelism: 20
  completions: 20
  template:
    spec:
      containers:
      - name: dummy-job
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        args: ["entrypoint-tester", "hello", "world"]
        resources:
          requests:
            cpu: 1
            memory: "200Mi"
      restartPolicy: Never

If only 9 cpu is free, this job is admitted with parallelism 9 instead of waiting for all 20. The completions count stays the same. ๐Ÿ™Œ

๐Ÿ“ˆ Example: elastic jobs

Elastic jobs let you change a running Job's parallelism without recreating, restarting, or suspending it. This is an alpha feature, so you must enable the ElasticJobsViaWorkloadSlices feature gate and annotate the Job.

# elastic-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: sample-elastic-job
  namespace: default
  annotations:
    kueue.x-k8s.io/elastic-job: "true"
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  parallelism: 3
  completions: 100
  template:
    spec:
      containers:
      - name: dummy-job
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        command: ["/bin/sh"]
        args: ["-c", "sleep 60"]
        resources:
          requests:
            cpu: "100m"
            memory: "100Mi"
      restartPolicy: Never

When you bump parallelism up, Kueue creates a new admitted Workload for the new pod count and marks the old one as Finished. When you scale down, the extra pods terminate and no new Workload is created. Smooth. ๐Ÿง˜

๐Ÿงฑ Example: multiple resource flavors

Real clusters often mix node types. Say you have x86 and arm nodes labelled with cpu-arch. You can create one flavor per architecture.

# flavor-x86.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: "x86"
spec:
  nodeLabels:
    cpu-arch: x86
# flavor-arm.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: "arm"
spec:
  nodeLabels:
    cpu-arch: arm

Then reference both in a single ClusterQueue. Here cpu is split across the two architectures, while memory uses the simple default flavor because we do not care which architecture provides it.

# cluster-queue-multi.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {}  # match all
  resourceGroups:
  - coveredResources: ["cpu"]
    flavors:
    - name: "x86"
      resources:
      - name: "cpu"
        nominalQuota: 9
    - name: "arm"
      resources:
      - name: "cpu"
        nominalQuota: 12
  - coveredResources: ["memory"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "memory"
        nominalQuota: 84Gi

The labels in the ResourceFlavor must match the labels on your nodes. If you use the cluster autoscaler, make sure it adds those labels to new nodes too. ๐Ÿท๏ธ

๐Ÿ‘ฅ Example: fair sharing between teams

This is where Kueue really shines. Put two ClusterQueues in the same cohort and they can borrow each other's unused quota.

# team-a-cq.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: "team-a-cq"
spec:
  namespaceSelector: {}
  cohortName: "team-ab"
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
        borrowingLimit: 6
      - name: "memory"
        nominalQuota: 36Gi
        borrowingLimit: 24Gi
# team-b-cq.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: "team-b-cq"
spec:
  namespaceSelector: {}
  cohortName: "team-ab"
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 12
      - name: "memory"
        nominalQuota: 48Gi

Both queues belong to the cohort team-ab. Team A has its own guaranteed quota, but it can also borrow idle capacity from Team B, up to the borrowingLimit of 6 cpu and 24Gi. When Team B needs its capacity back, Kueue handles it. โš–๏ธ

๐ŸŽฏ Example: dedicated quota with a shared fallback

A ClusterQueue can borrow from the cohort even when it has zero nominal quota for a flavor. This lets you give each team dedicated capacity on one flavor, plus a shared pool to fall back on.

# team-a-cq.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: "team-a-cq"
spec:
  namespaceSelector: {}  # match all
  cohortName: "team-ab"
  resourceGroups:
  - coveredResources: ["cpu"]
    flavors:
    - name: "arm"
      resources:
      - name: "cpu"
        nominalQuota: 9
        borrowingLimit: 0
    - name: "x86"
      resources:
      - name: "cpu"
        nominalQuota: 0
  - coveredResources: ["memory"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "memory"
        nominalQuota: 36Gi
# shared-cq.yaml
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: "shared-cq"
spec:
  namespaceSelector: {}  # match all

Comments

No comments yet. Start the discussion.