DEV Community 2h ago

Kubernetes resource requests and limits explained: scheduling, throttling, and OOMKill

This is part of the Platform engineering with Go series: a growing collection of posts on Kubernetes, Go tooling, and infrastructure automation. View all posts in the series

The 3am incident nobody talks about

It's 3am. Your on-call phone goes off. A service is down in production. You log in, check the pods, and see this:

kubectl get pods -n production
NAME                     READY   STATUS      RESTARTS   AGE
api-7d6b9f8c4-xk2pq     0/1     OOMKilled   14         2d
api-7d6b9f8c4-mn9rt     0/1     OOMKilled   14         2d
api-7d6b9f8c4-p8wvz     0/1     OOMKilled   14         2d

You restart the pods. They come back. Five minutes later, they die again.

The problem could be several things: limits set too low for the actual workload, no limits set at all so the pod consumed memory freely until the node ran out, a traffic spike the pod wasn't provisioned for, or yes, a memory leak in the application itself.

In all cases, the result is the same: memory consumption exceeded the allowed ceiling, the OOM killer fired, the pod died, and Kubernetes restarted it into the same situation. Fourteen times.

This is one of the most common production incidents in Kubernetes, and one of the most preventable. But preventing it requires understanding what requests and limits actually do, what happens when memory consumption exceeds them, and how to configure them correctly for your actual workload. That's what this post is about.

No Go knowledge required: this post has zero code

Requests vs limits, two completely different things

Before anything else, the most important concept to internalize: requests and limits are not two versions of the same thing. They serve entirely different purposes and are enforced by entirely different components of Kubernetes.

# A pod spec with both requests and limits set
resources:
  requests:
    cpu: "250m"        # 250 millicores = 0.25 CPU cores
    memory: "256Mi"    # 256 mebibytes
  limits:
    cpu: "500m"        # 500 millicores = 0.5 CPU cores
    memory: "512Mi"    # 512 mebibytes

Here's what each one actually does:

Requests are a promise to the scheduler. When you set requests.memory: 256Mi, you're telling Kubernetes: "I need at least 256Mi of memory reserved for this pod on whatever node it runs on." The scheduler uses this value to decide which node to place the pod on. Once scheduled, that 256Mi is considered reserved on that node, even if the pod is only using 50Mi at the moment.
Limits are a ceiling enforced by the runtime. When you set limits.memory: 512Mi, you're telling Kubernetes: "This pod is never allowed to use more than 512Mi of memory." If it tries to exceed that ceiling, the kernel kills it. No warning. No graceful shutdown. Just gone.

The key insight: requests affect scheduling; limits affect runtime behavior. They are read by different components at different times for entirely different reasons.

How the scheduler uses requests to place pods

The Kubernetes scheduler's job is to find a node for each new pod. It does this by looking at each node's allocatable resources and comparing them against the sum of all pod requests already scheduled there.

A node's allocatable resources are not the same as its total capacity. Some resources are always reserved for the operating system and Kubernetes system components:

kubectl describe node my-node
# Look for the Allocatable section:
Allocatable:
  cpu:    3800m   # 3.8 cores available to pods (out of 4 total)
  memory: 7Gi     # 7Gi available to pods (out of 8Gi total)

The scheduler adds up the requests of all pods already on a node and compares that sum against the allocatable resources. If the remaining capacity is less than the new pod's request, the node is skipped.

Here's the critical subtlety: the scheduler cares about requests, not actual usage.

Node: 4 CPU allocatable
Pod A: requests 1 CPU → actual usage: 0.2 CPU
Pod B: requests 1 CPU → actual usage: 0.1 CPU
Pod C: requests 1 CPU → actual usage: 0.8 CPU

From the scheduler's perspective: 3 out of 4 CPU are "used"
From the kernel's perspective: only 1.1 CPU are actually being consumed

New pod requesting 1.5 CPU → scheduler says NO (only 1 CPU remaining)

This creates an important tension: if your requests are set too high relative to actual usage, your nodes look full when they're actually mostly idle, and new pods can't be scheduled. This is called poor bin packing, and it wastes money.

On the other hand, if your requests are too low, too many pods get scheduled onto the same node. When they all start consuming resources simultaneously, the node becomes overloaded, and Kubernetes starts evicting pods to relieve the pressure.

Getting requests right is a balancing act between cost efficiency and stability.

CPU limits and throttling: the silent killer

CPU is what Kubernetes calls a compressible resource. If a pod tries to use more CPU than its limit allows, the Linux kernel doesn't kill it, it throttles it. The pod keeps running, but it gets fewer CPU cycles, so everything it does takes longer.

The throttling mechanism is the Linux CFS (Completely Fair Scheduler).

Understanding CPU cycles and scheduling periods

Before we talk about how throttling works, it helps to understand two concepts that are invisible in day-to-day operations but fundamental to what's happening under the hood.

What is a CPU cycle? Your server's CPU is constantly doing work, executing instructions, processing data, running code. A CPU cycle is the smallest unit of that work. A modern CPU completes billions of cycles per second (gigahertz, that's what the "3.2GHz" on a server spec means: 3.2 billion cycles per second).

Think of CPU cycles like minutes of attention from a very fast worker. Your pod's processes need a certain number of those "minutes" to do their job, handle a request, run a query, process a message. The more cycles your pod gets, the faster it runs. The fewer it gets, the slower it runs.

When Kubernetes talks about CPU in millicores (250m, 500m, 1000m), it's describing what fraction of one CPU core's cycles your pod gets access to:

1000m = 1 full CPU core = 100% of one core's cycles
500m = 0.5 CPU core = 50% of one core's cycles
250m = 0.25 CPU core = 25% of one core's cycles

What is a scheduling period? A CPU doesn't serve one process at a time from start to finish. It slices time into tiny windows and gives each process a turn. This is called time-sharing, and the windows are called scheduling periods.

Think of it like a teacher in a classroom. Instead of helping one student for the entire class, the teacher spends 5 minutes with each student in rotation. Every student gets attention, but no single student monopolizes the teacher's time.

The Linux CFS uses scheduling periods of 100 milliseconds by default. In every 100ms window, the CPU divides its time among all the processes competing for it.

100ms scheduling period
│
├── 0ms - 25ms     → Pod A gets its turn (250m limit = 25% of 100ms = 25ms)
├── 25ms - 75ms    → Pod B gets its turn (500m limit = 50% of 100ms = 50ms)
├── 75ms - 100ms   → Pod C gets its turn (250m limit = 25% of 100ms = 25ms)
│
└── (next 100ms period starts)

Each pod's CPU limit determines how many milliseconds of that 100ms window it's allowed to use:

CPU limit of 250m → 25ms of CPU time per 100ms period
CPU limit of 500m → 50ms of CPU time per 100ms period
CPU limit of 1000m (1 full core) → 100ms of CPU time per 100ms period

What happens when a pod hits its limit mid-period?

Here's where throttling kicks in. If a pod uses up its entire allocation before the 100ms period ends, the CFS puts it in a throttled state for the rest of that period; it gets zero CPU cycles until the next period starts, regardless of whether other pods are idle.

Pod with 250m CPU limit → 25ms of allowed CPU time per 100ms period

Period 1 (0ms to 100ms):
├── 0ms:   Pod starts processing a request
├── 25ms:  Pod has used its full 25ms allocation ← throttled here
├── 25ms to 100ms: Pod sits idle, gets zero CPU cycles
└── 100ms: New period starts, pod gets another 25ms

What the user experiences:
├── Request arrives at 0ms
├── Pod processes half the request, then waits 75ms doing nothing
└── Response arrives much later than it should

The pod didn't crash. It didn't log an error. It just stopped making progress for 75ms out of every 100ms, which is why a throttled service feels sluggish rather than broken. Everything works, just much slower than it should.

A concrete analogy

Imagine you're writing a report and your manager says you can only use the shared laptop for 15 minutes every hour. You start writing, but at the 15-minute mark your access is cut off, even if you're mid-sentence. You sit and wait for the next hour to start before you can type another word.

That's exactly what the CFS does to a throttled pod. It doesn't care that you were in the middle of something important. When the quota is up, the process waits, and whatever request it was handling has to wait too.

Why this is hard to detect

The reason CPU throttling causes so much confusion is that it's invisible in all the usual places:

# This shows current usage - looks fine
kubectl top pods -n production
NAME                    CPU(cores)   MEMORY(bytes)
api-6d8f9b7c-xk2pq     240m         180Mi

# But the pod might be throttled 80% of the time
# kubectl top shows average usage, not whether that usage caused throttling

A pod using 240m CPU on average can still be heavily throttled if it regularly bursts above its limit within a single 100ms period. The average looks healthy; the latency tells a different story.

So, this is why this is dangerous: CPU throttling is completely silent. There's no error. No log line. No Kubernetes event. Your pod just slows down. Requests take longer. Latency increases. Users notice something is wrong, but nothing in your logs explains why.

The only reliable way to detect throttling is through the container_cpu_cfs_throttled_periods_total metric in Prometheus, which counts how many scheduling periods a container was throttled in. If that number is climbing, throttling is happening regardless of what kubectl top shows.

To detect throttling, you can use the Kubernetes metrics server:

# See current CPU usage vs requests
kubectl top pods -n production
NAME                    CPU(cores)   MEMORY(bytes)
api-7d6b9f8c4-xk2pq    480m         210Mi
api-7d6b9f8c4-mn9rt    498m         198Mi

If you see pods consistently near their CPU limit, throttling is likely happening, especially if you're seeing latency spikes that don't correlate with error rates.

The CPU limits controversy

There's an ongoing debate in the Kubernetes community about CPU limits. Some platform teams remove CPU limits entirely for latency-sensitive services, allowing pods to burst freely as long as there's spare capacity on the node.

The argument for removing CPU limits:

Eliminates throttling completely
Workloads use spare node capacity efficiently
Latency becomes predictable because pods are never artificially slowed

The argument for keeping CPU limits:

A noisy neighbor pod can consume all spare CPU and starve other pods
Without limits, a bug in one service can degrade the entire node

The right answer depends on your workload. For latency-sensitive APIs, consider removing CPU limits and relying on requests alone. For batch workloads, CPU limits are fine. For anything in between, measure throttling first before deciding.

Memory limits and OOMKill: the dangerous one

Memory is what Kubernetes calls an incompressible resource. Unlike CPU, the kernel cannot throttle memory access, it can't say "you only get 80% of the memory reads you asked for." Memory is binary: either the process has it, or it doesn't.

When a pod's memory usage exceeds its limit, the Linux OOM (Out of Memory) killer terminates one of its processes immediately. No warning. No graceful shutdown. The process is gone.

Kubernetes then sees that a container has exited unexpectedly and restarts it, which is where CrashLoopBackOff comes from.

# Detect OOMKill in pod description
kubectl describe pod api-7d6b9f8c4-xk2pq -n production

# Look for this in the output:
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 21 Jun 2026 02:14:32 +0000
  Finished:     Mon, 21 Jun 2026 02:14:33 +0000

Exit code 137 means the process was killed by signal 9 (SIGKILL) from the OOM killer. If you see this, your memory limit is too low for your actual workload.

OOMKill at the node level

OOMKill can also happen at the node level, independently of your pod limits. If total memory consumption across all pods on a node approaches the node's total capacity, the Linux kernel's node-level OOM killer activates.

In this case, Kubernetes doesn't wait for the OOM killer. It has its own eviction mechanism: if available memory on a node drops below a configured threshold, Kubernetes starts evicting pods proactively. Which pods get evicted first is determined by QoS class, which we'll cover next.

QoS classes: who dies first under pressure

Kubernetes automatically assigns every pod a Quality of Service (QoS) class based on how its requests and limits are configured. You don't set this manually; it's derived. Under node pressure, Kubernetes evicts pods in order from lowest to highest QoS class.

There are three classes:

BestEffort: lowest priority

A pod is BestEffort when it has no requests or limits set at all:

# BestEffort, no resources section at all
resources: {}

BestEffort pods are the first to be evicted under node pressure. They get whatever resources happen to be available, and nothing is guaranteed. Never run production workloads as BestEffort.

Burstable: middle priority

A pod is Burstable when it has requests set, but limits are either not set or higher than requests:

# Burstable, requests set, limits higher than requests
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Most production workloads should be Burstable. The pod has guaranteed minimum resources (the requests) but can burst above them when capacity is available. Under eviction pressure, Burstable pods are evicted after BestEffort but before Guaranteed.

Guaranteed: highest priority

A pod is Guaranteed when its requests equal its limits.

Read on DEV Community ↗ ← Back to News

Kubernetes resource requests and limits explained: scheduling, throttling, and OOMKill