☸️ Kubernetes Pod Scheduling

Filtering, Scoring & Binding — How Pods Find Their Home

When you create a Pod, the kube-scheduler decides which node to place it on. The scheduler runs a pipeline: first it filters out nodes that can't run the Pod (not enough resources, wrong taints, affinity mismatches), then it scores the remaining candidates (prefer less loaded nodes, balanced allocation, topology spread), and finally binds the Pod to the highest-scoring node. If no node passes filtering, the Pod stays Pending — or triggers preemption of lower-priority Pods.

🔀 Scheduler Pipeline

The three phases every Pod goes through before it runs.

🔍
Filter
Remove unfit nodes
📊
Score
Rank remaining nodes
🔗
Bind
Assign Pod to winner

🖧 Interactive Cluster

Configure a Pod and watch the scheduler filter, score, and place it across 5 nodes.

Pod Configuration

📋 Scheduler Log

Configure a Pod and click "Schedule Pod"...

🔍 Filtering (Predicates)

Nodes that fail any predicate are eliminated. No exceptions.

📦 Resource Fit

The node must have enough allocatable CPU and memory to satisfy the Pod's requests. Note: requests reserve capacity, while limits cap usage. A node can be overcommitted on limits but never on requests.

🏷️ Taints & Tolerations

Nodes can have taints (e.g., dedicated=ml:NoSchedule). A Pod must have a matching toleration or the node is filtered out. This keeps specialized nodes reserved for specific workloads.

📍 Node Affinity

requiredDuringScheduling rules act as hard filters — the Pod must land on a node matching the label selector. preferredDuringScheduling rules are soft and only affect scoring.

🔄 Pod Anti-Affinity

requiredDuringScheduling anti-affinity prevents co-locating Pods. For example, two replicas of the same service shouldn't land on the same node for HA.

📊 Scoring (Priorities)

Surviving nodes are scored 0–100 on each priority, then weighted and summed.

⚖️ LeastRequestedPriority

Prefers nodes with the most free resources. Score = (capacity - used) / capacity × 100. Spreads load across the cluster.

🎯 BalancedResourceAllocation

Prefers nodes where CPU and memory usage are balanced. Avoids nodes that are 90% on CPU but 10% on memory — those waste capacity.

🧲 InterPodAffinity

Scores higher when preferredDuringScheduling affinity rules are satisfied. Useful for co-locating tightly coupled services (e.g., app + cache).

⚡ Preemption

When no node passes filtering, the scheduler may evict lower-priority Pods.

How It Works

Each Pod has a PriorityClass (0–1,000,000,000). When a high-priority Pod can't be scheduled, the scheduler finds a node where evicting low-priority Pods would free enough resources. The evicted Pods get a gracefulTermination period, then are deleted. The high-priority Pod is then scheduled on that node.

node-3 (4 CPU, 8Gi)

🌐 Topology Spread Constraints

Distribute Pods evenly across failure domains (zones, racks, nodes).

💡 maxSkew

maxSkew defines the maximum difference in Pod count between any two topology domains. With maxSkew: 1 and 3 zones, scheduling 6 replicas gives 2 per zone. If a zone already has 3 and another has 1, the scheduler places the next Pod in the zone with fewer Pods. whenUnsatisfiable: DoNotSchedule makes this a hard constraint; ScheduleAnyway makes it a soft preference.

📏 Requests vs Limits

📋 Requests

The guaranteed amount of resources. The scheduler uses requests to decide placement. A node's allocatable capacity minus all Pod requests = free capacity. Pods are guaranteed their requested resources.

🚀 Limits

The maximum a Pod can use. If a Pod exceeds its memory limit, it's OOM-killed. CPU limits cause throttling. A node can have total limits > capacity (overcommitted) — this is fine until Pods actually try to use it all simultaneously.

⚠️ QoS Classes

Guaranteed: requests == limits for all containers.
Burstable: requests < limits (or only requests set).
BestEffort: no requests or limits set — first to be evicted under pressure.