☸️ Kubernetes Autoscaling

How Kubernetes Dynamically Adjusts Pods and Nodes β€” From HPA to Karpenter

Kubernetes autoscaling automatically adjusts the number of pods (HPA, VPA) and the size of the cluster itself (Cluster Autoscaler, Karpenter). The goal is simple: match capacity to demand in real time, without over-provisioning that burns cloud budget or under-provisioning that causes latency spikes and OOM kills. The key insight: metrics drive decisions β€” CPU utilization, memory pressure, request rates, and custom application signals all feed the autoscaler. Getting the right metrics and the right threshold is 80% of the work.

HPA Algorithm Visualizer

The Horizontal Pod Autoscaler adjusts replica count based on observed metrics. Watch how it evaluates current vs desired utilization and scales up or down in discrete steps.

Stable state: current replicas = desired replicas. No scaling needed.
Metric Evaluation
HPA State
MetricCurrentTargetStatus
CPU45%70%βœ“ OK
Memory60%80%βœ“ OK
Replicas33Stable

Metrics-Driven Scaling

What metric should you use? The wrong metric causes oscillation or slow response. Compare common patterns:

CPU Utilization
Simple, built-in, no extra setup
Bad for I/O-bound workloads (disk-intensive DBs)
β˜…β˜…β˜…β˜…β˜†
Memory Utilization
Essential for caches (Redis, ClickHouse)
Leaks cause gradual OOM accumulation
β˜…β˜…β˜…β˜†β˜†
Requests per Second
Aligns directly with user-facing SLAs
Requires Prometheus Adapter, baseline calibration
β˜…β˜…β˜…β˜…β˜…
Custom Metrics
Queue depth, p99 latency, business KPIs
Complex setup, can introduce oscillations
β˜…β˜…β˜…β˜†β˜†
CPU Utilization

Built into Kubernetes.metrics-server. The HPA controller queries the metric API every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Scales when: current / desired > 1.1. Scales down when: current / desired < 0.9. Uses a stabilization window (default 5 min) to prevent thrashing during rapid fluctuations.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: clickhouse-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: clickhouse
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

VPA vs HPA: When to Use Which

Vertical Pod Autoscaling adjusts resource requests/limits on existing pods. HPA adds/removes pods. They address different problems.

Dimension HPA VPA Together (CA)
What changes Replica count CPU/memory requests Pods + node count
Use case Traffic spikes, request spikes Memory leaks, growing caches All of the above
Disruption Pod restart (configurable) Pod eviction & restart Multiple pod changes
Conflicts VPA in "Auto" mode HPA on same metric Requires VPA Off mode
Best for API servers, stateless services Databases, caches, stateful pods Full-stack auto-scaling
⚠️ VPA in "Auto" mode and HPA on the same workload conflict. Use VPA in "Off" or "Initial" mode when HPA is active. VPA recommendation can be monitored via vpa-updater and applied manually or via admission webhook.

ClickHouse Workload: A Real Autoscaling Story

ClickHouse is stateful, memory-hungry, and I/O-intensive β€” a tricky target for autoscaling. Here's a production-grade setup:

Ingress
Load Balancer
β†’
Query Gateway
↓
Compute Layer (HPA)
clickhouse-server Γ— 3–30
↓
Storage Layer (Fixed)
S3 / MinIO (data never moves)
HPA Metric HTTP requests/sec per pod NOT CPU β€” ClickHouse CPU spikes are query-bound, not pod-bound
Scale-up trigger > 500 req/s per pod for 30s
Scale-down trigger < 100 req/s per pod for 5 min
VPA Memory Initial: 8Gi, Max: 64Gi ClickHouse buffers are sized to memory; more memory = bigger query buffers
Stabilization 5-minute stabilization window Prevents scale oscillation during batch query bursts
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: clickhouse-query-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: clickhouse-query
  minReplicas: 3
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100   # Double pods in one step max
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1      # Remove max 1 pod per step
        periodSeconds: 60
  metrics:
  - type: External
    external:
      metric:
        name: http_requests_total
        selector:
          matchLabels:
            service: clickhouse-query
      target:
        type: AverageValue
        averageValue: "500"

Cluster-Level Scaling: Cluster Autoscaler vs Karpenter

HPA adjusts pods. Cluster Autoscaler and Karpenter adjust nodes. They work at different layers of the stack.

Feature Cluster Autoscaler Karpenter
Provisioner model Node groups + node pools Provisioner CRD (declarative)
Scale-up speed ~30–60 seconds (node bootstrap) ~15–30 seconds (direct launch)
Spot instance support Yes (mixed instance pools) Yes (native, flexible)
Bin-packing Default bin-packing More efficient (Grafana metrics)
Diversity Limited to defined node groups Any instance type on demand
Shutdown grace 10 minutes Configurable TTL expiry
Best for Predictable, batch workloads Dynamic, mixed workloads
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: clickhouse-spot
spec:
  ttlSecondsAfterEmpty: 120   # Scale to zero after 2 min idle
  limits:
    resources:
      cpu: "256"
      memory: 512Gi
  provider:
    amiFamily: AL2
    instanceTypes:
      - c6i.8xlarge
      - c6i.16xlarge
      - c6i.32xlarge
    capacityType: spot         # 70% cheaper than on-demand
  requirements:
    - key: karpenter.k/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/lifecycle
      operator: In
      values: ["spot"]

Autoscaling Anti-Patterns

These mistakes cause oscillation, OOM kills, or runaway costs:

❌ Scale on CPU with I/O-bound workloads

A ClickHouse pod doing heavy disk scans barely uses CPU but starves for I/O. CPU-based HPA won't respond. Use query rate or queue depth instead.

❌ No stabilization window

Without stabilizationWindowSeconds, the HPA reacts to every spike and trough. This causes "flapping" β€” rapid scale-up, then scale-down, burning cloud budget.

❌ Scale-down to zero on stateful services

ClickHouse pods share data via S3 β€” but query state (processing buffers, connection pools) lives in-memory. Scaling to zero doesn't "free" memory in the traditional sense, and scaling back up is slow.

❌ Too-wide min/max range without budget controls

maxReplicas: 100 without spend guards can result in a $50K/month bill after a traffic spike. Always set behavior.scaleUp.stabilizationWindowSeconds and cost alerts.

βœ… Right metric: requests/sec per pod

For stateless API servers and ClickHouse query nodes, requests/sec divided by replica count gives a per-pod utilization that's stable, fast-responding, and directly tied to user experience.

βœ… Warm pool for latency-sensitive services

Keep minReplicas: 2 even at 3 AM for services with SLA < 100ms. Cold starts on new pods add 10–30 seconds of latency. The cost of idle replicas is almost always worth it.