βΈοΈ Kubernetes Autoscaling
How Kubernetes Dynamically Adjusts Pods and Nodes β From HPA to Karpenter
Kubernetes autoscaling automatically adjusts the number of pods (HPA, VPA) and the size of the cluster itself (Cluster Autoscaler, Karpenter). The goal is simple: match capacity to demand in real time, without over-provisioning that burns cloud budget or under-provisioning that causes latency spikes and OOM kills. The key insight: metrics drive decisions β CPU utilization, memory pressure, request rates, and custom application signals all feed the autoscaler. Getting the right metrics and the right threshold is 80% of the work.
HPA Algorithm Visualizer
The Horizontal Pod Autoscaler adjusts replica count based on observed metrics. Watch how it evaluates current vs desired utilization and scales up or down in discrete steps.
Metrics-Driven Scaling
What metric should you use? The wrong metric causes oscillation or slow response. Compare common patterns:
Built into Kubernetes.metrics-server. The HPA controller queries the metric API every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Scales when: current / desired > 1.1. Scales down when: current / desired < 0.9. Uses a stabilization window (default 5 min) to prevent thrashing during rapid fluctuations.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: clickhouse-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: clickhouse
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 VPA vs HPA: When to Use Which
Vertical Pod Autoscaling adjusts resource requests/limits on existing pods. HPA adds/removes pods. They address different problems.
vpa-updater and applied manually or via admission webhook.
ClickHouse Workload: A Real Autoscaling Story
ClickHouse is stateful, memory-hungry, and I/O-intensive β a tricky target for autoscaling. Here's a production-grade setup:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: clickhouse-query-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: clickhouse-query
minReplicas: 3
maxReplicas: 30
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100 # Double pods in one step max
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1 # Remove max 1 pod per step
periodSeconds: 60
metrics:
- type: External
external:
metric:
name: http_requests_total
selector:
matchLabels:
service: clickhouse-query
target:
type: AverageValue
averageValue: "500" Cluster-Level Scaling: Cluster Autoscaler vs Karpenter
HPA adjusts pods. Cluster Autoscaler and Karpenter adjust nodes. They work at different layers of the stack.
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: clickhouse-spot
spec:
ttlSecondsAfterEmpty: 120 # Scale to zero after 2 min idle
limits:
resources:
cpu: "256"
memory: 512Gi
provider:
amiFamily: AL2
instanceTypes:
- c6i.8xlarge
- c6i.16xlarge
- c6i.32xlarge
capacityType: spot # 70% cheaper than on-demand
requirements:
- key: karpenter.k/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/lifecycle
operator: In
values: ["spot"] Autoscaling Anti-Patterns
These mistakes cause oscillation, OOM kills, or runaway costs:
A ClickHouse pod doing heavy disk scans barely uses CPU but starves for I/O. CPU-based HPA won't respond. Use query rate or queue depth instead.
Without stabilizationWindowSeconds, the HPA reacts to every spike and trough. This causes "flapping" β rapid scale-up, then scale-down, burning cloud budget.
ClickHouse pods share data via S3 β but query state (processing buffers, connection pools) lives in-memory. Scaling to zero doesn't "free" memory in the traditional sense, and scaling back up is slow.
maxReplicas: 100 without spend guards can result in a $50K/month bill after a traffic spike. Always set behavior.scaleUp.stabilizationWindowSeconds and cost alerts.
For stateless API servers and ClickHouse query nodes, requests/sec divided by replica count gives a per-pod utilization that's stable, fast-responding, and directly tied to user experience.
Keep minReplicas: 2 even at 3 AM for services with SLA < 100ms. Cold starts on new pods add 10β30 seconds of latency. The cost of idle replicas is almost always worth it.