AWS Lambda
Firecracker microVMs, Cold Starts, and the Anatomy of an Execution Environment
Lambda is often described as "just functions," but the substrate underneath is a carefully engineered hypervisor stack. Every invocation runs inside a Firecracker microVM — a minimal KVM-based VMM written in Rust that boots in roughly 125 ms and isolates one customer's code from another's on the same host. Around that microVM, Lambda glues a runtime (the language interpreter), an init phase, a frozen-when-idle execution loop, and a control plane that decides when to reuse, replace, or scale environments. Understanding Lambda performance — cold starts, tail latencies, the cost of VPC ENIs, the trade-off between provisioned concurrency and SnapStart — means understanding what happens between "request arrives at API Gateway" and "your handler returns."
Why Lambda's Design Looks the Way It Does
On EC2, you trust the hypervisor. On a function platform that may pack hundreds of customer functions onto a single host within seconds of a traffic spike, container isolation (cgroups + namespaces) is insufficient against kernel-level attacks. Lambda's answer is Firecracker: a stripped-down VMM that gives every function a real Linux guest kernel inside KVM, with a minimal device model (virtio-net, virtio-block, no PCI, no USB). This keeps the host kernel attack surface small while still booting in <200 ms.
"Serverless" hides the fact that somebody still has to allocate a CPU, copy your code, initialize the runtime, and run your global-scope code. When a request arrives and no warm environment exists, the user pays for that latency. Cold starts dominate p99 for sporadic workloads and are the single most-discussed Lambda performance topic. Provisioned concurrency, SnapStart, init phases, and execution-environment reuse are all tools to amortize or eliminate this cost.
6 MB request/response, 15-minute timeout, 10 GB /tmp, 10 GB max memory, 250 MB unzipped code, 1024 file descriptors. Every limit is load-bearing: the 6 MB cap forces large payloads through S3 (cheaper, more durable). The 15-minute timeout pushes long jobs to ECS/Fargate. The /tmp size lets ML workloads cache models. These are not arbitrary; they are the contour lines of the platform's cost and isolation model.
Firecracker, MicroVM, and Host Architecture
A Lambda host is an EC2 bare-metal or Nitro instance running the Lambda Worker software. Inside the host:
Firecracker is the load-bearing piece. It was open-sourced in 2018 (originally written for Lambda; later adopted for Fargate too). Each VMM is a single Rust process with seccomp filters limiting it to roughly 50 syscalls. The jailer chroots Firecracker into a per-microVM directory and drops privileges before the VMM ever touches guest code. The guest kernel is a custom Amazon Linux 2 build with module loading disabled, no PCI bus, and only the virtio devices Lambda needs.
Key Numbers
The Cold-Start Lifecycle
A "cold start" is overloaded terminology. There are at least four distinct phases that contribute to it, and each has different mitigation strategies.
The Lambda placement service selects a worker host with available capacity. For VPC-attached functions, it must also pick a Hyperplane ENI. This is fast unless the region is capacity-constrained.
The worker fetches your zip or container image from internal S3 (S3 with internal accelerated endpoints, not public S3). For container images, Lambda uses a custom layered cache with chunk-level deduplication: the image is split into 512 KB chunks, each chunk fetched lazily as it's read. Only chunks the runtime actually touches get downloaded.
Firecracker boots the guest kernel in ~125 ms. The Lambda init process starts, mounts /var/task and /opt (Lambda Layers), and exec's your runtime. The runtime then runs your global-scope code. For Java/Spring this can be seconds; for Node.js this is typically <100 ms; for Python it depends heavily on what you import at module load.
The runtime polls the Lambda runtime API (HTTP loopback to 169.254.100.1) for an
invocation, deserializes the event, and calls your handler. Once the handler returns, the runtime
returns the response and goes back to polling. The microVM stays warm; the kernel freezes
the runtime process between invocations using cgroup freezer.
The asymmetry between phases drives optimization choices:
| Phase | Typical | Worst case | Mitigation |
|---|---|---|---|
| Placement | 5–30 ms | seconds (capacity) | Provisioned concurrency |
| Code fetch (zip) | 10–100 ms | 500 ms | Smaller deployment, layers |
| Code fetch (image) | 50–300 ms | 2 s on first chunk miss | Image flattening, fewer layers |
| VM boot | ~125 ms | 200 ms | Nothing — this is fixed |
| Runtime init | 20–3000 ms | 10 s+ (Spring, Quarkus JVM) | SnapStart, GraalVM, smaller deps |
| VPC ENI attach | 0 (Hyperplane) | was 5–10 s pre-2019 | Already mitigated |
Memory, CPU, and the Linear Coupling
Lambda has one knob the user controls for performance: memory. CPU is allocated proportionally, with breakpoints that AWS does not officially document but that are well-characterized by benchmarks:
| Memory (MB) | vCPU equivalent | Typical use case |
|---|---|---|
| 128 | ~0.083 (8.3% of 1 vCPU) | Tiny webhook, cron tick |
| 1,769 | 1.0 vCPU (one full core) | Single-threaded API |
| 3,538 | 2.0 vCPUs | JSON-heavy API, light I/O parallelism |
| 5,308 | 3.0 vCPUs | Image processing, light ML inference |
| 7,077 | 4.0 vCPUs | FFmpeg, heavier image work |
| 10,240 | 6.0 vCPUs (max) | ML inference, large-batch transforms |
The breakpoint at 1,769 MB is the key one: below it, you have less than a full core, and any single-threaded compute is artificially throttled. AWS Lambda Power Tuning (open source state-machine tool) will sweep memory settings for your function and produce a cost-vs-latency Pareto frontier — the result is often counterintuitive, e.g. doubling memory halves duration and halves cost.
// Lambda Power Tuning typical output for a CPU-bound function
// Memory Duration Cost (per 1M invocations)
// 128 MB 6,800 ms $14.20
// 512 MB 1,720 ms $14.40
// 1024 MB 860 ms $14.40
// 1769 MB 500 ms $14.50 <-- 1 vCPU breakpoint
// 3008 MB 480 ms $24.10 <-- diminishing returns
//
// Sweet spot: 1769 MB for single-threaded work. Provisioned Concurrency vs SnapStart
For workloads where cold-start latency is unacceptable, Lambda offers two distinct mechanisms:
Pre-initialized environments held warm. You pay an hourly rate for each provisioned instance
(~$0.0000041667/GB-second when idle, on top of normal request charges). Init code runs ahead of
time. Works with any runtime. Good for predictable peak traffic; combine with Application Auto
Scaling on the ProvisionedConcurrencyUtilization CloudWatch metric. Anti-pattern:
using PC to fix a slow init you should have profiled instead.
Lambda runs your init code, takes a Firecracker microVM snapshot (memory + filesystem), encrypts
and stores it. New invocations restore the snapshot instead of re-running init. Restoration is
~200 ms vs 5+ seconds for cold Spring Boot. There is no extra charge for SnapStart itself. The
catch: you must handle uniqueness post-restore — randoms, network connections,
temporary credentials must be regenerated in the new environment via BeforeCheckpoint
and AfterRestore hooks. Java was first; Python and .NET followed in 2024–2025.
// Java SnapStart: regenerate connections after restore
import org.crac.Resource;
import org.crac.Core;
public class DbHandler implements Resource {
private Connection conn;
public DbHandler() {
Core.getGlobalContext().register(this);
}
// Called BEFORE snapshot -- close transient state
public void beforeCheckpoint(org.crac.Context<? extends Resource> ctx) {
if (conn != null) conn.close();
conn = null;
}
// Called AFTER restore in each new microVM -- recreate
public void afterRestore(org.crac.Context<? extends Resource> ctx) {
conn = DriverManager.getConnection(System.getenv("DB_URL"));
}
} Execution Environment Reuse and Frozen State
Between invocations, Lambda freezes the microVM using the cgroup freezer subsystem. CPU is paused; memory is retained. When the next invocation arrives, the runtime is unfrozen and resumes. This is why module-scoped variables persist across invocations on the same environment — a fact that's both a powerful optimization and a source of subtle bugs.
// Node.js -- module scope persists across warm invocations
const AWS = require('aws-sdk');
const ddb = new AWS.DynamoDB.DocumentClient(); // Created ONCE per env, reused
const cache = new Map(); // Persists across invocations on the same env
exports.handler = async (event) => {
if (cache.has(event.id)) return cache.get(event.id); // Warm cache hit
const item = await ddb.get({ TableName: 'T', Key: { id: event.id } }).promise();
cache.set(event.id, item.Item);
return item.Item;
};
// Caveats:
// - Cache may be on ANY of N parallel envs -- partial hit rate
// - Setting timers (setInterval) leaks across invocations — freeze stops them, but they fire on thaw
// - Pending promises that aren't awaited may complete on a future invocation The freeze/thaw model has three practical consequences:
- Background work is unsafe. A
setTimeoutor unawaited promise is paused when your handler returns. It may complete on the next invocation — or never, if the environment is reaped. - File descriptors leak. Open sockets to a database survive freeze. Long-idle sockets get reset by the database; your next invocation gets EPIPE. Use connection-pooling libraries with health checks (e.g. RDS Proxy, or libraries that ping before reuse).
- Wall-clock drifts. A frozen environment doesn't see time pass. Code that caches
based on
Date.now()can hold stale entries longer than expected.
VPC, Hyperplane ENIs, and the Cold-Start Tax That Was
Pre-2019, attaching a Lambda to a VPC was punitive: Lambda would create an ENI per concurrent execution, and ENI attachment took 5–10 seconds. A burst of 100 cold concurrent invocations meant 100 ENI creations.
AWS replaced this in 2019 with Hyperplane ENIs — shared ENIs in your subnet that Lambda multiplexes across many concurrent function executions using NAT-style port mapping. The result:
- One Hyperplane ENI per (VPC, subnet, security group) combination, regardless of concurrency.
- ENI cold-start tax dropped from seconds to effectively zero.
- Lambda still needs an unused IP in your subnet — if your subnet's CIDR is exhausted, function scaling is capped.
- NAT for outbound: each Hyperplane ENI is a NAT gateway from the function's perspective. Bandwidth is shared across all functions on that ENI.
Function URLs vs API Gateway vs ALB
| Function URL | API Gateway (REST) | API Gateway (HTTP) | ALB | |
|---|---|---|---|---|
| Latency overhead | ~10 ms | 20–50 ms | 10–20 ms | 5–15 ms |
| Cost per million | $0 (just Lambda) | $3.50 | $1.00 | $5.60/LCU-hr |
| Auth options | IAM only | IAM, Cognito, custom, JWT | IAM, JWT, custom | Cognito, OIDC |
| WAF | No (use CloudFront) | Yes | No | Yes |
| Request limit | 6 MB | 10 MB | 6 MB | 1 MB |
| Streaming response | Yes (up to 20 MB) | No | No | Yes |
| Best for | Internal tools, webhooks | Production APIs, throttling | Cost-sensitive APIs | Existing ALB infra |
Function URLs (released 2022) are the lightest-weight option: a stable HTTPS URL with TLS termination, zero per-request charge beyond Lambda itself, and support for response streaming via the Lambda invoke-with-response-stream API. They lack the ergonomics of API Gateway (no resource models, no request validation, no integrated WAF), but for <10 RPS internal tools they're hard to beat.
Lambda Layers, Container Images, and the Deployment Question
Lambda has two deployment models that look similar but behave differently:
# Zip deployment (default, fastest cold start)
$ zip -r function.zip handler.js node_modules/
$ aws lambda update-function-code --function-name f --zip-file fileb://function.zip
# Limit: 50 MB zipped, 250 MB unzipped (incl. layers)
# Cold start: fastest -- code is fetched as one blob
# Container image deployment (10 GB ceiling, OCI-compatible)
$ docker build -t my-fn .
$ aws ecr get-login-password | docker login --username AWS --password-stdin <acct>.dkr.ecr.<region>.amazonaws.com
$ docker push <acct>.dkr.ecr.<region>.amazonaws.com/my-fn:latest
$ aws lambda update-function-code --function-name f --image-uri <acct>.dkr.ecr.<region>.amazonaws.com/my-fn:latest
# Lambda: chunks the image into 512 KB blocks, fetches lazily on first read
# Cold start: only marginally slower than zip if image is well-layered
Container images solved the 250 MB limit (huge for ML workloads with PyTorch + model weights), but they
require an ECR repo, a VPC endpoint or NAT for ECR if Lambda is in a private subnet, and a Dockerfile
based on an AWS-provided base image (or one with the runtime interface client embedded). The chunked
lazy-load means a 5 GB image with a Python handler that imports only boto3 may only fetch
~50 MB on first call — but if your handler later loads a 4 GB model, that fetch happens during
invocation, on the user's wall clock.
Lambda Layers are a separate mechanism: a zip mounted at /opt on the
microVM. Up to 5 layers per function, 250 MB combined unzipped. Layers are deduplicated across functions
that reference the same layer ARN, so if 100 of your functions share a 50 MB layer, the layer is fetched
once per worker, not once per function. Layers are most useful for cross-function utility code (the
AWS SDK is pre-bundled in the runtime, so you don't need to layer it).
Lambda Extensions API
Extensions are out-of-band sidecars that run inside the same microVM as your function. They register
with the Lambda extensions API at init time and receive lifecycle events: INVOKE,
SHUTDOWN. Common uses: APM agents (Datadog, New Relic), secrets prefetching (AWS Parameters
and Secrets Lambda Extension caches Parameter Store values in-memory), log shippers.
# Extension manifest at /opt/extensions/<name>
# It's a binary or script; Lambda starts it, then your runtime
# Lifecycle:
# 1. Lambda starts extension processes from /opt/extensions/
# 2. Each extension calls POST /2020-01-01/extension/register
# 3. Lambda starts your runtime + handler
# 4. On each INVOKE, extensions get notified (parallel to your handler)
# 5. Extensions have until 2 seconds AFTER your handler returns to do work
# before Lambda freezes the env
# 6. SHUTDOWN event when env is reaped
# Example: AWS Parameters & Secrets Lambda Extension
# Adds layer arn:aws:lambda:<region>:177933569100:layer:AWS-Parameters-and-Secrets-Lambda-Extension:N
# Your handler hits localhost:2773 instead of SSM/Secrets Manager
# 5-minute in-memory TTL -- huge cold-start win for secrets-heavy functions Lambda@Edge vs CloudFront Functions
At the CloudFront edge, AWS offers two compute options that sound similar but have very different execution models:
| Lambda@Edge | CloudFront Functions | |
|---|---|---|
| Runtime | Node.js, Python (Lambda subset) | JavaScript (V8-isolate, ECMA 5.1) |
| Cold start | 50–200 ms | <1 ms (isolate, no microVM) |
| Max execution time | 5 s (viewer), 30 s (origin) | 1 ms (viewer-request only) |
| Memory | 128 MB – 10 GB | 2 MB |
| Network calls | Yes (any HTTP) | No (pure compute) |
| Triggers | 4 events (viewer/origin, request/response) | 2 events (viewer-request, viewer-response) |
| Cost per million | $0.60 + duration | $0.10 |
| Best for | A/B routing, image resizing, auth | URL rewrites, header manipulation, JWT verify (HMAC) |
CloudFront Functions runs in V8 isolates — the same isolation primitive as Cloudflare Workers, not microVMs. That's why cold start is sub-millisecond but the model is restrictive: no async, no network, 1 ms CPU budget. Lambda@Edge is "real Lambda," replicated across 13+ regional edge caches. Use CloudFront Functions for header manipulation; Lambda@Edge for anything that needs to do real work.
How Lambda Compares to Other Function Platforms
| AWS Lambda | Cloudflare Workers | Google Cloud Run | AWS Fargate | |
|---|---|---|---|---|
| Isolation | Firecracker microVM | V8 isolate | gVisor + container | Firecracker (now) |
| Cold start | 100–1000+ ms | <5 ms | 200–2000 ms | 20–60 s |
| Max duration | 15 min | 30 s CPU (10 min wall) | 60 min | Unlimited |
| Max memory | 10 GB | 128 MB | 32 GB | 120 GB |
| Languages | Many runtimes + custom | JS/TS/Rust (WASM) | Any (container) | Any (container) |
| Concurrency model | 1 invocation per env | Many per isolate | Configurable (1–1000 per instance) | Long-running |
| Network restrictions | None | Limited (subrequests, no raw TCP) | None | None |
The fundamental trade-off: Workers' V8 isolates give you near-zero cold start at the cost of running only sandboxed JavaScript with no native code. Lambda's microVMs give you any language and any binary, but every cold start pays for VM boot + runtime init. Cloud Run sits in between with concurrency >1 per instance, blunting the cold-start cost across many requests.
Tradeoffs & Failure Modes
- Concurrency limits cause throttles, not queueing. Account-level concurrency is 1,000
by default. Hit it and synchronous invocations get
TooManyRequestsException; async invocations retry with exponential backoff for up to 6 hours, then go to your DLQ. Always set a reserved concurrency on critical functions to protect them from being starved by other functions in the account. - Async retry semantics are surprising. Async invocations retry twice on failure (3 attempts total) with delays of ~1 minute and ~2 minutes. After that, events go to a DLQ if configured. For event sources like SNS, the SNS retry policy is layered on top — you can end up with up to ~20 attempts depending on failures.
- Idempotency is your problem. Lambda may invoke your function more than once for the same event, especially with SQS, Kinesis, and DynamoDB Streams. The function must be idempotent. Use Powertools idempotency utility (DynamoDB-backed) or your own dedupe table.
- Concurrency-per-event-source bottlenecks. SQS standard queues scale Lambda concurrency at +60 per minute, capped at 1,000 per function. A backlog of 100K messages takes ~17 minutes to fully spin up consumers. SQS FIFO and Kinesis are bounded by shard/group count, not Lambda concurrency.
- Init phase has its own timeout. If your global-scope code takes >10 seconds, the init fails and the invocation errors out. Lazy-load instead of eagerly importing everything.
- The 6 MB payload cap is an HTTP body cap, not a JSON-string cap. Base64 encoding for binary in API Gateway eats your budget. For larger payloads, write to S3 and pass a key.
- Hot environments don't load-balance fairly. Lambda routes to "stickiest available warm env first." Module-scoped state on a popular env grows; on a rare env, it's stale or empty. Don't use module scope as cache for anything where staleness matters > minutes.
- SnapStart restore is not free of state. A snapshot is a moment in time. UUIDs
generated at init are identical across restored microVMs unless you regenerate them in
afterRestore. Be paranoid about anything that "should be unique."
FAQ
Why does my Java Lambda take 8 seconds to cold-start, and how do I fix it?
Spring Boot or Quarkus on the JVM does massive classpath scanning at boot. The fix order: (1) enable SnapStart — for Spring Boot apps it typically drops cold start from 6–10 s to ~300–600 ms; (2) if SnapStart is unavailable, switch to GraalVM native-image — ~50 MB binary, ~150 ms cold start, but you lose reflection unless you register types; (3) as a last resort, use provisioned concurrency. Don't bother increasing memory past 1769 MB unless you've verified the bottleneck is CPU, not classpath I/O.
How do I know which exact warm environment my invocation hit?
Lambda exposes /proc/self/cgroup with a sandbox ID, and AWS_LAMBDA_LOG_STREAM_NAME
contains a stream name that includes the environment ID. Logs are partitioned by stream, so all
invocations on the same env land in the same stream. Cold starts have an INIT_START log
line; warm invocations don't. Use the aws-lambda-powertools tracer to add cold-start
markers to X-Ray traces.
Should I use connection pooling to RDS from Lambda?
Use RDS Proxy. A naive pool of N connections per Lambda environment, multiplied by Y concurrent environments, can exhaust the RDS connection limit (Postgres default is 100). RDS Proxy maintains a shared pool, presents Lambda with cheap virtual connections, and recycles backend connections across Lambda invocations. Aurora Serverless v2 with Data API is the alternative — HTTP, no connection state at all, but slower per-query and SQL-only (no transactions across calls).
What's the actual difference between async and sync invocation?
Sync (InvocationType=RequestResponse): caller waits for the response, gets the function's
return value or error. No retry by Lambda. Used by API Gateway, ALB, function URLs. Async
(InvocationType=Event): Lambda queues the event in an internal queue, returns 202
immediately, retries on failure (twice), DLQ on final failure. Used by S3 events, SNS, EventBridge.
The internal queue is not exposed and has no visible backlog metric — if you need to monitor
queue depth, put SQS in front.
How does Lambda's billing actually work for sub-100ms invocations?
Billing rounds duration up to the nearest 1 ms, multiplied by allocated memory in GB. A 50 ms
invocation at 1024 MB costs 0.05 s × 1 GB × $0.0000166667/GB-s = $0.00000083,
plus a flat $0.20 per million requests. The flat per-request fee dominates for very short functions:
at 50 ms / 128 MB, the duration cost is $0.0000001 vs request fee $0.0000002 — the request fee
is 2× the compute. SnapStart and provisioned concurrency are billed at different (lower) compute
rates while idle, plus full rate during execution.
Can I run a binary that's not a Lambda runtime (e.g., FFmpeg) on Lambda?
Yes — either bundle the binary in your zip (or a Lambda Layer), and exec it from your handler; or build a container image with the binary baked in. Use static binaries (FFmpeg static builds) to avoid library-version mismatches with the Lambda runtime's Amazon Linux 2 base. Watch /tmp size for intermediate files, and remember the 15-minute timeout. For workloads > 15 minutes, AWS Batch or ECS/Fargate is the right answer.
What happens if my handler doesn't return within the timeout?
The invocation is killed with a Task timed out after N.NN seconds error. The microVM
itself isn't necessarily destroyed — Lambda may reuse it for the next invocation if it's
recoverable. But any in-flight work in your handler is gone, side effects (database writes already
committed) remain, and the invocation result is a function error from the caller's perspective. Set
timeouts conservatively but not absurdly high — an over-long timeout means slow failures cost
you money.
How do Lambda Function URLs handle TLS?
AWS terminates TLS at the Lambda service boundary using a managed certificate on
*.lambda-url.<region>.on.aws. You cannot bring your own certificate to a Function
URL; if you need a custom domain, put CloudFront in front (which lets you use ACM certificates and
adds WAF). Function URLs support both IAM auth (signed requests) and NONE (public). For
public webhooks, use NONE + verify the source's signature in your handler.