AWS Lambda

Firecracker microVMs, Cold Starts, and the Anatomy of an Execution Environment

Lambda is often described as "just functions," but the substrate underneath is a carefully engineered hypervisor stack. Every invocation runs inside a Firecracker microVM — a minimal KVM-based VMM written in Rust that boots in roughly 125 ms and isolates one customer's code from another's on the same host. Around that microVM, Lambda glues a runtime (the language interpreter), an init phase, a frozen-when-idle execution loop, and a control plane that decides when to reuse, replace, or scale environments. Understanding Lambda performance — cold starts, tail latencies, the cost of VPC ENIs, the trade-off between provisioned concurrency and SnapStart — means understanding what happens between "request arrives at API Gateway" and "your handler returns."

Why Lambda's Design Looks the Way It Does

The multi-tenant isolation problem

On EC2, you trust the hypervisor. On a function platform that may pack hundreds of customer functions onto a single host within seconds of a traffic spike, container isolation (cgroups + namespaces) is insufficient against kernel-level attacks. Lambda's answer is Firecracker: a stripped-down VMM that gives every function a real Linux guest kernel inside KVM, with a minimal device model (virtio-net, virtio-block, no PCI, no USB). This keeps the host kernel attack surface small while still booting in <200 ms.

The cold-start tax

"Serverless" hides the fact that somebody still has to allocate a CPU, copy your code, initialize the runtime, and run your global-scope code. When a request arrives and no warm environment exists, the user pays for that latency. Cold starts dominate p99 for sporadic workloads and are the single most-discussed Lambda performance topic. Provisioned concurrency, SnapStart, init phases, and execution-environment reuse are all tools to amortize or eliminate this cost.

Constraints as features

6 MB request/response, 15-minute timeout, 10 GB /tmp, 10 GB max memory, 250 MB unzipped code, 1024 file descriptors. Every limit is load-bearing: the 6 MB cap forces large payloads through S3 (cheaper, more durable). The 15-minute timeout pushes long jobs to ECS/Fargate. The /tmp size lets ML workloads cache models. These are not arbitrary; they are the contour lines of the platform's cost and isolation model.

Firecracker, MicroVM, and Host Architecture

A Lambda host is an EC2 bare-metal or Nitro instance running the Lambda Worker software. Inside the host:

Lambda Worker Host (EC2 bare-metal, Nitro) Linux + KVM + Firecracker jailer + Lambda agent Host kernel (Linux, KVM enabled) cgroups v2, seccomp-bpf, jailer chroot, virtio backends. Talks to Nitro hardware for block + network. Customer A — microVM #1 Firecracker VMM (Rust, ~50 syscalls allowed) Guest kernel (Amazon Linux 2 microvm) 5.10.x, no module loading, ~15 MB initramfs Lambda runtime (Node.js / Python / Java / ...) Long-running process, polls runtime API for events Handler code + dependencies /var/task (read-only) + /tmp (10 GB ephemeral) 128 MB – 10240 MB RAM, 1–6 vCPUs Customer B — microVM #2 Firecracker VMM (separate process, separate jailer) Guest kernel (independent address space) No shared page cache with Customer A Different language runtime, different code Cold start: 100–300 ms VM boot + runtime init

Firecracker is the load-bearing piece. It was open-sourced in 2018 (originally written for Lambda; later adopted for Fargate too). Each VMM is a single Rust process with seccomp filters limiting it to roughly 50 syscalls. The jailer chroots Firecracker into a per-microVM directory and drops privileges before the VMM ever touches guest code. The guest kernel is a custom Amazon Linux 2 build with module loading disabled, no PCI bus, and only the virtio devices Lambda needs.

Key Numbers

Memory range
128 MB – 10,240 MB
vCPU at max memory
~6 vCPUs
Max execution time
15 min
Request / response cap
6 MB sync, 256 KB async
/tmp ephemeral storage
512 MB – 10,240 MB
Deployment package (zip)
50 MB zipped, 250 MB unzipped
Container image size
10 GB
Firecracker boot
~125 ms

The Cold-Start Lifecycle

A "cold start" is overloaded terminology. There are at least four distinct phases that contribute to it, and each has different mitigation strategies.

1
Worker placement (5–30 ms)

The Lambda placement service selects a worker host with available capacity. For VPC-attached functions, it must also pick a Hyperplane ENI. This is fast unless the region is capacity-constrained.

2
Code download (10–200 ms)

The worker fetches your zip or container image from internal S3 (S3 with internal accelerated endpoints, not public S3). For container images, Lambda uses a custom layered cache with chunk-level deduplication: the image is split into 512 KB chunks, each chunk fetched lazily as it's read. Only chunks the runtime actually touches get downloaded.

3
microVM boot + runtime init (100–500 ms)

Firecracker boots the guest kernel in ~125 ms. The Lambda init process starts, mounts /var/task and /opt (Lambda Layers), and exec's your runtime. The runtime then runs your global-scope code. For Java/Spring this can be seconds; for Node.js this is typically <100 ms; for Python it depends heavily on what you import at module load.

4
Handler execution

The runtime polls the Lambda runtime API (HTTP loopback to 169.254.100.1) for an invocation, deserializes the event, and calls your handler. Once the handler returns, the runtime returns the response and goes back to polling. The microVM stays warm; the kernel freezes the runtime process between invocations using cgroup freezer.

The asymmetry between phases drives optimization choices:

PhaseTypicalWorst caseMitigation
Placement5–30 msseconds (capacity)Provisioned concurrency
Code fetch (zip)10–100 ms500 msSmaller deployment, layers
Code fetch (image)50–300 ms2 s on first chunk missImage flattening, fewer layers
VM boot~125 ms200 msNothing — this is fixed
Runtime init20–3000 ms10 s+ (Spring, Quarkus JVM)SnapStart, GraalVM, smaller deps
VPC ENI attach0 (Hyperplane)was 5–10 s pre-2019Already mitigated

Memory, CPU, and the Linear Coupling

Lambda has one knob the user controls for performance: memory. CPU is allocated proportionally, with breakpoints that AWS does not officially document but that are well-characterized by benchmarks:

Memory (MB)vCPU equivalentTypical use case
128~0.083 (8.3% of 1 vCPU)Tiny webhook, cron tick
1,7691.0 vCPU (one full core)Single-threaded API
3,5382.0 vCPUsJSON-heavy API, light I/O parallelism
5,3083.0 vCPUsImage processing, light ML inference
7,0774.0 vCPUsFFmpeg, heavier image work
10,2406.0 vCPUs (max)ML inference, large-batch transforms

The breakpoint at 1,769 MB is the key one: below it, you have less than a full core, and any single-threaded compute is artificially throttled. AWS Lambda Power Tuning (open source state-machine tool) will sweep memory settings for your function and produce a cost-vs-latency Pareto frontier — the result is often counterintuitive, e.g. doubling memory halves duration and halves cost.

// Lambda Power Tuning typical output for a CPU-bound function
// Memory  Duration  Cost (per 1M invocations)
// 128 MB  6,800 ms  $14.20
// 512 MB  1,720 ms  $14.40
// 1024 MB   860 ms  $14.40
// 1769 MB   500 ms  $14.50   <-- 1 vCPU breakpoint
// 3008 MB   480 ms  $24.10   <-- diminishing returns
//
// Sweet spot: 1769 MB for single-threaded work.

Provisioned Concurrency vs SnapStart

For workloads where cold-start latency is unacceptable, Lambda offers two distinct mechanisms:

Provisioned Concurrency

Pre-initialized environments held warm. You pay an hourly rate for each provisioned instance (~$0.0000041667/GB-second when idle, on top of normal request charges). Init code runs ahead of time. Works with any runtime. Good for predictable peak traffic; combine with Application Auto Scaling on the ProvisionedConcurrencyUtilization CloudWatch metric. Anti-pattern: using PC to fix a slow init you should have profiled instead.

SnapStart (Java, Python, .NET)

Lambda runs your init code, takes a Firecracker microVM snapshot (memory + filesystem), encrypts and stores it. New invocations restore the snapshot instead of re-running init. Restoration is ~200 ms vs 5+ seconds for cold Spring Boot. There is no extra charge for SnapStart itself. The catch: you must handle uniqueness post-restore — randoms, network connections, temporary credentials must be regenerated in the new environment via BeforeCheckpoint and AfterRestore hooks. Java was first; Python and .NET followed in 2024–2025.

// Java SnapStart: regenerate connections after restore
import org.crac.Resource;
import org.crac.Core;

public class DbHandler implements Resource {
  private Connection conn;

  public DbHandler() {
    Core.getGlobalContext().register(this);
  }

  // Called BEFORE snapshot -- close transient state
  public void beforeCheckpoint(org.crac.Context<? extends Resource> ctx) {
    if (conn != null) conn.close();
    conn = null;
  }

  // Called AFTER restore in each new microVM -- recreate
  public void afterRestore(org.crac.Context<? extends Resource> ctx) {
    conn = DriverManager.getConnection(System.getenv("DB_URL"));
  }
}

Execution Environment Reuse and Frozen State

Between invocations, Lambda freezes the microVM using the cgroup freezer subsystem. CPU is paused; memory is retained. When the next invocation arrives, the runtime is unfrozen and resumes. This is why module-scoped variables persist across invocations on the same environment — a fact that's both a powerful optimization and a source of subtle bugs.

// Node.js -- module scope persists across warm invocations
const AWS = require('aws-sdk');
const ddb = new AWS.DynamoDB.DocumentClient(); // Created ONCE per env, reused
const cache = new Map(); // Persists across invocations on the same env

exports.handler = async (event) => {
  if (cache.has(event.id)) return cache.get(event.id); // Warm cache hit
  const item = await ddb.get({ TableName: 'T', Key: { id: event.id } }).promise();
  cache.set(event.id, item.Item);
  return item.Item;
};
// Caveats:
//  - Cache may be on ANY of N parallel envs -- partial hit rate
//  - Setting timers (setInterval) leaks across invocations — freeze stops them, but they fire on thaw
//  - Pending promises that aren't awaited may complete on a future invocation

The freeze/thaw model has three practical consequences:

  • Background work is unsafe. A setTimeout or unawaited promise is paused when your handler returns. It may complete on the next invocation — or never, if the environment is reaped.
  • File descriptors leak. Open sockets to a database survive freeze. Long-idle sockets get reset by the database; your next invocation gets EPIPE. Use connection-pooling libraries with health checks (e.g. RDS Proxy, or libraries that ping before reuse).
  • Wall-clock drifts. A frozen environment doesn't see time pass. Code that caches based on Date.now() can hold stale entries longer than expected.

VPC, Hyperplane ENIs, and the Cold-Start Tax That Was

Pre-2019, attaching a Lambda to a VPC was punitive: Lambda would create an ENI per concurrent execution, and ENI attachment took 5–10 seconds. A burst of 100 cold concurrent invocations meant 100 ENI creations.

AWS replaced this in 2019 with Hyperplane ENIs — shared ENIs in your subnet that Lambda multiplexes across many concurrent function executions using NAT-style port mapping. The result:

  • One Hyperplane ENI per (VPC, subnet, security group) combination, regardless of concurrency.
  • ENI cold-start tax dropped from seconds to effectively zero.
  • Lambda still needs an unused IP in your subnet — if your subnet's CIDR is exhausted, function scaling is capped.
  • NAT for outbound: each Hyperplane ENI is a NAT gateway from the function's perspective. Bandwidth is shared across all functions on that ENI.

Function URLs vs API Gateway vs ALB

 Function URLAPI Gateway (REST)API Gateway (HTTP)ALB
Latency overhead~10 ms20–50 ms10–20 ms5–15 ms
Cost per million$0 (just Lambda)$3.50$1.00$5.60/LCU-hr
Auth optionsIAM onlyIAM, Cognito, custom, JWTIAM, JWT, customCognito, OIDC
WAFNo (use CloudFront)YesNoYes
Request limit6 MB10 MB6 MB1 MB
Streaming responseYes (up to 20 MB)NoNoYes
Best forInternal tools, webhooksProduction APIs, throttlingCost-sensitive APIsExisting ALB infra

Function URLs (released 2022) are the lightest-weight option: a stable HTTPS URL with TLS termination, zero per-request charge beyond Lambda itself, and support for response streaming via the Lambda invoke-with-response-stream API. They lack the ergonomics of API Gateway (no resource models, no request validation, no integrated WAF), but for <10 RPS internal tools they're hard to beat.

Lambda Layers, Container Images, and the Deployment Question

Lambda has two deployment models that look similar but behave differently:

# Zip deployment (default, fastest cold start)
$ zip -r function.zip handler.js node_modules/
$ aws lambda update-function-code --function-name f --zip-file fileb://function.zip
# Limit: 50 MB zipped, 250 MB unzipped (incl. layers)
# Cold start: fastest -- code is fetched as one blob

# Container image deployment (10 GB ceiling, OCI-compatible)
$ docker build -t my-fn .
$ aws ecr get-login-password | docker login --username AWS --password-stdin <acct>.dkr.ecr.<region>.amazonaws.com
$ docker push <acct>.dkr.ecr.<region>.amazonaws.com/my-fn:latest
$ aws lambda update-function-code --function-name f --image-uri <acct>.dkr.ecr.<region>.amazonaws.com/my-fn:latest
# Lambda: chunks the image into 512 KB blocks, fetches lazily on first read
# Cold start: only marginally slower than zip if image is well-layered

Container images solved the 250 MB limit (huge for ML workloads with PyTorch + model weights), but they require an ECR repo, a VPC endpoint or NAT for ECR if Lambda is in a private subnet, and a Dockerfile based on an AWS-provided base image (or one with the runtime interface client embedded). The chunked lazy-load means a 5 GB image with a Python handler that imports only boto3 may only fetch ~50 MB on first call — but if your handler later loads a 4 GB model, that fetch happens during invocation, on the user's wall clock.

Lambda Layers are a separate mechanism: a zip mounted at /opt on the microVM. Up to 5 layers per function, 250 MB combined unzipped. Layers are deduplicated across functions that reference the same layer ARN, so if 100 of your functions share a 50 MB layer, the layer is fetched once per worker, not once per function. Layers are most useful for cross-function utility code (the AWS SDK is pre-bundled in the runtime, so you don't need to layer it).

Lambda Extensions API

Extensions are out-of-band sidecars that run inside the same microVM as your function. They register with the Lambda extensions API at init time and receive lifecycle events: INVOKE, SHUTDOWN. Common uses: APM agents (Datadog, New Relic), secrets prefetching (AWS Parameters and Secrets Lambda Extension caches Parameter Store values in-memory), log shippers.

# Extension manifest at /opt/extensions/<name>
# It's a binary or script; Lambda starts it, then your runtime
# Lifecycle:
#  1. Lambda starts extension processes from /opt/extensions/
#  2. Each extension calls POST /2020-01-01/extension/register
#  3. Lambda starts your runtime + handler
#  4. On each INVOKE, extensions get notified (parallel to your handler)
#  5. Extensions have until 2 seconds AFTER your handler returns to do work
#     before Lambda freezes the env
#  6. SHUTDOWN event when env is reaped

# Example: AWS Parameters & Secrets Lambda Extension
# Adds layer arn:aws:lambda:<region>:177933569100:layer:AWS-Parameters-and-Secrets-Lambda-Extension:N
# Your handler hits localhost:2773 instead of SSM/Secrets Manager
# 5-minute in-memory TTL -- huge cold-start win for secrets-heavy functions

Lambda@Edge vs CloudFront Functions

At the CloudFront edge, AWS offers two compute options that sound similar but have very different execution models:

 Lambda@EdgeCloudFront Functions
RuntimeNode.js, Python (Lambda subset)JavaScript (V8-isolate, ECMA 5.1)
Cold start50–200 ms<1 ms (isolate, no microVM)
Max execution time5 s (viewer), 30 s (origin)1 ms (viewer-request only)
Memory128 MB – 10 GB2 MB
Network callsYes (any HTTP)No (pure compute)
Triggers4 events (viewer/origin, request/response)2 events (viewer-request, viewer-response)
Cost per million$0.60 + duration$0.10
Best forA/B routing, image resizing, authURL rewrites, header manipulation, JWT verify (HMAC)

CloudFront Functions runs in V8 isolates — the same isolation primitive as Cloudflare Workers, not microVMs. That's why cold start is sub-millisecond but the model is restrictive: no async, no network, 1 ms CPU budget. Lambda@Edge is "real Lambda," replicated across 13+ regional edge caches. Use CloudFront Functions for header manipulation; Lambda@Edge for anything that needs to do real work.

How Lambda Compares to Other Function Platforms

 AWS LambdaCloudflare WorkersGoogle Cloud RunAWS Fargate
IsolationFirecracker microVMV8 isolategVisor + containerFirecracker (now)
Cold start100–1000+ ms<5 ms200–2000 ms20–60 s
Max duration15 min30 s CPU (10 min wall)60 minUnlimited
Max memory10 GB128 MB32 GB120 GB
LanguagesMany runtimes + customJS/TS/Rust (WASM)Any (container)Any (container)
Concurrency model1 invocation per envMany per isolateConfigurable (1–1000 per instance)Long-running
Network restrictionsNoneLimited (subrequests, no raw TCP)NoneNone

The fundamental trade-off: Workers' V8 isolates give you near-zero cold start at the cost of running only sandboxed JavaScript with no native code. Lambda's microVMs give you any language and any binary, but every cold start pays for VM boot + runtime init. Cloud Run sits in between with concurrency >1 per instance, blunting the cold-start cost across many requests.

Tradeoffs & Failure Modes

  • Concurrency limits cause throttles, not queueing. Account-level concurrency is 1,000 by default. Hit it and synchronous invocations get TooManyRequestsException; async invocations retry with exponential backoff for up to 6 hours, then go to your DLQ. Always set a reserved concurrency on critical functions to protect them from being starved by other functions in the account.
  • Async retry semantics are surprising. Async invocations retry twice on failure (3 attempts total) with delays of ~1 minute and ~2 minutes. After that, events go to a DLQ if configured. For event sources like SNS, the SNS retry policy is layered on top — you can end up with up to ~20 attempts depending on failures.
  • Idempotency is your problem. Lambda may invoke your function more than once for the same event, especially with SQS, Kinesis, and DynamoDB Streams. The function must be idempotent. Use Powertools idempotency utility (DynamoDB-backed) or your own dedupe table.
  • Concurrency-per-event-source bottlenecks. SQS standard queues scale Lambda concurrency at +60 per minute, capped at 1,000 per function. A backlog of 100K messages takes ~17 minutes to fully spin up consumers. SQS FIFO and Kinesis are bounded by shard/group count, not Lambda concurrency.
  • Init phase has its own timeout. If your global-scope code takes >10 seconds, the init fails and the invocation errors out. Lazy-load instead of eagerly importing everything.
  • The 6 MB payload cap is an HTTP body cap, not a JSON-string cap. Base64 encoding for binary in API Gateway eats your budget. For larger payloads, write to S3 and pass a key.
  • Hot environments don't load-balance fairly. Lambda routes to "stickiest available warm env first." Module-scoped state on a popular env grows; on a rare env, it's stale or empty. Don't use module scope as cache for anything where staleness matters > minutes.
  • SnapStart restore is not free of state. A snapshot is a moment in time. UUIDs generated at init are identical across restored microVMs unless you regenerate them in afterRestore. Be paranoid about anything that "should be unique."

FAQ

Why does my Java Lambda take 8 seconds to cold-start, and how do I fix it?

Spring Boot or Quarkus on the JVM does massive classpath scanning at boot. The fix order: (1) enable SnapStart — for Spring Boot apps it typically drops cold start from 6–10 s to ~300–600 ms; (2) if SnapStart is unavailable, switch to GraalVM native-image — ~50 MB binary, ~150 ms cold start, but you lose reflection unless you register types; (3) as a last resort, use provisioned concurrency. Don't bother increasing memory past 1769 MB unless you've verified the bottleneck is CPU, not classpath I/O.

How do I know which exact warm environment my invocation hit?

Lambda exposes /proc/self/cgroup with a sandbox ID, and AWS_LAMBDA_LOG_STREAM_NAME contains a stream name that includes the environment ID. Logs are partitioned by stream, so all invocations on the same env land in the same stream. Cold starts have an INIT_START log line; warm invocations don't. Use the aws-lambda-powertools tracer to add cold-start markers to X-Ray traces.

Should I use connection pooling to RDS from Lambda?

Use RDS Proxy. A naive pool of N connections per Lambda environment, multiplied by Y concurrent environments, can exhaust the RDS connection limit (Postgres default is 100). RDS Proxy maintains a shared pool, presents Lambda with cheap virtual connections, and recycles backend connections across Lambda invocations. Aurora Serverless v2 with Data API is the alternative — HTTP, no connection state at all, but slower per-query and SQL-only (no transactions across calls).

What's the actual difference between async and sync invocation?

Sync (InvocationType=RequestResponse): caller waits for the response, gets the function's return value or error. No retry by Lambda. Used by API Gateway, ALB, function URLs. Async (InvocationType=Event): Lambda queues the event in an internal queue, returns 202 immediately, retries on failure (twice), DLQ on final failure. Used by S3 events, SNS, EventBridge. The internal queue is not exposed and has no visible backlog metric — if you need to monitor queue depth, put SQS in front.

How does Lambda's billing actually work for sub-100ms invocations?

Billing rounds duration up to the nearest 1 ms, multiplied by allocated memory in GB. A 50 ms invocation at 1024 MB costs 0.05 s × 1 GB × $0.0000166667/GB-s = $0.00000083, plus a flat $0.20 per million requests. The flat per-request fee dominates for very short functions: at 50 ms / 128 MB, the duration cost is $0.0000001 vs request fee $0.0000002 — the request fee is 2× the compute. SnapStart and provisioned concurrency are billed at different (lower) compute rates while idle, plus full rate during execution.

Can I run a binary that's not a Lambda runtime (e.g., FFmpeg) on Lambda?

Yes — either bundle the binary in your zip (or a Lambda Layer), and exec it from your handler; or build a container image with the binary baked in. Use static binaries (FFmpeg static builds) to avoid library-version mismatches with the Lambda runtime's Amazon Linux 2 base. Watch /tmp size for intermediate files, and remember the 15-minute timeout. For workloads > 15 minutes, AWS Batch or ECS/Fargate is the right answer.

What happens if my handler doesn't return within the timeout?

The invocation is killed with a Task timed out after N.NN seconds error. The microVM itself isn't necessarily destroyed — Lambda may reuse it for the next invocation if it's recoverable. But any in-flight work in your handler is gone, side effects (database writes already committed) remain, and the invocation result is a function error from the caller's perspective. Set timeouts conservatively but not absurdly high — an over-long timeout means slow failures cost you money.

How do Lambda Function URLs handle TLS?

AWS terminates TLS at the Lambda service boundary using a managed certificate on *.lambda-url.<region>.on.aws. You cannot bring your own certificate to a Function URL; if you need a custom domain, put CloudFront in front (which lets you use ACM certificates and adds WAF). Function URLs support both IAM auth (signed requests) and NONE (public). For public webhooks, use NONE + verify the source's signature in your handler.