Elasticsearch Internals

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It turns text into inverted indexes, fans queries across shards, scores results with BM25, and exposes a JSON query DSL over HTTP. Underneath the marketing copy, every Elasticsearch cluster is a fleet of Lucene processes coordinated by a Raft-like cluster state and a write-ahead translog. Understanding that machinery is the difference between a cluster that returns in 50 ms and one that ages into a stop-the-world garbage-collection nightmare.

Covers Elasticsearch 7.x/8.x and the OpenSearch fork. All references to internals reflect the mainline Lucene 8/9 design.

Why Elasticsearch Exists

The Gap

Before Elasticsearch, full-text search meant either a brittle SQL LIKE '%foo%' (no scoring, full table scan) or hand-rolling Lucene in Java with no clustering, no replication, and no HTTP API. Solr existed but its master/slave replication model and XML config made it painful to operate at scale.

The Insight

Lucene's per-segment immutable inverted index is already a great single-node search engine. What was missing was a sharding layer, a JSON-over-HTTP API, automatic node discovery, and a query DSL that non-Java developers could use. Wrap Lucene with those four pieces and you get horizontal search.

The Result

A cluster you can grow from one node to thousands by adding hardware. Index 100 GB/day per node, search billions of documents in tens of milliseconds, and aggregate over hundreds of dimensions. Elasticsearch became the default substrate for log search (ELK), product search, and observability backends.

Cluster Architecture

Key Numbers

Default refresh

1 second

Translog flush

5 seconds

Default shards (8.x)

1 primary

Max shard size

~50 GB

Default scoring

BM25

Bulk batch sweet spot

5-15 MB

JVM heap ceiling

~31 GB

The Inverted Index — Lucene's Core Data Structure

An inverted index maps terms to the documents containing them. The forward index of a relational database asks "what columns does row 42 have"; the inverted index asks "which documents contain the term elasticsearch". For each term Lucene stores a posting list — a sorted sequence of document IDs along with optional positions, offsets, and payloads.

Term         |  Doc Frequency  |  Postings (docID, freq, [positions])
-------------+-----------------+--------------------------------------
elastic      |  3              |  (5, 1, [4]), (12, 2, [1, 8]), (47, 1, [2])
search       |  4              |  (5, 1, [5]), (12, 1, [9]), (47, 1, [3]), (51, 1, [0])
distributed  |  2              |  (12, 1, [3]), (47, 1, [7])

To answer "elastic search", Lucene intersects the posting lists for both terms, checking that positions are adjacent for phrase matches. Posting lists are stored on disk in a compressed format (PFOR-Delta and Frame-of-Reference encoding) so a list of millions of doc IDs costs a few MB, not hundreds. The terms themselves live in an FST (finite-state transducer), a compact data structure that supports prefix scans and fuzzy matches with bounded edit distance.

Crucially, every Lucene segment is immutable. Once written, postings, term dictionaries, doc values, and stored fields never change. Updates produce a new document with the same ID and mark the old one as deleted in a per-segment liveDocs bitmap. Deletes are tombstones. This immutability is what makes refreshes near-real-time and replication trivial, but it also means storage grows until segments are merged.

Refresh, Flush, Merge — The Write Path

Every indexing request takes a circuitous path through several layers before it is durable and searchable. Understanding this path is essential for tuning throughput.

In-memory buffer — Documents enter an in-memory indexing buffer (the indexing_buffer_size, default 10% of heap). They are not yet searchable.
Translog append — Simultaneously appended to the translog (the durability log). On request durability, fsync per request; on async, every 5 seconds.
Refresh (every 1 second) — The buffer is flushed to a new in-memory Lucene segment that is opened for search. This is what makes Elasticsearch "near real-time" — there is a 1-second delay between indexing and searchability by default.
Flush (every 30 minutes or 512 MB translog) — In-memory segments are fsynced to disk and the translog is rotated. This is the equivalent of a checkpoint.
Merge — A background thread continuously merges small segments into larger ones using a tiered policy (default TieredMergePolicy). Merges reclaim space from deleted documents and reduce the number of segments per shard.

PUT /logs-2025-05/_settings
{
  "index.refresh_interval": "30s",        // bulk indexing? widen this
  "index.translog.durability": "async",   // sacrifice 5s of writes for throughput
  "index.translog.sync_interval": "5s",
  "index.translog.flush_threshold_size": "1gb"
}

For high-throughput log ingestion, setting refresh_interval to 30s (or even -1 during initial bulk load) can double indexing throughput. Per-request durability halves it. The defaults are tuned for low-latency search, not maximum throughput.

Shards, Replicas, and Routing

An index is partitioned into primary shards at creation time. Each primary may have zero or more replica shards on different nodes. Primaries handle writes; both primaries and replicas serve reads. The number of primaries is fixed once an index is created — to change it you must reindex (or use the Split/Shrink APIs, which require constraints).

Documents are routed to a shard by hash:

shard_num = hash(_routing) % number_of_primary_shards

By default _routing is the document _id. You can override it to colocate related documents:

POST /orders/_doc?routing=customer_42
{ "customer_id": "customer_42", "total": 99.50, "items": [...] }

Custom routing turns a fan-out-to-all-shards query into a single-shard query. The downside: uneven routing keys produce hot shards. A common rule: aim for shards of 10-50 GB each. Smaller shards waste overhead (each shard is a Lucene index with its own files and threads); larger shards make merges and recovery slow.

Replicas serve two purposes: high availability (a replica is promoted if its primary node dies) and read scaling. They are kept in sync via the in-sync replica set (ISR-style) — the primary acknowledges a write only after every in-sync replica has applied it. Network partitions or slow replicas can be ejected from the in-sync set, traded for availability.

Query DSL — Filter Context vs Query Context

The Elasticsearch Query DSL is a JSON tree of clauses. The single most important distinction is between query context (scoring) and filter context (yes/no, cacheable). Filters are an order of magnitude faster than queries because they don't compute relevance scores and they are cached per shard.

GET /products/_search
{
  "query": {
    "bool": {
      "must":   [ { "match": { "title": "running shoes" } } ],   // SCORED
      "filter": [
        { "term":  { "in_stock": true } },                        // CACHED
        { "range": { "price": { "lte": 200 } } },                 // CACHED
        { "terms": { "category": ["athletic", "footwear"] } }
      ],
      "should": [ { "match": { "brand": "nike" } } ],             // boost only
      "must_not": [ { "term": { "discontinued": true } } ]
    }
  }
}

Anything that does not require relevance scoring should live under filter. The shard-level filter cache uses an LRU policy and stores result bitsets per filter clause; the same filter applied to subsequent queries skips evaluation entirely.

match queries run the analyzer on the input (so "Running Shoes" becomes tokens [run, shoe]); term queries do not. A surprising number of production bugs come from passing user input through term on an analyzed field — it almost never matches because the indexed terms have been lowercased and stemmed.

The Analyzer Pipeline

Text fields are processed by an analyzer chain at index time and at query time. The chain has three stages:

Character filters — Operate on the raw string. Strip HTML, replace patterns, normalize Unicode (e.g. html_strip, mapping).
Tokenizer — Splits the stream into tokens. Common choices: standard (Unicode word boundaries), whitespace, keyword (no split, treats input as a single token), edge_ngram (for autocomplete), icu_tokenizer for non-Latin scripts.
Token filters — Transform the token stream. Lowercase, stop word removal, stemming (porter_stem, snowball), synonyms, ASCII folding, n-grams.

PUT /articles
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_custom": {
          "char_filter": ["html_strip"],
          "tokenizer":   "standard",
          "filter":      ["lowercase", "english_stop", "english_stemmer", "asciifolding"]
        }
      },
      "filter": {
        "english_stop":     { "type": "stop", "stopwords": "_english_" },
        "english_stemmer":  { "type": "stemmer", "language": "english" }
      }
    }
  },
  "mappings": {
    "properties": { "body": { "type": "text", "analyzer": "english_custom" } }
  }
}

Critical rule: the same analyzer must run at index time and query time, otherwise terms will not match. jumping indexed as jump with a stemmer will never match the literal query term jumping if you bypass the analyzer.

BM25 — The Default Scoring Function

Elasticsearch replaced the classic TF/IDF with Okapi BM25 in version 5.0. BM25 scores the relevance of a document D for a query term q as:

score(D, q) = IDF(q) * (tf(q,D) * (k1 + 1)) / (tf(q,D) + k1 * (1 - b + b * |D| / avgdl))

Where tf(q,D) is the term frequency in the document, |D| is the document length, avgdl is the average document length in the field, and the tuning constants default to k1 = 1.2 and b = 0.75. The key advantage over plain TF/IDF is term-frequency saturation: once a term appears five or ten times, additional occurrences add diminishing returns. BM25 also penalizes long documents, preventing them from outranking short, focused matches simply by having more words.

You can override BM25 with the similarity setting per field, set custom k1/b, or replace it with classic TF/IDF or DFR. For most production search workloads the defaults are fine — start by tuning your analyzer chain, not the scoring constants.

Aggregations — Metric, Bucket, Pipeline

Aggregations are the OLAP half of Elasticsearch. They run at query time over matching documents using doc values (a column-oriented on-disk structure built per segment). Three categories:

Metric — sum, avg, min, max, cardinality (HyperLogLog, ~3% error), percentiles (T-digest), stats.
Bucket — terms, histogram, date_histogram, range, filters, nested. Each bucket can have sub-aggregations.
Pipeline — Operate on the output of other aggregations. derivative, moving_avg, cumulative_sum, bucket_script.

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "by_day": {
      "date_histogram": { "field": "created_at", "calendar_interval": "day" },
      "aggs": {
        "revenue":         { "sum": { "field": "total" } },
        "unique_customers":{ "cardinality": { "field": "customer_id" } },
        "p95_basket":      { "percentiles": { "field": "total", "percents": [95] } },
        "rev_growth":      { "derivative": { "buckets_path": "revenue" } }
      }
    }
  }
}

The terms aggregation is the gotcha. It returns the top N buckets per shard, then merges across shards. With skewed data this can miss low-frequency terms that happen to be globally common. The shard_size parameter controls how many buckets each shard returns to the coordinator (default size * 1.5 + 10). For exact counts on high cardinality fields, a composite aggregation with after-key paging is correct, if slower.

Bulk Indexing

Single-document indexing maxes out at a few thousand docs per second per node because of the HTTP/JSON overhead. The _bulk endpoint amortizes that overhead by batching operations:

POST /_bulk
{ "index": { "_index": "logs", "_id": "1" } }
{ "level": "ERROR", "msg": "connection refused", "ts": "2025-05-03T10:00:00Z" }
{ "index": { "_index": "logs", "_id": "2" } }
{ "level": "INFO",  "msg": "started", "ts": "2025-05-03T10:00:01Z" }
{ "delete": { "_index": "logs", "_id": "old-doc" } }

Each line is alternately an action header and a document body, separated by newlines (NDJSON). Sweet spot: 5-15 MB per batch, single connection per shard, several concurrent connections per node. Batches that are too large create heap pressure on the coordinator; too small leave throughput on the table.

For ingestion pipelines (Logstash, Filebeat, Fluent Bit, Vector) the bulk client tunes batch size, concurrency, and back-pressure based on response codes. A 429 Too Many Requests on the bulk queue is the canonical sign that you need more nodes, larger thread pools, or to slow your producer down.

ILM — Index Lifecycle Management

For time-series data (logs, metrics, traces) you don't want a single ever-growing index. ILM automates rolling indexes through phases:

Hot — Active writes and recent reads. Lives on fast SSD, multiple replicas.
Warm — Read-only, recently aged out. Force-merge to one segment per shard.
Cold — Older data, occasional reads. Lives on cheaper hardware, replicas reduced.
Frozen — Searchable snapshots backed by object storage (S3, GCS). Pay-as-you-search.
Delete — TTL the index entirely.

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot":    { "actions": { "rollover": { "max_age": "1d", "max_primary_shard_size": "50gb" } } },
      "warm":   { "min_age": "7d",  "actions": { "forcemerge": { "max_num_segments": 1 }, "shrink": { "number_of_shards": 1 } } },
      "cold":   { "min_age": "30d", "actions": { "set_priority": { "priority": 50 } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Combined with data streams (an index alias that hides the rollover details), ILM is the standard way to run log clusters at scale. Without it, you end up with a 5 TB index whose oldest segments never get merged because the merge policy is always busy with new writes.

Tradeoffs and When Not To Use It

Not a system of record — Elasticsearch loses writes on poorly configured translogs, has no foreign keys, no transactions across documents, and its replication is eventually consistent for reads. Treat it as a derived index, not a database.
Heap-bound — JVM heap should never exceed ~31 GB (compressed object pointers). Beyond that, pages and field data eat off-heap memory. A 64 GB box runs ES with 30 GB heap and leaves 30 GB for OS file cache (which Lucene relies on heavily).
Reindexing is painful — Mapping changes require reindexing into a new index. Aliases hide the swap from clients, but the data movement is not free. Plan your mappings.
Cardinality explosion — A terms aggregation on a high-cardinality field (user_id, request_id) can OOM the coordinator. Use composite aggregations.
Exact counts at scale — Track-total-hits beyond 10,000 is opt-in. By default large queries return "more than 10,000" not the exact count.

Elasticsearch vs PostgreSQL FTS vs OpenSearch

	Elasticsearch	PostgreSQL FTS	OpenSearch
Storage	Lucene segments	GIN index on tsvector	Lucene segments (forked)
Distribution	Native sharding & replicas	Single node (or Citus)	Native sharding & replicas
Query language	JSON DSL + SQL plugin	SQL (tsquery)	JSON DSL + SQL plugin
Scoring	BM25 (default)	ts_rank (cover density)	BM25 (default)
License	Elastic License v2 / SSPL	PostgreSQL License (open)	Apache 2.0
Aggregations	Full (metric / bucket / pipeline)	SQL GROUP BY	Full (metric / bucket / pipeline)
Best for	Logs, search, observability	Embedded FTS within OLTP	Logs, search (open source path)

OpenSearch is AWS's 2021 fork from Elasticsearch 7.10 (the last Apache-licensed version). The core engine, query DSL, and APIs remain compatible; the divergence is in plugins, security, and the new features each project adds. For greenfield projects sensitive to license terms, OpenSearch is the safer pick. For mature investments in the Elastic Stack (Kibana, Beats), the official distribution still leads on velocity.

Frequently Asked Questions

Why is my index "yellow" and not "green"?

Yellow means all primary shards are assigned but at least one replica is unassigned. The most common cause on a single-node dev cluster is that you asked for replicas (default 1) but have only one node, and a replica cannot be placed on the same node as its primary. Set index.number_of_replicas: 0 for dev work. In production, yellow indicates a node is down or the cluster lacks capacity to allocate replicas.

How many shards should an index have?

Aim for shards of 10-50 GB each, and no more than ~20 shards per GB of heap. Too few shards means you cannot parallelize across nodes; too many shards eats heap with metadata, slows cluster state updates, and degrades search latency from per-shard overhead. For an index that will grow to 1 TB across 5 data nodes with 32 GB heap, 20-40 primary shards is a reasonable starting point.

What is a "hot" thread and how do I diagnose one?

The GET /_nodes/hot_threads API samples the JVM stacks of every node and returns the threads burning the most CPU. The most common culprits are merges (look for org.apache.lucene.index.IndexWriter), expensive aggregations (look for BucketsAggregator), and regex queries on text fields. Hot threads are the first diagnostic to run when latency spikes and CPU is pegged.

Why do my deletes not free space?

Deleted documents are tombstoned in liveDocs but the underlying postings remain until the segment is merged. With heavy delete workloads you can have segments that are 80% deleted yet still on disk. POST /index/_forcemerge?only_expunge_deletes=true reclaims that space, but it is expensive and should be reserved for indexes that have stopped writing.

How does Elasticsearch compare to a vector database?

Elasticsearch 8.0+ ships native dense vector search (HNSW under the hood) and can serve as a hybrid lexical + vector index. Dedicated vector databases (Pinecone, Weaviate, Milvus, Qdrant) often have better single-purpose tuning, multi-modal models, and richer filtering of vector queries — but Elasticsearch is the right call when you already need BM25 keyword search and want vectors in the same query plan with no extra infrastructure.

What's the deal with the Elasticsearch license change?

In January 2021, Elastic relicensed Elasticsearch and Kibana from Apache 2.0 to a dual SSPL / Elastic License v2. AWS forked the last Apache 2.0 release as OpenSearch. In August 2024 Elastic added AGPL as a third option, restoring an OSI-approved path. The practical impact: OpenSearch and Elasticsearch are functionally similar at the query DSL level but diverge on plugins, snapshot formats over time, and feature pace.

How do I avoid mapping explosion?

Dynamic mapping creates a field for every JSON key it sees. Logs with unbounded keys (e.g. per-tenant fields) can produce indexes with hundreds of thousands of fields, exhausting heap. Set "dynamic": "strict" on log indexes, or use "dynamic": "false" plus explicit mappings. Limit fields with index.mapping.total_fields.limit (default 1000).