SQLite Virtual Tables

A virtual table is SQLite's extension mechanism for tables backed by user code rather than the built-in b-tree. To SQL it looks identical: SELECT, INSERT, UPDATE, joins all work. Underneath, a module implements the callbacks that produce rows, evaluate constraints, and report cost estimates to the query planner. SQLite ships with several powerful built-in modules — FTS5 for full-text search, R*Tree for spatial indexes, JSON1 for JSON values, dbstat for introspecting the database file, and csv for reading CSV files. This page walks the module API, the planner integration via xBestIndex, and what each shipped module does.

Virtual Table Architecture

Key Numbers

Module callbacks

~25

Bundled modules

FTS5, R*Tree, JSON1+

FTS5 token types

5 built-in

R*Tree dimensions

1-5

Shadow tables

user-invisible

Eponymous vtabs

no DDL needed

JSON1 funcs

~25

Why a Virtual Table API

Specialized indexes

B-trees aren't right for every query type. Full-text search needs inverted indexes; spatial queries need R-trees; vector similarity needs HNSW or IVF. The vtab API lets these live alongside b-tree tables in the same database.

External data sources

CSV files, JSON arrays, REST APIs, /proc filesystems — anything that can be enumerated as rows can be a virtual table. SQL queries against external data with JOINs to regular tables become trivial.

Introspection

dbstat exposes per-page database internals; stmt exposes prepared statement state; the schema is itself a btree readable as sqlite_master. The vtab API turns introspection into queries.

Defining a Virtual Table

CREATE VIRTUAL TABLE invokes the module's xCreate.

-- Create an FTS5 full-text index
CREATE VIRTUAL TABLE docs USING fts5(title, body, tokenize='porter');

-- Insert and query
INSERT INTO docs VALUES('Intro to SQL', 'SELECT FROM WHERE GROUP BY...');
SELECT * FROM docs WHERE docs MATCH 'sql AND select';
                                    -- ^ MATCH is the FTS5 query op

-- Create an R*Tree on bounding boxes
CREATE VIRTUAL TABLE bounds USING rtree(id, minX, maxX, minY, maxY);
INSERT INTO bounds VALUES(1, 0.0, 5.0, 0.0, 5.0);
SELECT id FROM bounds WHERE minX <= 3 AND maxX >= 1 AND minY <= 3 AND maxY >= 1;

Behind the scenes, the FTS5 module creates several "shadow tables" — real b-tree-backed tables holding the inverted index data. They have names like docs_data, docs_idx, docs_content, docs_docsize, docs_config. Direct queries on shadow tables are allowed but typically meaningless to applications; FTS5 itself manages them.

xBestIndex: Cost Negotiation with the Planner

The most important callback for performance.

When SQL hits a virtual table in a SELECT, the planner gathers the predicates and ORDER BY and calls xBestIndex with them as input. The module returns: an estimated cost (number, lower is better), an estimated number of rows, an idxNum (an integer the module assigns meaning to), an idxStr (similar but a string), and a flag indicating whether the results will already be sorted.

int xBestIndex(sqlite3_vtab *vt, sqlite3_index_info *info) {
  // info->aConstraint: list of WHERE constraints
  // info->aOrderBy: requested ORDER BY columns
  // info->estimatedCost: output, lower wins
  // info->estimatedRows: output, used for join planning
  // info->idxNum, info->idxStr: output, opaque to SQLite, passed to xFilter
  // info->orderByConsumed: output, 1 if this access method already sorts

  // FTS5 example: see if there's a MATCH constraint on a fts column
  for (int i = 0; i < info->nConstraint; i++) {
    if (info->aConstraint[i].usable && matches_fts_match(...)) {
      info->estimatedCost = 100.0;     // cheap with index
      info->idxNum = 1;
      info->aConstraintUsage[i].argvIndex = 1;  // tell SQLite to pass this arg to xFilter
      return SQLITE_OK;
    }
  }
  info->estimatedCost = 1000000.0;   // expensive without
  info->idxNum = 0;
  return SQLITE_OK;
}

When SQLite decides to use this access method, it calls xFilter(idxNum, idxStr, argv) on the cursor; the module uses idxNum to dispatch (1=index scan, 0=full scan, etc.) and uses argv for the constraint values.

FTS5: Full-Text Search

Inverted index over text columns with relevance ranking.

CREATE VIRTUAL TABLE docs USING fts5(
  title,
  body,
  tokenize = 'unicode61 remove_diacritics 1',
  content = ''      -- contentless: don't store original text
);

INSERT INTO docs(rowid, title, body) VALUES(1, 'Hello', 'World of SQL');

-- Phrase, prefix, NEAR, AND/OR/NOT
SELECT rowid FROM docs WHERE docs MATCH 'sql NOT mongodb';
SELECT rowid FROM docs WHERE docs MATCH 'NEAR(world sql, 5)';
SELECT rowid, bm25(docs) FROM docs WHERE docs MATCH 'sql' ORDER BY bm25(docs);

Tokenizers split text into tokens (words). Built-in: simple, porter (English stemming), unicode61 (Unicode-aware), ascii, trigram (n-gram, for substring queries). Custom tokenizers can be registered via the FTS5 API.

The on-disk structure is an inverted index: for each unique token, a list of (rowid, column, position) entries. Queries combine these lists with AND/OR/NEAR. BM25 ranking gives relevance scores. Snippets and highlights are accessible via the snippet() and highlight() auxiliary functions.

R*Tree: Spatial Indexing

Bounding-box queries in 1-5 dimensions.

CREATE VIRTUAL TABLE places USING rtree(
  id,
  minLat, maxLat,    -- 1st dimension (latitude)
  minLng, maxLng     -- 2nd dimension (longitude)
);

INSERT INTO places VALUES(1, 37.7, 37.8, -122.5, -122.4);  -- SF area
INSERT INTO places VALUES(2, 40.7, 40.8, -74.1, -74.0);    -- NYC area

-- Find all places overlapping a query box
SELECT id FROM places
WHERE maxLat >= 37.0 AND minLat <= 38.0
  AND maxLng >= -123.0 AND minLng <= -122.0;

The R*-tree is a balanced tree of bounding boxes; each interior node contains the bounding box of its children. A query box descends only into branches whose box overlaps the query — pruning aggressively for sparse data. For 1M random points in 2D, R-tree queries return candidates in O(log N) time, typically 5-10x faster than a btree index on lat or lng alone.

Limitations: bounding boxes can be approximate (a polygon's box overlaps the query but the polygon doesn't). After R-tree returns candidates, application code typically does exact geometry tests. R*Tree handles 1-5 dimensions; for higher-D vector search, use a different module or external library.

JSON1: Querying JSON Values

Built-in functions for SELECT WHERE on JSON columns.

SELECT json_extract(data, '$.user.name') FROM events;
SELECT * FROM events WHERE json_extract(data, '$.user.id') = 42;

-- json_each is a vtab: expand JSON arrays to rows
SELECT value FROM json_each('[1,2,3,4]');
-- 1, 2, 3, 4

SELECT * FROM events, json_each(events.tags) WHERE json_each.value = 'admin';

JSON1 isn't a single virtual table — it's a collection of scalar functions (json_extract, json_set, json_remove, json_array, json_object, etc.) plus two table-valued functions (json_each, json_tree) that act as virtual tables. json_each(arr) expands an array into rows; json_tree(doc) walks a nested JSON structure recursively.

Other Bundled Modules

Module	Use case	Form
dbstat	per-page database introspection	SELECT FROM dbstat
stmt	list of currently prepared statements	SELECT FROM stmt
generate_series	numeric sequence generator	SELECT FROM generate_series(1,100)
geopoly	polygon-based spatial	CREATE VIRTUAL TABLE … USING geopoly
csv	read CSV files as tables	CREATE VIRTUAL TABLE … USING csv
fts5vocab	list FTS5 tokens and stats	CREATE VIRTUAL TABLE … USING fts5vocab
completion	SQL keyword completion	SELECT FROM completion

FAQ

How do virtual tables differ from regular tables?

A virtual table looks identical to SQL — you SELECT, INSERT, UPDATE, DELETE the same way — but the storage is provided by user code (a 'module') rather than the b-tree backend. The module implements callbacks for xCreate, xConnect, xBestIndex, xOpen, xFilter, xNext, xColumn, xRowid, etc. SQLite's query planner consults xBestIndex during planning to ask 'what's the cost of using these constraints?' and routes execution through the module. This is how FTS5 (full-text search), R*Tree (spatial indexes), JSON1 (JSON queries), and CSV (read CSV files as tables) all integrate.

What is xBestIndex doing?

It's the cost estimator for the virtual table. SQLite's planner shows the module a list of constraints (e.g. 'WHERE name = ?' and 'ORDER BY date') and asks: 'if you used these, what would the cost be? what would the row order be?' The module returns a cost estimate plus an idxNum and idxStr that describe its chosen access method to itself. The planner uses this to compare virtual-table plans against regular-table plans during query planning. A poorly-implemented xBestIndex makes the planner pick bad plans.

What is FTS5 actually?

Full-Text Search version 5 — a virtual table module shipped with SQLite that maintains an inverted index over text columns and supports MATCH queries (with phrase, prefix, NEAR operators, BM25 ranking, custom tokenizers). The on-disk storage is several shadow tables (you can see them with .tables in CLI: x_data, x_idx, x_content, x_docsize, x_config) that hold the inverted index, doclist data, and metadata. Queries that use MATCH on the FTS5 vtab use the inverted index; non-MATCH queries fall back to scanning x_content.

What about R*Tree?

R*Tree is a virtual table module for spatial indexing — minimum bounding rectangles in 1D, 2D, 3D, 4D, or 5D. CREATE VIRTUAL TABLE bounds USING rtree(id, minX, maxX, minY, maxY) creates an R-tree index on min/max coordinates. SELECT id FROM bounds WHERE minX <= 5 AND maxX >= 3 AND minY <= 4 AND maxY >= 2 uses the R-tree to return only candidate rows whose rectangles intersect the query rectangle. Used heavily in geospatial systems built on SQLite.

Can I write my own virtual table?

Yes. Implement the sqlite3_module struct with the required callbacks, register it via sqlite3_create_module(), and now CREATE VIRTUAL TABLE foo USING mymod(...) creates instances. Common uses: expose CSV files as tables, expose REST APIs as tables, expose another database engine, expose process /proc filesystems. The csv vtab in SQLite's source is a good template — about 800 lines for a complete CSV reader.

What's dbstat?

A built-in virtual table that exposes the on-disk page-by-page layout of the database itself. SELECT * FROM dbstat returns one row per page: name (which table/index it belongs to), pageno, pagetype (leaf/interior, table/index), ncell, payload bytes, unused bytes, etc. Useful for understanding storage layout, finding pages with lots of fragmentation, and debugging space usage. Compile with -DSQLITE_ENABLE_DBSTAT_VTAB to enable; it's on by default in many distributions.