Vector Search

Marmot provides ANN vector search over packed float32 embeddings stored in normal user tables. It is designed for RAG stores where each row contains ordinary metadata plus an embedding.

The public SQL surface is small:

CREATE VECTOR INDEX
DROP VECTOR INDEX
REINDEX VECTOR
vec_match(column, query_vector, k) in WHERE
vec_distance(column, query_vector) in ORDER BY

The exact embedding remains in the base table. Marmot builds node-local derived state for fast candidate search: centroids, immutable segment generations, row maps, manifests, and an overlay journal. Exact rerank still reads shortlist vectors from the base table before returning final rows.

Enable It

Vector search is controlled by the vector_index config section.

[vector_index]
enabled = true
data_dir = ""

enabled enables vector DDL, query rewriting, and background vector maintenance.
data_dir is reserved for a future explicit vector-state root. Current builds colocate .vecseg files next to the SQLite database.

RAG Table Pattern

Use a normal table with an INTEGER PRIMARY KEY, scalar metadata columns, and one BLOB embedding column.

CREATE DATABASE rag;
USE rag;
 
CREATE TABLE chunks (
    id          INTEGER PRIMARY KEY,
    tenant_id   INTEGER NOT NULL,
    source_uri  TEXT NOT NULL,
    chunk_no    INTEGER NOT NULL,
    title       TEXT NOT NULL,
    body        TEXT NOT NULL,
    status      TEXT NOT NULL,
    created_at  INTEGER NOT NULL,
    embedding   BLOB NOT NULL
);
 
CREATE INDEX chunks_tenant_status_idx
    ON chunks(tenant_id, status);
 
CREATE INDEX chunks_source_chunk_idx
    ON chunks(source_uri, chunk_no);

The scalar indexes are optional, but they matter when your vector query also filters by tenant, status, source, owner, timestamp, or ACL columns.

Vector Blob Format

The vector column must contain packed little-endian float32 values. The byte length must be exactly DIM * 4.

Prefer driver parameters. Do not construct large SQL hex literals in application code.

import struct
 
def vec_blob(values: list[float]) -> bytes:
    return struct.pack("<" + "f" * len(values), *values)
 
embedding = get_embedding("Marmot stores vectors in SQLite.")
 
# Use your driver's placeholder convention; the important part is binding
# vec_blob(embedding) as bytes for the BLOB column.
cursor.execute(
    """
    INSERT INTO chunks
        (id, tenant_id, source_uri, chunk_no, title, body, status, created_at, embedding)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """,
    (
        1,
        42,
        "s3://kb/marmot.md",
        0,
        "Marmot vector search",
        "Marmot stores exact embeddings in the base table.",
        "published",
        1730000000,
        vec_blob(embedding),
    ),
)

For Go:

func Float32Blob(v []float32) []byte {
	buf := make([]byte, len(v)*4)
	for i, x := range v {
		binary.LittleEndian.PutUint32(buf[i*4:], math.Float32bits(x))
	}
	return buf
}

Create An Index

Start with auto tuning unless you have benchmark data that says otherwise.

CREATE VECTOR INDEX chunks_embedding_idx
    ON chunks(embedding)
    DIM 1536
    METRIC cosine;

Equivalent explicit form:

CREATE VECTOR INDEX chunks_embedding_idx
    ON chunks(embedding)
    DIM 1536
    METRIC cosine
    WITH (nlist = 512, nprobe = 32);

For maximum-inner-product search:

CREATE VECTOR INDEX chunks_embedding_dot_idx
    ON chunks(embedding)
    DIM 1536
    METRIC dot
    WITH (max_norm = 80.0);

max_norm is required for dot workloads that rely on MIPS-to-L2 augmentation. Set it to a fixed upper bound for your vector norms; vectors above it fail materialization and should be reindexed with a larger bound.

Insert, Update, Delete

Vector CRUD uses ordinary SQL. Marmot CDC captures row changes and updates each local vector overlay after commit.

INSERT INTO chunks
    (id, tenant_id, source_uri, chunk_no, title, body, status, created_at, embedding)
VALUES
    (?, ?, ?, ?, ?, ?, ?, ?, ?);
 
UPDATE chunks
   SET body = ?,
       embedding = ?
 WHERE id = ?;
 
DELETE FROM chunks
 WHERE id = ?;

Committed changes are query-visible immediately through the local overlay. They do not need to wait for a full rebuild.

Search

Use vec_match as a vector predicate and order by vec_distance on the same column.

SELECT id, source_uri, chunk_no, title, body
  FROM chunks
 WHERE vec_match(embedding, ?, 10)
   AND tenant_id = 42
   AND status = 'published'
 ORDER BY vec_distance(embedding, ?)
 LIMIT 10;

Pass the same query-vector blob to both placeholders.

LIMIT is the final output count. The third argument to vec_match is the candidate and exact-rerank budget. For ordinary top-10 search, keep them equal. For extra recall headroom or selective filters, use a larger vec_match budget while keeping the final LIMIT small:

SELECT id, source_uri, chunk_no, title
  FROM chunks
 WHERE vec_match(embedding, ?, 100)
   AND tenant_id = 42
   AND status = 'published'
 ORDER BY vec_distance(embedding, ?)
 LIMIT 10;

Requirements:

Use exactly one vec_match per SELECT.
Include ORDER BY vec_distance(...) on the same vector column.
Include a positive literal LIMIT.
Keep vec_match K and LIMIT aligned for normal top-K search, or make vec_match K larger when you want a deeper candidate/rerank budget.
vec_match K must be greater than or equal to LIMIT.
Use the same embedding dimensionality that was declared in DIM.

DDL Reference

CREATE VECTOR INDEX <index_name>
    ON <table>(<blob_column>)
    DIM <n>
    METRIC <l2 | cosine | dot>
    [WITH (nlist = <n>, nprobe = <n>, max_norm = <f>)];
 
REINDEX VECTOR <index_name>;
 
DROP VECTOR INDEX <index_name>;

Clause	Required	Meaning
`DIM n`	yes	External vector dimensionality. A 1536-dimensional embedding must be a 6144-byte blob.
`METRIC l2`	yes	Squared Euclidean distance. Lower is better.
`METRIC cosine`	yes	Cosine distance, `1 - cosine_similarity`. Lower is better.
`METRIC dot`	yes	Negative inner product. Higher dot product becomes lower distance.
`WITH (nlist = n)`	no	IVF cluster count. More clusters usually reduce rows scanned per cluster but increase build and maintenance cost.
`WITH (nprobe = n)`	no	Default number of clusters probed per query. Higher values usually improve recall and increase read cost.
`WITH (max_norm = f)`	for dot	Norm cap used by dot-product augmentation.

Current SQL DDL does not expose target_partition_size. Auto tuning uses an internal default target of 512 vectors per partition.

Auto Tuning

When nlist is omitted, Marmot chooses the IVF cluster count from the current indexable row count and the internal target partition size.

Current defaults:

Policy	Current behavior
Target partition size	512 vectors per IVF partition
Non-empty create `nlist`	roughly `ceil(rows / 512)`, clamped by the supported auto range
Empty-table first publish	starts from a bounded overlay snapshot once enough overlay rows exist
Later growth	background maintenance can promote to a larger cluster count as rows grow
Auto `nprobe`	derived from the target partition size; default behavior probes about 16 target partitions

Manual override guidance:

Goal	Try
Higher recall	Increase `nprobe` first.
Lower query disk reads	Lower `nprobe`, or use more clusters with enough training data.
Better large-corpus structure	Let auto `nlist` grow, or use a larger explicit `nlist` after benchmarking.
Faster build	Use fewer clusters.
Tenant-heavy filtering	Keep scalar tenant/status indexes and let the planner choose pre-filter when cheaper.

Do not tune from insert throughput alone. A vector workload has separate write, first-publish, settled, and read-QPS milestones.

Session Tuning

Session variables are per connection.

SET @@marmot_vec_nprobe = 64;
SET @@marmot_vec_nprobe = 0;
 
SET @@marmot_vec_force_plan = 'pre';
SET @@marmot_vec_force_plan = 'post';
SET @@marmot_vec_force_plan = 'auto';
 
SET @@marmot_vec_prefilter_cap = 5000;
SET @@marmot_vec_fallback = 'on';
SET @@marmot_vec_use_go_rank = 'on';

Variable	Default	Meaning
`@@marmot_vec_nprobe`	`0`	`0` means use the index default and budget probing when eligible. A positive value forces a fixed probe count for this connection.
`@@marmot_vec_force_plan`	`auto`	`auto`, `pre`, or `post`. Use this to compare exact pre-filter vs IVF post-filter.
`@@marmot_vec_prefilter_cap`	`5000`	Maximum estimated scalar-filter row count before the planner prefers IVF post-filter.
`@@marmot_vec_fallback`	`on`	Allows exact fallback if a post-filter path returns too few rows.
`@@marmot_vec_use_go_rank`	`on`	Uses the Go-side segment scan, exact-vector fetch, exact rerank, and final projection path.

Operational variables also exist for retrain checks and chunk sizing, but the default path is the recommended starting point.

Lifecycle And Settling

Vector indexes have four practical milestones.

Milestone	What it means
DDL committed	The vector index metadata exists. If the table was empty, the index can accept overlay writes immediately.
Overlay visibility	Inserts, updates, and deletes are visible to vector queries through the local overlay after commit.
First clustered publish	Marmot has published the first immutable `.vecseg` generation and can scan stable clusters.
Settled	Overlay backlog is below merge thresholds, auto cluster count no longer needs promotion, and cluster skew is within the current bounds.

For empty-table create, Marmot does not wait for the whole corpus before the first clustered state. The current auto-tuned bootstrap path waits until enough overlay rows exist for a useful initial structure. With the default target partition size, that floor is 64 target partitions, or 32,768 rows. The first publish is bounded by the bootstrap target and capped at 65,536 rows. If the overlay is below the publish target, Marmot waits briefly for writes to quiesce before publishing.

For non-empty create, CREATE VECTOR INDEX builds the first local generation from existing table rows before the index is ready.

Settling is not a correctness boundary:

Queries before settle include stable rows plus overlay rows.
Updates and deletes mask old stable rows through overlay tombstones before merge.
Background maintenance folds overlay rows into new immutable generations.
Promotions grow nlist when auto tuning decides the corpus has outgrown the current cluster count.

Settling affects read cost. Large overlays are correct but slower because queries must merge recent mutations in addition to scanning stable clusters.

Observability

The local vector catalog is stored in __marmot_vector_indexes. Treat it as read-only.

SELECT index_name,
       table_name,
       column_name,
       metric,
       dim,
       nlist,
       nprobe,
       auto_nlist,
       auto_nprobe,
       target_partition_size,
       status
  FROM __marmot_vector_indexes;

Useful log messages include:

engine hook bootstrap: bootstrap threshold reached
engine hook bootstrap: automatic bootstrap complete
maintenance: incremental merge failed
maintenance: catch-up rebuild failed
VectorIndexManager: local vector index marked dirty

vec-bench reports the measurements that matter for tuning:

DDL create time
first clustered publish time
final settled time
insert throughput
read QPS
recall
p50, p95, and p99 latency
RSS snapshots
stable encoding
overlay encoding mix and overlay journal size
block metadata rows and file size
segment logical bytes/query
segment actual read bytes/query
segment overread
exact rerank rows/query

Runtime Model

Local Files

For chunks_embedding_idx, Marmot stores derived state beside the database:

<db>.chunks_embedding_idx.vecseg/
  manifest/
    current
    gen-00000000000000000001.mf
  segments/
    gen-00000000000000000001.dat
  rowmap/
    gen-00000000000000000001.rmap
  blocks/
    gen-00000000000000000001.blk
  overlay/
    current.log

File	Purpose
`manifest/current`	Atomic pointer to the active generation.
`manifest/gen-*.mf`	Manifest with index identity, centroid epochs, encoding metadata, checksums, and file names.
`segments/gen-*.dat`	Stable vector payload laid out by cluster.
`rowmap/gen-*.rmap`	Rowid to stable-location map.
`blocks/gen-*.blk`	Validated per-generation block metadata sidecar for disk-first pruning and scan instrumentation.
`overlay/current.log`	Append-only local mutation log for rows newer than the stable generation.

Read Path

In post-filter IVF mode:

Marmot materializes the query vector into the internal metric representation.
It selects probe clusters from the active probe centroids.
It scans the selected stable cluster spans with pread.
It merges overlay rows and tombstones newer than the generation cutoff.
It fetches shortlisted exact vectors from the base table.
It reranks exact distances in Go.
It issues a final projection query for only the final top-K rowids.

The approximate scan is used for candidate generation only. Final distance ordering uses exact vectors from the base table for every non-empty ANN candidate set. For safe filtered post-ranking, Marmot can widen the candidate budget and refill until enough rows survive the SQL predicate, the search space is exhausted, or the internal safety cap is reached. Unsupported predicate shapes fall back to the pre-filter path instead of serving stale approximate results.

New generations also write a .blk sidecar. The sidecar is validated against the .dat file on open, is deleted with retired generations, and is available for internal block-pruning experiments and scan metrics. Current default query execution keeps block pruning off unless explicitly enabled internally, because the latest 100K run showed the full selected-cluster PQ scan was faster at this scale.

Write Path

After a successful commit:

Marmot CDC captures the row-level change.
The vector hook materializes new vector values into internal form.
The local overlay journal records inserts, updates, or tombstones. Before the first clustered publish, overlay vectors are temporary prepared float32 bytes; after probe centroids exist, overlay vectors are compact residual int8 bytes.
Queries see those changes immediately from the overlay.
Background maintenance publishes new immutable generations and compacts applied overlay rows.

Overlay snapshots are offset-backed and use a bounded vector cache; large online builds stream bootstrap and catch-up rows through temporary spools instead of retaining every prepared vector in heap. Exact rerank and maintenance always use the base table as the truth source. Lossy stable or overlay encodings are not used to update centroid sums or exact distances.

Stable Encoding

Stable segments use compact encodings:

Encoding	Used when
Residual PQ8	Internal dimension is at least 512.
Residual int8	Internal dimension is less than 512.

Raw float32 stable segment payloads are not written. Residual PQ8 uses generation-local codebooks. Incremental maintenance reuses the generation codec for touched clusters so untouched cluster spans remain byte-copyable. Full rebuild or manual REINDEX VECTOR is the codec refresh path.

PQ8 codebooks are trained deterministically from a bounded sample with multi-start subspace k-means. This spends more build CPU to improve compressed-score quality while keeping read memory low.

Replication

Marmot does not replicate vector segment artifacts.

Replicated:

row-level DML CDC for the base table
vector-control metadata for CREATE VECTOR INDEX
vector-control metadata for DROP VECTOR INDEX
vector-control metadata for REINDEX VECTOR

Node-local derived state:

.vecseg segment files
rowmaps
block metadata sidecars
manifest files
overlay journals
centroids
PQ codebooks

Each node builds and maintains its own local vector index from the same replicated table rows and vector-control metadata. A replica may publish its local generation at a different wall-clock time, but queries use only locally ready vector indexes. If local vector CDC fails after the SQLite commit, the index is marked dirty so stale ANN results are not served until repaired or rebuilt.

Benchmarking

Use cmd/vec-bench for repeatable vector benchmarking.

./vec-bench \
  --data-dir=/tmp/marmot/benchdata/dbpedia-openai-1536 \
  --db-dir=/tmp/marmot/bench-active \
  --db-name=dbpedia \
  --table=docs \
  --column=embed \
  --index=embed_idx \
  --force-build \
  --insert-n=100000 \
  --warmup=200 \
  --n-queries=500 \
  --query-concurrency=8 \
  --nprobe=24 \
  --settle-timeout=10m \
  --min-recall=0.95 \
  --min-qps=1000 \
  --max-overread=1.05 \
  --profile-dir=/tmp/marmot/bench-active/prof

Always report:

dataset and embedding dimension
row count
nlist and nprobe
stable encoding
create time
first clustered publish time
final settled time
insert QPS
read QPS
recall@K
latency percentiles
RSS after build and after settle
segment read bytes/query and overread

Do not report raw insert throughput as query-ready build throughput. Inserts are live immediately, but the clustered generation and settled read structure are separate milestones.

Latest 100K Validation

Latest local validation used a random 100K-row subset of the DBpedia OpenAI 1536d dataset from the 990K-row source corpus with subset seed 42. The query set contained 10,000 queries, K=10, cosine distance, Go-rank, id projection, and read concurrency 8.

Metric	Measured value
Rows	100,000
Source rows	990,000
`nlist`	196
`nprobe`	24
Stable encoding	Residual PQ8
Payload bytes/vector	132
Entry bytes/vector	140
Stable `.dat` size	14,003,208 bytes
Block metadata	877 blocks, 3,651,568 bytes
Segment read bytes/query	1,739,024
Segment overread	1.00x
Recall@10	0.9628
Recall@10-in-100	1.0000
Aggregate read QPS	1,752
Latency p50 / p95 / p99	4.31ms / 5.82ms / 6.48ms
RSS after measurement	340 MB

The matching 100K force-build lifecycle run measured DDL create at 203ms, 100,000 inserts in 3.00s, first clustered publish at 15.886s, and final settled state at 1m08.286s.

Latest 500K Validation

Latest larger local validation used a random 500K-row subset of the same DBpedia OpenAI 1536d source corpus with subset seed 42. The read-only measurement used the full 10,000-query set, K=10, cosine distance, Go-rank, id projection, read concurrency 8, and explicit nprobe=48.

Metric	Measured value
Rows	500,000
Source rows	990,000
`nlist`	977
`nprobe`	48
Stable encoding	Residual PQ8
Payload bytes/vector	132
Entry bytes/vector	140
Stable `.dat` size	70,015,704 bytes
Block metadata	4,401 blocks
Segment read bytes/query	3,456,555
Segment overread	1.00x
Recall@10	0.9575
Recall@10-in-100	1.0000
Aggregate read QPS	1,053
Latency p50 / p95 / p99	7.04ms / 12.13ms / 15.28ms
RSS after reopen	185 MB
RSS after measurement	459 MB

The matching 500K force-build lifecycle run measured 500,000 inserts in 36.118s, first clustered publish at 1m21.011s, final settled state at 5m31.928s, and same-process RSS after settled cleanup at 917 MB. Reopened steady read memory is lower because transient insert, bootstrap, and catch-up buffers are gone.

Crash Safety

Stable generation publish is append-and-swap:

write temp segment and rowmap files
fsync temp files
write temp manifest
fsync manifest
rename manifest into place
swap manifest/current
fsync the manifest directory

The old generation remains valid until manifest/current points to the new generation. The overlay journal remains the source of truth for mutations after the applied cutoff.

Limitations

Exactly one vec_match is supported per SELECT.
LIMIT is required and must be a positive literal integer.
vec_match K is the candidate/rerank budget and must be greater than or equal to LIMIT.
vec_match queries must order by vec_distance on the same vector column.
The base table must have an INTEGER PRIMARY KEY.
The embedding must be a packed little-endian float32 BLOB.
Vector index DDL uses simple table/index/column identifiers; run USE database first.
target_partition_size is currently internal and not a SQL DDL option.
Initial bootstrap, promotion, and REINDEX VECTOR are CPU-heavy because centroid training happens in-process.
Exact rerank reads shortlisted vectors from the base table, so very large vec_match candidate budgets can increase SQLite read cost.
Segment files, rowmaps, centroids, PQ codebooks, and overlay journals are local derived state and are not replicated.

SQL Reference Operations