Vector Search

Vector Search

Marmot provides ANN vector search over packed float32 embeddings stored in normal user tables. It is designed for RAG stores where each row contains ordinary metadata plus an embedding.

The public SQL surface is small:

  • CREATE VECTOR INDEX
  • DROP VECTOR INDEX
  • REINDEX VECTOR
  • vec_match(column, query_vector, k) in WHERE
  • vec_distance(column, query_vector) in ORDER BY

The exact embedding remains in the base table. Marmot builds node-local derived state for fast candidate search: centroids, immutable segment generations, row maps, manifests, and an overlay journal. Exact rerank still reads shortlist vectors from the base table before returning final rows.

Enable It

Vector search is controlled by the vector_index config section.

[vector_index]
enabled = true
data_dir = ""
  • enabled enables vector DDL, query rewriting, and background vector maintenance.
  • data_dir is reserved for a future explicit vector-state root. Current builds colocate .vecseg files next to the SQLite database.

RAG Table Pattern

Use a normal table with an INTEGER PRIMARY KEY, scalar metadata columns, and one BLOB embedding column.

CREATE DATABASE rag;
USE rag;
 
CREATE TABLE chunks (
    id          INTEGER PRIMARY KEY,
    tenant_id   INTEGER NOT NULL,
    source_uri  TEXT NOT NULL,
    chunk_no    INTEGER NOT NULL,
    title       TEXT NOT NULL,
    body        TEXT NOT NULL,
    status      TEXT NOT NULL,
    created_at  INTEGER NOT NULL,
    embedding   BLOB NOT NULL
);
 
CREATE INDEX chunks_tenant_status_idx
    ON chunks(tenant_id, status);
 
CREATE INDEX chunks_source_chunk_idx
    ON chunks(source_uri, chunk_no);

The scalar indexes are optional, but they matter when your vector query also filters by tenant, status, source, owner, timestamp, or ACL columns.

Vector Blob Format

The vector column must contain packed little-endian float32 values. The byte length must be exactly DIM * 4.

Prefer driver parameters. Do not construct large SQL hex literals in application code.

import struct
 
def vec_blob(values: list[float]) -> bytes:
    return struct.pack("<" + "f" * len(values), *values)
 
embedding = get_embedding("Marmot stores vectors in SQLite.")
 
# Use your driver's placeholder convention; the important part is binding
# vec_blob(embedding) as bytes for the BLOB column.
cursor.execute(
    """
    INSERT INTO chunks
        (id, tenant_id, source_uri, chunk_no, title, body, status, created_at, embedding)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """,
    (
        1,
        42,
        "s3://kb/marmot.md",
        0,
        "Marmot vector search",
        "Marmot stores exact embeddings in the base table.",
        "published",
        1730000000,
        vec_blob(embedding),
    ),
)

For Go:

func Float32Blob(v []float32) []byte {
	buf := make([]byte, len(v)*4)
	for i, x := range v {
		binary.LittleEndian.PutUint32(buf[i*4:], math.Float32bits(x))
	}
	return buf
}

Create An Index

Start with auto tuning unless you have benchmark data that says otherwise.

CREATE VECTOR INDEX chunks_embedding_idx
    ON chunks(embedding)
    DIM 1536
    METRIC cosine;

Equivalent explicit form:

CREATE VECTOR INDEX chunks_embedding_idx
    ON chunks(embedding)
    DIM 1536
    METRIC cosine
    WITH (nlist = 512, nprobe = 32);

For maximum-inner-product search:

CREATE VECTOR INDEX chunks_embedding_dot_idx
    ON chunks(embedding)
    DIM 1536
    METRIC dot
    WITH (max_norm = 80.0);

max_norm is required for dot workloads that rely on MIPS-to-L2 augmentation. Set it to a fixed upper bound for your vector norms; vectors above it fail materialization and should be reindexed with a larger bound.

Insert, Update, Delete

Vector CRUD uses ordinary SQL. Marmot CDC captures row changes and updates each local vector overlay after commit.

INSERT INTO chunks
    (id, tenant_id, source_uri, chunk_no, title, body, status, created_at, embedding)
VALUES
    (?, ?, ?, ?, ?, ?, ?, ?, ?);
 
UPDATE chunks
   SET body = ?,
       embedding = ?
 WHERE id = ?;
 
DELETE FROM chunks
 WHERE id = ?;

Committed changes are query-visible immediately through the local overlay. They do not need to wait for a full rebuild.

Search

Use vec_match as a vector predicate and order by vec_distance on the same column.

SELECT id, source_uri, chunk_no, title, body
  FROM chunks
 WHERE vec_match(embedding, ?, 10)
   AND tenant_id = 42
   AND status = 'published'
 ORDER BY vec_distance(embedding, ?)
 LIMIT 10;

Pass the same query-vector blob to both placeholders.

LIMIT is the final output count. The third argument to vec_match is the candidate and exact-rerank budget. For ordinary top-10 search, keep them equal. For extra recall headroom or selective filters, use a larger vec_match budget while keeping the final LIMIT small:

SELECT id, source_uri, chunk_no, title
  FROM chunks
 WHERE vec_match(embedding, ?, 100)
   AND tenant_id = 42
   AND status = 'published'
 ORDER BY vec_distance(embedding, ?)
 LIMIT 10;

Requirements:

  • Use exactly one vec_match per SELECT.
  • Include ORDER BY vec_distance(...) on the same vector column.
  • Include a positive literal LIMIT.
  • Keep vec_match K and LIMIT aligned for normal top-K search, or make vec_match K larger when you want a deeper candidate/rerank budget.
  • vec_match K must be greater than or equal to LIMIT.
  • Use the same embedding dimensionality that was declared in DIM.

DDL Reference

CREATE VECTOR INDEX <index_name>
    ON <table>(<blob_column>)
    DIM <n>
    METRIC <l2 | cosine | dot>
    [WITH (nlist = <n>, nprobe = <n>, max_norm = <f>)];
 
REINDEX VECTOR <index_name>;
 
DROP VECTOR INDEX <index_name>;
ClauseRequiredMeaning
DIM nyesExternal vector dimensionality. A 1536-dimensional embedding must be a 6144-byte blob.
METRIC l2yesSquared Euclidean distance. Lower is better.
METRIC cosineyesCosine distance, 1 - cosine_similarity. Lower is better.
METRIC dotyesNegative inner product. Higher dot product becomes lower distance.
WITH (nlist = n)noIVF cluster count. More clusters usually reduce rows scanned per cluster but increase build and maintenance cost.
WITH (nprobe = n)noDefault number of clusters probed per query. Higher values usually improve recall and increase read cost.
WITH (max_norm = f)for dotNorm cap used by dot-product augmentation.

Current SQL DDL does not expose target_partition_size. Auto tuning uses an internal default target of 512 vectors per partition.

Auto Tuning

When nlist is omitted, Marmot chooses the IVF cluster count from the current indexable row count and the internal target partition size.

Current defaults:

PolicyCurrent behavior
Target partition size512 vectors per IVF partition
Non-empty create nlistroughly ceil(rows / 512), clamped by the supported auto range
Empty-table first publishstarts from a bounded overlay snapshot once enough overlay rows exist
Later growthbackground maintenance can promote to a larger cluster count as rows grow
Auto nprobederived from the target partition size; default behavior probes about 16 target partitions

Manual override guidance:

GoalTry
Higher recallIncrease nprobe first.
Lower query disk readsLower nprobe, or use more clusters with enough training data.
Better large-corpus structureLet auto nlist grow, or use a larger explicit nlist after benchmarking.
Faster buildUse fewer clusters.
Tenant-heavy filteringKeep scalar tenant/status indexes and let the planner choose pre-filter when cheaper.

Do not tune from insert throughput alone. A vector workload has separate write, first-publish, settled, and read-QPS milestones.

Session Tuning

Session variables are per connection.

SET @@marmot_vec_nprobe = 64;
SET @@marmot_vec_nprobe = 0;
 
SET @@marmot_vec_force_plan = 'pre';
SET @@marmot_vec_force_plan = 'post';
SET @@marmot_vec_force_plan = 'auto';
 
SET @@marmot_vec_prefilter_cap = 5000;
SET @@marmot_vec_fallback = 'on';
SET @@marmot_vec_use_go_rank = 'on';
VariableDefaultMeaning
@@marmot_vec_nprobe00 means use the index default and budget probing when eligible. A positive value forces a fixed probe count for this connection.
@@marmot_vec_force_planautoauto, pre, or post. Use this to compare exact pre-filter vs IVF post-filter.
@@marmot_vec_prefilter_cap5000Maximum estimated scalar-filter row count before the planner prefers IVF post-filter.
@@marmot_vec_fallbackonAllows exact fallback if a post-filter path returns too few rows.
@@marmot_vec_use_go_rankonUses the Go-side segment scan, exact-vector fetch, exact rerank, and final projection path.

Operational variables also exist for retrain checks and chunk sizing, but the default path is the recommended starting point.

Lifecycle And Settling

Vector indexes have four practical milestones.

MilestoneWhat it means
DDL committedThe vector index metadata exists. If the table was empty, the index can accept overlay writes immediately.
Overlay visibilityInserts, updates, and deletes are visible to vector queries through the local overlay after commit.
First clustered publishMarmot has published the first immutable .vecseg generation and can scan stable clusters.
SettledOverlay backlog is below merge thresholds, auto cluster count no longer needs promotion, and cluster skew is within the current bounds.

For empty-table create, Marmot does not wait for the whole corpus before the first clustered state. The current auto-tuned bootstrap path waits until enough overlay rows exist for a useful initial structure. With the default target partition size, that floor is 64 target partitions, or 32,768 rows. The first publish is bounded by the bootstrap target and capped at 65,536 rows. If the overlay is below the publish target, Marmot waits briefly for writes to quiesce before publishing.

For non-empty create, CREATE VECTOR INDEX builds the first local generation from existing table rows before the index is ready.

Settling is not a correctness boundary:

  • Queries before settle include stable rows plus overlay rows.
  • Updates and deletes mask old stable rows through overlay tombstones before merge.
  • Background maintenance folds overlay rows into new immutable generations.
  • Promotions grow nlist when auto tuning decides the corpus has outgrown the current cluster count.

Settling affects read cost. Large overlays are correct but slower because queries must merge recent mutations in addition to scanning stable clusters.

Observability

The local vector catalog is stored in __marmot_vector_indexes. Treat it as read-only.

SELECT index_name,
       table_name,
       column_name,
       metric,
       dim,
       nlist,
       nprobe,
       auto_nlist,
       auto_nprobe,
       target_partition_size,
       status
  FROM __marmot_vector_indexes;

Useful log messages include:

  • engine hook bootstrap: bootstrap threshold reached
  • engine hook bootstrap: automatic bootstrap complete
  • maintenance: incremental merge failed
  • maintenance: catch-up rebuild failed
  • VectorIndexManager: local vector index marked dirty

vec-bench reports the measurements that matter for tuning:

  • DDL create time
  • first clustered publish time
  • final settled time
  • insert throughput
  • read QPS
  • recall
  • p50, p95, and p99 latency
  • RSS snapshots
  • stable encoding
  • overlay encoding mix and overlay journal size
  • block metadata rows and file size
  • segment logical bytes/query
  • segment actual read bytes/query
  • segment overread
  • exact rerank rows/query

Runtime Model

Local Files

For chunks_embedding_idx, Marmot stores derived state beside the database:

<db>.chunks_embedding_idx.vecseg/
  manifest/
    current
    gen-00000000000000000001.mf
  segments/
    gen-00000000000000000001.dat
  rowmap/
    gen-00000000000000000001.rmap
  blocks/
    gen-00000000000000000001.blk
  overlay/
    current.log
FilePurpose
manifest/currentAtomic pointer to the active generation.
manifest/gen-*.mfManifest with index identity, centroid epochs, encoding metadata, checksums, and file names.
segments/gen-*.datStable vector payload laid out by cluster.
rowmap/gen-*.rmapRowid to stable-location map.
blocks/gen-*.blkValidated per-generation block metadata sidecar for disk-first pruning and scan instrumentation.
overlay/current.logAppend-only local mutation log for rows newer than the stable generation.

Read Path

In post-filter IVF mode:

  1. Marmot materializes the query vector into the internal metric representation.
  2. It selects probe clusters from the active probe centroids.
  3. It scans the selected stable cluster spans with pread.
  4. It merges overlay rows and tombstones newer than the generation cutoff.
  5. It fetches shortlisted exact vectors from the base table.
  6. It reranks exact distances in Go.
  7. It issues a final projection query for only the final top-K rowids.

The approximate scan is used for candidate generation only. Final distance ordering uses exact vectors from the base table for every non-empty ANN candidate set. For safe filtered post-ranking, Marmot can widen the candidate budget and refill until enough rows survive the SQL predicate, the search space is exhausted, or the internal safety cap is reached. Unsupported predicate shapes fall back to the pre-filter path instead of serving stale approximate results.

New generations also write a .blk sidecar. The sidecar is validated against the .dat file on open, is deleted with retired generations, and is available for internal block-pruning experiments and scan metrics. Current default query execution keeps block pruning off unless explicitly enabled internally, because the latest 100K run showed the full selected-cluster PQ scan was faster at this scale.

Write Path

After a successful commit:

  1. Marmot CDC captures the row-level change.
  2. The vector hook materializes new vector values into internal form.
  3. The local overlay journal records inserts, updates, or tombstones. Before the first clustered publish, overlay vectors are temporary prepared float32 bytes; after probe centroids exist, overlay vectors are compact residual int8 bytes.
  4. Queries see those changes immediately from the overlay.
  5. Background maintenance publishes new immutable generations and compacts applied overlay rows.

Overlay snapshots are offset-backed and use a bounded vector cache; large online builds stream bootstrap and catch-up rows through temporary spools instead of retaining every prepared vector in heap. Exact rerank and maintenance always use the base table as the truth source. Lossy stable or overlay encodings are not used to update centroid sums or exact distances.

Stable Encoding

Stable segments use compact encodings:

EncodingUsed when
Residual PQ8Internal dimension is at least 512.
Residual int8Internal dimension is less than 512.

Raw float32 stable segment payloads are not written. Residual PQ8 uses generation-local codebooks. Incremental maintenance reuses the generation codec for touched clusters so untouched cluster spans remain byte-copyable. Full rebuild or manual REINDEX VECTOR is the codec refresh path.

PQ8 codebooks are trained deterministically from a bounded sample with multi-start subspace k-means. This spends more build CPU to improve compressed-score quality while keeping read memory low.

Replication

Marmot does not replicate vector segment artifacts.

Replicated:

  • row-level DML CDC for the base table
  • vector-control metadata for CREATE VECTOR INDEX
  • vector-control metadata for DROP VECTOR INDEX
  • vector-control metadata for REINDEX VECTOR

Node-local derived state:

  • .vecseg segment files
  • rowmaps
  • block metadata sidecars
  • manifest files
  • overlay journals
  • centroids
  • PQ codebooks

Each node builds and maintains its own local vector index from the same replicated table rows and vector-control metadata. A replica may publish its local generation at a different wall-clock time, but queries use only locally ready vector indexes. If local vector CDC fails after the SQLite commit, the index is marked dirty so stale ANN results are not served until repaired or rebuilt.

Benchmarking

Use cmd/vec-bench for repeatable vector benchmarking.

./vec-bench \
  --data-dir=/tmp/marmot/benchdata/dbpedia-openai-1536 \
  --db-dir=/tmp/marmot/bench-active \
  --db-name=dbpedia \
  --table=docs \
  --column=embed \
  --index=embed_idx \
  --force-build \
  --insert-n=100000 \
  --warmup=200 \
  --n-queries=500 \
  --query-concurrency=8 \
  --nprobe=24 \
  --settle-timeout=10m \
  --min-recall=0.95 \
  --min-qps=1000 \
  --max-overread=1.05 \
  --profile-dir=/tmp/marmot/bench-active/prof

Always report:

  • dataset and embedding dimension
  • row count
  • nlist and nprobe
  • stable encoding
  • create time
  • first clustered publish time
  • final settled time
  • insert QPS
  • read QPS
  • recall@K
  • latency percentiles
  • RSS after build and after settle
  • segment read bytes/query and overread

Do not report raw insert throughput as query-ready build throughput. Inserts are live immediately, but the clustered generation and settled read structure are separate milestones.

Latest 100K Validation

Latest local validation used a random 100K-row subset of the DBpedia OpenAI 1536d dataset from the 990K-row source corpus with subset seed 42. The query set contained 10,000 queries, K=10, cosine distance, Go-rank, id projection, and read concurrency 8.

MetricMeasured value
Rows100,000
Source rows990,000
nlist196
nprobe24
Stable encodingResidual PQ8
Payload bytes/vector132
Entry bytes/vector140
Stable .dat size14,003,208 bytes
Block metadata877 blocks, 3,651,568 bytes
Segment read bytes/query1,739,024
Segment overread1.00x
Recall@100.9628
Recall@10-in-1001.0000
Aggregate read QPS1,752
Latency p50 / p95 / p994.31ms / 5.82ms / 6.48ms
RSS after measurement340 MB

The matching 100K force-build lifecycle run measured DDL create at 203ms, 100,000 inserts in 3.00s, first clustered publish at 15.886s, and final settled state at 1m08.286s.

Latest 500K Validation

Latest larger local validation used a random 500K-row subset of the same DBpedia OpenAI 1536d source corpus with subset seed 42. The read-only measurement used the full 10,000-query set, K=10, cosine distance, Go-rank, id projection, read concurrency 8, and explicit nprobe=48.

MetricMeasured value
Rows500,000
Source rows990,000
nlist977
nprobe48
Stable encodingResidual PQ8
Payload bytes/vector132
Entry bytes/vector140
Stable .dat size70,015,704 bytes
Block metadata4,401 blocks
Segment read bytes/query3,456,555
Segment overread1.00x
Recall@100.9575
Recall@10-in-1001.0000
Aggregate read QPS1,053
Latency p50 / p95 / p997.04ms / 12.13ms / 15.28ms
RSS after reopen185 MB
RSS after measurement459 MB

The matching 500K force-build lifecycle run measured 500,000 inserts in 36.118s, first clustered publish at 1m21.011s, final settled state at 5m31.928s, and same-process RSS after settled cleanup at 917 MB. Reopened steady read memory is lower because transient insert, bootstrap, and catch-up buffers are gone.

Crash Safety

Stable generation publish is append-and-swap:

  1. write temp segment and rowmap files
  2. fsync temp files
  3. write temp manifest
  4. fsync manifest
  5. rename manifest into place
  6. swap manifest/current
  7. fsync the manifest directory

The old generation remains valid until manifest/current points to the new generation. The overlay journal remains the source of truth for mutations after the applied cutoff.

Limitations

  • Exactly one vec_match is supported per SELECT.
  • LIMIT is required and must be a positive literal integer.
  • vec_match K is the candidate/rerank budget and must be greater than or equal to LIMIT.
  • vec_match queries must order by vec_distance on the same vector column.
  • The base table must have an INTEGER PRIMARY KEY.
  • The embedding must be a packed little-endian float32 BLOB.
  • Vector index DDL uses simple table/index/column identifiers; run USE database first.
  • target_partition_size is currently internal and not a SQL DDL option.
  • Initial bootstrap, promotion, and REINDEX VECTOR are CPU-heavy because centroid training happens in-process.
  • Exact rerank reads shortlisted vectors from the base table, so very large vec_match candidate budgets can increase SQLite read cost.
  • Segment files, rowmaps, centroids, PQ codebooks, and overlay journals are local derived state and are not replicated.