Vector Search
Marmot provides ANN vector search over packed float32 embeddings stored in normal user tables. It is designed for RAG stores where each row contains ordinary metadata plus an embedding.
The public SQL surface is small:
CREATE VECTOR INDEXDROP VECTOR INDEXREINDEX VECTORvec_match(column, query_vector, k)inWHEREvec_distance(column, query_vector)inORDER BY
The exact embedding remains in the base table. Marmot builds node-local derived state for fast candidate search: centroids, immutable segment generations, row maps, manifests, and an overlay journal. Exact rerank still reads shortlist vectors from the base table before returning final rows.
Enable It
Vector search is controlled by the vector_index config section.
[vector_index]
enabled = true
data_dir = ""enabledenables vector DDL, query rewriting, and background vector maintenance.data_diris reserved for a future explicit vector-state root. Current builds colocate.vecsegfiles next to the SQLite database.
RAG Table Pattern
Use a normal table with an INTEGER PRIMARY KEY, scalar metadata columns, and one BLOB embedding column.
CREATE DATABASE rag;
USE rag;
CREATE TABLE chunks (
id INTEGER PRIMARY KEY,
tenant_id INTEGER NOT NULL,
source_uri TEXT NOT NULL,
chunk_no INTEGER NOT NULL,
title TEXT NOT NULL,
body TEXT NOT NULL,
status TEXT NOT NULL,
created_at INTEGER NOT NULL,
embedding BLOB NOT NULL
);
CREATE INDEX chunks_tenant_status_idx
ON chunks(tenant_id, status);
CREATE INDEX chunks_source_chunk_idx
ON chunks(source_uri, chunk_no);The scalar indexes are optional, but they matter when your vector query also filters by tenant, status, source, owner, timestamp, or ACL columns.
Vector Blob Format
The vector column must contain packed little-endian float32 values. The byte length must be exactly DIM * 4.
Prefer driver parameters. Do not construct large SQL hex literals in application code.
import struct
def vec_blob(values: list[float]) -> bytes:
return struct.pack("<" + "f" * len(values), *values)
embedding = get_embedding("Marmot stores vectors in SQLite.")
# Use your driver's placeholder convention; the important part is binding
# vec_blob(embedding) as bytes for the BLOB column.
cursor.execute(
"""
INSERT INTO chunks
(id, tenant_id, source_uri, chunk_no, title, body, status, created_at, embedding)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""",
(
1,
42,
"s3://kb/marmot.md",
0,
"Marmot vector search",
"Marmot stores exact embeddings in the base table.",
"published",
1730000000,
vec_blob(embedding),
),
)For Go:
func Float32Blob(v []float32) []byte {
buf := make([]byte, len(v)*4)
for i, x := range v {
binary.LittleEndian.PutUint32(buf[i*4:], math.Float32bits(x))
}
return buf
}Create An Index
Start with auto tuning unless you have benchmark data that says otherwise.
CREATE VECTOR INDEX chunks_embedding_idx
ON chunks(embedding)
DIM 1536
METRIC cosine;Equivalent explicit form:
CREATE VECTOR INDEX chunks_embedding_idx
ON chunks(embedding)
DIM 1536
METRIC cosine
WITH (nlist = 512, nprobe = 32);For maximum-inner-product search:
CREATE VECTOR INDEX chunks_embedding_dot_idx
ON chunks(embedding)
DIM 1536
METRIC dot
WITH (max_norm = 80.0);max_norm is required for dot workloads that rely on MIPS-to-L2 augmentation. Set it to a fixed upper bound for your vector norms; vectors above it fail materialization and should be reindexed with a larger bound.
Insert, Update, Delete
Vector CRUD uses ordinary SQL. Marmot CDC captures row changes and updates each local vector overlay after commit.
INSERT INTO chunks
(id, tenant_id, source_uri, chunk_no, title, body, status, created_at, embedding)
VALUES
(?, ?, ?, ?, ?, ?, ?, ?, ?);
UPDATE chunks
SET body = ?,
embedding = ?
WHERE id = ?;
DELETE FROM chunks
WHERE id = ?;Committed changes are query-visible immediately through the local overlay. They do not need to wait for a full rebuild.
Search
Use vec_match as a vector predicate and order by vec_distance on the same column.
SELECT id, source_uri, chunk_no, title, body
FROM chunks
WHERE vec_match(embedding, ?, 10)
AND tenant_id = 42
AND status = 'published'
ORDER BY vec_distance(embedding, ?)
LIMIT 10;Pass the same query-vector blob to both placeholders.
LIMIT is the final output count. The third argument to vec_match is the candidate and exact-rerank budget. For ordinary top-10 search, keep them equal. For extra recall headroom or selective filters, use a larger vec_match budget while keeping the final LIMIT small:
SELECT id, source_uri, chunk_no, title
FROM chunks
WHERE vec_match(embedding, ?, 100)
AND tenant_id = 42
AND status = 'published'
ORDER BY vec_distance(embedding, ?)
LIMIT 10;Requirements:
- Use exactly one
vec_matchperSELECT. - Include
ORDER BY vec_distance(...)on the same vector column. - Include a positive literal
LIMIT. - Keep
vec_matchK andLIMITaligned for normal top-K search, or makevec_matchK larger when you want a deeper candidate/rerank budget. vec_matchK must be greater than or equal toLIMIT.- Use the same embedding dimensionality that was declared in
DIM.
DDL Reference
CREATE VECTOR INDEX <index_name>
ON <table>(<blob_column>)
DIM <n>
METRIC <l2 | cosine | dot>
[WITH (nlist = <n>, nprobe = <n>, max_norm = <f>)];
REINDEX VECTOR <index_name>;
DROP VECTOR INDEX <index_name>;| Clause | Required | Meaning |
|---|---|---|
DIM n | yes | External vector dimensionality. A 1536-dimensional embedding must be a 6144-byte blob. |
METRIC l2 | yes | Squared Euclidean distance. Lower is better. |
METRIC cosine | yes | Cosine distance, 1 - cosine_similarity. Lower is better. |
METRIC dot | yes | Negative inner product. Higher dot product becomes lower distance. |
WITH (nlist = n) | no | IVF cluster count. More clusters usually reduce rows scanned per cluster but increase build and maintenance cost. |
WITH (nprobe = n) | no | Default number of clusters probed per query. Higher values usually improve recall and increase read cost. |
WITH (max_norm = f) | for dot | Norm cap used by dot-product augmentation. |
Current SQL DDL does not expose target_partition_size. Auto tuning uses an internal default target of 512 vectors per partition.
Auto Tuning
When nlist is omitted, Marmot chooses the IVF cluster count from the current indexable row count and the internal target partition size.
Current defaults:
| Policy | Current behavior |
|---|---|
| Target partition size | 512 vectors per IVF partition |
Non-empty create nlist | roughly ceil(rows / 512), clamped by the supported auto range |
| Empty-table first publish | starts from a bounded overlay snapshot once enough overlay rows exist |
| Later growth | background maintenance can promote to a larger cluster count as rows grow |
Auto nprobe | derived from the target partition size; default behavior probes about 16 target partitions |
Manual override guidance:
| Goal | Try |
|---|---|
| Higher recall | Increase nprobe first. |
| Lower query disk reads | Lower nprobe, or use more clusters with enough training data. |
| Better large-corpus structure | Let auto nlist grow, or use a larger explicit nlist after benchmarking. |
| Faster build | Use fewer clusters. |
| Tenant-heavy filtering | Keep scalar tenant/status indexes and let the planner choose pre-filter when cheaper. |
Do not tune from insert throughput alone. A vector workload has separate write, first-publish, settled, and read-QPS milestones.
Session Tuning
Session variables are per connection.
SET @@marmot_vec_nprobe = 64;
SET @@marmot_vec_nprobe = 0;
SET @@marmot_vec_force_plan = 'pre';
SET @@marmot_vec_force_plan = 'post';
SET @@marmot_vec_force_plan = 'auto';
SET @@marmot_vec_prefilter_cap = 5000;
SET @@marmot_vec_fallback = 'on';
SET @@marmot_vec_use_go_rank = 'on';| Variable | Default | Meaning |
|---|---|---|
@@marmot_vec_nprobe | 0 | 0 means use the index default and budget probing when eligible. A positive value forces a fixed probe count for this connection. |
@@marmot_vec_force_plan | auto | auto, pre, or post. Use this to compare exact pre-filter vs IVF post-filter. |
@@marmot_vec_prefilter_cap | 5000 | Maximum estimated scalar-filter row count before the planner prefers IVF post-filter. |
@@marmot_vec_fallback | on | Allows exact fallback if a post-filter path returns too few rows. |
@@marmot_vec_use_go_rank | on | Uses the Go-side segment scan, exact-vector fetch, exact rerank, and final projection path. |
Operational variables also exist for retrain checks and chunk sizing, but the default path is the recommended starting point.
Lifecycle And Settling
Vector indexes have four practical milestones.
| Milestone | What it means |
|---|---|
| DDL committed | The vector index metadata exists. If the table was empty, the index can accept overlay writes immediately. |
| Overlay visibility | Inserts, updates, and deletes are visible to vector queries through the local overlay after commit. |
| First clustered publish | Marmot has published the first immutable .vecseg generation and can scan stable clusters. |
| Settled | Overlay backlog is below merge thresholds, auto cluster count no longer needs promotion, and cluster skew is within the current bounds. |
For empty-table create, Marmot does not wait for the whole corpus before the first clustered state. The current auto-tuned bootstrap path waits until enough overlay rows exist for a useful initial structure. With the default target partition size, that floor is 64 target partitions, or 32,768 rows. The first publish is bounded by the bootstrap target and capped at 65,536 rows. If the overlay is below the publish target, Marmot waits briefly for writes to quiesce before publishing.
For non-empty create, CREATE VECTOR INDEX builds the first local generation from existing table rows before the index is ready.
Settling is not a correctness boundary:
- Queries before settle include stable rows plus overlay rows.
- Updates and deletes mask old stable rows through overlay tombstones before merge.
- Background maintenance folds overlay rows into new immutable generations.
- Promotions grow
nlistwhen auto tuning decides the corpus has outgrown the current cluster count.
Settling affects read cost. Large overlays are correct but slower because queries must merge recent mutations in addition to scanning stable clusters.
Observability
The local vector catalog is stored in __marmot_vector_indexes. Treat it as read-only.
SELECT index_name,
table_name,
column_name,
metric,
dim,
nlist,
nprobe,
auto_nlist,
auto_nprobe,
target_partition_size,
status
FROM __marmot_vector_indexes;Useful log messages include:
engine hook bootstrap: bootstrap threshold reachedengine hook bootstrap: automatic bootstrap completemaintenance: incremental merge failedmaintenance: catch-up rebuild failedVectorIndexManager: local vector index marked dirty
vec-bench reports the measurements that matter for tuning:
- DDL create time
- first clustered publish time
- final settled time
- insert throughput
- read QPS
- recall
- p50, p95, and p99 latency
- RSS snapshots
- stable encoding
- overlay encoding mix and overlay journal size
- block metadata rows and file size
- segment logical bytes/query
- segment actual read bytes/query
- segment overread
- exact rerank rows/query
Runtime Model
Local Files
For chunks_embedding_idx, Marmot stores derived state beside the database:
<db>.chunks_embedding_idx.vecseg/
manifest/
current
gen-00000000000000000001.mf
segments/
gen-00000000000000000001.dat
rowmap/
gen-00000000000000000001.rmap
blocks/
gen-00000000000000000001.blk
overlay/
current.log| File | Purpose |
|---|---|
manifest/current | Atomic pointer to the active generation. |
manifest/gen-*.mf | Manifest with index identity, centroid epochs, encoding metadata, checksums, and file names. |
segments/gen-*.dat | Stable vector payload laid out by cluster. |
rowmap/gen-*.rmap | Rowid to stable-location map. |
blocks/gen-*.blk | Validated per-generation block metadata sidecar for disk-first pruning and scan instrumentation. |
overlay/current.log | Append-only local mutation log for rows newer than the stable generation. |
Read Path
In post-filter IVF mode:
- Marmot materializes the query vector into the internal metric representation.
- It selects probe clusters from the active probe centroids.
- It scans the selected stable cluster spans with
pread. - It merges overlay rows and tombstones newer than the generation cutoff.
- It fetches shortlisted exact vectors from the base table.
- It reranks exact distances in Go.
- It issues a final projection query for only the final top-K rowids.
The approximate scan is used for candidate generation only. Final distance ordering uses exact vectors from the base table for every non-empty ANN candidate set. For safe filtered post-ranking, Marmot can widen the candidate budget and refill until enough rows survive the SQL predicate, the search space is exhausted, or the internal safety cap is reached. Unsupported predicate shapes fall back to the pre-filter path instead of serving stale approximate results.
New generations also write a .blk sidecar. The sidecar is validated against the .dat file on open, is deleted with retired generations, and is available for internal block-pruning experiments and scan metrics. Current default query execution keeps block pruning off unless explicitly enabled internally, because the latest 100K run showed the full selected-cluster PQ scan was faster at this scale.
Write Path
After a successful commit:
- Marmot CDC captures the row-level change.
- The vector hook materializes new vector values into internal form.
- The local overlay journal records inserts, updates, or tombstones. Before the first clustered publish, overlay vectors are temporary prepared float32 bytes; after probe centroids exist, overlay vectors are compact residual int8 bytes.
- Queries see those changes immediately from the overlay.
- Background maintenance publishes new immutable generations and compacts applied overlay rows.
Overlay snapshots are offset-backed and use a bounded vector cache; large online builds stream bootstrap and catch-up rows through temporary spools instead of retaining every prepared vector in heap. Exact rerank and maintenance always use the base table as the truth source. Lossy stable or overlay encodings are not used to update centroid sums or exact distances.
Stable Encoding
Stable segments use compact encodings:
| Encoding | Used when |
|---|---|
| Residual PQ8 | Internal dimension is at least 512. |
| Residual int8 | Internal dimension is less than 512. |
Raw float32 stable segment payloads are not written. Residual PQ8 uses generation-local codebooks. Incremental maintenance reuses the generation codec for touched clusters so untouched cluster spans remain byte-copyable. Full rebuild or manual REINDEX VECTOR is the codec refresh path.
PQ8 codebooks are trained deterministically from a bounded sample with multi-start subspace k-means. This spends more build CPU to improve compressed-score quality while keeping read memory low.
Replication
Marmot does not replicate vector segment artifacts.
Replicated:
- row-level DML CDC for the base table
- vector-control metadata for
CREATE VECTOR INDEX - vector-control metadata for
DROP VECTOR INDEX - vector-control metadata for
REINDEX VECTOR
Node-local derived state:
.vecsegsegment files- rowmaps
- block metadata sidecars
- manifest files
- overlay journals
- centroids
- PQ codebooks
Each node builds and maintains its own local vector index from the same replicated table rows and vector-control metadata. A replica may publish its local generation at a different wall-clock time, but queries use only locally ready vector indexes. If local vector CDC fails after the SQLite commit, the index is marked dirty so stale ANN results are not served until repaired or rebuilt.
Benchmarking
Use cmd/vec-bench for repeatable vector benchmarking.
./vec-bench \
--data-dir=/tmp/marmot/benchdata/dbpedia-openai-1536 \
--db-dir=/tmp/marmot/bench-active \
--db-name=dbpedia \
--table=docs \
--column=embed \
--index=embed_idx \
--force-build \
--insert-n=100000 \
--warmup=200 \
--n-queries=500 \
--query-concurrency=8 \
--nprobe=24 \
--settle-timeout=10m \
--min-recall=0.95 \
--min-qps=1000 \
--max-overread=1.05 \
--profile-dir=/tmp/marmot/bench-active/profAlways report:
- dataset and embedding dimension
- row count
nlistandnprobe- stable encoding
- create time
- first clustered publish time
- final settled time
- insert QPS
- read QPS
- recall@K
- latency percentiles
- RSS after build and after settle
- segment read bytes/query and overread
Do not report raw insert throughput as query-ready build throughput. Inserts are live immediately, but the clustered generation and settled read structure are separate milestones.
Latest 100K Validation
Latest local validation used a random 100K-row subset of the DBpedia OpenAI 1536d dataset from the 990K-row source corpus with subset seed 42. The query set contained 10,000 queries, K=10, cosine distance, Go-rank, id projection, and read concurrency 8.
| Metric | Measured value |
|---|---|
| Rows | 100,000 |
| Source rows | 990,000 |
nlist | 196 |
nprobe | 24 |
| Stable encoding | Residual PQ8 |
| Payload bytes/vector | 132 |
| Entry bytes/vector | 140 |
Stable .dat size | 14,003,208 bytes |
| Block metadata | 877 blocks, 3,651,568 bytes |
| Segment read bytes/query | 1,739,024 |
| Segment overread | 1.00x |
| Recall@10 | 0.9628 |
| Recall@10-in-100 | 1.0000 |
| Aggregate read QPS | 1,752 |
| Latency p50 / p95 / p99 | 4.31ms / 5.82ms / 6.48ms |
| RSS after measurement | 340 MB |
The matching 100K force-build lifecycle run measured DDL create at 203ms, 100,000 inserts in 3.00s, first clustered publish at 15.886s, and final settled state at 1m08.286s.
Latest 500K Validation
Latest larger local validation used a random 500K-row subset of the same DBpedia OpenAI 1536d source corpus with subset seed 42. The read-only measurement used the full 10,000-query set, K=10, cosine distance, Go-rank, id projection, read concurrency 8, and explicit nprobe=48.
| Metric | Measured value |
|---|---|
| Rows | 500,000 |
| Source rows | 990,000 |
nlist | 977 |
nprobe | 48 |
| Stable encoding | Residual PQ8 |
| Payload bytes/vector | 132 |
| Entry bytes/vector | 140 |
Stable .dat size | 70,015,704 bytes |
| Block metadata | 4,401 blocks |
| Segment read bytes/query | 3,456,555 |
| Segment overread | 1.00x |
| Recall@10 | 0.9575 |
| Recall@10-in-100 | 1.0000 |
| Aggregate read QPS | 1,053 |
| Latency p50 / p95 / p99 | 7.04ms / 12.13ms / 15.28ms |
| RSS after reopen | 185 MB |
| RSS after measurement | 459 MB |
The matching 500K force-build lifecycle run measured 500,000 inserts in 36.118s, first clustered publish at 1m21.011s, final settled state at 5m31.928s, and same-process RSS after settled cleanup at 917 MB. Reopened steady read memory is lower because transient insert, bootstrap, and catch-up buffers are gone.
Crash Safety
Stable generation publish is append-and-swap:
- write temp segment and rowmap files
- fsync temp files
- write temp manifest
- fsync manifest
- rename manifest into place
- swap
manifest/current - fsync the manifest directory
The old generation remains valid until manifest/current points to the new generation. The overlay journal remains the source of truth for mutations after the applied cutoff.
Limitations
- Exactly one
vec_matchis supported perSELECT. LIMITis required and must be a positive literal integer.vec_matchK is the candidate/rerank budget and must be greater than or equal toLIMIT.vec_matchqueries must order byvec_distanceon the same vector column.- The base table must have an
INTEGER PRIMARY KEY. - The embedding must be a packed little-endian
float32BLOB. - Vector index DDL uses simple table/index/column identifiers; run
USE databasefirst. target_partition_sizeis currently internal and not a SQL DDL option.- Initial bootstrap, promotion, and
REINDEX VECTORare CPU-heavy because centroid training happens in-process. - Exact rerank reads shortlisted vectors from the base table, so very large
vec_matchcandidate budgets can increase SQLite read cost. - Segment files, rowmaps, centroids, PQ codebooks, and overlay journals are local derived state and are not replicated.