Prometheus Metrics

Prometheus Metrics

Marmot exposes Prometheus metrics for monitoring cluster health, replication performance, and query processing. All metrics use the marmot_v2 namespace and include a node_id label for multi-node visibility.

Enabling Metrics

[prometheus]
enabled = true  # Metrics served on gRPC port at /metrics endpoint

Accessing Metrics:

curl http://localhost:8080/metrics

Cluster Health Metrics

MetricTypeLabelsDescription
marmot_v2_cluster_nodesGaugestatusNumber of nodes in cluster by status (ALIVE, SUSPECT, DEAD, JOINING, REMOVED)
marmot_v2_cluster_quorum_availableGauge-Whether quorum is achievable (1=yes, 0=no)
marmot_v2_gossip_rounds_totalCounter-Total number of gossip rounds executed
marmot_v2_gossip_messages_totalCounterdirectionTotal gossip messages by direction (sent, received)
marmot_v2_gossip_failures_totalCounter-Total failed gossip send attempts
marmot_v2_node_state_transitions_totalCounterfrom, toNode state transitions (e.g., ALIVE to SUSPECT)
marmot_v2_cluster_join_totalCounterresultCluster join attempts by result (success, failed)

Transaction Metrics (2PC)

MetricTypeLabelsDescription
marmot_v2_txn_totalCountertype, resultTotal transactions by type (write, read) and result (success, failed, conflict)
marmot_v2_txn_duration_secondsHistogramtypeTransaction duration in seconds
marmot_v2_twophase_prepare_secondsHistogram-2PC prepare phase duration in seconds
marmot_v2_twophase_commit_secondsHistogram-2PC commit phase duration in seconds
marmot_v2_twophase_quorum_acksHistogramphaseNumber of quorum acknowledgments received per phase
marmot_v2_write_conflicts_totalCountertype, pathWrite conflicts by type (mvcc, intent) and detection path (fast, slow)
marmot_v2_intent_filter_checks_totalCounterresultIntent filter checks by result (fast_path, slow_path_miss, slow_path_conflict)
marmot_v2_intent_filter_sizeGauge-Current number of entries in the Cuckoo filter
marmot_v2_intent_filter_false_positives_totalCounter-Intent filter false positives (slow path found no conflict)
marmot_v2_intent_filter_txn_countGauge-Number of transactions with active intents in filter
marmot_v2_replication_requests_totalCounterphase, resultReplication requests by phase (prepare, commit, replay) and result
marmot_v2_active_transactionsGauge-Number of currently active transactions

Query Processing Metrics

MetricTypeLabelsDescription
marmot_v2_queries_totalCountertype, resultTotal queries by type (select, insert, update, delete, ddl) and result
marmot_v2_query_duration_secondsHistogramtypeQuery duration in seconds
marmot_v2_rows_affectedHistogram-Number of rows affected per write query
marmot_v2_rows_returnedHistogram-Number of rows returned per read query
marmot_v2_mysql_connectionsGauge-Number of active MySQL protocol connections
marmot_v2_ddl_operations_totalCounterresultDDL operations by result (success, failed)
marmot_v2_ddl_lock_wait_secondsHistogram-Time waiting for DDL lock in seconds

Anti-Entropy Metrics

MetricTypeLabelsDescription
marmot_v2_antientropy_rounds_totalCounter-Total anti-entropy rounds executed
marmot_v2_antientropy_syncs_totalCountertype, resultAnti-entropy syncs by type (delta, snapshot) and result
marmot_v2_antientropy_duration_secondsHistogram-Anti-entropy round duration in seconds
marmot_v2_replication_lag_txnsGaugepeerTransaction lag behind peer
marmot_v2_delta_sync_txns_totalCounter-Total transactions applied via delta sync

Histogram Buckets

Different metrics use histogram buckets optimized for their expected latency profiles:

Write Transaction Buckets (for distributed writes with network + consensus):

5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s

Read Transaction Buckets (for local SQLite reads):

0.1ms, 0.5ms, 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms

2PC Phase Buckets (for prepare/commit latencies):

1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s

Sync Buckets (for anti-entropy and background sync):

100ms, 500ms, 1s, 2.5s, 5s, 10s, 30s, 60s

Prometheus Scrape Configuration

scrape_configs:
  - job_name: 'marmot'
    static_configs:
      - targets: ['node1:8080', 'node2:8080', 'node3:8080']
    scrape_interval: 15s

Example Queries

Cluster health:

# Check if all nodes are alive
sum(marmot_v2_cluster_nodes{status="ALIVE"}) by (node_id)

# Quorum availability across cluster
min(marmot_v2_cluster_quorum_available)

Transaction performance:

# Write transaction p99 latency
histogram_quantile(0.99, rate(marmot_v2_txn_duration_seconds_bucket{type="write"}[5m]))

# Transaction success rate
sum(rate(marmot_v2_txn_total{result="success"}[5m])) / sum(rate(marmot_v2_txn_total[5m]))

2PC performance:

# Prepare phase p95 latency
histogram_quantile(0.95, rate(marmot_v2_twophase_prepare_seconds_bucket[5m]))

# Commit phase p95 latency
histogram_quantile(0.95, rate(marmot_v2_twophase_commit_seconds_bucket[5m]))

Conflict detection:

# Write conflicts per minute
sum(rate(marmot_v2_write_conflicts_total[1m])) by (type)

Replication lag:

# Max replication lag across all peers
max(marmot_v2_replication_lag_txns) by (node_id)