Skip to main content
Retrieval quality is invisible to code review: you cannot eyeball whether a change made recall better or worse. So 3ngram treats its benchmark as the goal function. A golden-set eval runs in CI on every change, and a pull request that regresses it does not merge — and you do not “fix” the eval to make it pass.

The numbers

These are the recorded gate floors. They only ratchet upward — a CI run below any floor fails the gate and blocks the merge — so every committed eval run clears them. They measure the offline retrieval harness (see How it is measured), not a live-traffic SLA.
MetricFloorWhat it measures
recall@50.9773The right memory is in the top 5 results
MRR@50.9697How highly the right memory ranks (1.0 = always first)
Supersession correctness0.9474A correction ranks above the memory it superseded, with the old row still in the index
Abstention precision1.0000When the answer genuinely is not stored, the system declines instead of returning a false match
Answerable above τ0.9545Answerable queries clear the calibrated similarity threshold (τ = 0.4663)
These are the exact floor values from eval/fixtures/floors.json — the same values CI compares against, not rounded summaries. Supersession and abstention are the two that a plain vector store usually gets wrong. Supersession is scored with the superseded rows included in the index, so it proves ranking, not just filtering. Abstention is scored on topics that are verifiably absent, so a high score means the system knows what it does not know.

How it is measured

  • Dataset: 98 queries over 158 anonymized production memories, including real supersession chains.
  • Slices: retrieval (recall@5, MRR@5), supersession (successor ranks above its superseded predecessor), and abstention (top-1 similarity below the calibrated threshold τ for absent topics).
  • Embedding model: text-embedding-3-large at 1536 dimensions.
  • Method: deterministic, zero-dependency, no network — exact cosine retrieval over committed cached embeddings. Exact cosine is the retrieval upper bound; approximate-index (HNSW) parity is proven separately by an integration test.
  • Floors: recorded from a stable run and ratcheted up only; they never loosen to let a change through.

Reproduce it

The gate is offline and deterministic, so anyone can run it. From the repository root:
node eval/src/run.mjs --model openai-large-1536
or, through the workspace:
pnpm --filter @3ngram/eval run gate
This is the exact command CI runs in the eval-gate job. It needs no API key, no database, and no network — the embeddings are committed to the repository. A run that falls below any recorded floor exits non-zero, which is what blocks a regressing change from merging.
This is the blocking gate. A separate, non-blocking advisory tier runs broader benchmarks (LongMemEval- and MemoryAgentBench-shaped slices) nightly; those never gate a pull request and are not reflected in the floors above.