The numbers
These are the recorded gate floors. They only ratchet upward — a CI run below any floor fails the gate and blocks the merge — so every committed eval run clears them. They measure the offline retrieval harness (see How it is measured), not a live-traffic SLA.| Metric | Floor | What it measures |
|---|---|---|
| recall@5 | 0.9773 | The right memory is in the top 5 results |
| MRR@5 | 0.9697 | How highly the right memory ranks (1.0 = always first) |
| Supersession correctness | 0.9474 | A correction ranks above the memory it superseded, with the old row still in the index |
| Abstention precision | 1.0000 | When the answer genuinely is not stored, the system declines instead of returning a false match |
| Answerable above τ | 0.9545 | Answerable queries clear the calibrated similarity threshold (τ = 0.4663) |
eval/fixtures/floors.json — the same values CI compares against, not rounded summaries.
Supersession and abstention are the two that a plain vector store usually gets wrong. Supersession is scored with the superseded rows included in the index, so it proves ranking, not just filtering. Abstention is scored on topics that are verifiably absent, so a high score means the system knows what it does not know.
How it is measured
- Dataset: 98 queries over 158 anonymized production memories, including real supersession chains.
- Slices: retrieval (recall@5, MRR@5), supersession (successor ranks above its superseded predecessor), and abstention (top-1 similarity below the calibrated threshold τ for absent topics).
- Embedding model:
text-embedding-3-largeat 1536 dimensions. - Method: deterministic, zero-dependency, no network — exact cosine retrieval over committed cached embeddings. Exact cosine is the retrieval upper bound; approximate-index (HNSW) parity is proven separately by an integration test.
- Floors: recorded from a stable run and ratcheted up only; they never loosen to let a change through.
Reproduce it
The gate is offline and deterministic, so anyone can run it. From the repository root:eval-gate job. It needs no API key, no database, and no network — the embeddings are committed to the repository. A run that falls below any recorded floor exits non-zero, which is what blocks a regressing change from merging.
This is the blocking gate. A separate, non-blocking advisory tier runs broader
benchmarks (LongMemEval- and MemoryAgentBench-shaped slices) nightly; those never
gate a pull request and are not reflected in the floors above.