// methodology

Onenomad Bench Methodology

This document specifies how Onenomad Bench scores a memory system and how a receipt is produced. Everything here is intentionally precise so a result can be independently audited or reproduced.

Core principles

The adapter contract

interface Adapter {
  readonly name: string         // "engram", "mem0", "letta", ...
  readonly version: string      // SemVer of the system being benched

  ingest(items: Array<MemoryItem>): Promise<void>
  query(q: string, opts: { k: number, when?: Date }): Promise<RetrievedItem[]>
  reset(): Promise<void>        // Wipe state for a fresh fixture
}

interface MemoryItem {
  id: string
  content: string
  metadata: Record<string, unknown>
  timestamp: string             // ISO8601
}

interface RetrievedItem {
  id: string
  score: number                 // [0,1] confidence per the adapter
  content: string               // For human verification
}

The adapter author is responsible for translating their system's internal vocabulary to this surface. We do not score "did the adapter get the right idea." We score "for query Q, did the top-K returned IDs include the expected_answer_ids."

The receipt schema (v0.0.1)

{
  "receiptId": "uuid-v4",
  "benchVersion": "0.0.1",
  "ranAt": "2026-05-17T15:23:01Z",
  "adapter": { "name": "engram", "version": "2.4.0" },
  "fixture": {
    "id": "longmemeval-temporal-inference-001",
    "sha256": "abcd...",
    "n": 500
  },
  "environment": {
    "node": "22.11.0",
    "platform": "linux/amd64",
    "containerImage": "ghcr.io/onenomad-llc/bench-runner:<sha>",
    "git": { "commit": "<sha>", "dirty": false }
  },
  "scores": {
    "recall_at_5": 0.972,
    "recall_at_10": 0.988,
    "ndcg_at_10": 0.951,
    "latency_p50_ms": 44,
    "latency_p95_ms": 187,
    "ingest_throughput_items_per_sec": 312
  },
  "perQuery": [
    { "queryId": "q-001", "retrieved": ["id-13","id-42",...], "hit": true, "rank": 2 }
  ],
  "signature": {
    "algorithm": "Ed25519",
    "publicKeyFingerprint": "sha256:abcd...",
    "value": "base64url(...)"
  }
}

The signature.value is computed over a canonicalized JSON representation of the receipt with the signature field omitted. Canonicalization: sorted keys, no whitespace, no trailing comma. Uses JCS (RFC 8785) implementation in src/receipt/canonicalize.ts.

Scoring functions

All implemented as pure functions in src/scoring/.

Reproducibility checklist

A receipt is considered defensible if:

  1. The environment.git.commit is a tagged release of bench.
  2. The environment.git.dirty is false.
  3. The container image hash matches a tagged image at ghcr.io/onenomad-llc/bench-runner.
  4. The fixture.sha256 matches the fixture content at that commit.
  5. The signature verifies against keys/receipt-signing.pub.
  6. Re-running the same adapter version + fixture + environment produces byte-identical scores (modulo timestamp).

Threats to validity

Versioning

Breaking changes to the receipt schema, scoring math, or adapter contract bump the major version. New scoring metrics, new adapters, new fixtures bump the minor. Bug fixes bump the patch. Published receipts pin both the bench version and the adapter version.

License

Apache-2.0. Methodology is in-repo and forkable; if you build a competing benchmark with a different methodology, please name it differently.

Source: METHODOLOGY.md in the przm bench repo · Apache-2.0

Verify a receipt →