Назад към всички

memory-bench-pioneer

// Be one of the first to benchmark your agent's memory — and help shape how AI remembers. Runs a peer-review-grade evaluation suite (LLM-as-judge, nDCG/MAP/MRR with 95% CIs, ablation studies) against your live memory system and submits anonymized results to the ENGRAM/CORTEX research papers. Your data

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namememory-bench-pioneer
descriptionBe one of the first to benchmark your agent's memory — and help shape how AI remembers. Runs a peer-review-grade evaluation suite (LLM-as-judge, nDCG/MAP/MRR with 95% CIs, ablation studies) against your live memory system and submits anonymized results to the ENGRAM/CORTEX research papers. Your data stays private; only aggregate stats leave. Works with agent-memory-ultimate. For the bold few who believe AI memory should be measured, not guessed at.

Memory Bench

Collect, assess, and submit anonymized memory system statistics for the ENGRAM and CORTEX research papers.

Three-Step Pipeline

1. Assess Retrieval Quality

Run the standard test set (30 queries across 4 types × 3 difficulty levels) with LLM-as-judge:

# Full assessment with GPT-4o-mini judge + ablation (recommended)
python3 scripts/rate.py --queries 30 --judge openai --ablation

# Without OpenAI key: local embedding judge (weaker, marked in output)
python3 scripts/rate.py --queries 30 --judge local --ablation

# Custom test set
python3 scripts/rate.py --testset path/to/queries.json --judge openai

What it measures:

  • RAR (Recall Accuracy Ratio), MRR (Mean Reciprocal Rank)
  • nDCG@5, MAP@5, Precision@5, Hit Rate
  • All metrics include 95% bootstrap confidence intervals
  • Ablation: runs with AND without spreading activation to isolate its contribution

Judge methods:

  • openai — GPT-4o-mini rates each (query, result) pair 1-5. Independent from retrieval system. ~$0.01 per run.
  • local — Embedding cosine similarity. Weaker, marked as such in output. Zero cost.

Standard test set (scripts/testset.json): 30 queries stratified across semantic/episodic/procedural/strategic types and easy/medium/hard difficulty. No lexical overlap with stored memories. All deployments run the same queries for cross-site comparability.

2. Collect Statistics

python3 scripts/collect.py --contributor GITHUB_USER --days 14 --output /tmp/memory-bench-report.json

Collected (anonymized): Memory counts/types/ages, strength/importance histograms, association graph size, hierarchy levels, consolidation history, retrieval metrics (RAR/MRR/nDCG/MAP with CIs), ablation results, judge method, algorithm version, embedding coverage. Instance ID is a random UUID (not reversible).

Never collected: Memory content, queries, file paths, usernames, hostnames.

3. Submit as PR

scripts/submit.sh /tmp/memory-bench-report.json GITHUB_USERNAME

Forks, branches, places report, updates INDEX.json, opens PR. Requires gh CLI.

Validation Protocol

For peer-review-ready data, contributors should:

  1. Run rate.py --ablation --judge openai (minimum N=30 queries)
  2. Collect at least 2 reports from the same instance, ≥7 days apart (longitudinal)
  3. Report the algorithm version (auto-captured from git)

Test Set Format

Custom test sets are JSON arrays:

[
  {
    "id": "T01",
    "query": "...",
    "category": "semantic|episodic|procedural|strategic",
    "difficulty": "easy|medium|hard"
  }
]

Agent Workflow

When asked to submit benchmarks: run rate.py --ablation --judge openai, then collect.py, review summary, then submit.sh. Share the PR link.