ml-model-eval-benchmark
// Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
nameml-model-eval-benchmark
descriptionCompare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
ML Model Eval Benchmark
Overview
Produce consistent model ranking outputs from metric-weighted evaluation inputs.
Workflow
- Define metric weights and accepted metric ranges.
- Ingest model metrics for each candidate.
- Compute weighted score and ranking.
- Export leaderboard and promotion recommendation.
Use Bundled Resources
- Run
scripts/benchmark_models.pyto generate benchmark outputs. - Read
references/benchmarking-guide.mdfor weighting and tie-break guidance.
Guardrails
- Keep metric names and scales consistent across candidates.
- Record weighting assumptions in the output.