matchspec
Eval framework. Define correct, test against it, get results.
3 items
Eval framework. Define correct, test against it, get results.
Ship evals before you ship features.
Benchmark runner for Model Context Protocol servers. Paired comparison experiments on SWE-bench.