GeistHaus
log in · sign up

Improving a Coding Agent Harness: Part 3, Scoring 100% on Coding Benchmarks

joe-b-security.github.io

How a 9B parameter model scored 100% on Terminal-Bench by exploiting a bug in the scoring predicate, and what that says about benchmark trust boundaries.

0 pages link to this URL

No pages have linked to this URL yet.