How a 9B parameter model scored 100% on Terminal-Bench by exploiting a bug in the scoring predicate, and what that says about benchmark trust boundaries.
No pages have linked to this URL yet.
Log in or sign up to submit feeds.