How far up the exploitation ladder can an agent climb on a production JS engine? ExploitBench measures frontier LLMs on full-control V8 exploit synthesis with 16 capabilities measured per run and multi-round shuffled-layout grading.
No pages have linked to this URL yet.