GeistHaus
log in · sign up

ExploitBench

exploitbench.ai

How far up the exploitation ladder can an agent climb on a production JS engine? ExploitBench measures frontier LLMs on full-control V8 exploit synthesis with 16 capabilities measured per run and multi-round shuffled-layout grading.

0 pages link to this URL

No pages have linked to this URL yet.