We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its focus on simple bug fixes in familiar open-source repositories.
I dug into popular coding benchmarks while building StoryMachine, an experiment in breaking down software tasks into agent-executable units.