New arXiv paper introduces an open benchmark across 23 live SaaS apps and 106 tasks; strongest model completed under 4% end-to-end. Code on GitHub.
No pages have linked to this URL yet.
Log in or sign up to submit feeds.