GeistHaus
log in · sign up

Kabir Khandpur

Part of kabirk.com

Kabir Khandpur's website

stories primary
SWE-bench Multilingual
Introducing a new dataset in the SWE-bench family with 300 curated tasks in 9 programming languages to evaluate LLMs on software engineering tasks.
https://kabirk.com/multilingual
How well can LLMs see?
Creating a small benchmark to test how well multimodal language models can find specific objects in complex Where's Waldo-style illustrations.
https://kabirk.com/wimmelbench