Kabir Khandpur — GeistHaus

May 6, 2025

Introducing a new dataset in the SWE-bench family with 300 curated tasks in 9 programming languages to evaluate LLMs on software engineering tasks.

https://kabirk.com/multilingual

How well can LLMs see?

Nov 21, 2024

Creating a small benchmark to test how well multimodal language models can find specific objects in complex Where's Waldo-style illustrations.

https://kabirk.com/wimmelbench

Jul 21, 2024

Exploring a missing piece of the software observability stack to monitor business logic.

https://kabirk.com/checker

Jun 18, 2024

Implementing partial builds to reduce the time taken to live reload a static site by >99%.

https://kabirk.com/rebuilds

Jan 29, 2024

A look into how the UK government raises and spends money.

https://kabirk.com/ukfinances

Jan 4, 2024

https://kabirk.com/sgtransport