Introducing a new dataset in the SWE-bench family with 300 curated tasks in 9 programming languages to evaluate LLMs on software engineering tasks.
https://kabirk.com/multilingual
Creating a small benchmark to test how well multimodal language models can find specific objects in complex Where's Waldo-style illustrations.
https://kabirk.com/wimmelbench
Exploring a missing piece of the software observability stack to monitor business logic.
https://kabirk.com/checker
Implementing partial builds to reduce the time taken to live reload a static site by >99%.
https://kabirk.com/rebuilds
A look into how the UK government raises and spends money.
https://kabirk.com/ukfinances
https://kabirk.com/sgtransport