GeistHaus
log in · sign up

https://kabirk.com/feed.xml

rss
6 posts
Polling state
Status active
Last polled May 19, 2026 04:08 UTC
Next poll May 20, 2026 02:13 UTC
Poll interval 86400s

Posts

SWE-bench Multilingual
Introducing a new dataset in the SWE-bench family with 300 curated tasks in 9 programming languages to evaluate LLMs on software engineering tasks.
https://kabirk.com/multilingual
How well can LLMs see?
Creating a small benchmark to test how well multimodal language models can find specific objects in complex Where's Waldo-style illustrations.
https://kabirk.com/wimmelbench