Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

The last six months in LLMs in five minutes

Simon Willison’s Weblog Simon Willison May 19, 2026

I put together these annotated slides from my five minute lightning talk at PyCon US 2026, using the latest iteration of my annotated presentation tool. # I presented this lightning …

0 inbound links article en lightning-talks 7pycon 28speaking 120ai 2025generative-ai 1792local-llms 157llms 1758annotated-talks 31pelican-riding-a-bicycle 114coding-agents 203

Stream

sajalchoudhary.net May 14, 2026

Blog posts, photos, and micro updates

0 inbound links website en

Spencer Schneidenbach

schneidenba.ch Nov 21, 2025

Spencer Schneidenbach - AI Architect and Software Engineer

0 inbound links website en

Links

scottwillsey.com May 9, 2026

Sites and other stuff I like and that you should too.

0 inbound links en

The AI App Experience Matters More Than Benchmarks Now

Macstoriesnet Federico Viticci Nov 28, 2025

I was catching up on different articles after the release of Claude Opus 4.5 earlier this week, and this part from Simon Willison’s blog post about it stood out to me: I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were

1 inbound link article en notes AIartificial intelligencefeaturedLLMs

The AI App Experience Matters More Than Benchmarks Now

Macstoriesnet Federico Viticci Nov 28, 2025

I was catching up on different articles after the release of Claude Opus 4.5 earlier this week, and this part from Simon Willison’s blog post about it stood out to me: I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were

1 inbound link article en notes AIartificial intelligencefeaturedLLMs

Cross-Vendor Dynamic Model Fusion - A Framework for Vendor-Agnostic AI Orchestration - Dale Hurley

Dale Hurley Dale Hurley Nov 28, 2025

A comprehensive framework for intelligent orchestration of multiple large language models from different vendors, enabling vendor-agnostic AI systems with optimal cost-performance tradeoffs

0 inbound links article en Blog aiorchestrationllmmcpcross-vendordynamic-model-fusionacademic-paper

What to know about Claude Opus 4.5 - TechTalks

TechTalks - Technology solving problems... and creating new ones Ben Dickson Nov 25, 2025

Anthropic responds to OpenAI and Google with Claude Opus 4.5, a model that prioritizes coding dominance, cost-efficiency, and user-controlled reasoning.

0 inbound links article en What is...AnthropicArtificial intelligence (AI)Claude LLMClaude Opus 4.5large language modelslarge reasoning modelsbendee983

Productivity and AI: it's the tool, not the model

nocodefunctions.com Nocode functions Dec 23, 2025

Every week, a new “SOTA” (State of the Art) model is announced, promising higher reasoning capabilities and (often) lower costs. We could be led to think that we are entering an era of infinite, frictionless productivity. But the reality is messier. While the models are getting smarter, the gap between “intelligence on tap” and “completing a task” is managed by our tools and right now, that tooling interface is becoming a major source of friction. As we will see, this isn’t just a developer’s dilemma in the context of new AI assisted coding interfaces. It is a preview of the “retooling tax” that every professional domain must soon learn to navigate. The paradox is simple: as models improve, productivity bottlenecks increasingly shift away from intelligence itself and toward the tools that mediate access to it. The race for better models This is the popular meme reflecting the merry-go-round of weekly improvements of AI models: (source. other versions of this meme do include Anthropic’s Claude, if you wonder) LLMs become more capable, cheaper, and available on tap, to the point that the new best performing model can be indistinguishable from the previous one, simply because models are now so smart that the tasks we perform are not complex enough to clearly differentiate between “a great model” and an “even greater model”: both perform equally well on the tests. This is the experience of Simon Willison when testing a preview of Claude Opus 4.5 on November 24, 2025: It’s clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha [of his coding project]. I switched back to Claude Sonnet 4.5 and… kept on working at the same pace I’d been achieving with the new model. With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected. I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I

0 inbound links en

Minimum Viable Benchmark

Nilenso Blog Atharva Raykar Read more by Atharva here Nov 28, 2025

A few months ago, I was co-facilitating a “Birds of a Feather” session on keeping up with AI progress. This was a group of engineering leaders and ...

0 inbound links article en