Measuring AI Ability to Complete Long Tasks

Xai: The Unpriced Risk In Spacex's Ipo A report by May 19, 2026

A report concerning xAI's safety practices, governance, and disclosure obligations ahead of the SpaceX IPO.

0 inbound links article en

Apricitas Economics Joseph Politano Feb 10, 2026

US AI-Related Investment Keeps Breaking Records, With Total Software, Computer, & Data Center Spending Now Exceeding $1T Per Year

1 inbound link article en

Where Are the Vibecoded Photoshops?

news.ycombinator.com May 18, 2026

1 inbound link en

2025: The year in LLMs

Simon Willison’s Weblog Simon Willison Dec 31, 2025

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about …

1 inbound link article en ai 2024openai 419generative-ai 1791llms 1757anthropic 282gemini 185ai-agents 111pelican-riding-a-bicycle 113vibe-coding 91coding-agents 202ai-in-china 95conformance-suites 10

The Bitter Lesson versus The Garbage Can

One Useful Thing Ethan Mollick Jul 28, 2025

Does process matter? We are about to find out.

1 inbound link article en

2025: The year in LLMs

Simon Willison’s Weblog Simon Willison Dec 31, 2025

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about …

1 inbound link article en ai 2024openai 419generative-ai 1791llms 1757anthropic 282gemini 185ai-agents 111pelican-riding-a-bicycle 113vibe-coding 91coding-agents 202ai-in-china 95conformance-suites 10

Tokenmaxxing

Theory Ventures Tomasz Tunguz Apr 1, 2026

I burnt 250M tokens in a day. Tokenmaxxing is the deliberate practice of maximizing AI token consumption through parallelization.

0 inbound links article en [aiproductivity] [tokenmaxxingAI productivityAI agentsparallel workflowsMETR researchAI token consumptionagent orchestration]

CS 2881 AI Safety

CS 2881 AI Safety Sep 4, 2025

Fall 2025 - Harvard

4 inbound links website en

Reflections on 2025

Samuel Albanie Samuel Albanie Dec 30, 2025

The Compute Theory of Everything, grading the homework of a minor deity, and the acoustic preferences of Atlantic salmon

2 inbound links article en

A guide to local coding models

news.ycombinator.com Dec 21, 2025

1 inbound link en

There Are No New Ideas in AI… Only New Datasets

Token for Token Jack Morris Apr 9, 2025

LLMs were invented in four major developments... all of which were datasets

1 inbound link article en

Peto's Paradox and the Future of AI Agents

Fergus Hamilton Fergus Hamilton Jan 23, 2026

Analysis of AI agent reliability using survival models. Re-examining METR's task data with Weibull distributions reveals insights about long-horizon AI autonomy.

3 inbound links website en

The Galaxy Brains Have ADHD

robdearborn.com Rob Dearborn Dec 1, 2025

0 inbound links en

Grading AI 2027’s 2025 Predictions

AI Futures Project Eli Lifland; Daniel Kokotajlo Feb 12, 2026

How has AI progress compared to AI 2027 thus far?

2 inbound links article en

Against the METR graph

Transformer Nathan Witkin Jan 20, 2026

METR’s benchmark has become a bellwether of AI capability growth, but its design isn’t up to the task, argues Nathan Witkin

2 inbound links article en

AI Futures Model: Dec 2025 Update

AI Futures Project Daniel Kokotajlo; Eli Lifland; Brendan Halstead; Alex Kastner Dec 31, 2025

We've significantly improved our model of AI timelines and takeoff speeds!

7 inbound links article en

The AI Adoption Gap: Preparing the US Government for Advanced AI

Forethought Mar 31, 2025

Advanced AI could unlock an era of enlightened and competent government action. But without smart, active investment, we’ll squander that opportunity and barrel blindly into danger.

0 inbound links article en

Agent Autonomy - Part 2: Going Beyond Algorithms

Hani's Blog هاني الشاطر Dec 25, 2025

Educational demos, marketing materials, and creative work—problems without a mathematical harness. Part 2 of the Agent Autonomy series shows how orchestrated agent evolution can solve subjective problems through skill-based guidance and multiple independent evaluators.

0 inbound links article en Machine LearningProduct ManagementLeadershipSoftware Engineering

From Behavior to Judgment: Designing Evaluation for Agentic Systems » { design@tive } information design

{ Design@Tive } Information Design Itamar Medeiros Apr 21, 2026

Learn how to evaluate agentic AI systems using dual evaluation, LLM-as-a-judge, and hybrid methods that go beyond observability.

0 inbound links article en Artificial IntelligenceProject ManagementTalks & WorkshopsUser Experience Agentic AIArtificial IntelligenceDecision MakingHuman-Computer InteractionHuman–Agent Centered Designproduct developerProduct ManagementUsabilityUser ExperienceUX

How to React to a New Frontier Model

Chris Parsons Jan 1, 2026

Gemini 3 is out. The benchmarks are genuinely incredible. But it’s hard to know what to do about it.41% on HLE. 45% on ARC-AGI-2. These are colossal achievem...

0 inbound links website en tw-jekyll

Are we in a bubble?

Ashwin's blog Dec 4, 2025

Intro/Disclaimer: Trade at your own risk! Just like women think “I’m special” and “This time it’ll be different” when it comes to relationships, men think the same way when it comes to stock trading. Warren Buffett’s advice has remained the same for years: Most people should simply dump their money into an index fund that tracks the market and forget about it rather than actively trade and lose money. I too have recommended that same approach in one of my most-read posts, but given my actions, I think I need to add another oft-repeated piece of advice: “do as I say and not as I do.”

0 inbound links article en posts ThoughtsAI

The Substrate Revolution: When We Mistook the Shadow for the Fire | Blakeist

Blakeist Dec 27, 2025

Here's what's actually shocking about GLP-1s: they work by making the entire behavioral apparatus of weight loss obsolete. AI does the exact same violence to our model of intelligence.

0 inbound links article en

2026: When the System Becomes the Bottleneck | Blakeist

Blakeist Dec 26, 2025

The limiting factor shifts from cognition to governance and economics. The model is no longer the bottleneck. The system is.

0 inbound links article en

AI Slow vs. Fast Takeoff, Defined

David Shapiro’s Substack David Shapiro Sep 17, 2025

Cut through the rhetoric: what people mean by each, why benchmarks keep saturating, and how super-exponential autonomy reframes the next few years

1 inbound link article en

Evaluating Long Context (Reasoning) Ability

nrehiew.github.io Jan 1, 2026

2 inbound links en

Agency without consciousness

mynamelowercase.com Published on Nov 19, 2025

0 inbound links en

JDP Reviews IABIED

minihf.com John David Pressman Sep 18, 2025

0 inbound links en CC ZERO 1.0

Why are there so many rationalist cults?

news.ycombinator.com Aug 12, 2025

1 inbound link en

Verissimo Monthly - May 2025

Verissimo Ventures Binyamin Grobman May 30, 2025

The Unreliability of LLMs & What Lies Ahead

0 inbound links article en

Will AI Produce the Next Great Divergence?

Default Sarosh Nagar; David Eaves Nov 5, 2026

An analysis of AI and institutions.

1 inbound link website en

Towards end-to-end automation of AI research - Nature

Nature Publishing Group UK Chris Lu; Cong Lu; Robert Tjarko Lange; Yutaro Yamada; Shengran Hu; Jakob Foerster; David Ha; Jeff Clune Mar 25, 2026

An artificial intelligence system can produce research papers with minimal human involvement, even passing the first round of peer review for the workshop of a main machine learning conference.

2 inbound links article en Computer scienceMathematics and computing Computer scienceMathematics and computingScienceHumanities and Social Sciencesmultidisciplinary CC BY 4.0

Turning 20 while the world turns upside-down

parvmahajan.com Dec 21, 2025

1 inbound link en

Estimating AI productivity gains

anthropic.com Nov 25, 2025

Anthropic economic research on productivity gains

4 inbound links website en

Towards end-to-end automation of AI research - Nature

Nature Publishing Group UK Chris Lu; Cong Lu; Robert Tjarko Lange; Yutaro Yamada; Shengran Hu; Jakob Foerster; David Ha; Jeff Clune Mar 25, 2026

An artificial intelligence system can produce research papers with minimal human involvement, even passing the first round of peer review for the workshop of a main machine learning conference.

5 inbound links article en Computer scienceMathematics and computing Computer scienceMathematics and computingScienceHumanities and Social Sciencesmultidisciplinary CC BY 4.0

My response to AI 2027

vitalik.eth.limo Jul 10, 2025

1 inbound link en

A deep critique of AI 2027’s bad timeline models

Timeline Topography Tales Titotal Jun Jun 19, 2025

Disclaimers:

3 inbound links article en

Ok, AI Can Write Pretty Good Fiction Now

Lift High The Muse Justis Mills Jun 15, 2025

A lot changes in three months

2 inbound links article en

Why I don't think AGI is imminent

dlants.me Mastodon Feb 12, 2026

0 inbound links en

Designing for delegation | Ad Hoc

Ad Hoc Mark Headd Oct 1, 2025

Agentic, delegation-based services could reshape how people access government, cutting administrative burden – if agencies start building the right design patterns now.

1 inbound link website en

Claude Estimates in Human Time

Gorewood Logs Feb 2, 2026

I ask how long something will take. Claude answers in developer-hours. Claude is doing the work.

0 inbound links article en

Hiring For Humans (podcast highlights) | Michał Prządka - Blog

blog.michalprzadka.com Michał Prządka Dec 2, 2025

What are we actually hiring for when AI can ace your interviews?

0 inbound links article en

The Assembly Language of Knowledge Work

vivekhaldar.com Sep 13, 2025

The work most of us are doing right now—the clicking, the tabbing between windows, the copy-pasting, the endless typing interspersed with bursts of genuine cognition—will soon seem as archaic as programming in assembly language—the low-level instruction set for a machine that is about to be automated away. The Atom of Work: The Read-Cognify-Write Loop Break down any task performed by a knowledge worker, and you find the same atomic structure repeating itself:

0 inbound links en

Thoughts by a non-economist on AI and economics

Windows On Theory Boaz Barak Nov 4, 2025

Crossposted on lesswrong Modern humans first emerged about 100,000 years ago. For the next 99,800 years or so, nothing happened. Well, not quite nothing. There were wars, political intrigue, the in…

2 inbound links article en Uncategorized

The state of AI safety in four fake graphs

Windows On Theory Boaz Barak Mar 30, 2026

Here is a quick overview of my intuitions on where we are with AI safety in early 2026: So far, we continue to see exponential improvements in capabilities. This is most visible in the famous “METR…

0 inbound links article en Uncategorized

← Back to Blog | OpenAI's inflated valuation, as I understand it

taloranderson.com Oct 10, 2025

Blog post: OpenAI's inflated valuation, as I understand it

0 inbound links en

Let's automate our jobs

quanttype.net Feb 25, 2026

Here's what needs to be done to make software engineering automatic

0 inbound links article en

2025 letter

Zhengdong Wang Zhengdong Wang Dec 30, 2025

Zhengdong Wang’s personal website

5 inbound links article en

The Extreme Inefficiency of RL for Frontier Models — Toby Ord

Toby Ord Toby Ord Sep 19, 2025

The new scaling paradigm for AI reduces the amount of information a model could learn per hour of training by a factor of 1,000 to 1,000,000. I explore what this means and its implications for scaling.

1 inbound link article en

Learning Claude Code, a wild 3 weeks, and the looming mental health crisis - SQLGene Training

SQLGene Training Eugene Meidinger Jan 5, 2026

Since most of my audience is data people, I’m pretty confident you can read a graph. Take a guess when I started using Claude Code. Yes, that’s correct. I installed Claude Code on December 14th with my pro plan. On December 15th, I upgraded to the $200/mo MAX plan, and I expect to keep it […]

0 inbound links article en LLMs

2025 letter

Uzpg Jan 1, 2026

0 inbound links article en

The Bubble and the Long Game - Log - nibzard

Nibzard Nikola Balić Mar 10, 2026

What the printing press taught me about AI, FOMO, and the decades-long game of technological diffusion.

0 inbound links article en AIDIFFUSIONHISTORYSTRATEGYFOMO AIDIFFUSIONHISTORYSTRATEGYFOMO

Mathematics in the Library of Babel — Daniel Litt

Daniel Litt Daniel Litt Feb 20, 2026

Mathematics isn't only about saying true things. It's about asking the right questions, being confused, stumbling about, getting distracted, being wrong, recognizing when you're wrong, being stuck. Mostly being stuck. It's about clinging to a giant edifice and feeling it out until you understand som

4 inbound links article en

Productivity and AI: it's the tool, not the model

nocodefunctions.com Nocode functions Dec 23, 2025

Every week, a new “SOTA” (State of the Art) model is announced, promising higher reasoning capabilities and (often) lower costs. We could be led to think that we are entering an era of infinite, frictionless productivity. But the reality is messier. While the models are getting smarter, the gap between “intelligence on tap” and “completing a task” is managed by our tools and right now, that tooling interface is becoming a major source of friction. As we will see, this isn’t just a developer’s dilemma in the context of new AI assisted coding interfaces. It is a preview of the “retooling tax” that every professional domain must soon learn to navigate. The paradox is simple: as models improve, productivity bottlenecks increasingly shift away from intelligence itself and toward the tools that mediate access to it. The race for better models This is the popular meme reflecting the merry-go-round of weekly improvements of AI models: (source. other versions of this meme do include Anthropic’s Claude, if you wonder) LLMs become more capable, cheaper, and available on tap, to the point that the new best performing model can be indistinguishable from the previous one, simply because models are now so smart that the tasks we perform are not complex enough to clearly differentiate between “a great model” and an “even greater model”: both perform equally well on the tests. This is the experience of Simon Willison when testing a preview of Claude Opus 4.5 on November 24, 2025: It’s clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha [of his coding project]. I switched back to Claude Sonnet 4.5 and… kept on working at the same pace I’d been achieving with the new model. With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected. I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I

0 inbound links en

The macroeconomics of agentic AI

Martin Lumiste Martin Lumiste Feb 17, 2026

During the Industrial Revolution, machines displaced most Western agricultural workers, who later went on to cities and earned higher wages. However, some time later, the automobile displaced all horses, and they still haven’t reallocated to new professions. In the coming decades, are we the 19th century peasant, or the horse?

0 inbound links article en

The SaaSpocalypse Won't Kill SaaS — jd:/dev/blog

Julien Danjou Julien Danjou Mar 31, 2026

Wall Street wiped $300 billion from SaaS stocks and declared the model dead. They're right about the wrong thing.

0 inbound links article en aistartupsaas

AstroCoder

Nolan Koblischke Apr 10, 2025

<nav class="toc"> <ul> <li><a href="#key-features-of-astrocoder">Key Features of AstroCoder</a> <li><a href="#auto-generating-documentation"...

0 inbound links article en

Agent design patterns

rlancemartin.github.io Rlancemartin Jan 9, 2026

Agent design patterns.

1 inbound link en

Anthropic Economic Index report: Economic primitives

anthropic.com First Author Block Jan 13, 2026

This report introduces new metrics of AI usage to provide a rich portrait of interactions with Claude in November 2025, just prior to the release of Opus 4.5.

8 inbound links website en

AIs can now often do massive easy-to-verify SWE tasks and I've updated towards shorter timelines — LessWrong

lesswrong.com Ryan Greenblatt Apr 6, 2026

I've recently updated towards substantially shorter AI timelines and much faster progress in some areas. [1] The largest updates I've made are (1) an…

5 inbound links article en

Radical Optionality — Governing Transformative AI Under Uncertainty

Radical Optionality Apr 23, 2026

Radical optionality is about preserving democratic governments’ ability to make good decisions about how to govern transformative AI systems as circumstances evolve.

0 inbound links article en

Claude Can (Sometimes) Prove It

galois.com Mike Dodds Sep 16, 2025

3 inbound links website en

What I am working on

Ankit Maloo Ankit Maloo Mar 19, 2025

Documenting my journey in the world of AI and RL.

0 inbound links website en

Grading AI 2027’s 2025 Predictions

AI Futures Project Eli Lifland; Daniel Kokotajlo Feb 12, 2026

How has AI progress compared to AI 2027 thus far?

1 inbound link article en

We’re all behind The Curve

Transformer Shakeel Hashim; Celia Ford Oct 10, 2025

Transformer Weekly: GAIN AI Act, China’s rare earth crackdown, and AI bubble talk

1 inbound link article en

No, AI Progress is Not Grinding to a Halt

Obsolete Garrison Lovely Aug 21, 2025

A botched GPT-5 launch, selective amnesia, and flawed reasoning are having real consequences

1 inbound link article en

Grading AI 2027’s 2025 Predictions

AI Futures Project Eli Lifland; Daniel Kokotajlo Feb 12, 2026

How has AI progress compared to AI 2027 thus far?

1 inbound link article en

Grading AI 2027’s 2025 Predictions

AI Futures Project Eli Lifland; Daniel Kokotajlo Feb 12, 2026

How has AI progress compared to AI 2027 thus far?

1 inbound link article en

Designing AI resistant technical evaluations

anthropic.com Jan 21, 2026

What we learned from three iterations of a performance engineering take-home that Claude keeps beating.

6 inbound links website en

Interaction Models: A Scalable Approach to Human-AI Collaboration

Thinking Machines Lab Thinking Machines Lab May 11, 2026

Interaction models move beyond turn-based AI interfaces by handling multimodal, real-time collaboration natively across audio, video, and text.

15 inbound links article en thinky thinkingmachines machine learning deep learning ai

Anthropic is donating $20 million to Public First Action

anthropic.com Feb 12, 2026

Donating to a 501(c)(4) focused on AI issues in the public interest

5 inbound links website en

The Productive Half-Life of AI Agents | Wardley Leadership Strategies

wardleyleadershipstrategies.com Dave Hulbert Dec 14, 2025

Track how long AI agents stay useful before a human must step in, and design leadership rituals to extend that window without losing control.

1 inbound link article en ai-and-leadershipleadershipwardley-mappingautonomyproductivity CC BY-SA 4.0

This is the most misunderstood graph in AI

MIT Technology Review Grace Huckins Feb 5, 2026

To some, METR’s “time horizon plot” indicates that AI utopia—or apocalypse—is close at hand. The truth is more complicated.

2 inbound links article en Artificial intelligence

AI in 2025: gestalt

gleech.org Dec 8, 2025

2 inbound links en Creative Commons

AI vs Human Attention Spans

Zappable Ariel Sep Sep 28, 2025

AI models were initially given human tests (like the LSAT or MCAT) or tests written for AIs like the MMLU. However since they’ve mastered so many of these tests and the tests don’t always carry over to real-world abilities, new measures of progress are needed. One way to rank the difficulty of a task is by how long it would take a human to complete it. In March,

0 inbound links article en

Agentic Engineering: Building Without Writing — Bill de hÓra

Bill de hÓra Bill de hÓra Mar 2, 2026

tars is a personal AI assistant with CLI, Web UI, Email, and Telegram channels, persistent memory, hybrid search, integration with tools I used all the time. About 35 features, 14kloc of python and 600 tests all told. I didn't write any of it. The experience was different enough from traditional de

2 inbound links article en

2025: The year in LLMs

Simon Willison’s Weblog Simon Willison Dec 31, 2025

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about …

25 inbound links article en ai 2014openai 418generative-ai 1785llms 1751anthropic 282gemini 185ai-agents 110pelican-riding-a-bicycle 113vibe-coding 90coding-agents 200ai-in-china 95conformance-suites 10

AI agents find $4.6M in blockchain smart contract exploits

red.anthropic.com Dec 1, 2025

6 inbound links en

Measuring AI agent autonomy in practice

AnthropicAI Authors Nov 3, 2023

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

3 inbound links website en

A Project Is Not a Bundle of Tasks

Second Thoughts Steve Newman Nov 4, 2025

Current AIs struggle to create a whole that exceeds the sum of its parts

0 inbound links article en