The Phoenix Architecture

Apr 20, 2026 Updated Apr 20, 2026

Show full content

Production used to be the place where software went to fail. Observability made it the place where software becomes legible.

But it left one loop open.

We use production telemetry to debug incidents, explain behavior, gate rollouts, and decide whether to roll back. We use it to help humans understand reality. Then a person decides what the code change should be.

Instead, production truth becomes an input to what the system generates next. The key idea is simple: the primary failure mode is not always code breakage. It is evidence decay.

A component can satisfy the spec today and fail it three months from now even if nobody touches the code.

Traffic shape changes. Data distribution shifts. Dependencies slow down. Fallback paths activate more often. Cost envelopes move. Latency ceilings stop holding.

The implementation may be unchanged. The world is what changed.

That is technical drift: when production evidence no longer supports the claim that an implementation satisfies the operational or business constraints attached to its spec.

Not that a service got slower but that the implementation no longer satisfies the latency and cost envelope the requirement promised.

Not that a dashboard got worse but that the evidence that justified this module is no longer valid.

Once you see that, the role of observability changes.

Charity Majors and others have been pushing toward this for years: production as the place where we learn the truth, observability as the ability to ask new questions of live systems, and production as the place where intent has to be validated against reality instead of against our hopes. See “You Had One Job,” “Observability: A Manifesto,” “Honeycomb 10 Year Manifesto: Observability in a World of AI,” and “Your Data is Made Powerful By Context.”

Production truth should not stop at helping humans reason about software. It should participate directly in creating the next version.

In The Phoenix Architecture, production telemetry becomes evidence inside the software creation process itself. It is attached to requirements. It has provenance. It can age. It can drift. And when it drifts, it can invalidate specific parts of the system instead of merely informing a human that something seems off.

A module is not good because it once passed tests. It is good only as long as the evidence still supports the claim that it satisfies the requirement.

The interesting question is no longer just “What is wrong with the system?”. How about, “Which claims about the system are no longer true?”

That is a much sharper question. It also points to a different architecture.

The first important layer is the spec layer. In Phoenix, requirements are not just behavioral. They include operational and business constraints: latency ceilings, cost envelopes, reliability targets, quality thresholds, tenant-specific promises. Those constraints are part of the requirement, not implementation detail discovered later.

The second important layer is the canonicalization layer. Raw production signals are not enough. They have to be turned into stable evidence statements attached to those requirements. Not screenshots. Not dashboards. Not anecdotes from last week’s incident review. Structured claims: a p95 latency measurement for enterprise traffic at peak, a cost-per-request ratio that has blown past budget, a fallback activation rate that has doubled past its threshold.

This is why context matters so much. If you throw away relationships too early, aggregate too aggressively, or preserve only the questions you already thought to ask, you don’t have evidence. You have artifacts of somebody else’s curiosity.

The third important layer is the implementation graph. Once requirements are connected to modules, services, queries, prompts, dependencies, and contracts, drift can be localized. You no longer have to say “the app is degrading.” You can say: this requirement is drifting, these modules are implicated, and these claims are now stale.

That leads to the most important architectural move: selective invalidation.

When production evidence drifts out of bounds, Phoenix should not just open a ticket or wake up an engineer to start hunting through code. It should invalidate the affected subgraph and the specific evidence claims that no longer hold.

Not the whole system.

Only the part whose justification has expired.

That is what makes regeneration tractable. Without that step, “production should feed software creation” collapses into a vague fantasy about AI reading logs. With it, you get a bounded, governed process.

Canonicalized evidence identifies which requirement is failing, and the implementation graph localizes the affected modules. The invalidation system marks only that subgraph as stale. Then regeneration has a concrete job.

Now the question becomes:

What should be regenerated because of what production just taught us?

Maybe a query planner needs to be rewritten for the actual workload it now sees. Maybe a cache strategy needs to be redesigned because the hit-rate assumptions no longer hold. Maybe a component needs to optimize for tail latency rather than mean latency because that is what the requirement actually cares about in production.

That is not observability as a dashboard. It is observability as an input to software creation.

Compiler input means exactly that.

Compilers do not just transform source. They operate under constraints. They take targets, assumptions, and optimization goals. Phoenix extends that idea upward. Production truth becomes one of the things the system compiles with.

Not because production is magical, but because production is where the promises in the spec are forced to meet reality.

The first generation of observability helped us detect failure.

The second helped us understand complex behavior in running systems.

The next step is to let production truth participate directly in software creation.

If production is where the truth is, why isn’t production truth a first-class input to the build?

https://aicoding.leaflet.pub/3mjx4erlboc2l

The Phoenix Primitives

Apr 13, 2026 Updated Apr 13, 2026

The architecture of a regenerative system is defined entirely by what you can't delete.

Show full content

Thirty source files organized into layers. Controller, service, repository, data mapper. That's a typical microservice, and every one of those layers exists so a developer can navigate. You put the database logic in one place and the HTTP handling in another because a person needs to hold the shape of the system in their head. The file is a cognitive container. The module boundary reflects what a developer can reason about in a single sitting, not some deeper truth about the problem domain.

When code is regenerated rather than maintained, those constraints lose their force. A regenerative system doesn't need navigation. It needs a specification precise enough to produce correct behavior. Those 30 files might be regenerated from a single behavioral spec, structured completely differently each time. The file layout is an output, not an input.

This goes deeper than files. Branches, pull requests, code review: these are rituals organized around the assumption that a human authored each change and another human needs to verify it. They're valuable rituals. But they're contingent on a model of software production that's shifting underneath us.

The file stops being a primitive. The specification becomes one.

The architecture of a regenerative system is defined entirely by what you can't delete.

That's the claim this essay is built around. Every architecture has primitives: irreducible building blocks you compose everything else from. Ours were shaped by decades of humans writing code by hand. When implementation becomes disposable, a different set of primitives emerges. The interesting question isn't "what can AI generate?" but rather what are the critical artifacts in a system designed to be continuously reborn?

The Primitives

Start with something concrete. A payment processing module in a traditional system is defined by its source code, its test suite, and its git history. Delete those, and the module is gone.

In a regenerative architecture, the same module is defined by four different artifacts. Its behavioral specification: "charge a card, handle declines, emit events with these schemas." Its evaluations: runnable contracts that any implementation must pass. Its context boundary: the API contract and event schemas neighboring services depend on. Its provenance record: "this module was regenerated Tuesday because the fraud detection spec changed."

Delete the implementation, keep those four artifacts, and you can regenerate a working system. Delete any one of the four, and you can't.

That's what makes them primitives. Specification, evaluation, context boundary, and provenance are the minimum set from which everything else can be derived. Together, a specification and its evaluations define what I call a regenerative grain: the natural unit you can safely delete and recreate. A context boundary defines where one grain ends and another begins. Provenance gives you the audit trail that git log used to provide, but at the level of intent rather than diff.

Specifications Are Not Documentation

This is the hardest conceptual shift, and most teams get it wrong on the first try.

In a regenerative system, the specification isn't a description of code that exists. It's the source of truth from which code is derived. This inverts the traditional relationship where documentation describes implementation. Here, implementation expresses specification.

Think about an API spec today. It lives in a Swagger file that developers keep "in sync" with the real code. It drifts constantly. Everyone knows it drifts. Nobody fixes it until a customer complains. The spec is decorative: a polite fiction maintained for onboarding and external consumers.

In a regenerative system, that specification is the generative input. If the spec says the endpoint returns a 404 on missing resources, the regenerated code does that. Not because a developer remembered to, but because the evaluation derived from the spec enforces it. The spec can't drift from the implementation because the implementation is derived from the spec on every regeneration cycle. Contract-first development taken to its logical conclusion.

The tradeoff is real: writing good specifications is harder than writing good documentation. Documentation can be vague and still useful. A specification that's vague produces implementations that are wrong in unpredictable ways. The discipline required is closer to writing a contract than writing a README.

Context Boundaries Are the New Architecture

If interiors are disposable, boundaries are everything.

The most consequential design decisions in a regenerative system aren't about how code is structured internally. Internal structure is regenerated and therefore fluid. The decisions that matter are where you draw context boundaries: the integration contracts, event schemas, and shared data formats that multiple regeneration units depend on. These boundaries form what I think of as the conservation layer: the part of the system that resists change because changing it means coordinating across multiple independent regeneration cycles.

Two teams share a protobuf schema for inter-service communication. The internal implementation of each service can be regenerated freely. Different languages, different frameworks, different internal structure every time. But the schema itself is a conservation-layer artifact. Changing it requires coordinating both teams, both regeneration cycles, both evaluation suites. That schema is more architecturally significant than any line of code in either service, because it's the one thing that can't be regenerated in isolation.

This shifts what architectural skill means. It moves from "how do I structure this service internally" to "where do I draw the lines between things that regenerate independently." Get the boundaries wrong and you've coupled regeneration units that should be independent. Now changing one means regenerating three, and you've lost the whole point. Get them right and each unit can be reborn on its own schedule without coordination.

Boundary design requires more upfront thought than internal structure ever did. You can't refactor a context boundary the way you refactor a class. Every service on the other side of that boundary has baked your contract into its evaluations. The conservation layer resists change at the points where change is most expensive. The skill is knowing which boundaries to draw tight and which to leave loose, and that judgment still comes from experience, not from any specification language.

Provenance Replaces Narrative

Traditional version control tells a story: who changed what, when, and (if you're lucky) why. In a regenerative system, most of that story is meaningless. The code was generated, not authored line by line. A diff between two generated implementations tells you almost nothing about what actually changed in the system's behavior or intent.

But the need for traceability doesn't disappear. It shifts.

Provenance tracks which specification version produced which implementation, which evaluation suite validated it, and what triggered the regeneration. This is richer than a commit log because it captures causation, not sequence.

Picture an incident investigation. In a traditional system, you're reading git blame and commit messages, trying to reconstruct what a developer was thinking three weeks ago at 11 PM. In a regenerative system, provenance tells you: "This implementation was generated from specification v2.3, validated against evaluation suite v2.1. Mismatch. The eval suite hadn't been updated for the new spec. Triggered by a dependency update in the auth module." That's a causal chain. It points you at the root cause directly: the evaluation suite was stale, so a spec change slipped through without proper validation.

The gap today is tooling. Domain-specific provenance systems exist: MLflow tracks model lineage, SLSA addresses build provenance in supply chains. But nobody has built the general-purpose provenance layer for regenerative software. Teams building these systems now are stitching together metadata stores, generation logs, and spec version tags. It works. It's also baling wire.

We spent decades refining primitives that optimize for human authorship: readability, navigability, reviewability. Those aren't wrong. They're contingent on a world where humans write and maintain every line. When implementation becomes disposable, the primitives that matter are the ones that survive deletion: the specification, the evaluation, the boundary, the provenance record.

The developers who master this won't be the ones who generate code fastest. They'll be the ones who draw the best boundaries.

https://aicoding.leaflet.pub/3mjfruwwuck2d

The Generative Stack

Apr 7, 2026 Updated Apr 7, 2026

Trying to find the best tool or platform for generative software in 2026 is a mistake that could haunt you for decades

Show full content

Every layer of the software toolchain is changing at the same time. New spec formats, new generation strategies, new evaluation frameworks, & new feedback mechanisms show up weekly, each one supposedly the one that matters. Most teams respond by trying to pick winners.

Maybe I'm just old and too cranky to follow the trends day in and day out, but I think that's exactly wrong.

What you actually want is an architecture where independent tools and representations coexist at every phase. Competing, reinforcing, composable. A system that absorbs change without requiring you to place a bet. Because pipelines that lock in today's best options? Those become next quarter's migration projects. I've seen it happen enough times to stop being surprised by it.

That's what I mean by the generative stack.

The Full Pipeline

The generative stack covers every phase between a human's intent and a system's verified behavior. Not just specifications.

Spec inputs (markdown documents, conversations, diagrams, existing code, domain models) feed into structured clauses. Canonicalization structures those clauses into requirements and invariants. Requirements drive evaluations. Evaluations constrain implementation units. Implementation units become actual code. Running code produces feedback (eval results, runtime behavior, performance data) that flows back into the loop, refining every upstream layer.

Each of those phases is its own site of diversity. Each one can absorb independent tools, overlapping representations, several approaches. And the trick is you don't choose between them. You compose them.

Diagram of the previous paragraph's flow

At the spec layer, you might combine natural language descriptions, example interactions, and formal constraints. No single artifact captures the full intent. Together, they triangulate it.

At the clause layer, several structuring strategies (e.g. one tool extracting behavioral requirements from conversation, another identifying invariants from existing code) produce a reconciled set of requirements that's more complete because each tool parses intent differently.

At the evaluation layer, the same principle holds. I say evaluations rather than tests deliberately: these are behavioral contracts that outlive any particular implementation. Property-based checks, example-based specs, integration contracts, performance bounds, LLM-as-judge assessments. Each catches failures the others miss. A system verified by one evaluation strategy has blind spots. A system verified by five has fewer. (Not zero. Fewer.)

Diversity at each phase is the mechanism that catches what any single tool misses.

At the generation layer, different models and strategies have different strengths. One might produce cleaner abstractions. Another might handle edge cases more carefully. You don't have to pick one. Generate candidates from several approaches and let your evaluations arbitrate.

At the feedback layer, runtime telemetry, eval results, and user behavior data all flow back upstream. Each signal illuminates different gaps. Eval failures point to broken implementations. Performance data reveals implementations that are correct but insufficient. And sometimes user behavior exposes the real problem: the specification itself was wrong. Independent feedback channels make the whole loop self-correcting in ways that any single channel can't.

Every phase is additive. Every phase benefits from redundancy.

Why Redundancy Works Here

In traditional development, maintaining overlapping representations of the same concern was expensive. Two spec documents meant two things to keep in sync. Three test frameworks meant three things to maintain. The cost of reconciliation exceeded the benefit of coverage. We all learned this the hard way.

AI inverts that calculus.

When generation is cheap and reconciliation can be automated, the economics flip. You can afford to maintain a natural language spec and a formal constraint set and a suite of example interactions, not because any one of them is sufficient, but because an AI system can reconcile them faster than a human can maintain a single canonical version. The same applies at every other layer. Run multiple generation strategies and let evaluations pick the winner. Maintain independent evaluation approaches; their disagreements surface ambiguity you'd otherwise miss. Ingest diverse feedback signals and let canonicalization absorb them all.

The cost of redundancy drops. The value of redundancy stays the same--or increases, because blind spots don't overlap across independent approaches.

Fault-tolerant systems in every other engineering discipline depend on redundancy across independent channels. Flight control systems don't rely on one sensor. Distributed databases don't rely on one replica. The principle is old and boring and correct: if you want resilience, don't depend on a single representation of anything critical.

In a generative stack, every layer is critical. Specs, clauses, requirements, evaluations, generation strategies, feedback loops. A single point of fragility at any layer makes the whole pipeline brittle.

No Layer Is Settling

There probably isn't going to be a single dominant way to specify intent, structure requirements, generate code, evaluate correctness, or close the feedback loop. This has been the observable pattern across every layer of the software toolchain for decades. Build systems, test frameworks, deployment strategies, monitoring approaches; none of them converged to one winner. We keep thinking they will. They keep not doing it.

What's different now is degree, maybe not kind. Every layer between intent and implementation is becoming a site of rapid evolution simultaneously, and they're all getting better at different rates. Take evaluation alone. Two years ago, LLM-as-judge was a research curiosity. Today it's a production evaluation strategy that catches semantic regressions no unit test would surface. It coexists with property-based testing, which catches algebraic violations no LLM judge would notice. Neither replaces the other. Both arrived on different timelines. This is happening at every layer at once.

The stack keeps expanding. An architecture that assumes convergence will need to be rebuilt every time a layer lurches forward.

Designing for Absorption

If you build your generative pipeline around a specific tool at any layer such as a particular spec format, a particular LLM, or a particular evaluation framework, you've created a coupling that will fight you when that layer evolves. And it will evolve. I don't think I need to convince anyone of this at this point.

The alternative is to design each layer as a composition point. A place where inputs, tools, and representations can be added without replacing what's already there.

At the spec layer, accept multiple input formats without requiring a single canonical source of truth for intent. Canonicalization should structure requirements so new clauses from new sources don't invalidate existing ones. Evaluations work as an additive suite: new strategies strengthen the contract rather than replacing it. Generation stays model-agnostic, able to swap or combine approaches as capabilities shift. Feedback ingests signals from independent sources, routing them to the right upstream layer.

Each layer becomes a site where you can add without subtracting. Where new tools strengthen the pipeline instead of forcing you to rip out what's already working. This is possible, because the layers and their interfaces are well known. Because the architectural system around them is well defined and unchanging. The specific tools and techniques inside the layer are encapsulated.

This is what makes regeneration robust. If your regeneration pipeline depends on one tool at each layer, it's as fragile as the weakest tool. If each layer draws from overlapping representations and approaches, regeneration tolerates the failure or obsolescence of any single component.

You're no longer depending on one interpretation being correct at any layer. Which, if you think about it, was always a pretty optimistic assumption.

You don't optimize the generative stack once. You cultivate it. Every phase absorbs the next tool, the next representation, the next approach that hasn't been invented yet. Convergence is a bet against the entire history of software tooling. I wouldn't take that bet.

https://aicoding.leaflet.pub/3miwhqqvwxc2x

The Conversation Is the Commit

Mar 26, 2026 Updated Mar 26, 2026

Show full content

Here’s a scene every working programmer has lived.

It’s 2 AM. Something is broken in production. You’re staring at a commit from eight months ago. The message says “refactor auth logic.” That’s it. The PR has two approvals—thumbs-up emoji, no comments. The Slack thread where the team debated the approach was in a channel that got archived when the org restructured. The person who wrote it left the company in June.

You don’t just need to understand what this code does. You need to understand why it exists in this form—what alternatives were considered, what constraints shaped it, what tradeoffs were accepted.

That information isn’t missing. It was never captured.

It lived in someone’s head. It passed through ephemeral media. It evaporated.

We’ve always treated code as the primary artifact of software development. Everything else—the reasoning, the tradeoffs, the rejected approaches—is secondary. Informal. Optional. We write it down when we feel like it, in whatever format is convenient, with no expectation that it will survive contact with time.

That was a defensible tradeoff when humans wrote all the code. The alternative—capturing every decision rigorously—was more expensive than the knowledge loss it prevented.

Humans optimized for shipping. Documentation was a tax.

It breaks completely when agents enter the picture.

The Real Artifact

When an agent writes code, something fundamental shifts.

The code is no longer where decisions are made. It’s where decisions show up.

The real work happens in the back-and-forth. You try something, it does something slightly wrong, you tighten the prompt, add a constraint, back it out, try again. That loop is the engineering.

The code is just what falls out of that loop—the way a compiled binary falls out of source.

Once you see it, you can’t unsee it:

The conversation isn’t attached to the commit.

The conversation is the commit.

The code is derived.

Which means the interesting part of any line of code isn’t the line itself. It’s the path that led there. The moment someone said “we can’t do it that way because of X,” or “we tried that last year and it blew up,” or “this only works if we assume Y.”

That’s the thing we’ve been throwing away.

Editing Compiled Binaries

If the conversation is the source and the code is the compiled output, what does it mean to manually edit the code?

You’re editing compiled binaries.

You’re bypassing the process that produced the artifact. You’re breaking the provenance chain. The code no longer traces back to a decision record—it traces back to a person’s fingers on a keyboard, with whatever reasoning they had in their head at the time.

That reasoning will be gone.

And if the conversation is gone, you’re left trying to reverse-engineer intent from what is basically obfuscated output.

Manual edits aren’t always wrong. Sometimes you need to patch a binary.

But you should recognize them for what they are:

An escape hatch, not a methodology.

In a system built on conversation-driven development, every manual edit introduces a new kind of debt:

something like provenance debt — code that just… appeared, with no way to ask why it’s there

The Constraint

If the conversation is the source and the code is just the output, then there’s an obvious question:

Why are humans allowed to edit the output at all?

If you care about provenance—about being able to trace every line of code back to a decision—then manual edits are a problem. They break the chain. They introduce code that has no record, no explanation, no origin you can inspect.

So you end up in an uncomfortable place:

You don’t just prefer that agents write the code.

You require it.

Not because agents are better programmers.

But because they are the only way to guarantee that every change passes through a process that can be captured, inspected, and replayed.

Humans don’t stop participating. They just move up a level.

They define intent. They shape constraints. They review outcomes.

But they don’t write the code directly anymore.

Because the moment they do, the system forgets how it got there.

Process Becomes Enforceable

Every software team has rules.

No code without tests.

No merge without review.

At least, in theory.

In practice, those rules bend. Writing the “why” is a tax we stop paying the moment the coffee wears off or someone says “can we just ship this?” in Slack.

Each shortcut makes sense at the time.

Stack enough of them together and you get a system nobody really understands anymore.

When code has to go through an agent, that changes.

You can make it impossible to produce code without tying it back to something—some requirement, some explanation of what it’s supposed to do. You can require tests. You can require that there is a reason, somewhere, that can be pointed to.

And because the agent is doing the work, you actually get it. Every time.

Of course, agents can fake this. They can generate tests that don’t test anything, explanations that don’t explain.

So you need a way to verify the constraints are actually being met.

The difference is: that verification can also be automated.

The Responsibility Inversion

At first, this sounds backwards.

Letting agents write all your code feels reckless. Taking humans out of the loop feels like giving up control.

But look at what you get if you don’t.

Code appears with no history. Decisions live in someone’s head. The reasoning is gone the moment it matters.

That’s the system we’ve been calling “safe.”

If agents are the only way to guarantee:

that every change has a reason
that the reason is recorded
that the process can be inspected and replayed

then letting humans write code directly isn’t caution.

It’s accepting that your system will drift out of understanding over time.

The safer system is the one that doesn’t allow untraceable changes at all.

Version Control for Intent

Git answers one question well: what changed?

It answers the important questions poorly:

why did it change?
what alternatives were considered?
what assumptions were active?
what would make this wrong?

When every commit is linked to the full conversation that produced it, version control starts to look less like a history of files and more like a history of decisions.

You can trace a piece of code back to the moment someone decided it had to exist. You can see what was tried before that. You can see what assumptions were in play.

You’re not just diffing code anymore. You’re diffing reasoning.

There’s a wrinkle here.

Unlike source code, the mapping from conversation to code isn’t perfectly deterministic. Run the same prompt against a different model, or even the same model at a different time, and you may get a different result.

That doesn’t invalidate the idea. It just means the conversation alone isn’t enough.

The “source” becomes the conversation plus the execution context: the model, the tools, the constraints, the evaluation criteria. A serialized state, not just a transcript.

Which is another way of saying: we don’t just need version control for code. We need version control for the entire process that produces it.

The Substrate Problem

If conversations are the source of truth, they need infrastructure worthy of that role.

Today’s tools—Slack, GitHub, email—are:

ephemeral
fragmented
structurally disconnected from the artifacts they produce

What this looks like when it works is less glamorous than it sounds.

On Monday morning, you don’t start by grepping through a 10,000-line file that hasn’t been touched in three years. You don’t start by guessing.

You start by asking the system: “Why did we stop using the Redis cache here?”

And it can answer. Not with a guess, but by pointing you to the conversation where that decision was made. What broke. What was tried. What didn’t work.

We need a substrate where:

conversations stick around
you can point to them
they’re tied directly to the code they produced
agents operate inside that system, not bolted on the side

Where the whole process is:

replayable
inspectable
auditable

Without that, you’re relying on convention.

And convention is exactly what failed us.

The Shift

We’ve been treating code as the thing that matters, and everything that produced it as exhaust.

That worked, as long as humans were doing the work and we were willing to forget most of it.

It stops working the moment the process becomes the only place the real decisions are made.

At that point, the code isn’t the source of truth anymore.

It’s just what fell out the bottom.

If you want to understand a system, or trust it, or change it safely, you don’t start with the code.

You start with whatever produced it.

And increasingly, that’s the conversation.

In that world, writing code directly starts to feel less like craftsmanship and more like bypassing the system entirely.

https://aicoding.leaflet.pub/3mhxvpam4z22z

Compile to Architecture

Mar 6, 2026 Updated Mar 6, 2026

For a long time we’ve treated frameworks as the target of software development. But if systems are meant to be regenerated and replaced safely, the real compilation target has to be the architecture itself.

Show full content

The industry is still trying to generate applications.

A React app. A Django service. A Rails API. A FastAPI backend.

That instinct made sense when writing software was the expensive part. But in a world where code can be generated quickly and cheaply, the real constraint has shifted. The problem is no longer producing code. The problem is replacing it safely.

Regenerative software does not work if the unit of generation is an application. Regeneration only works if the unit of generation is a component that compiles into a system architecture.

Languages and frameworks matter, but they are implementation choices. Architecture is the thing that actually determines whether a system can evolve without breaking itself.

A System That Already Worked

Years ago at Wunderlist we ended up with a system structure that, in hindsight, explains why replacing pieces of the system was easier than it should have been.

The architecture settled into several layers: clients across multiple platforms, a WebSocket actor server acting as a synchronization proxy, a REST domain layer containing business logic, a data service layer in which each service owned its datastore, and a mutation bus that propagated changes through the system.

The frameworks we used inside those services varied over time. But the structure of the system stayed consistent.

Each layer had clear responsibilities. Every datastore had a single owner. Mutations moved through predictable paths instead of appearing unpredictably across the codebase.

Once that structure existed, every new component had to fit within it.

Services could be rewritten. Implementations could change. But the structure of the system remained stable.

That stability is what made replacement safe.

Frameworks Hide the Real Problem

Frameworks are valuable because they provide structure. They define directory layouts, lifecycle hooks, routing conventions, and dependency patterns. Those conventions help teams move faster and reduce accidental complexity.

The problem is that frameworks provide just enough structure to make teams feel like they have architecture.

But frameworks do not answer the questions that actually determine whether a system can evolve safely: how components communicate, who owns data mutation, how behavior is verified, how parts of the system can be replaced, and where the natural boundaries between components should live.

Those are architectural decisions.

If they remain implicit, regeneration becomes dangerous. Two services may share a database table and both work perfectly in production. But neither of them can be replaced safely. The moment one changes, invisible coupling emerges.

The same thing happens with ad-hoc HTTP endpoints. They often look clean at first glance. But they quietly embed assumptions about state, ordering, and behavior that are nowhere captured in the architecture.

Ad-hoc HTTP endpoints are not an architecture. They are simply networked function calls.

The Missing Compilation Target

Most development pipelines still follow the same conceptual structure:



spec → code → framework → deployable application

AI tools accelerate the middle step. They produce code faster. They scaffold projects more efficiently. But the overall structure remains unchanged.

Regenerative systems require a different model:



spec → architecture → regenerable components → implementations

In this model the architecture becomes the compilation target. Code becomes a build artifact produced to satisfy architectural constraints.

Instruction Sets for Systems

A useful analogy comes from the way compilers interact with hardware.

Modern programming languages rarely compile directly to processors. Instead they compile to an instruction set architecture (ISA).

A C program can compile to ARM. So can Rust. So can Swift. The languages differ, but they all target the same architecture.

The ISA defines the rules that make execution possible: how functions are called, how memory is addressed, and how instructions interact with the machine.

Because those rules are stable, implementations can evolve while the underlying machine remains usable.

Regenerative software needs a similar layer.

A system architecture defines the rules every component must satisfy. Without those rules, components are just programs running next to each other. With them, components become replaceable parts of a larger system.

The Constraints That Make Replacement Safe

A regenerative architecture works because it imposes constraints that make replacement possible.

One of those constraints concerns how components communicate. Systems tend to settle around a small number of interaction models—RPC calls, events, actor messages, command buses, or mutation logs. The specific transport is less important than the consistency. When interaction patterns are predictable across the system, replacing a component does not require rediscovering hidden assumptions about how messages flow.

Another constraint concerns data ownership. Regeneration requires clear authority over state. If multiple components mutate the same data directly, none of them can be replaced safely. A regenerative system therefore assigns exclusive mutation authority for each dataset to a single component. Other parts of the system interact with that data through well-defined interfaces. Variations of this rule appear repeatedly in resilient designs—from service-oriented architectures to event-sourced systems and data mesh. Data ownership ends up defining architectural seams far more reliably than file boundaries ever will.

A third constraint determines the natural grain of components. When components grow too large, regeneration becomes risky because deleting or rewriting them introduces widespread uncertainty. But making components too small produces a different problem: coordination overhead. Context scatters across too many fragments, and network chatter begins to dominate behavior.

The right grain usually reveals itself through two forces. One is data ownership boundaries, which define where mutation authority lives. The other is evaluation surfaces—places where behavior can be verified independently. When a unit both owns its mutation authority and can be verified without booting the entire system, a natural seam appears. Architecture’s job is to stabilize those seams.

Finally, regeneration depends on the presence of clear evaluation surfaces. Behavior must be verifiable independently of implementation. Systems accomplish this through mechanisms such as API contract tests, domain invariants, event consistency rules, and property tests that assert system behavior. Without these surfaces, regenerated components cannot be trusted. Verification becomes guesswork.

What Made the Wunderlist System Work

Looking back, the Wunderlist architecture satisfied all of these constraints.

The mutation bus established predictable calling conventions for how changes flowed through the system. Each data service held exclusive authority over its datastore, ensuring that mutation ownership was clear. Services were sized around domain responsibilities, which produced a workable grain for independent evolution. And contract and integration tests provided evaluation surfaces that verified behavior at the boundaries.

None of those properties came from frameworks.

They came from architectural constraints.

Once those constraints existed, implementations became flexible. The system could evolve without collapsing under its own complexity.

Architecture as Runtime

In a regenerative system, architecture begins to resemble a kind of runtime environment—not in the sense that it executes code directly, but in the sense that it actively constrains how components exist within the system.

The architecture defines how components communicate, how state may be mutated, where correctness is evaluated, and how components can be replaced. Components plug into this environment. Their implementations can change, but the structure that governs their interaction remains stable.

Just as many languages can target the same CPU architecture, many implementations can target the same system architecture.

That stability is what makes regeneration possible.

The Shift

For decades we learned an important lesson about machines.

Software does not compile directly to hardware. It compiles to an instruction set that makes hardware replaceable.

Regenerative systems need the same insight.

Software should not compile directly to frameworks.

It should compile to architecture.

Not applications.

Not templates.

Systems.

https://aicoding.leaflet.pub/3mgfsrk75ac2l

The Regenerative Grain

Feb 19, 2026 Updated Feb 19, 2026

In 2014 I gave a talk called Tiny (keep things small enough to understand) In 2026, small means something different. Small means safe to delete. New in the Phoenix Architecture series: The Regenerative GrainI

Show full content

From Tiny to Deletion Safety

In 2014, I gave a talk called Tiny. The argument was simple: keep things small enough that a human can understand them.

In 2026, that idea has evolved.

Small is no longer primarily about cognitive load. Small is about deletion safety.

When generation is abundant, the scarce resource is no longer typing. It is verification. It is traceability. It is knowing what must remain true when implementations change.

Regeneration only works at the right grain.

From Cognitive Load to Deletion Safety

"Small" used to mean easier to reason about, faster to ship, less likely to explode. Now "small" means safe to delete, cheap to regenerate, and verifiable at a clean boundary.

If deleting a component feels terrifying, the problem isn't courage. It's architecture. If you can't kill it, you don't own it.

It owns you.

This builds directly on earlier arguments in this series. The Deletion Test asked whether you could delete your entire codebase and regenerate it from specification. That was a diagnostic for the system as a whole. The regenerative grain applies the same logic at a finer resolution: individual components, services, and modules. Where the Deletion Test reveals whether your evaluations are strong enough to survive regeneration, the grain question reveals whether your boundaries are drawn in the right place.

Tinyness isn't aesthetic minimalism. It is structural. It determines whether regeneration is safe or reckless.

Finding the Grain

Before you settle on a component boundary, imagine deleting it and recreating it from its specification. Not refactoring it. Not patching it. Deleting it.

If that thought makes you sweat, the grain is wrong.

This diagnostic isn't a law of physics. It's a heuristic. But it's a reliable one. And unlike a one-time design exercise, it's something you run repeatedly when requirements shift, when a service accretes new responsibilities, when a team changes. Boundaries aren't set at the whiteboard. They're discovered iteratively, the way you discover the natural grain of wood by working with it. Force a cut against the grain and it splinters. Find the grain and the material cooperates.

Here's how to test it.

Comprehension. Can a human (or an LLM) understand the invariants, edge cases, and data transformations of this unit in roughly ten minutes? If understanding requires a historical tour of the repository or a walkthrough of hidden behaviors, you're not holding a component. You're holding accumulated sediment. Layers deposited over time that no one can safely excavate.

Isolation. Can you verify this unit's correctness at its boundary without booting half the system? Not necessarily with boundary tests alone (property-based tests and simulations may still matter) but the core behavior should be mechanically assertable without orchestrating the world around it. If your "unit" requires shared infrastructure just to run, you haven't found a seam.

Mutation ownership. Does this unit have exclusive logical write authority over its data? Physical replication doesn't matter. Read models don't matter. What matters is whether more than one place in the system can change the same logical state. Shared writes create hidden coupling. You cannot safely regenerate what you do not exclusively control.

Contracts. Does the unit communicate through a versioned, schema-enforced interface — typed APIs, explicit behavioral contracts? If so, you can replace it without destabilizing neighbors. If it relies on shared libraries or informal knowledge of how the other side works, regeneration becomes a coordination event instead of a replacement. This is the difference between pruning a branch and performing surgery on a root system.

Your gut. Rate your emotional response to deleting this component. If it feels trivial, the grain may be too fine or fragmented to the point of meaninglessness. If it feels like mild inconvenience, you are in the right zone. If it feels like existential dread, you've discovered a taproot.

That last one sounds soft. It isn't. Deletion terror is a technical signal. It usually means hidden invariants, implicit contracts, or shared mutation, which are precisely the conditions that make regeneration unsafe.

A Concrete Example

Consider a feature-flag evaluation service. Its job is simple: accept a user context and a flag key, return a deterministic decision, and log the evaluation.

The correct grain is not the entire user-profile subsystem or the entire logging pipeline. It is the evaluation engine itself.

If the rules are declarative, the invariants (determinism, no hidden side effects) are verified at the boundary, and the service has exclusive authority over flag definitions, then you can delete the implementation, regenerate it from its specification, run the evaluations, and ship.

If, instead, it reaches directly into user databases, embeds logging side effects across the codebase, or allows multiple services to mutate flag rules, deletion becomes terrifying.

The grain was wrong.

Finding the Sweet Spot

Not everything should be tiny.

When the grain is too coarse, the verification surface explodes. Regeneration becomes opaque. Fear accumulates like deadwood. When the grain is too fine, you drown in orchestration and coordination overhead. Meaning fragments across too many seams. Decision fatigue replaces clarity.

The right grain is large enough to contain coherent invariants and small enough to isolate mutation. A unit that fits in working memory, owns its mutations, exposes a versioned contract, and can be verified at its boundary without booting the entire system.

The goal is not maximal fragmentation. It is disposability.

What this doesn't yet address is how grain decisions compose. Five well-grained components don't automatically produce a well-grained system. System-level regeneration requires something beyond individual component boundaries — orchestration contracts, integration specifications, the connective tissue between seams. That's a question this series will return to.

Where This Doesn't Apply Cleanly

There are domains where regeneration must be handled with extreme care: cryptographic primitives, protocol parsers, performance-critical hot paths, regulated or safety-critical components.

In those contexts, deletion safety coexists with formal verification, audits, and traceability requirements. The grain question still applies — you still want to know whether boundaries are drawn well — but the cost of regeneration is higher and must be deliberate.

Regeneration presupposes deterministic specifications and boundary evaluations. Without them, you are not regenerating. You are improvising.

What This Changes on Monday

This is not about adopting a new framework. It is about shifting from writing code that lasts to designing systems whose components can be replaced.

That shift forces you to care about deterministic specifications, boundary evaluations, mutation ownership, and versioned contracts. The same things The Gradient of Trust argues are more durable than prompts, and that Immutable Infrastructure, Immutable Code argues should never be patched in place.

In a world where generation is abundant, the most expensive thing you can own is code you are afraid to change.

Tinyness is no longer about elegance. It is about safe forgetting.

https://aicoding.leaflet.pub/3mfai4nqg6224

The Industrialization of Regenerative Software

Feb 12, 2026 Updated Feb 12, 2026

Show full content

The “AI software factory” metaphor is seductive.

Factories increase output. They reduce marginal cost. They turn craft into production. For decades, writing code was the bottleneck. Now generation is cheap, and it feels like we’ve industrialized software.

But factories are not optimized for throughput alone. Real factories are optimized for yield.

Throughput is how much you produce. Yield is how much of it survives.

That distinction matters more than most of the AI discourse admits.

A team embraces the factory model.

In two weeks, they’ve generated forty internal services: shared utilities, validation layers, orchestration glue, observability hooks. Everything modular. Everything neat.

Three months later, something subtle shifts.

A small requirement change touches validation. Five different modules contain similar logic. No one remembers which one is canonical. The tests still pass because they verify each implementation, not the invariant across them.

Monitoring disagrees with itself. Metrics drift, but not catastrophically. There’s no outage. Just friction.

Deleting anything feels dangerous. Everything might be referenced somewhere. No artifact looks obviously wrong. Nothing is clearly scrap.

They didn’t move too fast.

They industrialized generation without industrializing forgetting, and nothing in their tools or culture warned them they needed to.

Manufacturing learned this lesson long ago.

A factory that optimizes only for throughput accumulates waste. Defective units pile up. Rework increases. Inventory grows. Costs hide inside buffers. So industrial systems built discipline around scrap: defect rates are measured, processes are retooled, material is reclaimed, disposal is designed into production. Yield becomes the signal that matters.

Software factories might talk about output: how many features shipped, how many services generated, how many migrations completed.

...but rarely talk about yield: how many components can be safely deleted, how many services can be reimplemented without destabilizing the system, how much of the system’s identity survives regeneration unchanged.

Throughput without yield is accumulation.

And accumulation has carrying cost. Cost in cognitive load, integration surface, monitoring complexity, coordination overhead. When generation becomes cheap, those costs are easy to ignore until they dominate.

The factory metaphor is only useful if it includes reclamation.

Industrial systems are built to absorb replacement. They assume parts will be discarded and design processes where scrap is expected and controlled. Regenerative software demands the same discipline.

Deletion must be ordinary. Replacement must be bounded. System identity must live outside implementation.

The factory does not produce code. It produces behavior that can be regenerated without fear.

That is yield.

If code is cheap to produce but expensive to understand, then the durable artifact cannot be the code itself. It must be the interfaces that survive language changes, the invariants that survive framework swaps, the evaluations that survive reimplementation, and the data that survives infrastructure shifts.

Without that structure, industrial generation increases risk. Every regeneration adds surface area unless something else constrains it. Every new abstraction feels cheap in isolation. Every optimization leaves residue.

Cheap generation shifts the constraint from production to coherence.

Factories that scale output without scaling coherence do not become efficient. They become brittle. Real industrial systems reduce variance, define tolerances, and make replacement predictable. They are engineered for discard as much as for production.

The software equivalent is not “more code faster.” It is explicit boundaries, durable evaluations, drift detection in production, and compaction cycles that reduce conceptual mass rather than expanding it.

Industrial regeneration means that the cost of replacement is lower than the cost of preservation. When that is true, deletion becomes safe. When deletion becomes safe, accumulation becomes optional.

AI gives us industrial generation immediately. It does not automatically give us industrial regeneration.

You can produce artifacts faster than you can reason about them. You can increase throughput while quietly lowering yield.

A system that cannot safely forget will eventually be constrained by what it remembers.

If we are serious about industrialization, we must industrialize forgetting. That requires evaluation suites that outlive implementations, invariants expressed at stable boundaries, regeneration cadence built into process, and architectural discipline that treats deletion as a first-class event.

Until we do that, every AI-powered factory we celebrate will quietly increase the weight of the systems it claims to accelerate.

Throughput is easy.

Yield is the work.

https://aicoding.leaflet.pub/3men54inhes2d

The Deletion Test

Jan 24, 2026 Updated Jan 24, 2026

Show full content

Here’s a simple test you can apply to any software system you work on:

Imagine deleting the entire implementation.

Not refactoring it.

Not archiving it.

Not putting it behind a feature flag.

Deleting it.

rm -rf src/

If that thought makes your stomach drop, pay attention. That reaction is telling you something important.

It’s not telling you that you’re reckless.

It’s not telling you that you lack discipline.

It’s telling you that you don’t know what would survive.

Fear Is a Signal

Most engineers experience deletion as existential. Code feels like the thing. It’s what we write, review, version, deploy, and debug. Losing it feels like losing the system itself.

But that fear is not inevitable. It’s contingent.

There are systems where deleting the implementation would be inconvenient but not terrifying. You might lose time. You might burn compute. But you wouldn’t lose understanding.

The difference is not bravery or craftsmanship.

It’s where meaning lives.

What Are You Actually Afraid of Losing?

When people say, “We can’t just throw the code away,” what they usually mean is something more precise:

We don’t know exactly what behavior is required.
We don’t know which failures are unacceptable.
We don’t know what invariants must always hold.
We don’t know how to tell if a new version is correct.
We don’t know which bugs are intentional fixes for forgotten edge cases.

Those are not code problems. They are evaluation problems.

Code becomes precious when it is the only place knowledge lives.

Code as a Stand-In for Understanding

For most of software history, treating code as durable was reasonable.

We treated code as permanent because the labor to produce it was the bottleneck. Rewriting was expensive. Re-validation was risky. Implementations accumulated meaning over time. Structure, tests, comments, bug fixes, and tribal knowledge fused into something you learned not to disturb.

That made sense when production was the constraint.

Today, the bottleneck has shifted from production to validation.

Generation is cheap. Confidence is not.

When regeneration is easy, code stops being an asset and starts acting as a cache: a materialized view of understanding that is useful while current, disposable when stale.

The danger appears when the cache becomes the source of truth.

The Deletion Test, Defined

The deletion test is not a recommendation. It’s a diagnostic.

Ask yourself:

If I deleted this codebase and regenerated it from scratch, what would I rely on to decide whether the result was correct?

If the honest answer is “the old code,” then the old code is doing work it shouldn’t be doing.

It’s acting as:

the specification
the test suite
the documentation
the bug database
the definition of correctness

That’s not robustness. It’s entanglement.

Oracles, Not Artifacts

If you delete the code but keep your property-based tests, your contracts, your invariants, and your operational signals, you haven’t actually lost the system.

You’ve lost an artifact.

What matters is whether you still have an oracle: a way to mechanically distinguish “correct” from “incorrect” without referring to history.

Production telemetry can tell you that something changed.

It rarely tells you whether it should have.

Without explicit evaluations, runtime behavior tells you what happened, not whether it was right.

Why This Matters Now

This isn’t a tooling fad. It’s an economic shift.

When generating new implementations is cheap, the cost of uncertainty dominates. Keeping code around because you’re afraid to delete it becomes a liability, not a safety measure.

If you don’t know how you would evaluate regenerated code, regeneration is reckless.

If you do know, regeneration is conservative.

The deletion test tells you which world you’re in.

When Deletion Is Boring

Notice the inversion this forces.

The question stops being “How do we write safer code?”

It becomes “What must exist so that code can be replaced safely?”

That question is harder. It requires deciding what truly matters and making it explicit, testable, and durable. It requires relocating rigor out of the implementation and into the system around it.

Most teams don’t do this because they haven’t needed to.

The deletion test makes the cost visible.

The goal is not to delete everything.

The goal is to build systems where deletion is boring.

Because when deleting code is boring, regenerating it is safe. And when regenerating code is safe, code stops being the thing you’re afraid to lose.

If you want software that can survive regeneration, start by asking what would survive deletion.

https://aicoding.leaflet.pub/3md5ftetaes2e

UI Is a Conservation Layer

Jan 21, 2026 Updated Jan 21, 2026

Why the user interface is the last to become regenerative

Show full content

If you’ve been following this series, you may already be thinking:

“This all makes sense for non-UI code. But surely this can’t apply to interfaces.”

That reaction should feel familiar. It’s the same one people had earlier when I argued for regeneration and got back:

“Regenerating all the code every time is crazy.”

It sounded reasonable then, and it sounds reasonable now. In both cases, the objection comes from the same category error.

Regeneration does not mean indiscriminate churn. It means bounded replacement behind stable interfaces. When you miss the boundary, the idea sounds reckless. When you see the boundary, it becomes conservative.

UI is arguably where that boundary matters most.

The Category Error, Revisited

Regeneration works extremely well for large parts of modern systems:

infrastructure
services
domain logic
state management
non-UI code inside client applications

All of that can change rapidly without confusing users, as long as the system’s human-facing behavior remains coherent.

The mistake is assuming the UI belongs in the same category.

UI is not just another artifact. It is the human-facing boundary of the system. It is where people form habits, build expectations, and decide whether they trust what they’re using.

Treating UI as just another regenerable component ignores the cost of breaking those things.

What UI Actually Represents

Users don’t experience systems as implementations. They experience them as continuity.

They learn where things are. They internalize flows. They stop thinking and start acting.

That’s not cosmetic. That’s learned behavior.

When you change UI, you’re not just changing pixels. You’re invalidating a mental model that took time to form.

That cost is paid by every user, every time.

Backend regeneration optimizes for correctness and cost.

UI stability optimizes for trust and habit.

Those are different objectives, and they should live in different layers.

Pace Layers, Used Precisely

Pace layers are often misunderstood as a ranking of what can change fastest.

They’re not.

They describe what must change slowest in order to stabilize everything else.

In a regenerative system, the correct mapping looks like this:

Regenerable code (infrastructure, services, domain logic, non-UI client code): fast to change, low human-visible cost
UI: slow to change, high human-visible cost
User trust and habit: slowest of all

UI sits under fast technical and cultural pressure, but it must itself move slowly to buffer that pressure from users.

Good UI absorbs volatility. Bad UI transmits it directly.

This is why UI cannot be treated as a fast regeneration layer, even though everything beneath it can.

Where Regeneration Pressure Must Stop

A well-designed regenerative system does something subtle but crucial.

It absorbs change internally and presents continuity externally.

We already accept this idea everywhere else.

Stable APIs protect callers from volatile implementations.
Stable protocols protect clients from transport churn.

UI is the human protocol of the system.

Protocols don’t churn. Implementations do.

If your system regenerates aggressively all the way up to the interface, you haven’t built an adaptive system. You’ve just pushed the cost of change onto your users.

Why AI Makes This More Dangerous

AI makes regeneration cheap, and that’s the problem.

When UI changes are cheap but user relearning is expensive, the system exports its flexibility costs to users. Confusion compounds. Trust erodes slowly. Users adapt, but grudgingly.

Nothing breaks loudly. Metrics drift. Support volume rises later. People stop exploring and start avoiding.

A system that constantly “improves” its interface while exhausting its users is not adaptive. It’s hostile.

AI makes this dramatically easier to get wrong.

What Regenerative Architecture Demands of UI (and Developers)

In a healthy regenerative system:

UI changes are rare, deliberate, and justified
most regeneration happens behind the interface
UI evolution is additive, optional, and reversible
deprecations are slow and visible

UI should optimize for predictability, not novelty.

There’s an uncomfortable implication here.

Developers can no longer treat UI as a thin aesthetic layer on top of “real” systems. UI lives in a slower, higher-leverage layer precisely because its failure modes are human.

That means developers must treat user trust as a first-class dependency.

You can’t mock it.

You can’t regenerate it.

You can’t A/B test it recklessly.

Once broken, it recovers slowly.

This is architectural responsibility, not design polish. It’s about protecting accumulated human understanding while everything underneath continues to evolve.

Regeneration Requires Conservation

Regeneration does not mean everything should change.

Just as regenerative systems protect stable interfaces from volatile implementations, they must protect users from internal churn.

The UI exists to conserve meaning while the system beneath it evolves.

If you regenerate the interface as aggressively as the code, you haven’t built an adaptive system.

You’ve built a forgetting machine.

https://aicoding.leaflet.pub/3mcxo5ojob22c

Provenance Is the New Version Control

Jan 13, 2026 Updated Jan 13, 2026

Show full content

When code can be thrown away and recreated, the unit of change is no longer lines of code. It’s reasons. Version control has to follow.

Regenerable systems quietly invalidate an assumption that has underpinned software engineering for decades: that the text of the code is the best record of how and why a system came to be. Once an AI can reliably regenerate an implementation from specification, the code itself becomes an artifact of synthesis, not the locus of intent.

By regenerable, I mean: if you delete a component, you can recreate it from stored intent (requirements, constraints, and decisions) with the same behavior and integration guarantees.

In that world, version control doesn’t disappear, but it has to move upstream.

When Diffs Stop Representing Decisions

Traditional version control works because code edits are a reasonable proxy for human decisions. Someone typed this conditional. Someone refactored that loop. A diff is an imperfect but serviceable record of authorship.

AI-assisted generation severs that link.

When an agent reads a specification, reasons about constraints, chooses an approach, and emits code, the resulting text reflects outcomes, not decisions. A diff can show what changed in the artifact, but it cannot explain which requirement demanded the change, which constraint shaped it, or which tradeoff caused one structure to be chosen over another.

This is the sense in which code-first version control becomes a lossy history. Not because diffs are useless (they still matter operationally) but because they no longer represent the causal history of the system. They tell you what happened, not why it happened.

That distinction matters once code is no longer directly authored.

Specifications as Executable Intent

In a regenerable system, specifications are no longer descriptive documents. They are executable inputs.

If a component can be deleted and recreated at will, then whatever information is required to recreate it is, by definition, the source of truth. Specifications stop being explanatory prose and become causal inputs.

The same is true of an agent’s plan.

The plan that matters isn’t free-form thinking. It’s the decision record: chosen strategy, rejected alternatives, and the constraints that forced the choice. Even when the choice is wrong, it’s still the most useful artifact to preserve: it explains why the system looks like this. Treating this as throwaway reasoning discards information that is often more important than the final text.

The plan is not documentation. It is part of the implementation.

A Concrete Example: Email Validation

Consider a small component: a function that validates email addresses.

A specification might state:

The system must accept standard email addresses of the form local@domain.

It must reject inputs without exactly one @.

It must not attempt full RFC compliance.

An agent produces a plan:

Use a simple regular expression.

Do not rely on external libraries.

Explicitly reject whitespace.

Favor readability over completeness.

From this, code is generated.

Now the requirement changes:

The system must accept internationalized domain names (IDN) in the domain portion.

Nothing else changes.

In a code-centric workflow, you inspect the diff and infer intent after the fact. In an intent-centric workflow, a single requirement node changes, the dependent plan node(s) changes, and the generated code changes as a consequence. The unit of change is not “these lines,” but “this reason.”

You can now answer not just what changed, but why it had to.

From Files to Intent Graphs

To support this, intent cannot live in a loose collection of documents. It needs structure.

The representation that works is a content-addressed graph. Individual requirements, constraints, plans, decisions, and environmental factors become nodes. Each node has a stable representation and a hash derived from its content. Edges express causality: this plan depends on that requirement; this decision exists because of that constraint.

In practice, each node needs at least: a type, canonical content, explicit dependencies, and evaluation artifacts (tests, constraints, budgets) that make regeneration checkable.

Even in the small example above, the graph is explicit:

A requirement node: “accept standard email addresses”
A constraint node: “no RFC compliance”
A plan node: “use a regex, reject whitespace”
A generator node: “Claude-class model, email-validator template”

The code sits downstream of all four.

The “version” of the component is the root hash of this graph. Change a requirement and only the downstream nodes change. Regenerate with identical inputs and the root hash remains stable. Identity moves from files to intent.

What’s New and What Isn’t

None of these ideas exist in isolation.

Build systems like Bazel—and increasingly Nix-style systems—use hashed inputs and content-addressed caches to track which inputs produced which outputs. Formal methods have long pursued specifications with mathematical semantics precise enough to analyze and verify.

What’s new is the coupling.

Bazel tracks build causality. Formal specifications describe logical intent. Regenerable systems require generative provenance: a direct, machine-enforced link between intent and implementation. The specification graph doesn’t sit beside the system. It drives it.

Description can drift. Drivers cannot.

Why Traceability Failed and Why It Might Not Now

Industries have attempted requirements traceability for decades, usually through tickets, spreadsheets, and process checklists. It often failed in mainstream software because humans were asked to maintain links that the system itself did not depend on.

Regenerable systems invert the incentives.

If a system can regenerate itself, it must already know what it’s doing. Provenance stops being overhead and becomes infrastructure. The links exist because generation requires them.

This does not describe how today’s AI tools work. Current generators do not emit stable, versionable plans or structured intent graphs. This is not a description of the present. It’s an argument about the direction forced by regeneration economics: the cost of re-deriving code keeps falling, while the cost of rediscovering intent does not.

Hard Problems and Failure Modes

This model raises real challenges.

Specifications expressed in natural language require canonicalization. Two nodes may be semantically equivalent but textually different, and we won’t always detect that reliably. Agents will make implicit assumptions that are not explicitly recorded. Non-deterministic generators may produce different code from identical intent graphs.

These are not reasons to abandon the approach. They are design constraints.

The model does not require perfect formalization. It requires tractability—and tractability improves as specifications become more structured, plans become explicit, and generators are forced to surface their decisions. Ambiguity becomes visible rather than hidden in diffs.

Even failure becomes diagnosable at the level that matters: intent.

Versioning What Actually Matters

Git taught us how to version text.

Regenerable systems force us to version intent: the requirements, constraints, and decisions that caused a system to take its current shape. Code still matters but it becomes an artifact, not the record of authorship.

The tools to do this well don’t fully exist yet. But the pressure is already here. If code can be recreated at will, the question becomes unavoidable:

What, exactly, is worth preserving, and how would you know?

https://aicoding.leaflet.pub/3mcbiyal7jc2y

n=1 Is a Design Constrain (Not a Staffing Model)

Jan 7, 2026 Updated Jan 7, 2026

Single-developer capability isn’t a productivity story. It’s the test that tells you whether your architecture is worth keeping.

Show full content

Here is a design constraint worth taking seriously: if your system cannot be understood, modified, and regenerated from specification by one competent engineer, it is already too complex.

This is not a statement about staffing. It is a statement about architecture.

Call it n=1 capability. The claim is not that you should run your engineering organization with one person. The claim is that you should design systems where one person could. That’s the test.

Why This Constraint Matters

Systems that pass the n=1 test have specific properties: clear boundaries, externalized meaning, replaceable components, low coordination overhead. These are the properties you actually want. They scale judgment, not headcount.

A system that requires a team is not necessarily bad. But a system that requires a team just to understand it—where no single person can hold its shape in mind—has a problem that will compound. Every new engineer slows down. Every departure creates knowledge gaps. Every change requires negotiation across boundaries that exist only in people’s heads.

n=1 capability is the diagnostic. When one person can ship what used to require a team, it tells you something important about the system, not the individual.

The Myth of the Solo Genius

Software culture has a long history of mythologizing exceptional individuals. We tell stories about lone hackers and 10x developers, and we treat outsized impact as evidence of rare talent.

That framing is comforting. It is also misleading.

Not to mention the fact that I have rarely worked with a “10x developer” that wasn’t also somehow a tax on the team, project, or system, a single developer can only be effective to the extent that the system allows them to be. No amount of skill lets one person safely reason about a sprawling, entangled codebase with unclear boundaries, implicit behavior, and hidden state.

When n=1 development works, it works because the system is shaped to allow it.

The question should not be “how is that person so productive?”

The question is “what properties of the system make this possible?”

The answer will never be talent alone. It will be boundaries, compaction, evaluations, and replaceability. Those are architectural choices.

Where n=1 Fails

Take a typical large system. Thousands of files. Deep dependency graphs. Implicit invariants. Behavior encoded in history. Knowledge distributed across people.

Drop a single developer into that environment and watch what happens. They slow down. They become cautious. They avoid change. They depend on tribal knowledge they do not have.

This isn’t a talent problem. It’s an architectural one.

n=1 development fails when the cost of understanding the system exceeds the capacity of one human mind. That was te normal state of affairs for decades. The team was a coping mechanism for complexity that had outgrown individual cognition.

The Cognitive Load Theory of Architecture

The limiting factor in software has never been typing speed. It has always been cognition.

n=1 development works when the total cognitive load of the system fits within one person’s mental budget. This is not about making systems small. It’s about making them comprehensible. A large system can be n=1 capable if its structure allows a single mind to reason about it in layers, with clean boundaries between concerns.

This requires compaction: eliminating accidental complexity, enforcing boundaries aggressively, designing for replacement rather than accumulation.

Legacy systems too large for a single human to understand might only be that way because they lack boundaries and abound with accidental complexity. In other words, the true essence of a complex legacy system might still be simple and n=1-accessible if not for all the damn code.

Generative AI didn’t create these requirements, but it certainly reveals them. AI makes it easy to generate code. Generation without comprehension is just faster accumulation of debt. Compaction and regeneration make it possible to control what AI produces. Without those disciplines, AI simply accelerates collapse.

Meaning Lives Outside the Code

The systems that enable n=1 development share a common property: meaning is externalized.

Behavior is defined by evaluations, not implementations. Contracts live at interfaces, not in comments or unstructured documentation. Monitoring catches drift before it compounds. Automation makes replacement cheap.

In such systems, AI does not replace engineers. It removes the tax of manual execution, allowing a single person to operate at the level of architecture instead of implementation. The human holds the shape. The machine fills it in.

This only works when the shape is explicit enough to verify. If correctness depends on social knowledge rather than mechanical enforcement, AI code generation is a liability. It can produce output, but no one can tell (at least quickly enough) whether the output is right.

Again, This Is Not Outsourcing

It’s tempting to compare n=1 development to past attempts at labor arbitrage. That analogy fails for an important reason.

Outsourcing tried to scale execution without externalizing system knowledge. It relied on supervision, documentation, and process to compensate for implicit structure. The coordination cost remained; it just moved.

n=1 development works only when the opposite is true. The system must be so well-defined that supervision is unnecessary. Behavior must be enforced mechanically, not socially. Correctness must be observable, not inferred.

n=1 is not cheaper labor. It is cheaper coordination. That’s a different thing entirely.

What n=1 Tells You About Teams

n=1 capability does not mean teams go away.

It means teams are no longer required to compensate for architectural opacity. In well-designed systems, one person can own a component end-to-end. Teams form around interfaces, not codebases. Collaboration happens at boundaries. Coordination cost drops dramatically.

n=1 is a lower bound, not a mandate. When systems are compact and regenerative, adding people becomes a choice, not a necessity. You scale because you want to go faster or cover more ground. Not because the system has become too complex for any one person to hold.

This is the real test: if you can’t get to n=1 in theory, your architecture is already too expensive.

The Canary in the Architecture

n=1 capability is a leading indicator.

If your system cannot be understood, modified, and regenerated by one competent engineer, it is living on borrowed time. That does not mean it is broken today. It means its complexity is compounding faster than your ability to manage it.

AI will not save such systems. It will make their fragility visible faster. Every acceleration in generation speed is also an acceleration in accumulation speed. Accumulated complexity eventually wins.

The systems that thrive will be the ones designed for n=1 from the start. Not because they’ll be run by one person, but because the constraint produces the properties that matter: coherence, replaceability, verifiability.

The Constraint That Produces Quality

When you see n=1 development succeeding, don’t dismiss it as heroics. Don’t write it off as “not real engineering.” Ask what it reveals.

The system’s complexity has been reduced to the point where human judgment, assisted by machines, is sufficient to keep it coherent. That is not the end of software engineering. It is what software engineering looks like when architecture finally matters more than artifacts.

n=1 is not a staffing goal. It is a design goal.

Design for n=1 capability. Not because you want to run lean, but because systems that pass the test are systems worth building.

https://aicoding.leaflet.pub/3mbuc4mohwc2k

Relocating Rigor

Jan 6, 2026 Updated Jan 6, 2026

The Discipline That Looks Like Recklessness

Show full content

In late 1999, I found myself inside Extreme Programming movement before most people had heard of it. Kent Beck's white "extreme programming explained" book had just come out. Ward Cunningham's wiki was where the real conversations happened. The Agile Manifesto wouldn't exist for another couple of years.

From the outside, what we were doing looked reckless.

We threw away long-range plans. We rejected heavyweight processes. We stopped pretending we could predict the shape of a system months in advance. We paired constantly, which looked inefficient. We wrote tests before code, which looked backward. We released continuously, which looked dangerous.

To many observers, this was a removal of constraints. The opposite was true.

XP compressed feedback loops until truth became unavoidable. Tests replaced promises. Continuous integration replaced status reports. Working software replaced narrative. You could no longer hide behind process because the system itself reported your progress, loudly and continuously.

The practices that looked like chaos were actually mechanisms for enforcing honesty. Pair programming meant every line of code had a witness. Test-first meant you couldn't ship wishes. Short iterations meant you couldn't hide. The discipline was more demanding than what came before, not less. It just didn't look like the discipline people were used to seeing.

That experience permanently changed how I think about software.

It also explains why I lost interest once Extreme Programming got absorbed into the broader "Agile" movement and solidified into branding and ceremony. When the name took over, the rigor drained out. The feedback softened. The theater returned. Consultants taught the artifacts without the discipline. I wrote about that years ago in The Curse of a Name.

I'm revisiting this history now because we're watching the same pattern repeat with generative AI, and it's being misunderstood in exactly the same way.

The Pattern

Certain shifts in software history feel like freedom because they remove familiar signals of control. In reality, they relocate rigor closer to where truth lives. They make it harder to fake progress.

This pattern has repeated at least three times in my career.

Dynamic languages displaced static type systems. When Ruby and Python started spreading into production systems, they were widely criticized as undisciplined. No compile-time guarantees. No rigid type constraints. Too easy to write sloppy code.

What actually happened was a shift in where rigor lived. Static promises gave way to runtime truth. Type declarations gave way to executable behavior. Compiler appeasement gave way to test-enforced correctness. In practice, the teams that succeeded doubled down on executable specifications: tests that described behavior precisely enough to function as a de facto type system.

The discipline didn't vanish. It moved into tests, contracts, and feedback loops that reflected how the system actually ran. The type system was still there; you just had to earn it through behavior rather than declaration.

(yes, I know and appreciate that there are some great and very popular languages that have started to displace the dynamic with amazing static type systems)

Extreme Programming displaced phase-gate development. XP removed plans, design documents, and phase gates. These were the artifacts that made organizations feel safe. In their place it installed mechanisms that were far less forgiving: test-first development, continuous integration, constant peer review, real customer feedback.

It looked chaotic because it removed the appearance of control. What replaced it was operational truth. You knew where you stood because the code told you, not because a project manager updated a Gantt chart.

Continuous deployment displaced release management. No release windows. No stabilization phases. No heroic integration efforts. Another apparent loss of discipline.

In reality, continuous deployment demands far stricter engineering than quarterly releases ever did. You need reversibility. Observability. Automated verification. Fast rollback paths. Continuous deployment isn't about speed; it's about never being surprised. You can't ship continuously without knowing exactly what your system is doing at all times. The rigor becomes continuous as well.

Why Regenerative Software Fits This Pattern

Generative AI appears to remove the ultimate constraint: hand-written code. That makes people nervous, and it should. But the danger isn't probabilistic generation. The danger is quiet failure.

Here's what I mean. When you generate code instead of writing it, you lose the incidental knowledge that comes from typing every character. You lose the friction that forces you to understand. You can produce systems that work without ever knowing why they work.

That's the legitimate fear, and it's a real failure mode. I've seen teams drowning in generated code they don't understand, systems that function but can't be debugged, abstractions that exist because an LLM suggested them rather than because they serve a purpose.

But the answer isn't to reject generation. The answer is to relocate the discipline.

Generative systems only work if invariants are explicit rather than implicit. Interfaces must be real contracts, not incidental boundaries. Evaluation must be ruthless. Failures must be loud and immediate. The engineer's job shifts from typing code to specifying intent and verifying outcomes.

What does this look like in practice? One possibility: You write the tests and the LLM generates implementations. If the tests don't pass, the code doesn't ship.

This is test-first development with a different author for the implementation. The discipline I learned in 1999 turns out to be exactly the discipline that makes AI-assisted development work. The rigor relocated from who writes the code to what the code must satisfy. The tests don't care whether a human or a machine produced the implementation. They care whether it behaves correctly.

The pattern is: probabilistic inside, deterministic at the edges.

This is harder than it sounds. Specifying intent precisely enough that a machine can generate correct implementations is not easier than writing code. It's a different skill, and in some ways a more demanding one. You have to know what you actually want. You have to be able to recognize when you've gotten it. You can't hide behind activity.

Cheap generation without strict judgment isn't a new paradigm. It's abdication.

I've been experimenting with frameworks that treat evaluation as a first-class system component, not an afterthought. Generation can be flexible. It can even be probabilistic. But evaluation must be rigid. Systems must fail visibly when they drift from intent. The comfort of working code that you don't understand is precisely the comfort you have to refuse.

What This Means for Practice

If you're working with generative AI now, the question to ask yourself is: where did the rigor go?

If you removed hand-written code but didn't add explicit invariants, you lost rigor. If you're generating implementations without rigorous evaluation, you lost rigor. If you're accepting code because it runs rather than because you understand it, you lost rigor.

The engineers who thrive in this environment will be the ones who relocate discipline rather than abandon it. They'll treat generation as a capability that demands more precision in specification, not less. They'll build evaluation systems that are harder to fool than the ones they replaced. They'll refuse the temptation to mistake velocity for progress.

The Throughline

Across decades of software evolution, the same misunderstanding keeps recurring. Constraint removal is mistaken for loss of rigor.

But what actually happens, when things go well, is rigor relocation.

Control doesn't disappear. It moves closer to reality.

XP taught me this. Dynamic languages reinforced it. Continuous deployment reinforced it again. Now generative systems are teaching it to a new generation of engineers, whether they realize it or not.

The lesson is always the same. When something looks like recklessness, look for where the discipline moved. If you can't find it, that's when you should worry. If you can find it, you're probably looking at the future.

If generation gets easier, judgment must get stricter. Otherwise, you're not engineering anymore.

That was the real lesson of Extreme Programming before it got diluted into a brand. It's the same lesson now. And this time, the velocity of change means we don't have years to figure it out.

https://aicoding.leaflet.pub/3mbrvhyye4k2e

The System Is the Asset

Jan 5, 2026 Updated Jan 5, 2026

Why Regeneration Does Not Mean Starting Over

Show full content

We’ve spent decades talking as if “the system” and “the codebase” were the same thing.

They are not.

A system is defined by its behavior, its interfaces, its data, and its invariants. Code is just one way, the historically dominant way, of expressing those things.

When people hear “throw the code away” and assume “throw the system away,” they are conflating two very different acts. That conflation is the source of most of the resistance to these ideas. So let’s be precise about the distinction.

What Actually Persists

Look at any system that has survived for a long time, not because it was beautiful, but because it worked.

What endured was never the exact implementation, the original language, or the clever abstractions. What endured was stable interfaces, well-understood behavior, data continuity, and a clear sense of what must not break.

The system’s identity lived outside the code.

The code was replaced far more often than people like to admit, sometimes explicitly, sometimes by accretion. The system survived because something else held it together.

In retrospect, this was always true. We just did not have the tools or the economics to act on it deliberately.

That something else is what we should be designing for.

Local Replacement, Not Global Amnesia

No serious architecture advocates “start over every time.” That idea collapses under even casual scrutiny.

What does work, and has worked for a long time, is targeted replacement behind stable boundaries.

This is the same logic that made immutable infrastructure viable. You do not throw away the service; you replace the instance. Identity lives at the service boundary, not the machine.

Applying this to software means the system remains intact. The contracts remain intact. The behavior remains intact. The data remains intact. Only the mechanism changes.

This also means something crucial: you cannot regenerate what you have not yet defined. For legacy systems, the first act is not rewriting. It is extraction.

We already accept this model everywhere else in computing. The question is whether we are ready to accept it for code itself.

Why the Outsourcing Analogy Fails

A common objection goes like this: “We could have rewritten code cheaply for decades. We tried that with outsourcing. It failed.”

That history matters. But it is being misapplied.

The failure mode of large-scale outsourcing was not that code was rewritten. It was that system knowledge lived in mutable code and in human heads. The moment supervision stopped, intent was lost, assumptions drifted, and nobody could tell whether the system was still correct.

That was not a failure of regeneration. It was a failure to externalize system memory.

That memory has to live somewhere durable: machine-readable specifications, comprehensive test suites, explicit contract definitions. In outsourcing, that memory remained implicit and social. In regenerative systems, it must be explicit and executable.

Regeneration without durable system anchors is chaos. Regeneration with them is not.

AI does not change this dynamic. It makes it unavoidable. When code becomes cheap to produce, the question of where system identity lives stops being theoretical.

What This Looks Like in Practice

Consider a payment processing service. What is the system, actually?

It is not the Python or Go or Java that handles the requests. The system is:

The contract: these endpoints accept these inputs and produce these outputs
The invariants: a charge is never duplicated, a refund never exceeds the original amount, ledger entries always balance
The operational envelope: p99 latency under 200ms, availability above 99.95%
The data: transaction records, account states, audit logs

This is why schema evolution becomes the true constraint, not code preservation.

You could rewrite the implementation from scratch tomorrow. If the new code honors those contracts, preserves those invariants, meets those operational requirements, and maintains data continuity, you still have the same system.

The customer does not experience “new code.” They experience the same service, because the service was never the code.

This is what it means to treat the system boundary as the durable artifact.

Making a system safe to regenerate means specifying behavior independently of implementation, making interfaces explicit and enforced, making invariants testable, observing runtime behavior continuously, and surfacing failure modes quickly.

None of that requires preserving code. All of it requires preserving meaning.

Fresh Code Is Not the Risk

The discomfort with “fresh code” is understandable, but misplaced.

What people actually fear is undetected behavior change, performance regressions, security regressions, and silent drift. Those failures are caused by unobserved change, not by newness.

A system with stable contracts, strong evaluations, continuous monitoring, and clear rollback paths can safely tolerate very fresh code. A system without those things is dangerous even if the code is ten years old.

Age is not stability. Visibility is.

Where the Asset Lives

This is the crux of the argument.

The asset is not the code. The asset is the system’s ability to remain coherent while its internals change.

That ability lives in interfaces, invariants, evaluations, and operational discipline. Code is a consumable input to that process.

Treating code as the asset made sense when replacing it was expensive. Treating it that way now creates fragility, not safety.

The distinction between system and implementation is what separates regenerative architectures from reckless ones. It is also the difference between software that decays under change and software that endures because it can change.

https://aicoding.leaflet.pub/3mbp5ukeuzs22

Conceptual Mass and the Compaction Discipline

Jan 2, 2026 Updated Jan 2, 2026

Show full content

As I mentioned in a previous post, at Wunderlist, we had a rule: any new service had to be "this big", a constraint I'd demonstrate by holding my fingers a few inches apart. The metric wasn't about lines of code. It was about replaceability.

If a service was small enough to rewrite in a day, it couldn't accumulate the kind of complexity that makes systems brittle. That rule was about resisting growth. Not preventing change but resisting mass.

Every software system naturally grows. When change is easy and addition is cheap, structure accumulates unless something pushes back. For most of software history, that counterforce was human effort. Writing code was slow. Adding complexity hurt. Growth had friction.

Generative AI removes that friction.

Without an opposing discipline, AI doesn't just accelerate development. It accelerates bloat. This post is about the discipline that prevents success from turning into system weight.

Accumulation Is the Default Failure Mode

In AI-accelerated systems, expansion is the path of least resistance. Generation is cheap. Preservation is emotionally easy. Deletion requires justification. Think about how many times you've seen commented out code in a legacy code base where someone couldn't bring themselves to outright delete it even though it's not used anymore. That's the psychology we're dealing with here.

Modern LLM-driven workflows strongly favor addition: new features appear instantly, glue code materializes, abstractions proliferate because the model has seen them before. Edge cases get special handling instead of root-cause fixes. "Temporary" code survives because it works.

None of this requires bad engineers. It barely requires engineers at all.

If you do nothing, your system will grow until it becomes unmanageable. This was true before AI, but the timeline has collapsed. What used to take years of drift now happens in months of "high-velocity" shipping.

Conceptual Mass

Lines of code are a distraction. What actually matters is conceptual mass—the weight of ideas a system asks you to hold in your head.

Conceptual mass is the sum of distinct concepts, invariants, public interfaces, dependencies, and exception paths. It is the number of things a human, or an AI, must understand to make a safe change.

AI is exceptionally good at increasing conceptual mass silently. Every generated abstraction, every "clean" separation of concerns, every helper function adds weight. The code passes the linter. The tests pass. The system gets heavier.

The Compaction Discipline exists to reduce conceptual mass relentlessly.

Compaction Is Not Cleanup

Most teams think about size reduction as hygiene: occasional refactors, technical-debt sprints, cleanup tickets that sit in the backlog, but that framing is wrong.

In theory, refactoring can reduce conceptual mass. In practice, it rarely does. Most refactoring reorganizes existing structure without challenging whether that structure should exist at all.

Refactoring is reorganizing the closet.

Compaction is realizing you don't need the closet.

Compaction is not maintenance. It is structural pressure. It is the deliberate, continuous application of force to keep a system's conceptual mass proportional to its purpose.

If your system gets more complex every time it gets more capable, you are losing.

What Compaction Looks Like

Removing code often accompanies compaction, but deletion is incidental. The goal is not fewer lines. The goal is less surface area.

AI loves to hallucinate architecture. It will suggest a Strategy pattern, a Factory, and an Interface for a feature that could be a single if statement.

Expansion is keeping those files because "it's best practice."

Compaction is deleting them because the distinction doesn't pay rent.

Successful compaction looks like fewer abstractions doing more work. Collapsed layers. Eliminated special cases. Simpler dependency graphs. Clearer boundaries. Smaller interfaces.

Code disappears because it no longer earns its keep. Sometimes the code stays, but the conceptual mass drops, because two ideas become one and the mental model shrinks.

The question is not "can we delete this?" It's "does this concept justify its existence?"

Architecture as Compaction

At Wunderlist, we built what people would now call a microservices architecture, but we thought of it as a deliberately dumb architecture.

The industry focuses too much on "microservices" and not enough on "architecture." That's why microservices get a bad rap. Our system worked because it was simple to the point of boredom.

We organized around nouns, not verbs. Users, lists, tasks, comments, each owned by exactly one service. Operations were almost entirely CRUD. Communication happened through exactly two mechanisms: a standardized REST/JSON convention that every service spoke natively and exclusively, and a message bus that broadcast every mutation. That was it. No service-to-service RPC. No custom protocols. No internal APIs that only two services knew about.

We didn't choose this approach because we loved distributed systems. We chose it because it enforced replaceability. When a service became too heavy—too much conceptual mass—we didn't refactor it. We deleted it and replaced it with something simpler. Or faster. Or cheaper to run. Because the architecture was dumb, rewriting was cheaper than preserving complexity.

The architecture gave everything exactly one place to go. Duplication was obvious. Special cases had nowhere to hide.

The specifics don't matter. The constraint does. You don't need microservices to do this. You can practice compaction in a monolith by enforcing modular boundaries that are ruthless about dependency direction and ownership. The technology is incidental (though in my own expereince, separation by process boundary makes the modularity more explicit). What matters is designing systems where bloat has no natural home.

Optionality

Compaction buys you more than cleanliness. It buys you options.

A compact system is cheaper to regenerate. It fits inside bounded reasoning contexts. It adapts to new languages and frameworks because there's less to port. It is easier to audit. It has a smaller blast radius when it fails.

This is why the most durable legacy systems are often boring. They didn't grow clever. They resisted the urge to solve tomorrow's problems today.

The Discipline, Stated Plainly

Any system that does not actively compress will inevitably bloat. AI does not change this law. It just accelerates it.

We are moving from an era where code seemed like an asset to an era where code is more clearly a liability, and only functionality (and arguably its architecture) is the asset.

The Compaction Discipline is the counterforce: continuous structural pressure to keep conceptual mass proportional to purpose.

Generation is cheap. Compression is leverage.

https://aicoding.leaflet.pub/3mbhnolyzds2d

Immutable Infrastructure, Immutable Code

Dec 30, 2025 Updated Dec 30, 2025

Why "Never Upgrade in Place" Now Applies to Software

Show full content

In 2013, I wrote about trashing servers and burning code. The argument was simple: systems that mutate while running accumulate state, history, and uncertainty in ways humans can't reason about. When something breaks, nobody knows which change caused it or what the system actually is anymore.

So we stopped patching servers and started replacing them. We built machines that could burn down and rise again, identical in behavior, without human intervention. The server wasn't the thing. The capability to regenerate was the thing.

That was an infrastructure principle, but it has always felt true to me for software. I was CTO for the company behind the popular Wunderlist productivity tool at the time, and as CTO I came up with a simple set of rules for choosing technologies we deployed at work:

anyone can decide to use any new language or framework they want, but
it must work with our build system,
it must work with our deployment system,
they must find at least one other person on the team to work on it with them and support it if necessary and
(this is the most important part) the code has to be no more than "this big", which I'd say while holding up my hand with my fingers spread apart a few inches.

That last part constrained the code in such a way that the worst thing that could happen with a new language or technology is that it crashed, nobody on call was able to fix it, and it would be trivial to rewrite and replace. And we did that sometimes.

Code could even be treated like cells. As humans, parts of our biological material are dying all the time, yet the system (our body, brain, mind) remains.

So today, if code can be regenerated cheaply, perhaps upgrading code in place is the antipattern.

Infrastructure Figured This Out First

Immutable infrastructure wasn't adopted because it was elegant. It was adopted because mutable systems failed in ways that were hard to diagnose, hard to reproduce, and hard to roll back. Snowflake servers. Configuration drift. Hand-applied fixes. Tribal knowledge baked into machines nobody could recreate.

Replacing machines instead of fixing them solved this not by making systems smarter, but by making them simpler to reason about. Each deployment was a clean slate. Each artifact was knowable.

The key insight was almost more economic than technical: mutation accumulates hidden cost faster than replacement does.

That insight is now true for application code.

Editing Code Is Mutation

When you edit code in place, you're doing the software equivalent of SSHing into a production server and tweaking a config file.

You're assuming you understand the full state of the system. You're assuming the change is local, that history doesn't matter, that side effects are predictable.

Those assumptions were always shaky. They're becoming untenable. As code is generated more rapidly, whether by humans, AI, or both, the mutation rate increases while the understanding rate stays flat or declines.

Every in-place edit is a drift event. AI just makes this visible by compressing the timeline.

Mutable Code Accumulates Entropy

In-place modification has a hidden cost profile. Incremental edits entangle intent with the sequence of changes that produced them. Code gets layered atop code (this is why developers often prefer to use git rebase instead of git merge). Local fixes obscure global behavior. Understanding requires replaying the evolution of the codebase in your head — archaeology instead of engineering.

This is exactly how legacy systems are born. Not through age, but through mutation. A system becomes legacy when understanding it requires historical knowledge that isn't encoded anywhere except the code itself.

The tragedy is that teams recreate this failure mode faster with AI, because mutation feels cheap while understanding quietly becomes expensive. You can generate a thousand lines in seconds. But the moment you start editing those lines, you've created an artifact that can only be understood historically. You've created brittle legacy code in an afternoon.

Replacing code avoids this entirely.

The Phoenix Principle

What made immutable infrastructure work wasn't really about servers. It was about a property: the ability to burn something down and have it rise again, identical in behavior, without human intervention or institutional memory.

That property—call it the phoenix principle—is what makes systems understandable at scale. Not documentation. Not code comments. Not the engineer who remembers why that conditional exists. The ability to regenerate from specification.

Applied to code, this means: if you can't regenerate a component from its specification and evaluation criteria, that component is not well-defined enough to exist.

That's not cruelty. That's feedback. The fire tells you what you actually knew versus what you only thought you knew.

Replace-over-modify systems behave differently. Each regeneration is explicit. Each deployment is intentional. Rollback is trivial. Drift cannot accumulate. The system burns and is reborn, but its identity persists because its behavior is externally defined.

Why This Works Now

Historically, we avoided full replacement because writing code was expensive, coordination was slow, re-testing everything was painful, and human review was the bottleneck.

AI changes the cost of generation. Testing is automated. Coordination happens through interfaces.

But the deeper shift is this: comprehension became the bottleneck.

The entire history of software engineering has been about making code easier to understand. Style guides, design patterns, clean code, self-documenting functions — all of it assumed that humans would read and reason about implementations. We optimized for readability because reading was mandatory.

Immutable code sidesteps that problem. If a component can be regenerated from spec, understanding its implementation is optional. You need to understand the contract, the interface, the expected behavior. You don't need to understand how it achieves that behavior, because the "how" is transient.

The expensive thing left is defining what you want. Comprehension of implementations becomes a debugging activity, not a maintenance activity.

What Survives Replacement

If code is immutable, something else must carry continuity.

That something is: interfaces, contracts, evaluations, monitoring, and data. These are the stable layers. Code is a transient expression of them.

This mirrors infrastructure perfectly. AMIs mattered less than APIs. Containers mattered less than contracts. Servers mattered less than services.

The thing you cared about was never the machine. It was what the machine did and how you could verify it was doing it correctly.

Software is catching up to the same realization. The code is not the asset. The specification and the evaluation are the asset. Code is just the current rendering.

Objections

"This is wasteful." Mutation is wasteful. It just hides the cost in future debugging, onboarding, and incident response. Replacement is explicit cost with bounded risk.

"We'll lose optimizations." If an optimization matters, encode it as a constraint or invariant. If you can't express it formally, it probably wasn't real value — it was accident.

"What about institutional knowledge?" This is the real anxiety. The code embodies decisions nobody wrote down. But that's precisely the problem immutable code solves. If knowledge only exists in the implementation, it's not knowledge. It's risk. Regeneration forces you to make the implicit explicit, or accept that it wasn't essential.

"This won't work for large systems." Large systems already replace infrastructure constantly. Code is next. The hard part is decomposition, not replacement.

"This breaks developer intuition." So did containers. So did CI. So did version control. So did every advance that traded local convenience for systemic clarity.

The Rule, Updated

The old rule was: never upgrade infrastructure in place.

The new rule is: never upgrade code in place if you can regenerate it instead.

Just like SSHing into a server and tweaking something in production is still possible but clearly undesirable, editing code is now a last resort, a sign that regeneration failed, that your specification was incomplete, that your evaluations weren't sufficient. It's a debugging activity, not a development activity.

The Payoff

Immutable code yields predictable deployments, lower cognitive load, cleaner rollback, easier audits, faster evolution, and smaller blast radius.

But the real payoff is psychological. You stop being afraid of change. You stop tiptoeing around legacy decisions. You stop asking "what will this break?" and start asking "does this pass the evaluation?"

The code becomes a renewable resource instead of a fragile artifact.

Infrastructure taught us that mutability was the enemy of understanding.

AI teaches us the same lesson again — higher up the stack.

If you're still editing AI-generated code in place, you're reliving the worst era of configuration drift, just faster. You're creating legacy systems in days instead of years.

Burn it. Regenerate it. Trust what survives the fire.

https://aicoding.leaflet.pub/3mbaguyrjek2g

Evaluations Are the Real Codebase

Dec 29, 2025 Updated Dec 29, 2025

Why behavior outlives implementations

Show full content

If deleting your codebase feels terrifying, your evaluations are insufficient. That's not a moral failure. It's a technical one—and in the age of AI-assisted development, it's an increasingly expensive one.

Here is the shift: language models have made code generation cheap. Not free, not perfect, but cheap enough that regenerating a service is often faster than understanding and modifying it. This changes what counts as a durable asset. Code isn't it. Code is now a materialized view of understanding—useful while current, disposable when stale.

The durable asset is the thing that lets you regenerate with confidence: evaluations that encode what the system must do, independent of how any particular implementation does it.

Code is Cache

Traditional software culture treated code as the memory of the system. It encoded intent, explained decisions, and preserved behavior over time. Protecting it was rational because replacing it was expensive.

That expense has collapsed. When a model can produce working code from a description in minutes, the calculus inverts. Keeping code around "just in case" stops being wisdom and starts being hoarding. The implementation is a cache. It's a snapshot of your current understanding, useful for running in production, not precious in itself.

If you delete a codebase and can't confidently regenerate it, that's not a tragedy. It's a diagnosis. The problem wasn't the deletion. The problem was that nothing important lived outside the code. The intent, the constraints, the behavioral requirements were all implicit in the implementation rather than explicit in artifacts that survive the implementation's death.

The Spectrum of Test Durability

Most engineers, asked how they ensure a system works, answer "tests." But tests vary enormously in what they actually protect.

Consider a unit test that verifies a specific function's behavior by calling it with specific inputs and checking specific outputs. This test is coupled to that function's existence, its signature, its language. Rewrite the service in a different language and the test doesn't just fail. It can't run at all. The test's lifetime is bounded by the implementation's lifetime.

This isn't because the test was poorly written. Even exemplary TDD—testing behavior over structure, focusing on public interfaces—produces tests that assume the codebase continues to exist in the same language with the same entry points. That assumption was safe when reimplementation was rare. It's not safe when regeneration is routine.

The alternative is tests specified at a boundary that survives reimplementation:

Invariants are properties that hold regardless of implementation. "Balances never go negative." "Events maintain causal ordering." "Round-trip serialization is lossless." These can be verified against any implementation in any language.

Contracts specify what crosses boundaries between components. If service A sends this shape, service B returns that shape. The contract survives reimplementation of either service.

Property-based tests verify behavioral properties across generated inputs. "Sorting is idempotent." "Encryption and decryption are inverses." These encode what must be true, not how to make it true.

End-to-end behavioral checks verify the system's observable outputs. Given this input, the system produces output in this class. The internal path doesn't matter.

These are durable evaluations. They encode intent at a level of abstraction that outlives any particular implementation. A codebase can be deleted and regenerated; if these evaluations pass, the system still works.

The Cost of Durability

This is where most manifestos would stop. Having sold the destination, ignore the hike. But, writing durable evaluations is hard. Genuinely hard. Harder than writing the code they specify.

Identifying true invariants requires deep domain understanding. Most systems have implicit invariants that no one has articulated. They're embedded in code that "just works" without anyone knowing exactly why. Extracting these invariants is archaeological work.

Property-based testing requires thinking in universals rather than examples. Instead of "when I call sort([3,1,2]), I get [1,2,3]," you must specify "for all lists, sorting produces a list with the same elements in non-decreasing order." This is a different mental motion than example-based testing, and most engineers haven't practiced it.

Formal contracts require precision that natural language resists. "The API returns user data" is not a contract. "The API returns a JSON object with fields id (string, non-empty), email (string, valid RFC 5322), and created_at (ISO 8601 timestamp)" is a contract. The gap between these is where bugs hide.

The investment is real. But the alternative (keeping code around because you're afraid to delete it, because nothing external specifies what it does) is also an investment. You pay it in cognitive load, in context-window costs, in the compounding complexity of systems that only grow.

The question isn't whether durable evaluations are expensive. The question is whether they're cheaper than the alternative. As regeneration gets cheaper, the answer increasingly favors evaluation.

Why the Boundary Matters

The distinction between ephemeral and durable tests reduces to one question: is the test specified at a boundary that survives reimplementation?

Tests against internal functions, private methods, specific call sequences: these are specified at the implementation boundary. They verify decisions that might change. Their lifetime is coupled to the implementation's lifetime.

Tests against inputs and outputs, observable behavior, interface contracts: these are specified at the system boundary. They verify obligations the system owes the outside world. The implementation can change completely as long as these obligations are met.

This isn't a new idea at all! Information hiding, API design, coupling versus cohesion, etc...the software engineering literature has understood interface boundaries for fifty years. What's new is the economic weight. When regeneration was expensive, careful interface specification was good hygiene. When regeneration is cheap, it's the difference between systems that can evolve and systems that calcify.

A simple check for you to apply: if reimplementing your service in a different language would invalidate your test suite, your tests are specified at the wrong boundary.

Monitoring as Continuous Evaluation

Even rigorous evaluations only verify intent at a point in time. They don't verify that production behavior matches intent continuously.

This matters more as regeneration frequency increases. Each regeneration is an opportunity for drift. Subtle changes in behavior that pass all explicit checks but diverge from baseline in ways no one anticipated. Monitoring catches what tests miss.

The relevant signals include standard operational metrics (latency distributions, error rates, throughput) but also business metrics specific to each application: conversion rates, fraud detection accuracy, revenue per transaction, whatever invariants matter in your domain. And for AI-assisted systems, add inference cost per request, token usage patterns, and context window consumption. If a regenerated system passes all tests but doubles your API costs or quietly degrades decision quality, that's a failure your evaluations didn't catch.

Monitoring is not separate from evaluation. It's evaluation that runs continuously against reality rather than periodically against test fixtures.

The Real Codebase

Three tiers of evaluation, three lifetimes:

Ephemeral tests verify implementation decisions. Unit tests, structural assertions, mock-heavy integration tests. Useful during development, disposable when the implementation changes. Write them freely; delete them without guilt.

Durable evaluations verify behavioral intent. Property tests, contract tests, invariants, end-to-end checks. These survive reimplementation because they're specified at boundaries that survive reimplementation. They're expensive to write and worth the expense.

Live evaluations verify production reality. Monitoring, drift detection, anomaly alerts. These run continuously because intent and reality can diverge even when all explicit tests pass.

A system with only ephemeral tests cannot be safely regenerated. You don't know what behavior you're trying to preserve. A system with durable evaluations but no live evaluation will drift without warning. A system with all three can be deleted and rebuilt with confidence.

That confidence is the product. Code is a byproduct.

The real codebase is everything that lets you throw code away without fear: the properties that define correctness, the contracts that specify interfaces, the monitors that detect drift. If that set is empty, no amount of careful implementation will save you. If that set is rich, the implementation is just a detail, regenerable on demand, disposable without loss.

This is the promise of regenerative software. It requires investment in specification that most teams haven't made. It requires honesty about what your tests actually protect. And it requires accepting that the code you wrote yesterday might not exist tomorrow and that this is fine, because the behavior it encoded is preserved in artifacts that outlive it.

https://aicoding.leaflet.pub/3mb526js42k26

The Gradient of Trust

Dec 28, 2025 Updated Dec 28, 2025

Better shapes beat better prompts

Show full content

If you've been using generative AI regularly for a while, you already know this feeling. There are classes of code you'll happily accept without even reading. A small, pure function. Statically typed inputs and outputs. A well-understood transformation. No I/O. No hidden state. No ambiguity. The AI writes it, you paste it in, and you move on with your life.

And then there's code that touches the network. Code that encodes business rules. Code that depends on unclear invariants, partial documentation, or "everyone knows how this works." That's where things get weird fast. You reread it. You test it. You argue with it. Sometimes you rewrite it entirely.

What's interesting isn't that these two poles exist. It's the gradient between them. Over time, developers build an intuition for where a piece of code sits on that gradient. Some code you trust immediately. Some code you trust only after careful review. Some code you never quite trust, no matter who wrote it.

Once you notice that gradient, an obvious question appears: how do we design systems so that more of the code lives on the side where trust is easy?

Constraints as Trust

One of silly old jokes I'd tell from a system I worked on years ago was: "If you can make the Haskell system compile, it works." That's not actually true, of course, but it points at something real. A strong type system, purity by default, and explicit handling of effects dramatically shrink the space of possible mistakes. You trust the code not because you've verified it, but because the structure makes it hard to get wrong.

This was always valuable. AI makes it load-bearing.

If a function is small, pure, and tightly specified, it doesn't really matter whether it was written by a senior engineer, a junior engineer, or an LLM. The structure constrains the output. You trust it because trust is rational given the constraints. Conversely, if a component is large, stateful, and ambiguous, it doesn't matter who wrote it. You're paying for that complexity in review time, debugging time, and the nagging feeling that something might be wrong.

Two Strategies

This suggests two complementary approaches to system design.

First, structure systems so that more of the work can be expressed as simple, constrained transformations. Things you'd trust anyone to write without supervision. A data pipeline of pure, typed functions where each stage takes an input type and produces an output type. No hidden state, no ambient dependencies. Most code in such a system is trustworthy by construction, and—crucially—replaceable by construction. You can delete a stage and regenerate it, confident that if the types align, the behavior is probably correct.

Second, design the remaining messy parts so that failure is cheap, contained, observable, and reversible. Not everything can be pure and constrained. Some code genuinely needs to manage state or encode business rules that resist formalization. The goal isn't to eliminate this code but to quarantine it. Push it to the edges. Make it small. Surround it with monitoring. When it fails, the blast radius is limited.

Architectural Trust

There's a distinction here worth naming: code trust versus architectural trust.

Code trust asks whether a specific implementation is correct. Architectural trust asks whether the system is shaped so that correctness is easy and failure is survivable. You can have high code trust in a bad architecture. Every function is perfect, but the interactions are a nightmare. You can have high architectural trust with mediocre code. Individual functions might have bugs, but types prevent certain errors, tests catch others, and monitoring detects what slips through.

AI shifts the emphasis from code trust to architectural trust. When code is cheap to generate, the quality of any individual implementation matters less. What matters is whether the system is shaped so that cheap code is good enough.

The Real Leverage

The developers who thrive with AI won't be the ones who write the best prompts. They'll be the ones who design systems where prompts don't need to be perfect, because the system's structure does most of the work, and the AI is just filling in blanks that are hard to fill incorrectly.

When you can generate code freely, the bottleneck shifts to verification. Systems where most code needs careful review become expensive. Systems where most code is trustworthy by construction become cheap. The gradient of trust becomes a cost curve, and the systems that win are the ones where that curve slopes in the right direction.

The real leverage isn't better prompts. It's better shapes.

https://aicoding.leaflet.pub/3mb2qb6odxc2d

Compaction Is a Financial Strategy

Dec 27, 2025 Updated Dec 27, 2025

Why smaller codebases win in the AI era

Show full content

The cheapest system in the AI era is not the one that never changes. It is the one who parts can be cheaply regenerated because they are small and decoupled.

This claim sounds like architecture-conference wisdom, the kind of thing consultants say to justify rewrites. But something has shifted. The emergence of AI-assisted development has transformed code size from an aesthetic concern into a direct economic variable. Context windows have budgets. Tokens cost money. Every line of code you keep is a line you pay to process, again and again, every time you ask a model to reason about your system.

Compaction—the deliberate practice of making systems smaller—was always the quiet secret behind sustainable software. AI has simply made the economics impossible to ignore.

The Hidden Cost of Keeping Code

Most organizations dramatically underestimate how expensive it is to keep code. Not to write it—that cost is visible in salaries and sprints. Not to run it—that cost shows up in hosting bills. The hidden expense is keeping it: the ongoing cognitive and computational tax imposed by code that exists.

Consider what happens every time an engineer touches a large codebase. Before they can make a change, they must build a mental model of the relevant subsystems. This takes time—sometimes hours, sometimes days. The more code exists, the longer this ramp takes. Senior engineers hesitate to modify things they don't fully understand. Junior engineers make changes without understanding, introducing subtle bugs. The phrase "I'm not sure what this does, so I won't touch it" represents real operational risk, but it rarely appears in any budget.

AI compounds this problem in a new way. When a model assists with development, it reasons over whatever context you provide. Large codebases exceed context limits constantly, which means every prompt becomes a lossy compression of your actual system. The model sees fragments. It infers relationships. It guesses at conventions. Sometimes it guesses wrong, and you pay for those mistakes in debugging time.

There is a harder version of this objection that deserves acknowledgment: context windows are growing rapidly. Gemini offers two million tokens. Competitors are racing to match or exceed that figure. Why worry about code size when context is becoming effectively unlimited?

The answer is that capacity and quality are different things. Attention mechanisms degrade with noise regardless of window size. Retrieving relevant information from a massive context is itself a lossy process. More hay does not make the needle easier to find—it makes the search more expensive and less reliable. The constraint is not how much a model can hold but how well it can reason over what it holds. Smaller, cleaner inputs produce better outputs. This remains true whether the window is 100,000 tokens or 100 million.

What Legacy Systems Already Proved

None of this is entirely new. The software industry has been running a decades-long experiment on what makes systems survive, and the results point in a consistent direction.

Large systems fail for a boring reason: humans cannot reason about them. The so-called bus factor—the risk that key knowledge walks out the door when certain people leave—is usually described as a people problem. But it's really a surface-area problem. When only one person understands a system, it's almost always because the system is too big, too implicit, and too entangled for shared comprehension. The knowledge concentrated in that person's head is a symptom of architectural failure, not its cause.

The systems that survived longest tended to share certain characteristics: flat data models, explicit workflows, minimal abstraction, and code that repeated itself rather than hiding behind clever indirection. This last point requires clarification, because it seems to contradict the case for compaction. If repetitive code is good, doesn't that mean more lines, not fewer?

The distinction is between two kinds of complexity. Accidental complexity is bloat—code that exists because of historical accident, defensive layers accumulated over time, abstractions that obscure more than they clarify. Essential complexity is the irreducible difficulty of the problem domain itself. Compaction targets the former, not the latter. A system with explicit, even somewhat repetitive business logic can be smaller than a system with elaborate abstraction hierarchies, because the abstractions themselves consume space and impose cognitive load. The goal is not minimum character count. The goal is minimum semantic complexity: the smallest system that does the job while remaining comprehensible to both humans and machines.

Legacy systems that survived were not the ones built with the most sophisticated architectures. They were the ones that fit in people's heads.

The Politics of Deletion

If compaction is so valuable, why don't more organizations practice it? The technical answer—that deletion is risky and requires deep understanding—is part of the story. But the more important barrier is organizational.

Code has authors. Authors have feelings and careers. Managers who approved code have reputations attached to it. Deleting a system is not just a technical act; it is a political act that implicitly criticizes past decisions. This is why deprecation efforts so often stall. The engineer who proposes removing a subsystem must navigate a minefield of organizational sensitivities while also taking on technical risk. If the deletion goes wrong, they own the outage. If it goes well, the reward is invisible—the absence of problems that would have occurred otherwise.

There is also the Chesterton's Fence problem: code often exists for reasons that are no longer documented or understood. That strange conditional? It handles an edge case that caused a production incident four years ago. The seemingly redundant validation? It compensates for a bug in a third-party library that was never fixed. Deleting such code requires either deep institutional knowledge or the willingness to rediscover these constraints the hard way.

This is why compaction, in practice, is a senior-engineer activity. It takes experienced judgment to distinguish load-bearing complexity from accumulated sediment. The cost of that judgment is real. In the short term, it is often cheaper to work around old code than to remove it. The long-term costs of that choice are diffuse and easy to ignore until they become critical.

AI Sharpens the Imperative

What changes in the AI era is that the costs become measurable in ways they never were before.

Cognitive load was always real, but it was hard to quantify. How do you put a number on "the engineers are confused"? Token costs are different. Every unnecessary line of code increases inference cost: more tokens to load, more ambiguity to resolve, more paths for the model to evaluate. When you remove code, you can see the prompt shrink. You can measure the reduction in API calls. Deletion has ROI you can put in a spreadsheet.

The effect goes beyond cost. AI agents—systems that take actions based on model outputs—behave differently in large versus small codebases. In complex environments, agents get stuck in loops. They hallucinate libraries that don't exist. They make changes that break unrelated subsystems because they couldn't see the full dependency graph. Compaction is not just about efficiency; it's about reliability. Smaller systems produce more consistent agent behavior because there's less room for the model to get confused.

This creates a new kind of feedback loop. Teams that maintain compact codebases get more value from AI assistance. That increased value creates resources and motivation for further compaction. Teams with sprawling codebases struggle to use AI effectively, which means they have less capacity to clean things up. The gap between well-maintained and poorly-maintained systems will widen as AI capabilities improve.

The Design of Deletable Systems

Systems that can shrink are systems designed for deletion. This is not the same as systems designed for change, though the two overlap. The critical property is what might be called clear seams: boundaries between components that allow removal without collapse.

Loose coupling is often presented as an architectural virtue in its own right, a marker of good design. In the context of compaction, it's better understood as a prerequisite for deletion. A tightly coupled system cannot shrink because removing any part damages the whole. A loosely coupled system can shed components the way a healthy organization can lose employees—with adjustment, but without crisis.

Replacement beats refactoring for a related reason. Refactoring preserves historical constraints. It says: this code has problems, but its fundamental structure represents decisions worth keeping. Replacement discards those constraints entirely. When regeneration becomes cheap—when you can describe what you want and get working code quickly—carrying forward old decisions becomes increasingly irrational. The sunk cost fallacy, already a problem in software, becomes even more expensive to indulge.

Deletion is the most underrated operation in software. It eliminates unknown behavior. It collapses state space. It restores comprehensibility. And unlike refactoring, it cannot introduce new bugs in the code it removes. The systems that endure will not be the ones that grew most carefully. They will be the ones that learned how to remove safely.

Where This Leads

If compaction lowers cost and improves AI reliability, then regeneration replaces maintenance as the dominant strategy for certain kinds of software. This is a more radical shift than it might first appear.

Maintenance assumes preservation. Its central question is: how do we keep this system working while making necessary changes? The answer involves careful modification, extensive testing, and respect for existing structure. Maintenance treats code as an asset to be protected.

Regeneration assumes discard. Its central question is: how do we describe what this system should do so we can rebuild it when needed? The answer involves clear specifications, good tests, and confidence that reconstruction will work. Regeneration treats code as a byproduct of understanding—valuable, but not precious.

Not all software can or should be regenerable. Critical infrastructure, systems with hard-won safety properties, code that encodes institutional knowledge built up over years—these may always require careful maintenance. But a surprising amount of software is more like scaffolding than like cathedrals. It exists to solve a problem at a moment in time. When the problem or the context changes, the old solution may have less value than a fresh one.

The shift raises uncomfortable questions. If code is disposable, does quality still matter? If understanding lives in prompts and specifications rather than implementations, what happens to the craft of programming? These are genuine uncertainties, not rhetorical flourishes. The definition of quality may be moving from "durability" to "regenerability"—from code that lasts to code that can be reliably reproduced. What that means for how we train engineers, evaluate systems, and think about software as a discipline is not yet clear.

What is clear is that the economics have changed. Compaction was always wise. Now it is also profitable in ways you can measure. The organizations that figure this out first will find themselves with systems that are cheaper to run, easier to understand, and better suited to AI assistance. The ones that don't will be paying a tax on every prompt, forever, for the privilege of carrying code they no longer need.

https://aicoding.leaflet.pub/3may5niwoyk2n

Code Was Never the Asset

Dec 24, 2025 Updated Dec 24, 2025

Why AI makes the hidden economics of software unavoidable

Show full content

When we talk about software economics in the age of generative AI, it can feel like we’re inventing something entirely new. But the truth is that many of the economic pressures now simply reveal dynamics that have always existed beneath the surface of software development.

The Myth of Code as Capital

For decades, most engineering cultures treated code like a durable asset. The prevailing mindset was:

Code should be long-lived
“Technical debt” must be paid down
Rewrite is failure
Maintenance is virtue

This made sense when the dominant cost of software was writing it: hiring, training, manual coding, test cycles, and team coordination absorbed most of the budget. But that interpretation created a myth: that code itself is valuable capital.

It isn’t.

Legacy systems — codebases decades old still running mission-critical functions — are often costly only because they are expensive to understand and maintain, not because their lines of code are inherently valuable. According to industry definitions, a legacy system is code that continues to serve a purpose but has become burdensome to evolve because of outdated technology or missing automated tests and documentation.

Even the term “legacy code” in software engineering — code without tests — implies maintenance risk, not long-term capital value.

Systems Already Reveal the Hidden Economics

Look at how successful evolutionary modernization of legacy systems happens in practice. Thoughtworks and other practitioners favor incremental, evolutionary strategies over “big bang” rewrites because they reduce risk and cost.

What do these approaches have in common?

They minimize the amount of code kept while surface area grows
They replace functionality in iterations
They rely on incremental modernization patterns

These are not new economic insights. They’re how humans have long coped with complexity when preserving old code becomes more expensive than replacing it.

Even before AI, many approaches (including my own) to evolutionary architecture were solidified as responses to the high cost of maintaining monolithic legacy codebases. They accept that restating, reshaping, and replacing parts of systems can be cheaper than preserving and patching old ones.

Pace Layers: How Software Already Had Multiple Cost Regimes

As discussed in a previous post, a useful lens for understanding why not all code should be treated the same is pace layering — a model first articulated by Stewart Brand to explain how complex systems adapt and endure.

In Brand’s original framing, different layers of a system evolve at different speeds:

Fast layers innovate and experiment
Slow layers stabilize and provide continuity
The tension between them produces resilience

Applied to software, this predicts that some parts of a system ought to change rapidly, and others slowly — because the economic cost and impact of change differ by layer.

This insight was true long before AI. In practice:

UI frameworks cycle quickly
Core business rules change occasionally
Deep infrastructure and protocol logic rarely change

Traditional engineering already valued some code as more durable because its replacement was expensive. This matches Brand’s argument that “fast learns, slow remembers”. Fast layers respond quickly to shocks while slow layers retain memory and continuity.

AI Reveals What We Already Ignored

Generative AI collapses the cost of producing code, but not the cost of understanding it. Writing is cheap; comprehension is expensive. This exposes a core truth that legacy practitioners already knew instinctively:

Software isn’t valuable because it exists and serves a purpose. its value also lies in the requirement that we can reason about it, evolve it safely, and trust its behavior.

That’s why legacy systems become expensive: the burden of understanding and maintaining code outweighs its utility. Traditional approaches like software archaeology — reverse-engineering undocumented code — are symptomatic of organizations trying to carry cost forward because retiring code was harder than preserving it.

AI accelerates this pressure.

Why This Matters Now

Today, as tools can generate massive amounts of code with little human effort, the economic question is no longer "How do we write code efficiently?" It's "How do we reduce the long-term cost of what we write?"

That question demands we rethink what we treat as persistent. Underneath the shiny façade of AI productivity, the real economic driver will be systems that minimize the cost of comprehension, evaluation, and replacement.

https://aicoding.leaflet.pub/atom

Posts