Codemanship's Blog — GeistHaus

I Am Ralph – CRESS Principles in Practice

codemanship May 17, 2026 Updated May 17, 2026

Since I wrote about my CRESS principles for context engineering – contexts should be Current, Refutable, Empirical, Small & Specific – I’ve been thinking about how that applies to my AI-assisted software development workflow. You won’t be surprised – if you follow this blog or know me professionally at all – to hear that I … Continue reading "I Am Ralph – CRESS Principles in Practice"

Show full content

Since I wrote about my CRESS principles for context engineering – contexts should be Current, Refutable, Empirical, Small & Specific – I’ve been thinking about how that applies to my AI-assisted software development workflow.

You won’t be surprised – if you follow this blog or know me professionally at all – to hear that I drive design and development with tests.

You also won’t be surprised to hear that I work in small steps, solving one problem at a time. (Though you might be surprised at how small a step I mean by a “small step”).

You probably won’t be surprised that I run my tests after every change to the code. And you probably won’t be surprised that I’m in the habit of committing changes when I see the tests pass, or that I often revert changes when tests fail.

Nor will you be very surprised that I review the code after each small change, and not after a whole bunch of changes. I’ll look at the code carefully, perhaps run a linter to check for low-level problems that are easy to miss.

This has been my workflow for nearly 3 decades. And so you probably won’t be surprised to learn that it’s still my workflow in 2026, whether I’m using AI tools or not.

I’ve experimented extensively with automating the parts where I normally judge results and make decisions, and I’ve seen many others trying to do the same.

I went on a journey from me essentially orchestrating every small step, to a single agent, to multiple concurrent agents working without intervention for longer and longer.

And I saw just how impossible long-horizon, fully autonomous agentic workflows are. And I do mean impossible. A single step it might get right 80% of the time. 2 in a row? 10 in a row? 100 in a row? Forget it. It might not fall at the first hurdle, but it will fall soon enough.

So I walked it back to a single agent – a basic Ralph loop – and then back even further to me essentially being the agent. I am Ralph.

I see more and more people who’ve spent lots of time on the same journey, and they too have reached a stage where they’re making their harnesses simpler and simpler, stripping out everything that they’ve discovered isn’t helping – and, in many instances, probably making things worse. I expect to meet some of them at “I am Ralph” soon.

If I visualise my workflow as a conversation between me and the agent, and between the agent and the model, there’s pretty much a one-to-one mapping between the steps in the process and my interventions.

I asked ChatGPT to try and visualise how this might look in a test-driven workflow, with continuous testing, inspection and refactoring and continuous integration.

What it came up with is close – in spirit, at least. Except I wouldn’t ask Claude to perform a refactoring that my IDE has a shortcut for. If you find yourself asking AI to do something you can do quicker and better, arguably you’ve lost the plot.

Also not mentioned in the diagram is automated code inspection – static analysis and that sort of thing – which I would have used multiple times in this workflow.

And, most importantly, the agent doesn’t decide the next step. I do. Always. But ChatGPT refused to let go of that one.

Note how context is being created fresh for each step, and being flushed after each step. As soon as changes are applied to the code – having been tested first – the context is now stale. New balls, please!

It also means that the agent isn’t dragging context from earlier steps behind it, keeping context small and task-specific, dramatically reducing the risk of effects like attention dilution, context rot and probability collapse, and improving model predictions.

This kind of workflow is far more token and compute-efficient for the models, too.

http://codemanship.wordpress.com/?p=4483

Extensions

Calibrating Your Steps – How Small is “Small”?

codemanship May 13, 2026 Updated May 13, 2026

Join me on Saturday May 23rd at 9:50 BST with other self-funding learners to get hands on with the micro-cycles and small steps of Test-Driven Development. You know how it is when folks agree on something, but in their heads they have very different pictures of what it is they think they’re agreeing on? I … Continue reading "Calibrating Your Steps – How Small is “Small”?"

Show full content

Join me on Saturday May 23rd at 9:50 BST with other self-funding learners to get hands on with the micro-cycles and small steps of Test-Driven Development.

You know how it is when folks agree on something, but in their heads they have very different pictures of what it is they think they’re agreeing on?

I get that a lot when I talk about working in “small steps”. They nod enthusiastically and we all agree that small steps are a good thing.

And then I look at the size of their commits. Or they look at the size of mine. And now we don’t agree. We don’t agree at all.

Aside from being a classic example of where “Don’t tell me, show me” can aid in communication, it’s generally useful to contrast and compare our place in a distribution, and maybe recalibrate our expectations.

To give you an idea, pay close attention to how little code I change before I run my tests and – if they pass – commit those changes, before making the next change in this demonstration of refactoring.

http://codemanship.wordpress.com/?p=4478

Extensions

CRESS Principles for Context Engineering – S is for Specific

codemanship May 11, 2026 Updated May 11, 2026

One of the common challenges I face as a teacher is getting developers to move forward by putting one sure foot in front of the other, instead of trying to do it in risky leaps and bounds. One activity in particular where this friction occurs is refactoring. I watch people hack away at swathes of … Continue reading "CRESS Principles for Context Engineering – S is for Specific"

Show full content

One of the common challenges I face as a teacher is getting developers to move forward by putting one sure foot in front of the other, instead of trying to do it in risky leaps and bounds.

One activity in particular where this friction occurs is refactoring. I watch people hack away at swathes of code, making dozens of changes, before I say “Shall we run our tests now?”

More often than not, the tests fail. Every change to the code carries the risk of breaking it, and that’s true whether we make them one a time or 100 at a time before testing them.

But when we make them one at a time, if we break the software we know exactly which change broke it. Fixing it is a doddle usually. And if we can’t fix it, we can just roll back the change with a simple Ctrl-Z or git reset –hard (if we’re in the habit of committing whenever we see the tests pass after a change).

If we make 100 changes and one or more of them breaks the software – which is now almost certain – then we have a much bigger problem. Was it the first change? Or maybe the last change? Or the 48th change? Into the debugger we go – probably for quite some time. And if we can’t fix it, undoing it is a lot of work lost.

The discipline of refactoring is in reshaping code one single atomic, tested change at a time instead of hacking away at it, making a whole bunch of changes. We rename a function, and run the tests. And if the tests pass, we might commit that before making another change.

No matter the scale of the restructuring we plan to do, we do it one small, easily-reversible change at a time. We put one foot in front of the other.

And each change has one specific objective – make the intent of a function easier to understand, break down a complicated IF block into something simpler, decouple business logic from an API call, and so on. Each change solves one problem, working in rapid micro-cycles with continuous testing and code review, instead of turning them into serious bottlenecks later.

This is beneficial when we’re working with AI coding tools, as multiple large-scale studies show very clearly the negative impact of downstream bottlenecks in development.

And it’s also helpful when we’re using LLMs to generate code for us. The more we ask a model to do, the less likely it is to do it successfully. (See “S is for Small“)

If we move forward in small steps, solving one problem at a time in tight feedback cycles, then our contexts can be about one specific thing – write this specific failing test, write the simplest code to pass that test, review the code that’s changed for this specific smell, do this specific refactoring.

And don’t include any information in the context that isn’t needed for that specific task.

If the task is to move a method from one class to another, we don’t need to give the model a summary of our architecture or of our coding standards or anything else unrelated.

It just needs a single instruction, and the code affected by the refactoring, and perhaps an example – maybe in a reusable context file – that illustrates the mechanics of that refactoring.

And it needs a way to test that the refactoring achieved the goal. So if the goal was to eliminate Feature Envy, then we can test for that smell afterwards in the code that’s changed.

This means that – provided the “blast radius” of the change is small – the context for this interaction with the model will be well within effective limits.

Any information included in the context that has no relation to the task at hand will just water down the model’s attention and reduce the probability of successful completion.

I conducted a closed-loop experiment where I asked Claude Opus 4.6 to execute a coding task, and then – with the help of GPT-5.2, arguably the best model if waffle is what you’re after – added more and more irrelevant information to the prompt. The task remained the same, but we buried it under increasing amounts of distractions – including a fictional set of coding standards and an architecture summary.

Each variation was attempted 1o times, so I could measure how many times out of 10 the task was successfully completed.

Long story short – the more extraneous or irrelevant information, the worse the model performs in specific tasks.

The experiments I’ve done, backed up by larger independent studies into the effect of context size on model performance, have also forced me to recalibrate what I mean by a “small context”. Forget the maximum advertised context limit for your model. Accuracy degrades rapidly with even just a few hundred tokens.

So for each interaction, contexts needs to be fresh, task-specific and only contain the minimum information needed for that task.

http://codemanship.wordpress.com/?p=4463

Extensions

Slow. The. F**k. Down.

codemanship May 10, 2026 Updated May 10, 2026

“Slow is smooth, and smooth is fast.” US Navy SEALS training mantra Another day, another data set telling us what we already knew. In the latest AI Engineering Report from Faros, the software development telemetry folks, they found from studying 22,000 developers working on more than 4,000 teams what they call an “acceleration whiplash” effect … Continue reading "Slow. The. F**k. Down."

Show full content

“Slow is smooth, and smooth is fast.”

US Navy SEALS training mantra

Another day, another data set telling us what we already knew.

In the latest AI Engineering Report from Faros, the software development telemetry folks, they found from studying 22,000 developers working on more than 4,000 teams what they call an “acceleration whiplash” effect caused by AI code generation.

As with other large-scale studies, more code’s undoubtedly being generated faster. Individual developer “productivity” – which I put in quotation marks because there’s no such thing – is up. Nobody’s contesting that.

But they also saw the same “downstream chaos” that the DORA folks saw in their data, and CircleCI saw in 28 million merge, build & deployment workflows.

Incidents per Pull Request were up 242%. Monthly incidents were up 58%. Bugs were up 54% (and that’s up 9% on 2025, so it’s accelerating).

Work restarts are up 14%. 26% more of tasks show no activity for a week or more, languishing in code review purgatory. More work in progress, more work stalled, more work abandoned. As they put it, “beginning is easy and finishing is hard”. And the best available data is telling us that unconstrained AI code generation is making it much harder.

Notably, they found that PR’s that made it into production without any review were up 31%. Teams are shipping at “LGTM-speed”.

Hey, it’s this guy again.

This, they found, is accompanied by rising levels of cognitive load caused by bigger change sets requiring oversight – much bigger – and by increased context-switching caused by more work in progress; more plates spinning. And the average outcome of spinning more plates is broken plates (and broken plate-spinners).

Bugs, incidents and rework are rising rapidly, and – surprise, surprise – delivery lead times (by their definition, time from changes being committed to getting into production) are up 480%. That’s 5x!

In the 90s, CASE tool vendors often used the marketing strapline “better software, faster”. Perhaps AI coding tool vendors’ should read “worse software, later”.

All of this was predictable, and all of it was predicted – not least by me. Bigger changes hitting downstream bottlenecks faster just causes longer delays. We walked into this ambush with our eyes wide open. It’s the bottlenecks, stupid!

75+ years of software development has taught us important lessons about how we should work. That most teams still haven’t got the memo in 2026 – and many teams who did get the memo seem to have forgotten – is embarrassing for our profession. All it took was a shiny new toy, and software engineering fundamentals went out of the window. It’s not like that hasn’t happened before, but never on this scale or at this speed.

And what are those fundamentals?

Work in small slices, solving one problem at a time
Test, review and merge continuously
The gold isn’t in what we ship, it’s in what we learn from what we ship. A team that ships 2x and gets 0.5x meaningful user feedback is a 0.5x team. The time it takes to get that feedback is the ultimate speed limit, not how fast features can be churned out.
Slow the fuck down! Move forward by putting one sure foot in front of the other

I was in my twenties when I learned that what feels slow in software development can often turn out to be fast when we measure it at the system level, and not at my individual level. I learned to mistrust my gut feeling of being “productive”. In reality, my individual productivity is an illusion – albeit a very seductive one.

We’ve known for decades that in software development, batch size and feedback loops do the heavy lifting. How fast code’s being generated is a tiny little mouse compared to those elephants.

And until teams address the elephant in the room, the “downstream chaos” is just going to get worse as AI code generation becomes an ever-larger part of the process. It’s a good job we didn’t put software in everything, or this could have serious consequences for society!

Here’s your guide to how dev teams can become AI-ready, and it has very little to do with AI, and a lot to do with how you approach development.

http://codemanship.wordpress.com/?p=4432

Extensions

CRESS Principles for Context Engineering – S is for Small

codemanship May 9, 2026 Updated May 11, 2026

We’ll get to the effective context limits of Large Language Models in due course, but let’s open with a software engineering fundamental. More than 75 years of people writing computer programs has taught us a few hard lessons, and one the of the most important is the size of the steps we take. In my … Continue reading "CRESS Principles for Context Engineering – S is for Small"

Show full content

We’ll get to the effective context limits of Large Language Models in due course, but let’s open with a software engineering fundamental.

More than 75 years of people writing computer programs has taught us a few hard lessons, and one the of the most important is the size of the steps we take.

In my early days, I could code for hours before compiling and running my program. And when I ran the program, it inevitably didn’t work. Of course.

So I’d spend even more hours trying to figure out why it wasn’t working, and fixing the bugs.

As a young software developer, I believed debugging was Programming Skill #1, because I spent so much of my time doing it.

I later learned that this approach to writing programs was called “code-and-fix” development by those in the know. It’s the equivalent of shooting an entire movie without checking the footage, and then trying to fix all the mistakes in the editing suite.

“Code-and-fix” is very costly, and the end result is less than ideal, to say the least. A lot of the bugs never got fixed, because there just wasn’t time. (And because I was dumb enough to provide estimates that didn’t take debugging into account).

Now, here’s the funny part. I’d code for hours – hundreds of lines of code in one sitting – and then hit “Run”. But how I debugged code was a whole different approach. In debugging mode, I’d focus on one problem at a time, make one change to the code, run it again to see if that fixed it. If it did, I’d move on to the next bug in the (usually long) list.

The breakthrough came when I realised that’s how I should have written the code in the first place – solve one problem at a time, running the program many times an hour to check that the problem was indeed solved before moving on to the next problem.

Don’t shoot the whole movie and then look at the footage. Shoot one short take, and then go to video village and see how it looks. Actor didn’t hit his mark? Let’s go again now. Y’know, while the actor’s still here, along with the crew, and the set.

If I screw up a single change to a single line of code and find out immediately, it’s a quick and easy fix – I know exactly which change broke the code. If I screw up a bunch of changes to a bunch of code and only then find out, I’m going to end up in the debugger.

Having learned to work in small steps, making one change to the software at a time and getting feedback from running and testing the software, I found myself dealing with far fewer bugs, and – counterintuitively, because it feels slower – actually shipping sooner.

Changing code is like walking a tightrope. When we make lots of changes and then test the software, we’re walking a tightrope tied between two mountain peaks – by the time we reach the middle, it’s a long way to safety (working code) and a long way down if we fall.

When we make one change at a time and get feedback from testing as we go, our tightrope is tied to wooden posts a few feet a part and a few feet off the ground. We’re never far from safety, and if we do fall, we can just get back on the rope at the last point of safety with little time or effort wasted.

Importantly, as we progress in small steps from one tested, working version of the code to the next, every one of those posts represents a potential release. We’re never far from software that’s shippable.

This accelerates a more important feedback loop. When we can ship more often, we can get user feedback from working software more often. This enables us to learn what works and what doesn’t faster. And, it turns out, that learning is where the real value tends to be found – not from what we planned to deliver, but what we learned from what we delivered.

Working in smaller, tested steps gives us many more opportunities to steer the ship away from the rocks and towards the docks.

There are secondary systemic benefits for teams to doing the work in smaller slices, too. Larger batches of changes hitting downstream bottlenecks in the development process like testing, code review and merging to the release branch makes these activities take longer. Our changes spend a lot of time sitting in queues waiting their turn. The more changes in progress – the more cars on the road, if you like – the more time’s spent waiting instead of moving forward.

Faster cars != faster traffic.

LLMs can generate a lot of code very fast, and the tendency for AI-assisted development to exacerbate these bottlenecks – leading to worse software delivery performance overall – is well-documented.

The impact of batch sizes on delivery lead times and release stability is so big – much, much bigger than AI code generation – that it’s a mystery why more teams don’t pull that lever.

Now for the really fun part; for all the reasons I’ve stated, it serves us well to work in small, tested slices – putting one foot in front of the other – whether we’re using AI or not.

But when we are using AI, it helps us in another important way. These days, LLMs have large advertised maximum context sizes in the order of as much as 1 million tokens. But they do not remain effective at that order of magnitude, or indeed anywhere near it.

The accuracy of token predictions drops off rapidly with contexts as small as just a few hundred tokens, according to independent studies.

As the amount of text an LLM has to keep track of grows, its performance tends to get worse for a few reasons. One is “attention dilution,” where the model spreads its focus too thinly across too much information, making its predictions less confident and precise.

Another is “probability collapse,” where the model struggles more as a conversation or task becomes longer and less similar to examples it saw during training – like how a chess-playing model can make increasingly poor moves deep into a game. Together, these effects make LLMs less reliable and effective when handling larger contexts.

For these reasons, contexts should be as small as possible – contain the least amount of information the model requires for the job at hand. We’ll explore the importance of being specific in the next post, but suffice to say that when extraneous or irrelevant information’s included, it reduces the chances of getting the outcome we want.

Tools like Claude Code and Cursor will typically “compress” contexts when they get too large – which involves summarising parts of the context, and that’s a lossy process. But if you see them doing that, the context is already way outside of the effective zone. In my own workflows, I very rarely see it happening.

When we work in small, tested steps – tackling on problem at a time – and apply the CRESS principles we’ve covered so far, this tends to keep contexts in the order of a few hundreds tokens, comfortably within the limit where models are effective.

When we don’t, we tend to end up spending more time fixing problems, more time doing retakes, and more time with our work sitting in queues. And right now, this is the average picture for the majority of teams, because they’re not slicing the work thinner like they should be. Indeed, some of them are actively moving in the opposite direction and making these problems worse, enthusiastically cheered on by AI vendors wo ought to know better.

On final word about context size: another major factor in how much information needs to be included in each interaction with a coding model is the “blast radius” of the code affected.

If our code has low modularity and poor separation of concerns, a single functional change could bring many source files into play, all of which will need to be included in the context.

If our design effectively localises the impact of changes by splitting code up into cohesive and loosely-coupled modules, then a lot less of it needs to be included.

As with the “small” in “small steps”, “modular” enjoys a wide range of interpretations. What we’re learning with AI coding tools is that what’s really required is – as I saw someone describe it recently – a kind of “radical modularity”. When I looked at code they described as “radically modular”, to me it, I just saw modular code as I understand that to mean. I suppose it’s a bit like how what we call “organic food” in the UK, in France they just call “food”.

LLMs, famously by now, have a bit of a problem with modular design. They’re very good at generating code that they’re pretty bad at modifying later, and the lack of separation of concerns in generated code appears to be one of the main culprits. A program I might have implemented in 100 source files, Claude Code might squeeze into a dozen.

So you really need to keep on top of that, continuously reviewing and refactoring the design to steer yourself clear of a Big Ball of LLM-Unfriendly Mud.

You might be thinking “I’ll just get the LLM to handle that” right now. That would be a mistake. Research shows that models struggle to learn long-range patterns. Matching local patterns is where LLMs are strongest. They can’t do “bigger picture”. Basically, they’re driving in fog, at any scale of model.

Modular design remains very much a “you” thing.

http://codemanship.wordpress.com/?p=4395

Extensions

CRESS Principles for Context Engineering – E is for Empirical

codemanship May 8, 2026 Updated May 8, 2026

Most commercial LLMs – that is to say, the ones with expensive lawyers – display a disclaimer along the lines of “<LLM> can make mistakes. Check important info.” They’re not kidding. Every token of text an LLM generates should be considered suspect, and when fidelity matters, we really should check the output thoroughly. In programming, … Continue reading "CRESS Principles for Context Engineering – E is for Empirical"

Show full content

Most commercial LLMs – that is to say, the ones with expensive lawyers – display a disclaimer along the lines of “<LLM> can make mistakes. Check important info.”

They’re not kidding. Every token of text an LLM generates should be considered suspect, and when fidelity matters, we really should check the output thoroughly.

In programming, fidelity matters. I appreciate that’s the kind of heresy that can get you sacked in Silicon Valley these days, where YOLO – mostly driven by FOMO – dominates.

But in banks and retail chains and hospitals and payroll it really does still matter – which is why, on high-risk systems, applying LLM-generated code changes directly is effectively banned in many organisations.

And it’s a two-way street. If we want more trustworthy output from an LLM, we need it to have more trustworthy input – both in training and in inference.

As users, we don’t have any control over the quality of the data an LLM is trained on, but we do have control over the quality of the data we give it in day-to-day use. Here’s another mnemonic: GIGO – Garbage In, Garbage Out.

Whenever possible, we want the context that the model is pattern-matching on to be grounded in observed reality, rather than in the model’s own output.

The code as it really is right now
The real requirements we agreed with the customer or product owner
The real customer acceptance tests
Actual test run results
Actual linter reports
Actual mutation testing results
Actual user feedback

And so on.

The uncomfortable truth is that the moment Claude Opus or GPT-5 or Gemini starts acting on its own output – e.g., its own planning or reasoning or generated code – the context starts drifting from reality. And the further we let the generated context run, the more they compound on their errors – eating their own fiction and producing even wilder flights of fancy. They have no model of the real world to compare it to.

Ditto where context “compression” and LLM-generated summaries are concerned – they’re notoriously unreliable narrators. That architecture.md file that Claude generated for you? Be very skeptical that it’s an accurate picture of the real architecture. Research finds that LLM-generated context files can mislead models.

The practical upshot of this is an information flow where our inputs are wherever possible grounded in observed reality, and LLM output can only become part of that observed reality after it’s been thoroughly tested against it.

And, yes, I’m implying that we shouldn’t rely on LLMs to mark their own homework, because they don’t have access to any kind of real-world model until we give it to them. In short, when an LLM tells you it’s raining, go outside and look.

To use an analogy, LLM waste water needs to be made clean and safe to drink before feeding it back into the LLM in future interactions. This often requires expert intervention, and often requires that the output be rejected outright if it’s too far from acceptable (e.g., if it fails the unit tests).

As with the C in CRESS – contexts should be Current – the implication is that contexts be short-lived, or they start to fill up with generated content that hasn’t been verified and – as the ground shifts beneath the model’s feet with each applied change to the code – it drifts further from the underlying reality.

The E in CRESS also works with the R – contexts should be refutable. In order for model outputs to be fed back into model inputs, they should pass through a quality gate that enables us to know with high confidence if they don’t satisfy our intent.

http://codemanship.wordpress.com/?p=4370

Extensions

CRESS Principles for Context Engineering – R is for Refutable

codemanship May 7, 2026 Updated May 7, 2026

If speculative ideas can not be tested, they’re not science; they don’t even rise to the level of being wrong. Wolfgang Pauli When we interact with a language model, we’re communicating in natural language. And communicating in natural language is a lossy process. There’s what I intended it to mean, and then there’s the meaning … Continue reading "CRESS Principles for Context Engineering – R is for Refutable"

Show full content

If speculative ideas can not be tested, they’re not science; they don’t even rise to the level of being wrong.

Wolfgang Pauli

When we interact with a language model, we’re communicating in natural language. And communicating in natural language is a lossy process.

There’s what I intended it to mean, and then there’s the meaning the model interprets, and they’re often not the same thing.

Many bad things have happened in the world because the receiver misinterpreted the intent of the sender. So it’s important to know with high confidence if we’ve grabbed the wrong end of the stick.

The inherent ambiguity of natural languages works against our desire to make our meaning clear.

In real-world communication, a simple technique to uncover misunderstandings is to test interpretations to see if they satisfy the original intent.

Including a test in an instruction given to an LLM serves two useful purposes:

It restricts pattern-matching to those that also match the test and not just the natural language instruction. Coding models are actually trained by pairing code samples with tests of some kind, and more recently test execution has been used as a reward function in reinforcement learning. LLMs are sort of build for tests.
It potentially gives us a direct way to check if the output doesn’t satisfy the intent. If our success criteria are turned into executable tests – e.g. unit tests – then we can run them against the output and get immediate feedback.

Imagine we want our LLM to generate code to add items to an online shopping basket. I regularly see prompts that look something like this.

Please generate a Python function for adding items to a shopping
basket. It should take product and quantity as parameters.

But the devil’s in the detail. What exactly are we expecting to happen when the function adds the item? How will we know if it doesn’t happen the way we intended?

I’ve been providing BDD-style tests in my contexts, along the lines of:

Given an empty basket,
And the customer has selected the product with ID 811 and stock of 3
When the customer adds the product to the basket with quantity 2
Then a new order item is added to the basket with product 811 and quantity 2
And 2 of product 811’s stock are put on hold, leaving available stock of 1

This gives the LLM much more to go on regarding the expected behaviour – the precise intent – of adding an item to the basket.

And it can be directly translated into unit tests:

class AddToBasket(unittest.TestCase):    def test_order_item_is_added(self):        basket = []        product = Product(id=811, stock=3)                add_to_basket(basket, product, quantity=2)        item = basket[0]                self.assertEqual(item.product, product)        self.assertEqual(item.quantity, 2)    def test_stock_put_on_hold(self):        basket = []        product = Product(id=811, stock=3)                add_to_basket(basket, product, quantity=2)        self.assertEqual(product.hold, 2)        self.assertEqual(product.available_stock(), 1)

(NB: In my workflow, I’d tackle one test at a time – we’ll cover that in the final two letters in CRESS.)

Provided the executable tests the LLM generates match the intent – and it’s really important to check that they do – any implementation it generates will need to pass them.

If the implementation doesn’t pass the tests, or the tests don’t match the intent, I revert the changes, flush the context (see “C is for Current“) and try again – perhaps adding further clarification to the context, like additional tests, if needed.

Does this really make a difference? It certainly does. I conducted closed-loop experiments where I tasked Claude Code – using Opus 4.6 – to implement a set of features for a small, but non-trivial, system.

I’d written my own reference implementation with tests that used a simple API that didn’t reveal any internal design details. I preserved the API and moved the tests to where Claude couldn’t see them, leaving just my instructions and the API for it to work with.

When Claude had finished, I moved the tests back in to the project and ran them, scoring each pass by the % of tests passing.

I didn’t intervene until Claude said it was done. (In real life, I don’t use it this way, of course.)

In one version of the experiment, I provided BDD-style examples in the prompt. In another, I just gave Claude the basic feature descriptions. In both versions, Claude was instructed to generate its own tests from its interpretation of the requirements.

In a single pass, measured by % of tests passing, the difference was big.

Over multiple passes, feeding back test results after each, the difference got even bigger.

With test examples provided, the agent has explicit success criteria to converge on. Without them, it just goes around in circles, literally aimlessly. Poor little Ralph!

One final thought: not all interactions with an AI coding tool will be about adding or changing functionality. What if the task is a refactoring?

Well, hopefully your refactorings have goals – they’re done with intent to improve the design.

In my TDD workflow, at every green light – whenever the tests are passing again – I perform a mini code review on the changes. I might, for example, run a linter over the diff. Let’s say one of my code quality checks – just another kind of test – is for functions or methods that have a cyclomatic complexity > 5.

If the LLM changes a function and makes CC = 6, I now have a failing test. I could revert and feed that back in another pass (and giving an LLM two objectives in the same interaction reduces the odds of either being satisfied, so we could be here all day throwing the dice over and over again).

Or I could ask the LLM to refactor the function, and then run the check again to see if the restructured version is within limits.

However I choose to handle it, importantly I have a clear way to know when it hasn’t worked.

http://codemanship.wordpress.com/?p=4344

Extensions

May Workshops for Self-Funding Learners – Update

codemanship May 7, 2026 Updated May 7, 2026

Hiya. Just a quick note about the Essential Code Craft training workshops aimed at self-funding learners that are happening this month. Specification By Example Tuesday May 12 (evening) has 3 places available. I’m guessing it’ll be sold out by the end of this week. Saturday May 16 (morning) is half-full at time of writing. -> … Continue reading "May Workshops for Self-Funding Learners – Update"

Show full content

Hiya. Just a quick note about the Essential Code Craft training workshops aimed at self-funding learners that are happening this month.

Specification By Example

Tuesday May 12 (evening) has 3 places available. I’m guessing it’ll be sold out by the end of this week.

Saturday May 16 (morning) is half-full at time of writing.

-> Register

Test-Driven Development

Tuesday May 19 (evening) is sold out, but you can add yourself to the waitlist in case anybody drops out and to be among the first to hear about future workshops.

Saturday May 23 (morning) still has plenty of places available.

-> Register

And if you want to keep an eye out for future workshops in June and beyond, bookmark our dedicated web page for self-funding learners on Ticket Tailor.

Upcoming skills we’ll be covering include Modular Design and Refactoring. Y’know? Boring skills that are nevertheless essential.

http://codemanship.wordpress.com/?p=4332

Extensions

CRESS Principles for Context Engineering – C is for Current

codemanship May 6, 2026 Updated May 6, 2026

Imagine you’re trying to deliver groceries in a busy city using a map that was published in 1971. You’ll find yourself looking for houses, apartment blocks, entire neighbourhoods that didn’t exist when the map was drawn. This is what it’s like when an AI coding assistant or agent is trying to work on a code … Continue reading "CRESS Principles for Context Engineering – C is for Current"

Show full content

Imagine you’re trying to deliver groceries in a busy city using a map that was published in 1971. You’ll find yourself looking for houses, apartment blocks, entire neighbourhoods that didn’t exist when the map was drawn.

This is what it’s like when an AI coding assistant or agent is trying to work on a code base using an out-of-date picture of the structure.

With every change to the code, the map gets a little bit more out-of-date. With every change, the context gets a little bit more misleading.

The thing about context to an LLM is that it can’t distinguish fact from fiction – the real code as it is at this moment from, say, a summary of the code that was generated a bunch of changes ago. To a Large Language Model, it’s all just context.

This is why it’s important to keep the information in the context as current as we can. The implication is that we need to refresh the context after every significant structural change to the code.

This won’t be the only principal that encourages us to keep contexts short-lived, but it’s definitely a key one. That snapshot of the code that your agent fed to the LLM at the start of a conversation is obsolete after any changes have been applied to the real code, and the LLM simply can’t know to ignore it. It gets attention whether we like it or not.

I’m in the habit of flushing the context after changes have been successfully applied, and constructing a brand new one for the next step in the workflow. It’s a whole new conversation with a brand new snapshot of the code.

You probably won’t be surprised to hear that I don’t rely on LLM-generated summaries, either. Putting aside the fact that language models are notoriously unreliable narrators, they can very quickly get out-of-date if we don’t update them after changes are applied. That architecture.md file could easily end up being a 1971 street map.

Research shows us what happens when the information models are trained on gets “stale”, with a study finding that LLM accuracy dropped significantly when trained on out-of-date API documentation, even when updated docs are provided at inference time. As a coding agent modifies our source files, that effect kicks in almost immediately.

Much more effective – when measured in terms of successful task completions (e.g., acceptance tests passed) – is to use techniques like static analysis to build a high-fidelity picture of the relevant code as it is now for each step in the workflow.

http://codemanship.wordpress.com/?p=4324

Extensions

C.R.E.S.S. Principles for Context Engineering

codemanship May 4, 2026 Updated May 11, 2026

Psst. If your boss won’t invest in training you in Specification By Example (BDD, ATDD), I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT. After more than 3 years of research and experimentation, I feel like I have a good handle on what properties of LLM contexts tend to produce … Continue reading "C.R.E.S.S. Principles for Context Engineering"

Show full content

Psst. If your boss won’t invest in training you in Specification By Example (BDD, ATDD), I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.

After more than 3 years of research and experimentation, I feel like I have a good handle on what properties of LLM contexts tend to produce the best results in code generation and modification. You’ll have read about them on my site if you follow this blog.

That was the easy part. The hard part is coming up with a catchy mnemonic to help me remember them.

Introducing C.R.E.S.S. principles for context engineering.

The most effective input contexts – measured by successful coding task completion in the fewest passes – are:

CurrentThey contain up-to-date information (e.g. not an architecture summary that was generated multiple changes ago)RefutableThey contain some way of knowing with high confidence when the output doesn’t satisfy the intent (e.g. an acceptance test)EmpiricalThey use information taken from observed reality (the actual code, test run results, linter output), not information generated by the modelSmallThey include the minimum necessary information required to satisfy CRESS. No redundant background, irrelevant history, or over-verbose explanationsSpecificThey are narrowly scoped to a single problem or task with no ambiguity in intent.

I’ve applied – and seen other folks applying – these principles for quite some time now, and I’ve tested all of them a fair amount in closed-loop experiments, so I’m quietly confident that they work.

But you shouldn’t take my word for it. Test them for yourself and see what difference they make.

Like the movie director said, “C.R.E.S.S. is more”.

(I’ll get my coat.)

http://codemanship.wordpress.com/?p=4313

Extensions

Is It Time To Get Back To Fundamentals?

codemanship May 1, 2026 Updated May 1, 2026

I have a friend who built a recording studio in his garden. The building – an adapted garden office – cost £15,000. Inside, he installed a pre-owned Neve 24-track mixing console with motorised faders in a custom-built desk – total cost: £17,000. Add to that easily another £15K-20K of high-end gear and studio fittings, he … Continue reading "Is It Time To Get Back To Fundamentals?"

Show full content

I have a friend who built a recording studio in his garden. The building – an adapted garden office – cost £15,000.

Inside, he installed a pre-owned Neve 24-track mixing console with motorised faders in a custom-built desk – total cost: £17,000.

Add to that easily another £15K-20K of high-end gear and studio fittings, he probably spent about £50,000 on that home studio in all. It took him 3 years in his spare time to build it out.

What does it sound like? I don’t know. I’ve never heard any music come out of it.

I, on the other hand, bought a 2-channel audio interface for £150 + some software and recorded 5 albums and 8 EPs – some getting radio-play on rock/metal stations. I was even Indie Band Of The Week on Metal Express Radio.

And it struck me that, while my hobby is making music, Jeff’s real hobby is building studios.

And that, folks, is the current state of AI-assisted software development. I see folks building some pretty elaborate studios, but I’m not hearing much in the way of finished music coming out of them.

Maybe it’s time to get back to basics and start focusing on the end product again?

Talking of fundamentals, if your boss won’t invest in training you in foundational software development practices like Specification By Example and Test-Driven Development, I’m running out-of-hours workshops in May specifically for self-funding learners. £99 + UK VAT.

http://codemanship.wordpress.com/?p=4308

Extensions

https://codemanship.wordpress.com/atom

Posts