Show full content
Choosing the right AI model is now a well-recognized problem. It is still not trivial, but at least there are benchmarks, pricing pages, context-window comparisons, and plenty of public discussion to guide you.
Coding agents are still more of a wild west. Many people treat them as simple wrappers around the model: a chat window with file access and terminal commands. In practice, a coding agent can influence the model’s behavior quite a lot. It controls the environment around the model: which tools are available, what system instructions are added, how much project context is collected, and how the model is expected to interact with your codebase.
That makes choosing a coding agent its own problem. How should you measure its performance and efficiency? Should you choose an agent tightly coupled with a specific model, or one optimized for flexibility? How much do its prompts, tools, and workflows affect speed, cost, and security?
I faced those questions after I cancelled my Cursor subscription. The decision was cost-based; I didn’t have technical complaints about my setup, so I expected to find a replacement quickly. Instead, I ended up testing a bunch of different agents and discovered huge fluctuations in execution time and very different token usage for the same task. I even caught an API key being exposed in real time.
AI dev environmentHere is a quick illustration of the modern AI dev environment that puts the puzzle pieces together. LLM is the “AI brain”, but a coding agent is what turns LLM into a coding assistant and interacts with the tools. Today, coding agents usually don’t come as separate apps. Instead, you interact with them via some other interface – Command Line Interface (CLI) or Integrated Development Environment (IDE).

Cursor combines all three layers into a single experience – it is a dev interface (which is based on its own version of VS Code, a very popular IDE) that comes with a built-in coding agent and a mix of models that it orchestrates based on your tasks. For me, this meant finding a replacement for all three layers of my AI dev environment.
Since the whole quest started after the spend analysis, the first question I asked was: should I still pay for a big cloud model, or can I replace it with local AI?
Going Local?I did a pretty simple experiment to see if local AI was good for me. I asked two different models to update my chatbot Virtual Alexandra and add a “conversation” mode alongside the existing “single question” one.
- Task: add a conversation mode to a chatbot based on an existing design doc
- Project size: ~2K lines of code, roughly 50K tokens
I was not trying to benchmark every model on the market. I wanted to answer a practical question: could local AI become part of my real dev environment, or at least serve as a backup?
For the cloud model, I chose GPT-5.5 because it was part of my previous setup and I was satisfied with it. For the local model, I selected Qwen3.6 because it had performed best in my husband’s local AI experiments.
Since I was testing different models, I wanted to keep the same dev interface and agent. To start, I picked VS Code as an IDE and Kilo Code as my coding agent. VS Code was simply familiar already – even Cursor was based on it. Kilo Code lets you bring different models, including local ones, and switch between them easily. It is also available as a plugin for more than one IDE, which helped me keep my options open.
In addition to time, I was also measuring the overall number of tokens and the prompt size – mostly to keep an eye on future costs. To see the prompt size, I used a simple trick – sending “hi” in a new session and checking the tokens.

Even this simple experiment produced quite a lot of data:
- The output quality was basically the same. The GPT’s UI design was a bit more polished, but again the local model is free, so I can iterate without paying extra.
- The local model was an order of magnitude slower than the cloud one. I expected the result, but not the magnitude of it.
- The cloud model was more token-efficient. But with local AI, token count matters less because you are not charged for it.
- Saying “hi” to your agent takes up to 20% of the context. The agent’s prompts are quite large (and this made me feel better about AI prompts in my own apps; it turns out they are not that big).
- The total amount of tokens sent was roughly equivalent to my whole codebase. This shows how the cost for a bigger project can quickly get out of control.
At this point my major concern was the speed of the local model. Up to 10x increase in execution time felt too high for my workflows. However, when I showed the results to my husband, he challenged them.
Dmitry’s comment
I also tested the local Qwen3.6 vs OpenAI GPT cloud models and I saw a 5x difference, not 10x. I was using the OpenCode agent in the terminal on my MacBook M3 Pro with 36GB of memory.
Well, it’s not a complicated test, so I ran it again with a different agent.
This was totally unexpected – switching the agent reduced the execution time by 2x and used 20% fewer tokens. As it turned out, I wasn’t simply measuring local vs cloud models, but an agent’s effectiveness as well.
Now it became clear that the model was only one part of the story. The next question was: how much do coding agents change the result?
Benchmarking coding agentsAfter realizing that agents can significantly affect the speed and cost of development, I decided to expand my tests to more agents in different environments.
There was no way I could test all the agents and IDEs – there are too many of them on the market. So I tested a couple of well-known names like Codex from OpenAI and Copilot from Microsoft, open-source offerings from OpenCode and Kilo, and a single-person driven project Pi, which is famous for being fast and lean.
Not all agents support all models or integrate with all IDEs, so I tested a variety of setups with the same codebase and the same prompt.

For smaller projects that use cloud LLMs, coding agents don’t seem to play such a big role:
- A one-minute difference in time can be a measurement error, network fluctuation, or infra queuing issue.
- A 10K token difference may come from the non-determinism typical of LLMs.
- All agents produced good results when backed by a fast capable model.
- IDE vs CLI didn’t play a major role.
The local model amplified coding agents’ influence:
- There can be up to 4x difference in the execution speed for the same task, with the same model.
- The total tokens count can fluctuate as much as speed – up to 3x.
- Smaller prompt size seems to be correlated with faster outcomes in the local setup.
- The leaner open-source agents had better speed and token usage.
- The outcome of the test task was good, with occasional UI polish issues that were easy to fix and iterate on.
If I simply used my own benchmark, the Pi agent would be a clear winner with OpenCode coming second. But things are not as simple in the coding agent’s land and there are more trade-offs to consider.
Beyond benchmarkingWhile testing the agents, I realized that speed and even the number of tokens do not tell the whole story. Agents need to earn my trust too.
Less tokens, less guardrailsOne reason Pi is so fast is that it has a very small prompt. It mostly relies on the LLM already knowing how to behave as a coding assistant, instead of restating a long list of standard coding-agent guidelines. However, some of the missing guidance is about safety.
Here is a somewhat scary story that happened to me. My husband configured the Pi agent to use the Qwen3.6 model on my machine. Then I wanted to test the agent with GPT-5.5 and typed: “Switch from Qwen3.6 to GPT-5.5 model.” Not a great prompt, admittedly, but I had become used to Cursor and ChatGPT asking follow-up questions before doing anything risky.
Pi didn’t ask. Instead, it:
- assumed I wanted to replace the Qwen3.6 configuration;
- assumed I wanted to use an OpenAI API key instead of a subscription;
- found the API key in the .env file in the current project;
- put that key into the default shell configuration file, making it available to any app.
Pi outputs its actions and thinking process directly in the terminal window, which is actually a good thing, so I watched this happen live, with my mouth open. I reverted the changes immediately.
After that, I looked at the prompts for Pi and OpenCode (two open-source projects where I could easily get access to the prompts), and the difference was obvious. Pi has a much leaner instruction stack: basic tools like read, bash, edit, and write, plus a short set of usage rules. OpenCode, by contrast, includes subagents, todo management, detailed Git and PR safety workflows, web fetching, question prompts, skill loading, and much more behavioral guidance. That makes OpenCode heavier, but also more guarded.
Is Pi a bad agent? No. But this was a very clear example of the speed-versus-safety tradeoff. A lean agent is like a powerful sports car: it goes fast, but you might be missing airbags and other safety features.
Slowing down with subagentsI did multiple test runs with the same model and agent to make sure I captured the data correctly. At some point, the OpenCode agent with the Qwen3.6 model took much longer on my benchmark task: 37 min instead of 15 min. I couldn’t understand why. I reran the test – still slow.
After some joint debugging with my husband and lots of different guesses (did he change some settings on the local model? Was there a new version of the OpenCode agent released?), we finally found the culprit – the main agent had delegated work to a subagent.
I would expect it to make things faster. But when you have a single machine with a local model and a relatively simple task, subagents can significantly slow things down. Even delegation has overhead, and on local hardware that overhead can be too much for the task.
Flexibility is still hardA coding agent is a layer between you and the model, and some agents optimize for certain models. I could have tested Codex with the local model, but it would require some configuration acrobatics and clearly, this is not what Codex is built for at the moment. OpenAI would prefer us to use its proprietary models.
Another issue is the dev interface – agents and IDEs can be very opinionated here as well. For example, Pi is pro-terminal; its support in IDEs is bare minimum. But the opposite also happens – VS Code is heavily integrated with Copilot, to the level that it gets in the way of other agents that run inside VS Code as plugins.
I realized that I prefer pluggable infrastructure where I can easily switch models and agents, but still have minimal IDE features like Git integration or a file viewer.
The features I don’t need yetI am looking at agents from a single-developer perspective. I need to save time and money, but I don’t have a team dealing with code reviews, SLAs, or production fires. So I am not especially interested in integrations with tools like Jira, Slack, or PagerDuty.
For a team, the evaluation would probably look very different. Collaboration, review workflows, incident response, security controls, and observability would matter a lot more. I strongly suspect that if I were optimizing for those tasks, Copilot and Kilo Code would score pretty high.
For me, the basic stakes are simpler: can I trust the agent, can I control it, can it work with my preferred models, and can it perform a small task without turning it into a time and tokens sinkhole?
It was becoming clear to me that it would be hard to settle on a single model or agent going forward. With that in mind, I started looking at the final piece of the puzzle – the dev interface that could give me a flexible multi-model and multi-agent setup.
Dev interface: the final layerI am not a big fan of the CLI – I code infrequently and prefer to have a nice UI and all the essential tools laid out in front of me, which is why Cursor was a good fit. To keep my car metaphor going, I want not just the airbags for safety, but an air conditioner and a nice sound system for comfort, even if it slows things down slightly.
To help narrow down the search for a new IDE, I created a wish list. My ideal IDE should:
- Contain basic tools: a file viewer, an editor with code completions and highlighting, terminal, Git integration, and of course an agent window
- Let me control which agent and model to use and allow for switching them easily
- Show me how many tokens I spent and how long it took to complete a task
- Display a desktop notification when the task is finished
- Work on my MacBook Air
- Be a free tool to avoid adding another subscription fee to my AI monthly budget
It turned out to be surprisingly hard to find a good IDE for me. I tried the big names and some personally trusted vendors. Each had strong differentiating points, but each also had its own issues.
- VS Code is the default choice for many, but for me it felt heavily optimized around Copilot, which made interacting with other agents harder.
- OpenCode Desktop has a nice editing experience and allows easy model switching, but only uses its own agent. Its Git integration is super basic.
- Air from JetBrains is a nice lightweight IDE coming from a trusted vendor, but it’s very new, and at the time I tested it, it only supported a small set of agents, notably, no OpenCode or Kilo Code.
- Many IDEs and coding agents obscure time spent on a task or token counts.
In a way, this is understandable. AI models are still new, coding agents are even newer, and they all evolve very fast. IDEs are established products and they have a harder job than standalone agents: they need to integrate AI without breaking decades of developer workflow expectations. I have no doubt that this layer will also change quickly and adapt to AI-native development styles.
My new AI dev setupNone of the options looked ideal, but given how quickly things change in the AI land, any choice I make will likely be temporary.
For now, I settled on the OpenCode Desktop IDE. It is minimalistic, and its Git integration is weak, but it is good for token transparency (and I expect to run more experiments along those lines in the future.) The IDE lets me switch models, but not agents, so I am committing to the OpenCode agent for now. On the other hand, both the agent and the IDE are open-source, so I can look not only at the benchmark tests, but at the code itself, which was already useful with prompt debugging.
So, the fastest and the leanest agent was not my first choice. Trust, control, and even the comfort of a good dev environment turned out to be more important.

Model benchmarks are not enough – the same model can behave very differently depending on the coding agent wrapped around it. With a fast cloud model, many agents look “good enough.” With a slower local model, agent overhead becomes very obvious.
Prompt size is not just a cost detailIt is part of the agent’s personality, behavior, and safety guidelines. Fast agents may be fast because they carry fewer guardrails. That can be useful, but you have to be more careful.
Subagents are not automatically betterDelegation can help with complex work, but it can slow down small tasks, especially on local hardware.
The best setup is still personalThe dev environment is not just about personal productivity anymore; it can significantly influence your output and directly affect cost. However, it’s still about you. What you choose is not necessarily the fastest or the cheapest, but what gives you the best balance of speed, cost, control, and trust.
Dmitry’s comment 
Insight










In other words, the chatbot can now explain how it was built based on this very text.

















