GeistHaus
log in · sign up

https://adandai.wordpress.com/feed/atom

atom
10 posts
Polling state
Status active
Last polled May 19, 2026 02:20 UTC
Next poll May 20, 2026 03:54 UTC
Poll interval 86400s
Last-Modified Wed, 13 May 2026 08:17:03 GMT

Posts

The secret life of coding agents
Uncategorizedaiartificial-intelligencecoding agentscoding-agentIDEllmLLMs
Choosing the right AI model is now a well-recognized problem. It is still not trivial, but at least there are benchmarks, pricing pages, context-window comparisons, and plenty of public discussion to guide you. Coding agents are still more of a wild west. Many people treat them as simple wrappers around the model: a chat window […]
Show full content

Choosing the right AI model is now a well-recognized problem. It is still not trivial, but at least there are benchmarks, pricing pages, context-window comparisons, and plenty of public discussion to guide you.

Coding agents are still more of a wild west. Many people treat them as simple wrappers around the model: a chat window with file access and terminal commands. In practice, a coding agent can influence the model’s behavior quite a lot. It controls the environment around the model: which tools are available, what system instructions are added, how much project context is collected, and how the model is expected to interact with your codebase.

That makes choosing a coding agent its own problem. How should you measure its performance and efficiency? Should you choose an agent tightly coupled with a specific model, or one optimized for flexibility? How much do its prompts, tools, and workflows affect speed, cost, and security?

I faced those questions after I cancelled my Cursor subscription. The decision was cost-based; I didn’t have technical complaints about my setup, so I expected to find a replacement quickly. Instead, I ended up testing a bunch of different agents and discovered huge fluctuations in execution time and very different token usage for the same task. I even caught an API key being exposed in real time.

AI dev environment

Here is a quick illustration of the modern AI dev environment that puts the puzzle pieces together. LLM is the “AI brain”, but a coding agent is what turns LLM into a coding assistant and interacts with the tools. Today, coding agents usually don’t come as separate apps. Instead, you interact with them via some other interface – Command Line Interface (CLI) or Integrated Development Environment (IDE).

Cursor combines all three layers into a single experience – it is a dev interface (which is based on its own version of VS Code, a very popular IDE) that comes with a built-in coding agent and a mix of models that it orchestrates based on your tasks. For me, this meant finding a replacement for all three layers of my AI dev environment.

Since the whole quest started after the spend analysis, the first question I asked was: should I still pay for a big cloud model, or can I replace it with local AI?

Going Local?

I did a pretty simple experiment to see if local AI was good for me. I asked two different models to update my chatbot Virtual Alexandra and add a “conversation” mode alongside the existing “single question” one. 

  • Task: add a conversation mode to a chatbot based on an existing design doc
  • Project size: ~2K lines of code, roughly 50K tokens

I was not trying to benchmark every model on the market. I wanted to answer a practical question: could local AI become part of my real dev environment, or at least serve as a backup? 

For the cloud model, I chose GPT-5.5 because it was part of my previous setup and I was satisfied with it. For the local model, I selected Qwen3.6 because it had performed best in my husband’s local AI experiments.

Since I was testing different models, I wanted to keep the same dev interface and agent. To start, I picked VS Code as an IDE and Kilo Code as my coding agent. VS Code was simply familiar already – even Cursor was based on it. Kilo Code lets you bring different models, including local ones, and switch between them easily. It is also available as a plugin for more than one IDE, which helped me keep my options open. 

In addition to time, I was also measuring the overall number of tokens and the prompt size – mostly to keep an eye on future costs. To see the prompt size, I used a simple trick – sending “hi” in a new session and checking the tokens.

EnvironmentAgentModelTimeTokensPromptVS CodeKilo CodeGPT-5.53 min47K11KVS CodeKilo CodeQwen3.628 min60K13K

✏ Even this simple experiment produced quite a lot of data:

  • The output quality was basically the same. The GPT’s UI design was a bit more polished, but again the local model is free, so I can iterate without paying extra.
  • The local model was an order of magnitude slower than the cloud one. I expected the result, but not the magnitude of it. 
  • The cloud model was more token-efficient. But with local AI, token count matters less because you are not charged for it.
  • Saying “hi” to your agent takes up to 20% of the context. The agent’s prompts are quite large (and this made me feel better about AI prompts in my own apps; it turns out they are not that big).
  • The total amount of tokens sent was roughly equivalent to my whole codebase. This shows how the cost for a bigger project can quickly get out of control.

At this point my major concern was the speed of the local model. Up to 10x increase in execution time felt too high for my workflows. However, when I showed the results to my husband, he challenged them.

🧑‍💻Dmitry’s comment 

I also tested the local Qwen3.6 vs OpenAI GPT cloud models and I saw a 5x difference, not 10x. I was using the OpenCode agent in the terminal on my MacBook M3 Pro with 36GB of memory.

Well, it’s not a complicated test, so I ran it again with a different agent.

EnvironmentAgentModelTimeTokensPromptCLIOpenCodeQwen3.615 min37K8K

This was totally unexpected – switching the agent reduced the execution time by 2x and used 20% fewer tokens. As it turned out, I wasn’t simply measuring local vs cloud models, but an agent’s effectiveness as well.

Now it became clear that the model was only one part of the story. The next question was: how much do coding agents change the result?

Benchmarking coding agents

After realizing that agents can significantly affect the speed and cost of development, I decided to expand my tests to more agents in different environments.

There was no way I could test all the agents and IDEs – there are too many of them on the market. So I tested a couple of well-known names like Codex from OpenAI and Copilot from Microsoft, open-source offerings from OpenCode and Kilo, and a single-person driven project Pi, which is famous for being fast and lean.

Not all agents support all models or integrate with all IDEs, so I tested a variety of setups with the same codebase and the same prompt.

Cloud-based testing (GPT-5.5)
AgentEnvironmentTimeTokensPromptCodexVS Code4 min45K13KKilo CodeVS Code3 min47K11KOpenCodeOpenCode Desktop4 min37K9KPiCLI3 min50K1K

✏For smaller projects that use cloud LLMs, coding agents don’t seem to play such a big role:

  • A one-minute difference in time can be a measurement error, network fluctuation, or infra queuing issue. 
  • A 10K token difference may come from the non-determinism typical of LLMs.
  • All agents produced good results when backed by a fast capable model. 
  • IDE vs CLI didn’t play a major role.
Local testing (Qwen3.6)
AgentEnvironmentTimeTokensPromptCopilot ChatVS Code40 min66K23KKilo CodeVS Code28 min60K13KOpenCodeCLI15 min47K8KPiCLI11 min22K1K

✏The local model amplified coding agents’ influence:

  • There can be up to 4x difference in the execution speed for the same task, with the same model.
  • The total tokens count can fluctuate as much as speed – up to 3x.
  • Smaller prompt size seems to be correlated with faster outcomes in the local setup.
  • The leaner open-source agents had better speed and token usage.
  • The outcome of the test task was good, with occasional UI polish issues that were easy to fix and iterate on. 

If I simply used my own benchmark, the Pi agent would be a clear winner with OpenCode coming second. But things are not as simple in the coding agent’s land and there are more trade-offs to consider.

Beyond benchmarking

While testing the agents, I realized that speed and even the number of tokens do not tell the whole story. Agents need to earn my trust too.

Less tokens, less guardrails

One reason Pi is so fast is that it has a very small prompt. It mostly relies on the LLM already knowing how to behave as a coding assistant, instead of restating a long list of standard coding-agent guidelines. However, some of the missing guidance is about safety.

Here is a somewhat scary story that happened to me. My husband configured the Pi agent to use the Qwen3.6 model on my machine. Then I wanted to test the agent with GPT-5.5 and typed: “Switch from Qwen3.6 to GPT-5.5 model.” Not a great prompt, admittedly, but I had become used to Cursor and ChatGPT asking follow-up questions before doing anything risky.

Pi didn’t ask. Instead, it:

  • assumed I wanted to replace the Qwen3.6 configuration;
  • assumed I wanted to use an OpenAI API key instead of a subscription;
  • found the API key in the .env file in the current project;
  • put that key into the default shell configuration file, making it available to any app.

Pi outputs its actions and thinking process directly in the terminal window, which is actually a good thing, so I watched this happen live, with my mouth open. I reverted the changes immediately.

After that, I looked at the prompts for Pi and OpenCode (two open-source projects where I could easily get access to the prompts), and the difference was obvious. Pi has a much leaner instruction stack: basic tools like read, bash, edit, and write, plus a short set of usage rules. OpenCode, by contrast, includes subagents, todo management, detailed Git and PR safety workflows, web fetching, question prompts, skill loading, and much more behavioral guidance. That makes OpenCode heavier, but also more guarded.

Is Pi a bad agent? No. But this was a very clear example of the speed-versus-safety tradeoff. A lean agent is like a powerful sports car: it goes fast, but you might be missing airbags and other safety features.

Slowing down with subagents

I did multiple test runs with the same model and agent to make sure I captured the data correctly. At some point, the OpenCode agent with the Qwen3.6 model took much longer on my benchmark task: 37 min instead of 15 min. I couldn’t understand why. I reran the test – still slow. 

After some joint debugging with my husband and lots of different guesses (did he change some settings on the local model? Was there a new version of the OpenCode agent released?), we finally found the culprit – the main agent had delegated work to a subagent.

I would expect it to make things faster. But when you have a single machine with a local model and a relatively simple task, subagents can significantly slow things down. Even delegation has overhead, and on local hardware that overhead can be too much for the task.

Flexibility is still hard

A coding agent is a layer between you and the model, and some agents optimize for certain models. I could have tested Codex with the local model, but it would require some configuration acrobatics and clearly, this is not what Codex is built for at the moment. OpenAI would prefer us to use its proprietary models.

Another issue is the dev interface – agents and IDEs can be very opinionated here as well. For example, Pi is pro-terminal; its support in IDEs is bare minimum. But the opposite also happens – VS Code is heavily integrated with Copilot, to the level that it gets in the way of other agents that run inside VS Code as plugins.

I realized that I prefer pluggable infrastructure where I can easily switch models and agents, but still have minimal IDE features like Git integration or a file viewer.

The features I don’t need yet

I am looking at agents from a single-developer perspective. I need to save time and money, but I don’t have a team dealing with code reviews, SLAs, or production fires. So I am not especially interested in integrations with tools like Jira, Slack, or PagerDuty.

For a team, the evaluation would probably look very different. Collaboration, review workflows, incident response, security controls, and observability would matter a lot more. I strongly suspect that if I were optimizing for those tasks, Copilot and Kilo Code would score pretty high.

For me, the basic stakes are simpler: can I trust the agent, can I control it, can it work with my preferred models, and can it perform a small task without turning it into a time and tokens sinkhole?

It was becoming clear to me that it would be hard to settle on a single model or agent going forward. With that in mind, I started looking at the final piece of the puzzle – the dev interface that could give me a flexible multi-model and multi-agent setup.

Dev interface: the final layer

I am not a big fan of the CLI – I code infrequently and prefer to have a nice UI and all the essential tools laid out in front of me, which is why Cursor was a good fit. To keep my car metaphor going, I want not just the airbags for safety, but an air conditioner and a nice sound system for comfort, even if it slows things down slightly.

To help narrow down the search for a new IDE, I created a wish list. My ideal IDE should:

  • Contain basic tools: a file viewer, an editor with code completions and highlighting, terminal, Git integration, and of course an agent window
  • Let me control which agent and model to use and allow for switching them easily
  • Show me how many tokens I spent and how long it took to complete a task
  • Display a desktop notification when the task is finished
  • Work on my MacBook Air
  • Be a free tool to avoid adding another subscription fee to my AI monthly budget

It turned out to be surprisingly hard to find a good IDE for me. I tried the big names and some personally trusted vendors. Each had strong differentiating points, but each also had its own issues.

  • VS Code is the default choice for many, but for me it felt heavily optimized around Copilot, which made interacting with other agents harder. 
  • OpenCode Desktop has a nice editing experience and allows easy model switching, but only uses its own agent. Its Git integration is super basic.
  • Air from JetBrains is a nice lightweight IDE coming from a trusted vendor, but it’s very new, and at the time I tested it, it only supported a small set of agents, notably, no OpenCode or Kilo Code.
  • Many IDEs and coding agents obscure time spent on a task or token counts.

In a way, this is understandable. AI models are still new, coding agents are even newer, and they all evolve very fast. IDEs are established products and they have a harder job than standalone agents: they need to integrate AI without breaking decades of developer workflow expectations. I have no doubt that this layer will also change quickly and adapt to AI-native development styles.

My new AI dev setup

None of the options looked ideal, but given how quickly things change in the AI land, any choice I make will likely be temporary. 

For now, I settled on the OpenCode Desktop IDE. It is minimalistic, and its Git integration is weak, but it is good for token transparency (and I expect to run more experiments along those lines in the future.) The IDE lets me switch models, but not agents, so I am committing to the OpenCode agent for now. On the other hand, both the agent and the IDE are open-source, so I can look not only at the benchmark tests, but at the code itself, which was already useful with prompt debugging. 

So, the fastest and the leanest agent was not my first choice. Trust, control, and even the comfort of a good dev environment turned out to be more important.

Learnings Local AI makes agent overhead visible

Model benchmarks are not enough – the same model can behave very differently depending on the coding agent wrapped around it. With a fast cloud model, many agents look “good enough.” With a slower local model, agent overhead becomes very obvious.

Prompt size is not just a cost detail

It is part of the agent’s personality, behavior, and safety guidelines. Fast agents may be fast because they carry fewer guardrails. That can be useful, but you have to be more careful.

Subagents are not automatically better

Delegation can help with complex work, but it can slow down small tasks, especially on local hardware.

The best setup is still personal

The dev environment is not just about personal productivity anymore; it can significantly influence your output and directly affect cost. However, it’s still about you. What you choose is not necessarily the fastest or the cheapest, but what gives you the best balance of speed, cost, control, and trust.

http://adandai.wordpress.com/?p=233
Extensions
Is AI an expensive hobby? 
Uncategorizedaiai-costartificial-intelligencechatgptllm
My husband and I started experimenting with AI about six months ago. Since we were both on a career break, we were careful about spending. So we set a strict rule – no more than $100 per month on anything AI-related: APIs, tools, subscriptions, all of it. I wasn’t trying to analyze the costs at […]
Show full content

My husband and I started experimenting with AI about six months ago. Since we were both on a career break, we were careful about spending. So we set a strict rule – no more than $100 per month on anything AI-related: APIs, tools, subscriptions, all of it.

I wasn’t trying to analyze the costs at the start. I just wanted to learn and stay within that budget. At the same time, the news about companies investing billions in AI – and developers talking about “burning tokens” – was hard to ignore. And I was building public apps, which meant other people could be burning my tokens.

Finally, I got curious: why wasn’t I running into budget issues? And where was the money actually going? 

Below are the answers. It is not a deep financial analysis. It’s a simple breakdown of how much I actually spent experimenting with AI for over six months, and what surprised me.

The grand total

After six months, out of my total allowed $600, I spent slightly over $400. And here is what I spent it on:

WhatHow muchWhyChatGPT Plus subscription$162 ($27 per month in EU)I switched to paid subscription mostly for generating images. The free tier only gives 3 images per day, and even if I simply wanted a nice cover image for a blog post, I needed more iterations than that.Cursor Pro subscription$126 ($21 per month)I needed an IDE that would support my vibe-coding-first approach. I tried a free tier, but quickly hit token limits. Render hosting$34 (~$7 per app per month)Not technically AI-related, but my apps have web UIs and I needed to host them somewhere for public access.OpenAI platform$100 (metered)Usage of the OpenAI APIs and services for my apps.Total$422

💰Insight

My flat-rate subscription fees were higher than my metered AI usage. I had a somewhat naive assumption that subscriptions imply a volume discount and should be cheaper than paying per API call.

Metered usage vs subscriptions

Of course, how I use ChatGPT or Cursor is very different from how I use OpenAI platform APIs. ChatGPT is for chats and architecture questions; Cursor handles all the code generation for my apps; and OpenAI APIs are for building the core logic of my apps and running them in production.

But all of them use LLMs and charge per token. So I wondered – if I paid for metered usage instead of a flat fee subscription during those 6 months, where would I end up?

I downloaded all the logs from Cursor and ChatGPT and got the overall amount of tokens, images, files processed, tools used, etc. Then I applied the rates from the latest GPT-5.4 model to see if I was under or overpaying.

GPT-5.4 pricing:

  • Input: $2.50 / 1M tokens
  • Output: $15 / 1M tokens
  • Images: $0.17 per image (estimated)
  • Uploaded image processing: $0.015
  • Web search: $0.01 per call
CursorChatGPTTotal input text tokens16.5M145KEst. price with gpt-5.4$37.6$0.36Total output text tokens1.5M838KEst. price with gpt-5.4$23.6$12.57ImagesN/A60Est. price with gpt-5.40$10.20Uploaded imagesN/A66Est. price with gpt-5.40$0.99Uploaded text filesN/A1.1M Est. price with gpt-5.40$2.75Web searchesN/A35Est. price with gpt-5.40$0.35Total est. price$61$24.26Subscription price$126$162

💰Insight 

Ouch! I am overpaying 2x for Cursor and 6x for ChatGPT compared to what I would have paid with metered usage. But these tools don’t offer a pay-per-use option. Also, subscriptions made it easier to work without interruptions or hitting limits. Whether that’s worth the price is still an open question.

Development vs Production cost

Another insight I got right away is that the development cost of AI apps is way higher than running my apps in production. Here is another way to look at the same data:

WhatHow muchTypeChatGPT Plus subscription$162DevelopmentCursor Pro subscription$126DevelopmentRender hosting$34ProductionOpenAI platform$100Development + ProductionTotal$422

Both ChatGPT and Cursor subscriptions are pure development cost, Render is pure production cost, but OpenAI is a mixture of both. To split this line into production vs development costs I needed a deep dive into OpenAI API logs and billing data.

After 6 months I ended up with two small apps:

  • A small text adventure game based on The Lord of the Rings (LOTR)
    This one uses OpenAI APIs to generate a different journey depending on the player’s input. All I use here is the Responses API. I played a little bit with images, but quickly abandoned the idea because they were too expensive.
  • Virtual Alexandra (VA)
    A chatbot that answers users’ questions based on my writing and instructions using APIs and retrieval augmented generation (RAG) from OpenAI. I also tried fine-tuning (didn’t work) and did a bunch of data preparation for RAG, including translating several MB of text from Russian into English and vice versa. 

I used different API keys for production and development of both apps, so I could see how much I spent on each of them. Images and fine-tuning are not used in production, but they are still part of the development cost – you pay even if you fail.

LOTR gameDevelopmentResponses API$17.99Images$1.36Development Total $19.35ProductionResponses API$1.49Production Total$1.49LOTR total$20.84Virtual AlexandraDevelopmentResponses API$29.74RAG (file_search)$1.23Data translation (via Completions API)$8.57Fine-tuning$34.08Development Total$73.62ProductionResponses API$5.52RAG (file_search)$0.16RAG (vector storage)$0.008Production Total$5.69VA Total79.31

💰Insight 

Most of my AI spend went into development, not production. The production cost is about 10 percent of the total ($41 out of $422).

Apparently, my fear of public apps becoming a money sinkhole didn’t materialize. Partially this is because of low usage – I had about a hundred people try my apps. And these are not productivity apps people would use every day, there is no stickiness in them. 

But another reason is that I was more careful with what I put in production. I optimized system prompts and discarded expensive features like images. One question that Virtual Alexandra answers costs me $0.088 – less than ten cents. Ten steps in the LOTR game cost $0.037 – about four cents. 

Local models vs paid APIs

$600 was our family total AI budget, and I spent more than $400. How much did my husband spend? Well, the answer is … $0. Our interests led us into different experiments: I wanted something I could share with other people, while he was happy exploring how he can improve his day-to-day coding and test the power of the local models. But that zero is deceiving. Here is what actually happened.

AlexandraDmitryAPI + subscription spend$422$0HardwareMacBook AirMacBook ProHardware cost~$1000~$3000Dev focusSmall public appsLocal AI dev environment

💰Insight 

I paid for convenience and speed, while my husband Dmitry paid upfront in hardware and setup effort.

We didn’t buy new laptops, we used what we had in the house. If you happen to have a powerful computer already and want to start your vibe coding journey, you can spend exactly $0. If your computer is from Apple, read Dmitry’s guide on setting up local AI on a Mac to get started.
But if you only have an old MacBook Air, you can still go ahead – it will be way cheaper to start experimenting with paid subscriptions and APIs instead of investing in new hardware upfront.

Learnings AI is an affordable hobby

All my data is related to small scale apps and experiments, it doesn’t apply to corporate, startup, or deep research worlds. But for a casual hobby, just set your budget. Without watching tokens or hours and never hitting any hard limits, I was able to keep it under $100 a month.

At small scale, the true cost is app development, not running in production

Public APIs turned out to be cheaper than I expected. Unless you build a really viral app, you are very unlikely to burn a crazy amount of money. The two apps I created each cost less than ten cents per user session, so I can afford hundreds of daily users without exceeding my monthly budget.

However, development costs add up quickly. Experimenting with features like fine-tuning and image generation can easily dominate the total spend, even when production usage stays low.

AI subscriptions provided less value than I would like

It’s worth looking at your AI subscriptions – are you maximizing their value? And if not, are you paying for something else like a good uninterrupted workflow? One of my personal takeaways from this write-up is to dig deeper into this and see how much I really need these services and whether I can replace them with local AI setup.

http://adandai.wordpress.com/?p=217
Extensions
A practical guide to running and coding with local AI on a Mac
Uncategorizedaiclaudecodellmlocalaiopencode
While I was focusing my own experiments on AI as a service and paid APIs, my husband went in a completely different direction. He’s been running models locally on his Mac, and getting results surprisingly close to what I was seeing with OpenA, but without the cost. He put together a guide on how to […]
Show full content

While I was focusing my own experiments on AI as a service and paid APIs, my husband went in a completely different direction.

He’s been running models locally on his Mac, and getting results surprisingly close to what I was seeing with OpenA, but without the cost.

He put together a guide on how to do this – not just step-by-step instructions, but also clear explanations (the quantization part is especially good) and a deeper look at the architectural trade-offs.

Now we have two working set ups for AI experiments at home – one with a MacBook Air and paid AI subscriptions, and another one with a powerful MacBook Pro and local agents. Which one would you choose?

If you are leaning towards local AI, definitely worth a read:
👉 https://github.com/dmitryryabkov/local-ai-mac

http://adandai.wordpress.com/?p=213
Extensions
A very fine untuning
Uncategorizedaiartificial-intelligencefine-tuningllmtechnology
This is a story about a failed attempt, not a successful one. In my previous post, I promised to talk about fine tuning and how it didn’t work for the Virtual Alexandra project, a chatbot based on my own writing and instructions.  My husband and I called fine tuning results “Drunk Alexandra” – incoherent answers […]
Show full content

This is a story about a failed attempt, not a successful one. In my previous post, I promised to talk about fine tuning and how it didn’t work for the Virtual Alexandra project, a chatbot based on my own writing and instructions. 

My husband and I called fine tuning results “Drunk Alexandra” – incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, fine tuning was not a good option. It didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system I already had.

This post is a breakdown of what I tried, what failed, and why I no longer think fine tuning is the right tool for building something like Virtual Alexandra.

What is fine tuning and why would you need it?

Let’s start with the definition. Fine tuning is a process where you take an existing AI model and further train it to make it more specialized in a certain field (like medical or legal) or to introduce style and behavioral changes. 

The thing about my current implementation of Virtual Alexandra – a simple RAG system with prompt engineering on top of a base model – is that it still doesn’t really sound like me. We can all guess when a text is written by ChatGPT now – there is a certain choice of words and structure that just points to the author. I can hear ChatGPT’s voice in Virtual Alexandra’s answers, and I would love to make it less pronounced. Fine tuning promised exactly that.

In theory, it should have been on the same level of complexity as using RAG. And at first glance, it really looked simple. I produced a so-called “instruction dataset”. It usually consists of the user’s input and the expected answer from the model. The format can vary depending on the model, but here is an example for OpenAI in JSONL:

{"messages":[{"role":"user","content":"What’s your favorite kind of weather?"},{"role":"assistant","content":"Nice and sunny."}]}

OpenAI’s recommendation is to have at least 50 entries. My dataset included about 1000.

The next step is to either find an online model to fine tune (OpenAI offers a few with a super intuitive UI), or run experiments locally.

But as is typical with AI, “simple” doesn’t mean “good”. So let’s dive into the different issues I had to battle with.

Refined hallucinations

The main problem I encountered after fine tuning, no matter which model I tried, was increased hallucinations. I will start with examples from gpt-4o-mini-2024-07-18 simply because it is the cheapest model OpenAI offers. With my 3 MB of training data containing ~1000 entries the fine tuning job took about 45 minutes.

I was shown some nice graphs with reassuring training metrics – loss was decreasing, accuracy increasing – everything looked exactly the way you’d expect when things are working. However, none of these metrics tell you whether the model will actually produce good answers or start hallucinating.

Things looked less exciting when I ran my own tests. I sent identical requests in parallel to three different models – base gpt-4o-mini, fine-tuned gpt-4o-mini, and gpt-5.2 (the one I use for my “production” version of Virtual Alexandra). All of them had the same prompts, access to the same vector store, etc. The testing was pretty basic – I had a list of 22 questions and I was rating the answers based on my own judgment, because after all it was about my personal data and my own tone. This isn’t a perfectly fair comparison, because gpt-5.2 is a much stronger model, but I used it as a reference for what I could achieve without fine tuning.

It’s hard to predict when hallucinations will happen. Even the same question can produce either an acceptable result or a hallucination. But here is a very telling illustration.

You can see that gpt-4o couldn’t calculate my age, while gpt-5.2 had no problems with it. The fine-tuned model once produced the same incorrect answer as its base model, but next gave a “drunk” answer – nonsensical sentences mangled with the actual data.

Here are the overall results:

  • gpt-4o-mini – could get some answers wrong, especially if they were related to the current date and calculations.
  • fine tuned gpt-4o-mini – introduced a 25% hallucinations rate on top of the issues of the base model, which was unacceptable in my opinion for the system I was entrusting with my personal data.
  • gpt-5.2 – I didn’t have a single wrong or hallucinated answer during my small test.

I tested in Russian as well. The hallucination rate was about the same, however the hallucinations themselves were more “severe” – the model sometimes completely broke grammar and even invented new words.

At this point, hallucinations in answers alone would already make fine tuning questionable. But this wasn’t the only problem.

Hallucinations all the way down

I wrote in the previous post that OpenAI’s RAG system can change the search queries (and I will repeat that this is an implementation choice, not an inherent LLM issue) and this may become a real problem with fine-tuned models. Because hallucinations can affect the search queries as well.

Here you can see how the fine-tuned model expanded a simple “Tell me about your kids” question. Needless to say, I have no idea who Eric Perez is or where those numbers came from. They were all hallucinated by the model.

Initially, I expected RAG to be a clean layer on top of the model. In reality, it behaves more like a tightly coupled system where changes in one part can leak into everything else.

Big models, big problems

Overall, I wasn’t surprised with the gpt-4o performance since I saw similar results from the locally hosted models that my husband was running. The reason why we both wanted to try fine tuning with OpenAI was to get access to larger and more powerful models. You can fine tune only a limited set of models with OpenAI and the latest available is gpt-4.1, but this was good enough for testing.

I created a new fine tuning job and … it got stuck in a queue… for 50 (fifty!!!) hours. I tried to restart – once after an hour, then after 2 hours – and after 24 hours I even wrote a rage post on LinkedIn. The fine tuning job itself took only about one hour and fifteen minutes, and fortunately I wasn’t charged for the time spent in a queue or the cancelled jobs.

But what about the result? All the graphs looked good, but… well… RAG stopped working. I don’t mean queries hallucinations – my Virtual Alexandra completely broke down. It still could answer questions about canonical facts included right into the prompt, like my age or when I got married, but nothing about facts in the vector store.

Let me illustrate the issue. In the screenshot, the original question was “Tell me about your LOTR project.” I chose this question since information about it was only in the vector store, not in the training data or prompt. RAG found several entries, but the system’s reply was… hallucination?… some technical details bleeding into the answer? Other times it would say it didn’t find anything, although it had plenty of search results.

I found a workaround for the RAG issue to test the model a bit more. OpenAI allows you to use the so-called “playground” where you can chat with the fine tuned model via UI. Weirdly enough, RAG access worked there. In the screenshot, you can see how it answered the same question that got a nonsensical answer via API. The exact same setup behaved differently depending on how it was accessed. 

🧑‍💻Dmitry’s comment 

I think you hit some bugs in the OpenAI system. In theory, RAG queries are rewritten not by a fine tuned model, but by a base model. The idea is that this should make search more relevant in the multi-step conversations, because the model can summarize the long context into a search query. Like if you asked about traveling to Paris by car and then how far is Berlin from Paris, it can answer about how long it would take to travel from Berlin to Paris by car. But from your logs it looks like the fine tuned model is somehow getting involved in the RAG query rewrite.

OK, maybe the RAG issue here is just a bug. But what about hallucinations? Well, the same rate of about 25%-30%, more or less. 

So, with a larger model I didn’t see any improvements in the hallucination rate but I got new and unexpected issues. And with high cost of training and long wait times for infra access, it was hard to debug or iterate.

Is two better than one?

Weird RAG issues aside, my major problem remained hallucinations. So, I tried a couple of techniques that were supposed to fix it and make the answers more reliable: style rewriting and verification. Both of them involve using two models – one base, out-of-the-box, model and one fine-tuned.

Style rewriting

In this case the base model is producing an answer and the fine tuned model is rewriting it in the person’s voice. I chose gpt-5.2 as my base and tried both gpt-4o and gpt-4.1 (fine-tuned) for style rewriting. What I found is that they don’t actually rewrite anything – maybe change a word or two, but no real tone change.
There are two theories about why this happens. One is that the base model’s answers are “good enough” and the fine-tuned model simply keeps it the same. This one is not really possible to fix.
Another is that my fine-tuned models were trained on “instruction” data sets, and the task of rewriting is very different from answering questions. If that’s the case, addressing it would require producing a new data set with examples of texts and how I would rewrite them… and honestly, this is not a task I am personally interested in. So, I didn’t pursue the style rewriting option further.

Verification

Here the fine tuned model produces an answer and the base model checks if this answer is correct. It’s pretty easy to see how it can work for math problems, but let’s take a look at yet another nonsensical answer to the “How old are you?” question:

It’s a bad answer, but it technically contains correct data. And in cases like this, defining a useful verification process becomes tricky, especially without turning it into a much more complex system.

At this point, I realized that I was trying to put a fine-tuned model to work somewhere, somehow, with extra time and effort, while I already had a working system with no fine tuning at all producing acceptable results.

But did you try this other thing?..

When my husband and I started working on the Virtual Alexandra project, we knew that fine tuning can be expensive, long, and needs some careful data preparation. So we didn’t go directly to OpenAI, but did a lot of experiments locally. Dima tried a bunch of different versions of an open source model that supports both English and Russian: Qwen3 4B/8B/14B. At some point he tried Mistral 7B as well, but abandoned it since it doesn’t support Russian.

He played with different parameters and settings – learning rate, epochs, batch size, LoRA rank and scale. The results were always encouraging – the models changed behavior and were kind of getting better, but we were never able to break above 25% hallucinations rate (and usually it was worse than that).

Going to OpenAI was simply the last step in making sure that for this particular project fine tuning is not an option.

Learnings For Virtual Alexandra, fine tuning was not worth it

I built a working version of Virtual Alexandra with prompt + RAG in a couple of days. After weeks of fine tuning, I ended up with worse results: more hallucinations, broken RAG behavior, and a system that was harder to debug and understand. Add to this long queues for larger models and relatively high cost of training. My advice –  if you have other implementation options available, use them instead.

RAG is not a clean layer on top of the model

I expected retrieval from the vector store to be stable and independent. In reality, fine tuning of the model or even inconsistencies in the system (like accessing it via UI vs API)  can change how retrieval works or whether it is usable at all.

Fine tuning is not available for the latest and greatest models

The latest OpenAI model for fine tuning is from April 2025, so it hasn’t been updated for about a year.  Meanwhile Anthropic is simply not offering this functionality as a service at the moment. This is a strong hint that something is not fine with fine tuning.

http://adandai.wordpress.com/?p=191
Extensions
Turning yourself into a chatbot
Uncategorizedaiartificial-intelligencechatbotchatgptllmragtechnology
Not long ago, I created Virtual Alexandra, a chatbot that answers questions based on my writing and instructions. The final implementation turned out to be so simple that even someone with no AI or coding experience can create their own virtual self. If you are brave enough to look at yourself in this peculiar mirror, […]
Show full content

Not long ago, I created Virtual Alexandra, a chatbot that answers questions based on my writing and instructions. The final implementation turned out to be so simple that even someone with no AI or coding experience can create their own virtual self. If you are brave enough to look at yourself in this peculiar mirror, this post will provide instructions and describe some unexpected LLM quirks along the way.

At a high level, a personal chatbot required surprisingly few parts. These are the major pieces:

  • OpenAI account – this project works with OpenAI’s backend and relies on its APIs.
  • Knowledge base – your blog posts, essays, forum answers, reviews, etc. 
  • Vector store – a place to store your knowledge base so the AI can search it.
  • System prompt – instructions that define the chatbot’s personality, rules, and behavior.
  • Simple chat interface – a UI where users can ask questions and that handles interactions with the OpenAI backend.

Now let’s look at how those pieces connect.

You are what you write

This project is only possible if you have a writing habit. For example, I could create Virtual Alexandra, but not Virtual Dmitry, since my husband is not fond of writing.

My own writing varies from technical articles to poems with everything in between. Also, most of the texts were on different websites, and I lost access to some of them. I needed to get those texts together in one place and in one format.

I chose JSON, because this way I could keep metadata about the texts – like the original link or the published date in case I need them later.

For example:

{"text": "...","date": "2010-12-19",“url”: "..."}

I relied heavily on vibe coding to get all those files together – from simple scrapers to get texts out of the lost accounts to formatting scripts and even translations, because I wanted most of the texts to be in both Russian and English.

RAG is all you need

The core technical mechanism behind Virtual Alexandra is  RAG (Retrieval-Augmented Generation).

If the term is unfamiliar, here is a quick explanation. To store and search data, RAG systems use a so-called vector store. In the LLM world the system doesn’t rely only on exact keyword matching – like the Find function in a text editor – but searches in semantic space. For example, “cat” and “dog” are located close to each other in this space. When the user asks Virtual Alexandra about my kids, it will find all the files where I wrote about “kids”, “children”, “son”, “daughter”, and so on. This is the retrieval part.

Each retrieved file has a relevance score – how closely it matches the search query. The top files become input for the system to construct an answer from. This is the augmented generation part.

The overall (and slightly simplified) formula is this:

User question + semantic search results + system prompt => LLM answer

OpenAI offers a vector store as a service with all the relevant APIs and UIs. Creating a vector store was as easy as clicking a button in OpenAI’s UI (https://platform.openai.com/storage/vector_stores/) and then uploading all my texts in the JSON format to it. You can upload files via API as well. In either case, you don’t need to worry about defining the semantic space or indexing your files – this magic is provided by OpenAI.

OpenAI UI for vector store
UI is the easy part

Once you create the vector store and upload the files, you can quickly vibe code the UI in your favorite tool.

I asked Cursor to create a simple web page mimicking modern chatbots. Then I told it to use RAG and gave it the vector store ID along with the API key to my OpenAI account. I also chose the gpt-5.2 model, because it was the latest at that time.  I wrote enough about vibe coding in my previous posts and I don’t want to dive into it here. But if you want a refresher, start here.

System prompt is the hard part

Originally, the vibe-coded chatbot had a very simple prompt along the lines of: “You are Virtual Alexandra, answer questions based on file_search results”, where file_search is the name of the RAG tool in OpenAI. 

But the prompt is an essential component – this is where general instructions and, well, personality come from. If you don’t construct it carefully, the chatbot will default to its standard “assistant” mode – this is what most modern AI systems are trained for.

Here are the sections included in my prompt:

  • Canonical facts
    Key events and dates in my life that I wanted to prioritize above all other facts, like my birthday or places where I lived.
    Why should they be in the prompt instead of the vector store? Because with RAG the system cares about how close the words are to each other semantically, not about whether the facts are correct. For example, when the user asks “How old are your kids?” a long story about a birthday party may look more relevant to the model than a short list of dates, even if the latter contains the correct answer. There are multiple ways around this problem, but the simplest is to include key facts directly in the prompt and ask the system to prioritize them over the RAG search results.
  • Response rules
    How long the answers should be, what topics not to discuss, what languages to respond to, etc. For example, I didn’t want this bot to be used for trivia questions like “What is the Earth’s radius?”, so I put a rule to decline those. 
  • Voice
    This is where most personality rules live in my prompt. It has a general summary of my tone as well as specific instructions for different types of questions – personal, professional, etc.
    I asked ChatGPT to summarize my style based on my writings and the general description came out like this:
    “Thinking in public” style: calm, reflective, slightly sarcastic humor. 
    Well, that sounds like me, more or less, so I actually included this line in the prompt.

Those are the main steps. However, simple implementation doesn’t mean it was trivial to get there. Once I had all the pieces in place, I started getting into all kinds of LLM-related issues, when you get weird answers to pretty straightforward questions. I summarized the issues and workarounds in the next section, so you don’t repeat my mistakes.

Learnings Translate your source files into each target language

In theory, translation is optional. The words that mean the same thing in different languages are located close to each other in the semantic space, for example “Moscow” (English) and “Москва” (Russian). 

In practice, texts written in the same language as the user’s query usually get higher ranking in the search results, so they “win”. At first, I had posts from my ADandAI blog in the vector store in English only. Then I asked the chatbot how Virtual Alexandra was created:

  • When asked in English, the system answered based on my writing in the blog.
  • When asked in Russian, Virtual Alexandra found one of my old sci-fi stories where I played with the idea of a digital copy of a person and then hallucinated an answer based on that story.
Clearly label what’s fiction and what’s facts

My writing is a mixture of personal essays and pure fiction, and you can see from the previous example how it can confuse AI. LLMs treat all retrieved text as potential truth unless explicitly told otherwise.

The fix can be as simple as adding “FICTION. Do not treat as factual.” at the start of the texts. In my case, I added a separate JSON field, since I was using that format.

Verify the RAG search query

Here is a surprising fact: in the OpenAI backend, the system may rewrite your search query before it reaches the vector database and even send multiple queries instead of just one. I only discovered this by reviewing the logs and here is what I found:

Original question:
“What do you think about Seattle?

Search queries:
“What do you think about Seattle?”
“Seattle opinions Alexandra”
“Seattle review travel weather culture tech scene”

Just to be clear – this is OpenAI’s implementation decision, not an inherent LLM problem. OpenAI also doesn’t let you control the query generation behavior via API calls. However, the model will follow the system prompt instructions, so I added rules like:

  • Exactly ONE search in file_search
  • Query = user’s question verbatim (fix typos only)
Check for conflicting instructions in your prompt

I noticed that when I edit my prompt too often I end up putting contradictions into it.

For example, originally I had all the canonical facts in the vector store but then put them in the prompt. Eventually, I ended up with two conflicting instructions hidden in different parts of the prompt:

  • Only use facts from file_search.
  • Use file_search or Canonical Facts.

It doesn’t seem like a lot, but this caused my system to often ignore canonical facts (and get my kids’ ages wrong yet again.)

For me, the best way to find these issues was to upload the full prompt into ChatGPT and ask to find conflicting instructions. 

But what about fine tuning?

Those of you who are more familiar with AI techniques might ask: “Isn’t fine tuning the best way to set your voice instead of prompt instructions?” And you would be right, because in fact fine tuning was the first thing I tried. But unexpectedly that particular road turned out to be a dead end, and this will be the topic of my next post. 

Meanwhile, if you just skimmed this post, you can always ask Virtual Alexandra to summarize its inner workings for you 🙂 In other words, the chatbot can now explain how it was built based on this very text.

Chat with Virtual Alexandra
http://adandai.wordpress.com/?p=182
Extensions
Welcome Virtual Alexandra
Uncategorizedaichatbotchatgptopenaisoftware-development
After a holiday break full of travel and family affairs, I am finally back to blogging. Since my silence was longer than I expected, I decided to introduce a new project right away. Say hi to Virtual Alexandra. Virtual Alexandra is a chatbot that uses OpenAI (gpt-5.2 at the moment) to answer questions based on […]
Show full content

After a holiday break full of travel and family affairs, I am finally back to blogging. Since my silence was longer than I expected, I decided to introduce a new project right away. Say hi to Virtual Alexandra.

Virtual Alexandra is a chatbot that uses OpenAI (gpt-5.2 at the moment) to answer questions based on my own writing and instructions. I have produced quite a lot of texts over the years – personal essays, a family blog, technical articles, sci-fi stories, poetry, Facebook posts, almost three hundred book reviews on Goodreads, this blog – you name it. The chatbot has access to my public texts, even this very post, and so you can ask it (her?) for a book recommendation, a nice vacation spot, or my AI projects. 

But the thing I want to talk about is not tech or how I built this project (this will come in later posts). I want to reflect on how weird and sometimes unsettling the process was. I originally thought of Virtual Alexandra as a continuation of the AI for video gaming theme. While my LOTR-based mini game was about world building and exploration, this one was more like creating an NPC (non-playing character) based on myself. Pretty cool, right?

However, as I kept working on this, it became less and less about gaming and more and more about, well, me. When things didn’t go well, my husband and I called this project “Drunk Alexandra” – it would spit out my words back to me but give them some crazy meaning, hallucinate facts, or simply produce nonsensical answers but it still sounded like me. It was really like talking to your drunk self – funny at first, but then tiring and kind of uncomfortable.

The experience also made me understand how some people get addicted to ChatGPT conversations and spiral into delusions. When you are talking to the mirror reflecting back your own life experiences, the danger of addiction is even greater. Because when things went well, I sometimes got really surprising answers, insights, and facts that I had forgotten over the years, or just a synthesis of my data that was right on point. I could see love for my husband and kids shining right back to me, as well as my self-doubts and my habit of making fun of almost everything. Sometimes it was hard to stop testing (and was it truly testing at that point?)

Here is one example. I asked Virtual Alexandra:

Where do you want to work next?

The bot answered:

I don’t have a “next company” picked in my notes. What I do see in my writing is a pendulum swing: after Microsoft I was hungry for a startup’s speed and growth, but I’m also very honest about the “romance” tax: no money, no time, and everyone washing dishes like it’s a team-building cult. So if I go back, I’d choose a place where I can lead real programs, ideally with sane boundaries.

That answer wasn’t copied from anywhere. It was synthesized from things I had written at different times. But this is what I would actually tell a recruiter. Should I use this to prepare for my next interview when I am ready to get back to work? Well… this is exactly the type of danger I was talking about before.

Before you chat with Virtual Alexandra yourself, here are some disclaimers and reminders:

  • Virtual Alexandra only uses my public texts that were shared online before, there is no data from emails or personal messages available to this bot. 
  • I didn’t write consistently throughout my life, and it can be noticeable in the answers. For example, I wrote a lot about Seattle, but gave up personal blogging long before I moved to Barcelona. So you will get a lot of facts about Seattle and only a bit about Barcelona. Also, if you know me personally pretty well, you may know more than is available to this chatbot.
  • It’s not an assistant, it is more like a personal facts search engine. Don’t ask it about trivia facts or things you wouldn’t ask me about – I did my best to put guardrails in to reject those questions.
  • It is still an LLM, after all. It can hallucinate and get things wrong. Also, a lot of personal writing normally refers to a certain point in time. If I started a new job, loved it, and wrote about it during the “honeymoon” period – it’s a fact for the system. Even if I later left that job bitterly disappointed, it would still find the happy moment and may quote it while answering the question “What was your favorite job?”
  • LLMs are not good with dates and timeframes overall. And my writing confuses it even more by using words like “recently”, “not long ago”, etc. So while the model has the basic facts about my timeline, it still can make mistakes. For example, it often assumes that my kids are very young – because I wrote about my kids when they were little more than when they grew up.

I will write more about how I created it, what worked and what didn’t in my next post. Meanwhile, enjoy conversations with Virtual Alexandra. She should have sobered up, but if you get any weird responses, please let me know. It’s still being built.

Virtual Alexandra
http://adandai.wordpress.com/?p=175
Extensions
Vibe code review 
Uncategorizedaiartificial-intelligencechatgptllmtechnology
Now that I had the game stabilized and feature-complete, it was time for code review. I decided to do a first pass with AI coding tools and then ask the architect (aka my husband) to give his verdict.  To make the post easier to follow, I am finally going to share my code. Just keep […]
Show full content

Now that I had the game stabilized and feature-complete, it was time for code review. I decided to do a first pass with AI coding tools and then ask the architect (aka my husband) to give his verdict. 

To make the post easier to follow, I am finally going to share my code. Just keep in mind that all of it was vibe-coded, and I’ll talk about the implications later in the post: https://github.com/aroussina/lotr-steps-game

And of course, you can always play the game itself here: https://lotr-steps-game.onrender.com/

AI reviewing its own work

Vibe coding is known to produce bad quality code. To see how my game’s code turned out, I employed a somewhat popular trick: I asked both Cursor and ChatGPT to review my code and tell me how the overall quality, architecture, and security can be improved.

Cursor and ChatGPT produced slightly different sets, but they both agreed on the following:

  • The game was mostly in the main/server.js file (1700+ lines of code), in one big function, and needed to be split into modules for easier maintenance.
  • There were a lot of “magic numbers” in the code (if step <=10) instead of named constants (if step <=MAX_STEPS) making it hard to update the constraints in the future.
  • The user input (as well as client requests) was fully trusted and passed directly to OpenAI without any safety checks, so it was vulnerable to hacker attacks and prompt injections.
  • There were no rate limits on API requests allowing for an easy DDoS attack.
  • No CORS (Cross-Origin Resource Sharing) policy was set, which isn’t a good security posture.
  • Prompts passed to the LLM were so large that I was wasting money.
  • JSON parsing was brittle and could easily break.
    The last one is the only thing I decided not to fix. Not because I disagree with it, but because I was simply tired after battling with JSON the previous week. So this issue went to my backlog.

It was fun to read, but fixing all those issues was really boring. My workflow was: pick a problem -> ask Cursor to write a detailed execution plan -> execute the plan in phases -> click “Run” and “Accept” for each of Cursor’s prompts. I didn’t want to do one giant refactoring in a single step. As a result, I didn’t face too many issues with the broken code, but it took hours and I was literally playing games on my phone while waiting for the code to generate.

So, a very hard refactoring task turned into something mechanical and yes, boring. While I joke about it, the issue of AI-driven burnout for developers is actually quite real. As for code quality, I couldn’t really say if it was getting better or worse. I got way more files and shorter functions, sure. But did the quality really improve? I didn’t know.

Human reviewing AI’s work

I did another round of code reviews with Cursor and ChatGPT and of course they found some new issues, but they looked minor to me (tighten security even more, make rate limiting smarter, do more refactoring). And now it was time to reach out to a person who actually cares about code quality – my consulting architect, aka my husband Dmitry – and ask him for a code review. He saw some of this code already during my AI adventure, but it was the first time he looked at it after the big refactoring.

🧑‍💻Dmitry’s overall comment 

It is now much easier for me to read. The file structure is decent. I didn’t check the game logic, but I can see how I would be able to understand it to fix issues in the future.

Interestingly enough, the issues Dmitry discovered were not the same as those that came up during the second round of AI reviews. Here are just some of his comments:

  • There is config/prompts.js that only contains the string literal for the system prompt. And there is also builders/promptBuilder.js that has a lot of other big string literals to create the user prompt mixed in with the game logic. This approach is inconsistent, and to make it more consistent, either promptBuilder.js needs to be refactored to separate these constants from the logic, or the only constant defined in prompts.js could be just moved to the builder class itself.
  • It’s good that location and wildcard characters are defined as enums in a separate file, config/constants.js, but these enums are not used in the rest of the code. For example, there are a lot of checks for “Mount Doom” string literal within the code, and if you ever change this to some other value, like “Mordor”, you will need to fix a lot of references instead of just one.
  • There is middleware/rateLimiter.js and it is completely commented out. Why is it even here?
  • The main/server.js is now nice and small, but a lot of complex code simply moved to api/step.js which is still long and complicated. There are also some pretty complex files in the services folder. Overall, the game logic looks more complex than I would expect from such a small game. 
  • utils/inputValidator.js contains a lot of text parsing code for locations and wildcard characters detection. I thought you delegated all that decision making to LLM.
  • There is only one unit test for the whole project: wildcardDetector.test.js. Why just one and why this one?
  • test/integration.test.js has a lot of duplicate code, this definitely can be improved.

Dmitry’s findings were, well, more architectural. He knew not only the right patterns to use, but also what I was trying to achieve with my game and my AI tooling. The only thing where he agreed with AI was that more refactoring of the core game logic was needed. It’s a little bit like someone reviewing your essay for English grammar and fact-checking (AI code review) vs actually reading the essay and highlighting incorrect assumptions or logical breakdowns (human code review).

I debated whether to fix these issues before sharing the code… And ultimately I decided to keep them the way they are for now. This project is about learning AI coding tools as much as it is about learning AI models, so I am preserving the current state of AI tools. It might be interesting to rerun the code review with a new version of ChatGPT in the future. For the record, I used the newly released gpt-5.2 in web UI and gpt-5.1-codex-max in Cursor.

Security audit

Still, I was a bit worried about sharing both the code and the publicly hosted app. So I told ChatGPT that I am going to make the project public on GitHub and asked for an additional security audit. It listed security issues that were mostly the same as in the original code review (CORS, limits, timeouts, etc.) and I already addressed them.

Next, I asked for specific prompt injection examples that could break my game, so I could try different inputs and see the outcome. There were A LOT of examples, ChatGPT knows how it’s getting attacked.

By the way, I made sure my game now stops the adventure for all of the prompts that ChatGPT gave me. However, there is no 100% protection against prompt injection, and even if attackers don’t get the response they hoped for, it can still produce subtler effects like longer responses or higher latency. AI-hacking techniques are evolving as fast as AIs themselves.

However, for this particular project, the biggest security risk is that someone will start burning my OpenAI tokens. The app itself doesn’t have any personal, financial or other data that hackers would be interested in. And even if someone takes the app offline, it’s just a small prototype game and I can survive this.

So, my ultimate protection layer is not in the app, but in the fact that I only have a limited amount of money on the account and set up spending alerts.

This is a nice segue into cost, which is a huge and complex topic in AI. But this is what my next post will be about – it will come after the Christmas break, because I am planning to travel and enjoy family time away from AI projects. For now, as usual, I am leaving you with some hard-earned takeaways.

Learnings
  • AI doesn’t learn from its own mistakes, even after the code review
    Yes, we all know it, kind of. But it keeps catching me by surprise. You can write detailed instructions, you can tell the AI in every prompt what to do and what not to do, but it will still make certain types of mistakes, over and over again, especially when you start a new chat with an agent and lose your previous context. For example, after Cursor fixed “magic numbers” and added named constants, it didn’t hesitate to use the same magic numbers again in the newly generated code, even when they already had corresponding constants. 
  • What is the future of code reviews?
    It’s less of a learning and more of a philosophical question I am now battling with. As a sole vibe coder of the project, I didn’t care that much about the quality of the code. In fact, it was easier for me to find things in one giant file that I more or less learned already than in a distributed set of files and functions that appeared after a day of my mindless clicking. I don’t know what’s easier to manage for AI – single file or many files. It was definitely easier for my husband to understand the refactored code. But if we want to leave this whole layer to AI tools, how much will code readability matter?… I don’t have answers yet, but these are for sure interesting things to think through.
http://adandai.wordpress.com/?p=157
Extensions
Performance and streaming and JSON, oh my!
Uncategorizedaiartificial-intelligencechatgptllmlord-of-the-ringslotr
The beauty of the diary blog is that I keep making new discoveries while writing my posts. And so I updated my mini LOTR-themed adventure game once again. With the new version, you don’t need to wait 30 seconds for each step anymore – the content will start showing up in less than one second. […]
Show full content

The beauty of the diary blog is that I keep making new discoveries while writing my posts. And so I updated my mini LOTR-themed adventure game once again. With the new version, you don’t need to wait 30 seconds for each step anymore – the content will start showing up in less than one second. Plus I added a simple map. Try the game here: https://lotr-steps-game.onrender.com/

I will explain how the updates came to be, but to do that I need to continue the deep dive that I started in the previous post. Last time, I mostly talked about different LLM quirks and how they can be exacerbated by vibe coding. Now I will discuss things you start to worry about as you get ready for the production – application performance and overall stability.

Performance and latency are not the same

I had a problem – my game wasn’t super fast. That’s why on the loading screen I put a message that it takes ~30 seconds to get a response. 

This is definitely a bit too long for normal gameplay. The question is – can it be optimized? My default model was gpt-5-mini, it is generally recommended as a cheaper and faster version, so it sounded like a great fit. That is, until I actually tested different models.

ModelAverage time to finish one stepText QualityCommentgpt-5-nano 68 secondsMediumThis was supposed to be the fastest option. It wasn’t. Plus, the generated text quality dropped and sometimes contained literals like “\n\n”.gpt-5-mini36 secondsGoodThis is what AI coding tools recommend by default. The generated text quality was acceptable, but the speed was lacking.gpt-5.112 secondsVery GoodThis is the default model in ChatGPT, and a clear winner in both quality and speed.gpt-5.1-pro6 minutesVery GoodHuge latency, while the quality of the generated text didn’t improve that much.

Puzzled, I went to ChatGPT to understand what was happening (and learned that ChatGPT had adapted my LOTR theme). In short – the model can be fast, but it doesn’t mean it has the best infra to run on.

Latency is the time the player spends waiting for a response to arrive, while performance is how efficiently the system actually computes and delivers that response once it starts working. Nano and mini can be faster than gpt-5.1 in theory, but their latency makes them slower in practice.

Clearly OpenAI uses shared cloud infrastructure (pretty standard), but not enough of it to give the advertised benefits to the cheaper models.

My game is just a proof of concept and I honestly don’t think major AAA games will use OpenAI APIs. Even with a 15 seconds response time, it would be too slow, plus taking a dependency on a 3d party live service might be too risky for a major studio. But what if a game installs a small model on the client? Just to see how that would go, I asked my husband to run the game locally on his MacBook Pro, with Apple 3 Pro chip and 36 GB of memory. 

🧑‍💻Dmitry’s experiment 

I used the Llama instruct 3B model because it’s reasonably fast while producing good quality results. I ran it as a separate subprocess first and loaded the model for each API call. This resulted in about 8 seconds per step. Then I tried pre-loading the model in memory by starting it as a server and the average time per step was about 5 seconds.  

Five seconds is not instantaneous, but it’s now within a similar time range for loading a new location on the open world RPG. I was getting more and more convinced that real time content generation can be used in games.

To stream or not to stream?

Dima and I were discussing whether a five seconds delay is acceptable for a major game, when he suddenly did what architects usually do – suggested a completely different approach.

🧑‍💻Dmitry’s suggestion 

Why don’t you use the streaming API? You are generating text, the player can start reading it while the rest is still being generated. It takes longer than 5 seconds to read this text.

That got me very annoyed, and not at my husband. This streaming API suggestion is what I would have loved to get from ChatGPT from the very beginning. Streaming doesn’t speed up computation, but the player experience becomes so much better when they don’t need to stare at the empty text box. And none of the AI tools even mentioned streaming when I asked them for code reviews or architectural advice. 

I vibe-coded the streaming API logic and rerun the tests to see how fast different models will start streaming and showing the content. I didn’t run gpt-5.1-pro tests, because 6 minutes is too big of a delay to mitigate with the streaming strategy. 

ModelAverage time to finish one stepAverage time to first characterText QualityCommentgpt-5-nano 76 seconds72 secondsMediumLooks like the model itself is fast. My requests spent a lot of time in the queue waiting to be processed, but after the streaming started the step got completed in just 4 seconds. It is definitely an infrastructure, not an LLM issue.gpt-5-mini22 seconds18 secondsGoodSame problem as with nano, but the response wait time is shorter. Probably, it means that it has more infra allocated.gpt-5.110 secondsLess than 1 secondVery GoodThis is a clear winner with more or less expected quick start of streaming. This mitigated my problem with the player waiting too long for the game to respond.

At the end of the test runs, I switched to gpt-5.1 and the streaming API, plus updated the UI. And just because I got tired of LLM’s quirks, I added a map of Middle Earth tracking the character’s locations. It’s nice to work with something deterministic once in a while. The map is not AI generated, it’s taken from here: https://lord-of-the-rings.org/collections/maps.html.

To JSON or not to JSON?

Now that I had decent UX, it was time to pay attention not just to the speed of LLM’s responses, but also to its content. In the LOTR game, I need LLM to generate several distinct things:

  • Description of each step;
  • Three options for the player to choose from based on this description;
  • Decide if the player won or lost the game or if the game should continue;
  • Decode user input and decide if there is a sudden location change or a new character introduced to the game.

Ideally, I would do independent API calls for each item on the list. But the options for the player to choose from are generated based on the step description, so I need to do at least two consecutive API calls, which would increase latency. I already spent a lot of time fixing that issue, and I didn’t want to increase the wait time for the player. 

The solution was to put more instructions into the prompt instead of calling OpenAI APIs more often. This increases the size of the prompt, but interestingly enough, I didn’t notice an increase in latency as my prompt grew, even though it changed from a couple of lines to a complex script with multiple instructions. 

🧑‍💻Dmitry’s comment 

This is expected. Whether the input is small or large, it shouldn’t really affect how long it takes to push it through the model and get the next token, assuming it fits into the context window. Your prompts are still small relative to the context window size.

If the LLM returns multiple values at once, it’s necessary to parse its responses. By default, AI coding tools introduce JSON requirements in the prompt to enforce the output format. 

JSON OUTPUT FORMAT (STRICT):
- Keys: description (string), options (array of 3 strings), status ("ongoing"|"win"|"lose"), endReason (string, empty if ongoing), wildcardCharacter (string|null), currentLocation (string)

This is a somewhat controversial idea and I even read on LinkedIn that it’s the The New Cringiest Trend in Software Development for 2025. However, if you try to avoid JSON schema, Cursor will do very ugly text parsing, which is even more error prone. Or you’ll need to invent your own schema, and well, it’s not something that an average vibe coder like me would do.

As I discussed in the previous post, LLMs do not necessarily follow the rules spelled out in the prompt – they follow probabilistic token generation rules. Despite all the instructions, sometimes OpenAI returned broken JSON, and I could see it in my logs. 

When I asked Cursor to fix the JSON parsing issue, it kept reinforcing the prompt, with its favorite all caps messages:

- CRITICAL: JSON must be valid and parseable.

It didn’t really help and I kept getting the same errors. So once again, I asked my husband for advice.

🧑‍💻Dmitry’s comment 

There is no workaround for this at the moment. LLMs are only taking prompts as inputs, you can’t enforce schema any other way. To avoid errors, tools like LangChain simply validate JSON and if it’s incorrect, they return the request to the model and ask it to try again. You can ask ChatGPT to implement the same logic for your app.

After that, it was a matter of giving this prompt to the Cursor’s AI agent:

​​Sometimes OpenAI returns broken JSON. Retry the request and ask OpenAI to fix the issue if this happens.

Cursor updated the code with the retries, but it couldn’t avoid prompt engineering completely. This time in addition to CRITICAL, it said “please”, twice. Maybe being polite helps?

const fixRequest = `CRITICAL: Your previous response contained invalid JSON that could not be parsed. Please fix the JSON syntax and return ONLY valid JSON with no additional text.
The invalid response was:
\`\`\`
${text.substring(0, 1000)}
\`\`\`
Common JSON errors to fix:
- Unescaped newlines, quotes, or special characters in string values
- Missing commas between object properties
- Trailing commas before closing braces/brackets
- Unmatched braces or brackets
- Control characters in strings that need escaping (\\n, \\r, \\t)
Please return the corrected JSON response now, ensuring all string values are properly escaped.`;

And then I switched to the streaming API calls… This broke my JSON workflow completely. Streaming delivers the response token-by-token, so the output isn’t valid JSON until the very last brace arrives. I needed to show the step description to the player before the structure was fully formed. Ugh… After a long session of vibe coding, I managed to display the step description displayed in real time, and even without special symbols like “\n”. But I ended up with lots of ugly text parsing code that I was so eager to avoid.

function decodeJsonString(str) { 
  try {
   // Wrap in quotes and parse as JSON string to decode escape sequences
   return JSON.parse(`"${str}"`);
 } catch (e) {
   // Fallback: manual decoding of common escape sequences
   return str
     .replace(/\\n/g, '\n')
     .replace(/\\r/g, '\r')
     .replace(/\\t/g, '\t')
     .replace(/\\b/g, '\b')
     .replace(/\\f/g, '\f')
     .replace(/\\"/g, '"')
     .replace(/\\\\/g, '\\')
     .replace(/\\u([0-9a-fA-F]{4})/g, (match, hex) => String.fromCharCode(parseInt(hex, 16)));
 }
}

This implementation works, but honestly, I agree now that LLM + JSON = CRINGE. The more I think about it, the more I believe this is a feature request to LLM and AI tool developers. We need some guidance and tools here. JSON might be too complex, but what is a good alternative? It would be great if coding agents would not default to JSON as the only interface to LLMs or if models supported some schema that doesn’t need to be enforced by prompts.

After the battle with JSON, I realized that my game is more or less functionally complete. Yes, I can always add more features and increase the depth of the world. But I actually achieved what I set out to do. Now comes the time to test the game for real – how will a vibe-coded app survive automated testing and architect’s code review?  Stay tuned – this is what I will talk about in the next post.

That was a lot of performance and JSON drama for one post, so I’ll wrap it up with my usual take-aways. 

Learnings
  • Smaller is not necessarily faster
    The fact that OpenAI’s mini and nano models were slower than the gpt-5.1 was my biggest surprise. This is why we test! But smaller models are faster if you control the infrastructure, which is what my husband proved with the self-hosted Llama model.
  • Streaming improves time-to-first-token, not total compute time
    This might be obvious, but I would like to remind my readers once again that the streaming is a UX trick in my case. It didn’t speed up any execution, it just lets you start reading sooner.
  • We need better AI tools and practices
    Bias towards putting all the logic into the prompts, enforcing JSON schema for LLM responses, ugly text parsing – AI agents did not make my life as easy as I dreamed. Coding agents need to be smarter to make vibe coding viable. Please, please, give us better tools (I said “please” twice to reinforce the prompt!)
http://adandai.wordpress.com/?p=141
Extensions
Hallucinating within the given constraints
Uncategorizedaiartificial-intelligencechatgptllmtechnology
Despite all the shortcomings I discovered after a day of vibe coding, my little prototype convinced me that using AI for real-time content generation in games is a viable idea. But, using a Large Language Model (LLM) as a proto game engine comes with unique challenges. Plus I was vibe coding – basically, an LLM […]
Show full content

Despite all the shortcomings I discovered after a day of vibe coding, my little prototype convinced me that using AI for real-time content generation in games is a viable idea. But, using a Large Language Model (LLM) as a proto game engine comes with unique challenges. Plus I was vibe coding – basically, an LLM ended up programming another LLM, and sometimes it created real surprises.

Before getting into details, a quick side note. In this blog I assume that readers (if I am lucky to have them) know the basics of AI. I will try to keep things simple but, if you don’t know much about LLMs, I recommend watching Deep Dive into LLMs like ChatGPT by Andrej Karpathy, which is the best intro to generative AI I have found so far.

And of course you can also try the game discussed here (just remember it’s on free hosting and may take a couple minutes to start): https://lotr-steps-game.onrender.com/ It’s a simple LOTR-themed text adventure, where you act as Frodo and must destroy the Ring in 10 steps. The game’s text is generated by the OpenAI LLM.

Now let’s dive into some challenges I encountered. 

Stay where you are

Lord of the Rings’ story moves through many distinct locations – the Shire, Rivendell, Mordor, and so on. Initially, I wanted to let the LLM run free and choose where to move next. But I quickly discovered this led to a poor game experience since the game often got stuck in the same location.

It’s understandable – LLMs generate tokens based on probabilities. If my character is in the Shire, the generated tokens are very likely to be about the Shire, and it becomes a self-reinforcing feedback loop. It was unpredictable when or where the characters would get stuck, but it was happening often, like every other game playthrough.

To fix this issue, I decided to introduce “location management” – basically, setting up a series of locations that the game should move through, which is pretty similar to what happens in most RPGs. Also, I made sure some choices offered to the player are “positive”, i.e. progress the game according to the LOTR story and others are “negative”, clearly not helping Frodo move forward. This led me straight to the next issue.

LLMs are not state machines

I gave Cursor an ordered set of locations from Lord of the Rings and some simple rules:

  • If the player selects a “positive” option, move to the next location.
  • If the player selects a “negative” one, keep the same location or move to the previous one.

I hit two issues with this approach. First, when you vibe-code an app that relies on LLMs, AI agents tend to push all the logic into the LLM prompt instead of using traditional code. So Cursor put the list of locations and the instructions on how to move between them into the prompt as plain text and also introduced complicated text parsing to figure out where the characters are. 

I resolved the first issue by explicitly asking Cursor to put locations into a list and pass them to the LLM, and this is when the second issue became apparent. Even the simplest instructions such as I wrote above were not consistently followed by the LLM. The game would still sometimes keep characters in the same location even when a “positive” option was selected.

This made me pause and think about the situation – what am I trying to achieve and am I using the right tool? It was time to talk to my husband.

🧑‍💻Dmitry’s comment 

You are implementing a state machine, with each game location being a distinct state. But when an LLM generates text it is stateless – it is essentially a gigantic function applied to your input, but it doesn’t have any internal state.

The “state machine” was the magic phrase which made me realize what’s happening (thanks to my computer science teachers from a long time ago). It also explained another issue I was facing – why the game would sometimes run longer than 10 steps. 

What I ended up doing was explicitly telling Cursor to move all the location logic out of prompts and use code. Now LLM receives the location as input and makes up a story about how characters got there and why. All the location tracking logic now runs done via code.

- CRITICAL: The step MUST happen at "${currentLocation}". Write your description as if the characters are currently at "${currentLocation}". All events, scenery, and choices must reflect this specific location.

${previousLocation && previousLocation !== currentLocation ? `- CRITICAL: The location changed from "${previousLocation}" to "${currentLocation}". You MUST explain in your description BOTH:

 1. HOW the characters traveled from "${previousLocation}" to "${currentLocation}" - describe the journey, travel, or movement (paths taken, terrain crossed, time passed, companions, encounters)

 2. WHY they went to "${currentLocation}" - explain their motivation, reason, or purpose for going there (what they seek, what they must do, what called them, what danger they flee, etc.)

And this is essentially the only way I can think of for maintaining state with LLMs – the state machine should be outside the LLM and the current/previous/next states must be passed as references.

Probability wins over randomness

The next problem happened when I added a “wildcard” character summon feature to the game that lets players bring characters from different universes like Marvel or Star Wars into the LOTR realm. Based on my learnings from the locations, I provided a list of characters likely to fit with the LOTR story and most people would recognize. Again, ChatGPT dumped all those characters into the prompt, but I decided to let it be, since I just needed to select one. It’s not really a state machine problem. So I had a prompt like below:

- Randomly select a character from the character pool.

- Character pool: Luke Skywalker, Yoda, Darth Vader, Obi-Wan Kenobi, Princess Leia, Han Solo, Harry Potter, Hermione Granger, Ron Weasley, Dumbledore, Mickey Mouse, Donald Duck, Goofy, Minnie Mouse, Pluto, Batman, Superman, Wonder Woman, Aquaman, Iron Man, Captain America, Spider-Man, Thor, Black Widow, Hulk, Joker, Voldemort, Loki, Moriarty, Sherlock Holmes

But somehow every time I rerun the game, it chose Hermione Granger. To fix it, Cursor added CRITICAL and MUST statements to the prompt (yep, in all caps), emphasizing randomness, which helped not at all. Writing in all caps is one of Cursor’s favorite tricks – I have CRITICAL and IMPORTANT all over my code.

- CRITICAL: Randomly select a character from the character pool. This MUST be a different character each time.

Frustrated, I abandoned Cursor and asked ChatGPT via web browser why this was happening, and got a surprising answer.

Basically, if you give a list of items to an LLM and ask it to pick one randomly, it will pick the most probable one, ignoring your command. ChatGPT also recommended a work around using prompt engineering:

First, internally generate a random integer from 1 to 30. Use that integer to select the Nth character in the list.

This actually worked, and I finally saw Superman and Sherlock Holmes in the game. I was ready to move on, but my husband raised an interesting question.

🧑‍💻Dmitry’s question 

Are you sure all numbers are treated the same? I remember “42”, for example, being a “special” number. 

42 is the famous answer to the Ultimate Question of Life, the Universe, and Everything (and Douglas Adams is one of my favorite writers), so yeah, numbers can be sticky too. I saw in the logs that certain characters were more likely to get selected (even if it wasn’t just Hermione). The real solution here again is to take this logic out of LLM completely and use code functions to generate randomness.

As I mentioned in the previous post, I had to fix quite a few issues, so there will be part two of this deep dive. For now, I’ll just leave you with some learnings

Learnings Beware of the bias towards prompt engineering

It’s very tempting for both humans and AI tools to attempt solving everything via prompt engineering when you use LLMs in your app. Even when the problems arise, ChatGPT, Cursor or similar tools try to reinforce the prompt repeatedly without success. Today, you have to explicitly ask the tools to introduce data structures and handle algorithms via code.

Following algorithms is one of LLMs’ weaknesses

To me, this feels similar to the famous “count r in strawberry” example or doing math outside of “thinking” models. Algorithmic instructions contradict the “next best token” strategy LLMs use. My quick conversation with ChatGPT produced a nice explanation that I will just copy verbatim: LLMs are better at “I know how this should look” than “I will execute this step-by-step with absolute discipline.” They often bypass the algorithm and jump to the shape of the answer, because that reduces loss during training.

Know what LLMs are good at

This post focused on the issues, but in such write-ups the good things often get omitted. Here are the things that I didn’t have to fix – they just worked out of the box for my game:

  • Matching the tone and feeling of the imaginary world.
  • Incorporating new and unexpected twists into the game.
  • Explaining how an event happened, even if it was a completely unexpected one.

A lot of my problems were things easily fixed with the normal code. But those things that LLMs are good at would be way harder – if not impossible – to implement with traditional development techniques.

http://adandai.wordpress.com/?p=123
Extensions
Building a game in one day
Uncategorizedaiartificial-intelligencechatgpttechnologywriting
My previous post described an unsuccessful attempt at building the LOTR game with Replit. In this one I will walk you through how I prototyped a full game in a day using only prompts, what worked, and what didn’t. As I mentioned earlier, I decided to reach out to my husband Dmitry. It’s hard not […]
Show full content

My previous post described an unsuccessful attempt at building the LOTR game with Replit. In this one I will walk you through how I prototyped a full game in a day using only prompts, what worked, and what didn’t.

As I mentioned earlier, I decided to reach out to my husband Dmitry. It’s hard not to, when there is a software architect with many years of experience sitting right next to me. He was busy reading books about transformers and fine-tuning self-hosted models, but of course I took priority over that. 

🧑‍💻Dmitry’s advice 

All you need here is a good prompt. Just “prototype” with ChatGPT in the web browser by feeding it the instructions. Make it play this game to see whether it understands the rules. Then you can make an app out of it by sending the same instructions via API calls. This is what prompt engineering is for.

Prompt engineering for the win

Somehow, prompt engineering didn’t sound very serious to me (and I was wrong), but I trust my husband and so I tried. I explained the rules to ChatGPT pretty quickly and played several iterations of the game, making sure it could be won and lost. I had to tweak the rules several times, then incorporate the changes back into the initial prompt and start a new conversation with ChatGPT. My prompt ended up being pretty verbose, but to make it more fun I added pictures and made sure to change the story in my own way.

After I was satisfied with the prompt, I simply told ChatGPT: 

This time, ChatGPT started generating code right away. It chose Node.js + Express backend with a React frontend. For readers less familiar with tech: that’s a popular JavaScript setup for building web apps. However, I couldn’t care less about that – I wasn’t planning to write the code myself. But I could still read the code, and I saw that it was statically generating steps the same way Replit did. 

// Define story steps
const steps = {
  1: {
    description: "Bilbo has left you with a mysterious golden ring. You sit in Bag End, the Shire feels peaceful but heavy with fate.",
    options: ["Take the ring", "Refuse the task", "Talk to friends"],
  },
  2: {
    description: "You know the ring is dangerous. Dark riders search the land. You must decide quickly.",
    options: ["Leave the Shire", "Hide the ring", "Seek Gandalf"],
  },
  // ... continue through step 10
};

My next prompt was the magic key that opened the vibe coding door:

The first version never works

I don’t think repeating the full conversation with ChatGPT is useful here, since everyone can do it themselves, and the results will be slightly different. Eventually, it produced a downloadable version of the app along with installation and running instructions. It also walked me through creating API keys with OpenAI. Quick code inspection confirmed that now the code was calling the proper APIs.

// --- OpenAI client ---
const client = new OpenAI({
 apiKey: process.env.OPENAI_API_KEY
});
   
const resp = await client.responses.create({
     model: "gpt-5-mini",
     input: [
       { role: "system", content: SYSTEM_PROMPT },
       { role: "user", content: stepPrompt }
     ]
   });

A nice downloadable package didn’t mean the app actually worked. Again, same as with Replit, there were countless problems: missing packages, API errors, and other random issues. The big difference was the absence of paywalls – I kept feeding the errors back to ChatGPT so it could fix them, and eventually I was able to run the game. The UI was much simpler than the one I got from Replit, but it did the job:

What’s interesting is that it compressed my verbose prompt to a much smaller one: 

You are running a Lord of the Rings text adventure game. 
Divide the story into 10 steps, from Frodo receiving the ring to the destruction of the ring. 
At each step, describe the situation and give exactly three options plus a 'custom input' choice.

It made me feel bad about my original prompts. But later, when I had to do QA, fix bugs, and improve game logic, the prompt grew so much that now it’s literally two pages (!!!) long. It was fascinating to watch how AI was programming itself via prompt engineering. So it’s definitely no joke.

IDE is still a must, even for vibe coders

Originally, I simply used ChatGPT web interface and kept downloading or editing files in a text editor, then running the app in the terminal. But copy-pasting got cumbersome pretty fast, as expected. Also, asking ChatGPT to produce a new version of the app sometimes led to totally unexpected results. At some point, my app stopped using React and now it uses vanilla JavaScript and HTML and I can’t even remember when it happened. 

Another surprise came when I asked to add a small feature and ChatGPT completely changed the UI. I want to emphasize that this is not what you would expect as a PM when you ask a developer to address a minor issue. This certainly looks quite different from the initial version:

I realized that I needed a proper integrated development environment (IDE). I tried to ask ChatGPT for analogs of Replit, but got a long list of meaningless recommendations. So I went old school with Google search and finally settled on Cursor. For context: Cursor is an IDE with AI built in, similar in spirit to Replit. According to multiple Reddit posts, it takes way longer to hit a paywall. In my case, I worked with a free tier for over a month before I finally upgraded to a “Pro” version (again, hitting some limit that I can’t fully understand).

I tried prototyping straight in Cursor, but it didn’t work that well for me, and I was never able to get to the nice UIs like with Replit or ChatGPT. Cursor started throwing questions at me and asked for a ton of permissions for running some weird terminal commands and overall, I just got confused.

So, this is my current setup – I use ChatGPT in the browser to generate the initial files and then edit them via Cursor. “Edit” may be too strong of a word, I am asking an AI agent within Cursor to do the work. This way at least I can see what’s being changed, plus AI within Cursor is a bit more careful about rewriting too much code at once. 

I’ve made it! Or not yet?

All the tooling issues aside, after just one day I had a working prototype without writing a single line of code. It satisfied my original requirements:

  • There was UI and backend.
  • The game created a new journey every time. 
  • It matched the LOTR tone quite well.
  • It incorporated my wild custom inputs into the story. 

And it was super fun to create – I don’t remember enjoying tech so much in a while, maybe since my early teenage years. Here is a computer, and it can do magic. Just say in plain English what you want, and it will do it, even if not from the first try.

The next day, when the excitement wore off a bit, I realized that the whole game felt like AI slop, and I wasn’t ready to show it to anyone else. Here is just a sample of issues I found:

  • The game could run way more than specified ten steps, I once finished in sixteen. 
  • Frodo would get stuck in the Shire and then simply appear at Mount Doom at the last moment. 
  • It was prone to prompt injection, a situation when a user can make your app perform unwanted actions or disclose sensitive information.
  • And of course, it will randomly throw some errors.

So I decided to make it presentable and actually playable but it took way longer than I anticipated. Getting it to the state that I felt comfortable sharing took several days. And this is what my next post will be about – a deep dive on prompt engineering and vibe coding when you try to move one step beyond the very first prototype. To wrap things up, here are some learnings from my first vibe coding day.

Learnings
  • Vibe coding is real!
    I didn’t code for many years, and it would take me way longer than one day to produce this app without AI tools, even with all the outlined issues. I am not sure the same is true for my husband, or that I would be able to do this if I really didn’t have any prior coding experience. The errors and questions along the way could be pretty confusing. But it did work, and I really didn’t do a single edit myself.
  • OpenAI’s ChatGPT is my best AI tool so far.
    Even when ChatGPT suggested AI coding tools, using it directly gave me the best results, especially during the initial prototyping phase. It’s not the goal of this blog to produce a comprehensive review of all available tools, and maybe there is some magical IDE that is free and works out of the box. But given the noise on the market from all the competing offerings, ChatGPT is the most straightforward option to start.
  • Prompt engineering is no joke.
    I was expecting way more “coding” when making this game. But this whole game is all about prompts, which became evident when I examined the generated code. In fact, when you work with LLMs via AI agents, it feels like they lean toward prompt engineering instead of using “regular code”. 
http://adandai.wordpress.com/?p=84
Extensions