GeistHaus
log in · sign up

_

Part of thespblog.net

Welcome! This is my corner of the internet ⛺🔥 I'm a guy who likes to read a lot, tries to write, and most recently, obsessed with gen ai. Other interests...

stories primary
008: Creating CPE
Show full content
Building CPE: From Chat Tool to Programmable Agent Harness

I've been working on a project called CPE for the past eight months, and I've used it daily since day one. What started as a simple chat-based code editor — inspired by aider back in July 2024, which in AI years is a lifetime ago — has evolved into something quite different: a general-purpose, self-assembling agent harness that power users can configure exactly the way they want.

Today, CPE lets you define your own system prompts, connect any tool via MCP with selective filtering, compose agent skills, manage conversations with forking and cross-model resumption, and — most importantly — run a Code Mode where the AI writes and executes Go programs that call your tools as typed functions. But it took a while to get here.

The Problem

CPE was born before the era of sidebar agents like Cursor's Composer or Windsurf's Cascade, and before the command-line agents that are everywhere today. At the time, most code assistants used only a fraction of the available context window, relying on semantic search to save on tokens — understandable when plans were subscription-based rather than pay-as-you-go.

The thing is, with a little experimentation, I could clearly see that as long as I provided the right context myself, the model generated code that was plausibly something I would write — just much faster. I didn't need a fancy retrieval system. I needed control.

I had a few specific frustrations:

  • Context control. I wanted to decide exactly what the model sees, not hope that some embedding-based retrieval picked the right files.
  • System prompt ownership. I wanted to define my own guardrails — what to do, what not to do, how to verify its own work — without wondering whether my .cursorrules file was being ignored or conflicting with the application's built-in system prompt.
  • Minimalism. I didn't want a VS Code fork, an extension, a Python runtime, or even a TUI. Just a binary I could go install, pipe input into, and read output from. Something that prints to the terminal and gets out of the way.
  • Experimentation. I had ideas I wanted to try. Connecting an LSP to the agent via tool definitions. A string-replace tool that used tree-sitter to verify edits hadn't introduced syntax errors. Wild stuff — some of it worked, some didn't.
The JourneyPhase 1: Pipe In, Get Text Out

CPE started as bare-bones as it gets. Pipe text via stdin, get a response. Copy code into a file, pass it as input, get analysis back. Then I added tools so the LLM could actually modify the local filesystem. Then utility commands: token counting per file, a tree view showing how much each file contributed to the total token budget. Useful, but still pretty simple.

Phase 2: Going All-In on MCP

When the Model Context Protocol came along, I had a realization: the LLM only ever interacts with tools. What matters is the flexibility to design your system prompt in conjunction with the tools you provide for a given task. Everything else is plumbing.

So I made a sharp pivot: CPE became a thin MCP client. I stripped out all the built-in functionality — even file reading and writing — and created separate MCP servers for everything. Need to list files? That's an MCP server. Need to edit text? Another MCP server. Along the way, I'd spin up minimal servers for whatever I needed or adopt existing third-party ones.

This was philosophically clean. In practice, it had problems.

Phase 3: The MCP Problem (and Discovering Code Mode)

As many in the ecosystem have learned, MCP can easily overload the context window. Install a popular MCP server, get excited about the capabilities — and then realize it exposes 30 tools with verbose descriptions that confuse the model and bloat every request.

I tried to make it work. I looked at overriding tool descriptions or replacing schemas, but that felt like fighting against MCP's design rather than working with it. The real issue was deeper: I wanted composability. I wanted the model to chain multiple tool calls, use conditionals, loop over results — all things that the one-tool-call-at-a-time paradigm doesn't support well.

Then I read the Cloudflare blog post on Code Mode and Anthropic's post on programmatic tool calling, and it clicked. Instead of exposing tools as individual actions the model invokes one at a time, what if the model could write a program that calls tools as functions?

Phase 4: Code Mode in Go

The initial prototype tried TypeScript, but I landed on Go for the execution language. The result: CPE exposes MCP tools as typed Go functions — with structs generated from each tool's input and output schema, and the tool description as the function's doc comment. The model generates a complete Go program implementing a Run function, CPE compiles and executes it, and the result comes back.

This opened up everything. The model can:

  • Compose multiple tool calls in a single execution
  • Loop over files, search results, or API responses
  • Branch based on intermediate results
  • Import Go standard library packages for data processing, HTTP requests, file manipulation
  • Run things in parallel using goroutines and errgroups

Instead of a dozen round-trips between the model and the tool server, the model writes one program that does it all. A frontier model can one-shot these compositions in seconds.

Phase 5: Sub-Agents as Go Functions

The final piece fell into place when I realized that CPE itself could be exposed as an MCP server — and since any MCP tool becomes a Go function in Code Mode, sub-agents became just function calls.

The orchestrating agent can spin up a sub-agent to handle a tangentially related task, get back a result, and continue — without that sub-task consuming any of its own context window. The sub-agent has its own context, its own system prompt, its own conversation. And because it's a Go function, you can call it inside a goroutine, inside a loop, conditionally, with templated inputs. The outputs can be parsed, filtered, or aggregated.

This is a fundamentally different model from how most agent frameworks handle delegation, and it's the pattern I'm most excited about exploring further.

Where It Is Today

I'm genuinely happy with where CPE is. Despite the crowded landscape of coding agents — both open-source and proprietary — it occupies a distinct niche: a minimalist, configurable harness where the user controls the system prompt, the tools, and the execution model.

Code Mode has been the standout success. I've used it for far more than software engineering:

  • Debugging macOS storage issues after an upgrade filled my disk in ways I couldn't understand — CPE helped me trace and clean up the culprits.
  • Diagnosing iMessage sync problems by inspecting local databases and logs.
  • Managing my email through a skill that imports a third-party IMAP library to talk to Gmail's servers — searching, labeling, deleting, all through natural language.

Programming and software engineering are increasingly different things. LLMs are already remarkably good at programming — translating intent into code. CPE has become less of a "coding agent" for me and more of a general-purpose computer-use tool.

What's Next

A few things I want to explore:

Compaction

Modern context windows are large, and between Code Mode and sub-agents, I can handle most tasks without hitting the limit. For the remaining 5-10% of cases, I currently use a manual workaround: run a skill that produces a compaction summary, then start a new conversation with that summary as input. It works, but I'd like to make it a first-class feature — or better yet, make it possible through the hooks system described below.

Agent Loop Hooks

I want to add lifecycle hooks to the agent loop: events that fire before and after tool calls. The obvious uses are validation (linting after file edits, running tests after code changes), observability, and security (blocking potentially destructive operations before they execute).

The tricky part is making hooks flexible enough to modify the conversation, not just observe it. If hooks could rewrite the conversation state on the fly, compaction might not need to be a built-in feature at all — it could just be a hook that triggers when context usage crosses a threshold. But mutable hooks interact poorly with conversation persistence and cross-model resumption, so I'm still thinking this through. For now, the models follow instructions well enough that I can put pre/post-tool-call behaviors directly in the system prompt.

An Agent Standard Runtime

This is the most speculative idea, but potentially the most powerful.

Code Mode already runs arbitrary Go code, and my skills already import third-party libraries (the email skill uses an IMAP library to talk to Gmail). The model can discover how to use unfamiliar libraries on its own via go doc, which lets it inspect types, functions, and documentation for any Go package. If its parametric knowledge fails, it can just go look.

But raw third-party libraries are often too low-level. For the email case, it would be far more productive if the model could import a high-level module with functions like SearchEmail, GetThread, DeleteThread, LabelThread — rather than constructing IMAP commands from scratch each time.

I want to build a set of these purpose-built Go modules: an "agent standard runtime" of utilities that Code Mode can import. Structure-aware file editing (e.g., a ReplaceGoFunction that finds a function by name and replaces it, saving tokens on reproducing the original text). High-level wrappers for common services. Helpers for data processing.

This echoes the Voyager paper — the famous work where an LLM in Minecraft created its own tools to make progress. The same pattern applies here: the agent, directed by the user or autonomously, creates utilities, documents them, and reuses them later. Other agents could share these modules. It looks like MCP or Claude Code plugins, but it's just Go modules — usable in Code Mode, in standalone Go scripts, or by any agent that can write Go.

I think this is a powerful pattern for building general-purpose agents, and I'm excited to see where it leads.


If any of this resonates, CPE is on GitHub. It's MIT-licensed, written in Go, and installs with a single go install. I'd love to hear what you think.

https://thespblog.net/008-creating-cpe/
007: Virtual Tool Calling: A Token-Efficient Alternative
Show full content

When building AI agents, tool calling is the standard way to give models access to external functionality. Most providers offer native tool calling APIs where you define schemas, and the model outputs structured JSON.

But there's another approach: virtual tool calling. Instead of using the provider's tool calling mechanism, you describe tools in plain text and have the model output tool invocations as delimited text (often XML).

What's the difference?

Traditional tool calling:

  • Tools defined as JSON schemas in the API request
  • Model outputs a structured tool_use block with JSON parameters
  • Code containing newlines, quotes, and special characters must be JSON-escaped

Virtual tool calling:

  • Tools described in the system prompt as plain text
  • Model outputs tool invocations as text wrapped in tags like <tool_call>
  • Code can be written as-is without escaping
Try it yourself

Here's the same tool call in both formats. You can run these through a token counter to see the difference.

Traditional tool calling (JSON)

The model outputs a tool_use block with JSON-encoded parameters:

{
  "type": "tool_use",
  "id": "tool_1",
  "name": "execute_go_code",
  "input": {
    "code": "package main\n\nimport (\n\t\"bufio\"\n\t\"context\"\n\t\"fmt\"\n\t\"os\"\n\t\"strings\"\n)\n\nfunc Run(ctx context.Context) error {\n\t// Read cities from file and get weather for each\n\tfile, err := os.Open(\"cities.txt\")\n\tif err != nil {\n\t\treturn fmt.Errorf(\"failed to open file: %w\", err)\n\t}\n\tdefer file.Close()\n\n\tvar results []string\n\tscanner := bufio.NewScanner(file)\n\tfor scanner.Scan() {\n\t\tcity := strings.TrimSpace(scanner.Text())\n\t\tif city == \"\" {\n\t\t\tcontinue\n\t\t}\n\n\t\tweather, err := GetWeather(ctx, GetWeatherInput{\n\t\t\tCity: city,\n\t\t\tUnit: \"fahrenheit\",\n\t\t})\n\t\tif err != nil {\n\t\t\treturn fmt.Errorf(\"failed to get weather for %q: %w\", city, err)\n\t\t}\n\n\t\tresults = append(results, fmt.Sprintf(\"%s: %.0f°F\", city, *weather.Temperature))\n\t}\n\n\tif err := scanner.Err(); err != nil {\n\t\treturn err\n\t}\n\n\tfmt.Println(\"Weather Report\")\n\tfmt.Println(\"==============\")\n\tfmt.Println(strings.Join(results, \"\\n\"))\n\treturn nil\n}\n",
    "executionTimeout": 60
  }
}

Notice how every newline becomes \\n, every tab becomes \\t, and quotes inside the code need escaping.

Virtual tool calling (XML)

The model outputs plain text with XML delimiters:

<execute_go_code>
<code>
package main

import (
	"bufio"
	"context"
	"fmt"
	"os"
	"strings"
)

func Run(ctx context.Context) error {
	// Read cities from file and get weather for each
	file, err := os.Open("cities.txt")
	if err != nil {
		return fmt.Errorf("failed to open file: %w", err)
	}
	defer file.Close()

	var results []string
	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		city := strings.TrimSpace(scanner.Text())
		if city == "" {
			continue
		}

		weather, err := GetWeather(ctx, GetWeatherInput{
			City: city,
			Unit: "fahrenheit",
		})
		if err != nil {
			return fmt.Errorf("failed to get weather for %q: %w", city, err)
		}

		results = append(results, fmt.Sprintf("%s: %.0f°F", city, *weather.Temperature))
	}

	if err := scanner.Err(); err != nil {
		return err
	}

	fmt.Println("Weather Report")
	fmt.Println("==============")
	fmt.Println(strings.Join(results, "\n"))
	return nil
}
</code>
<executionTimeout>60</executionTimeout>
</execute_go_code>

The code appears exactly as written. No escaping needed.

The token efficiency finding

I ran token counts on equivalent conversations using both approaches. The same code, the same tool descriptions, the same task.

Virtual tool calling used ~30% fewer tokens.

The savings come from escaping overhead. When you pass code as a JSON string, every newline becomes \\n, every quote needs escaping, every backslash doubles. In virtual tool calling, the code sits in plain text between XML tags.

For short strings, this doesn't matter much. For multi-line code blocks—which is common when using code-as-tool-call patterns—the overhead adds up.

A note on model providers

The actual token savings will vary depending on your model provider. Each provider post-trains their models differently for tool calling, which affects how tool calls get tokenized and generated.

Some providers may have optimized their tokenizers or generation for native tool calling in ways that reduce the escaping overhead. Others may add additional formatting or structure that increases it. The 30% figure comes from testing with Claude—your results with other models may differ.

The only way to know for sure is to measure with your specific provider and use case.

Where the savings come from

The token efficiency gains have two sources:

  1. No escaping. In JSON strings, newlines become \\n, tabs become \\t, quotes become \\\", and backslashes double up. Each escape sequence adds tokens. In XML-delimited text, these characters appear as-is.

  2. No JSON syntax overhead. Traditional tool calls require braces, colons, quotes around keys, and structural formatting that have nothing to do with the tool name or arguments. Compare:

{"type":"tool_use","id":"tool_1","name":"execute_go_code","input":{"code":"...","executionTimeout":60}}
<execute_go_code>
<code>...</code>
<executionTimeout>60</executionTimeout>
</execute_go_code>

The JSON version needs quotes around every key, colons, commas, and nested braces. These all cost tokens.

Tradeoffs

Virtual tool calling is better when:

  • Your tool inputs contain code, prose, or other multi-line text
  • You're optimizing for cost and already using XML-style prompts
  • You want simpler output parsing (just regex for tags)

Traditional tool calling is better when:

  • You need provider-specific features (parallel tool calls, tool choice constraints)
  • Your inputs are mostly short strings and numbers
  • You want the validation that comes with typed schemas
  • You're building on frameworks that expect structured tool responses
Implementation note

Most providers don't penalize you for putting tool descriptions in the system prompt instead of the dedicated tools parameter. The tokens count the same either way. The savings come purely from how the model's output gets encoded.

If you're already prompting the model to output in a specific format anyway, virtual tool calling may cost you nothing to try.

Counterpoint: Native Tool Calling May Still Win

Token efficiency isn't everything. There are reasons to prefer native tool calling despite the overhead:

Guaranteed schema compliance. OpenAI's Structured Outputs uses constrained decoding to guarantee 100% valid JSON. Malformed responses are impossible. With XML, you rely on the model to produce valid markup—parsing failures happen.

Models are trained for it. Providers train their models specifically on native tool calling formats. The Berkeley Function Calling Leaderboard distinguishes between native function calling (FC) and prompt-based approaches—FC consistently scores higher. Research like ToolACE (ICLR 2025) shows that models trained on function calling data achieve state-of-the-art results, with 8B parameter models rivaling GPT-4 on tool calling benchmarks.

Provider-specific optimizations. Native tool calling likely uses special tokens and internal representations that the model was trained with. OpenAI's fine-tuning cookbook explicitly supports fine-tuning for function calling when accuracy is critical.

Multi-turn reliability. Native formats handle conversation state and tool result injection in standardized ways. Custom formats must reinvent this, and edge cases accumulate.

The token savings from virtual tool calling are real, but for production systems where a single malformed tool call breaks an entire workflow, the reliability of native tool calling may be worth the cost.

Conclusion

Virtual tool calling isn't universally better. But for agents that generate code or other text-heavy outputs, the token savings are real. Worth measuring for your specific use case.

https://thespblog.net/007-virtual-tool-calling-a-token-efficient-alternative/
006: Strange quirks with Sonnet 3.7
Show full content

I've discovered two interesting quirks when working with Claude 3.7 Sonnet compared to earlier versions.

Quirk 1: JSON Array Parsing Issues

When working with my agentic coding CLI, Sonnet 3.7 consistently struggles with the get_related_files tool, which expects an array of file paths as input.

Despite the schema clearly specifying that input_files should be an array:

"input_files": {
  "description": "An array of input files...",
  "items": { "type": "string" },
  "type": "array"
}

Sonnet 3.7 consistently outputs a string instead of an array:

"input": {
  "input_files": "internal/agent/agent_instructions.txt"  // WRONG
}

While Sonnet 3.5 has no problem:

"input": {
  "input_files": [
    "internal/agent/agent_instructions.txt"  // CORRECT
  ]
}
Quirk 2: Excessive Proactiveness

Sonnet 3.7 is noticeably more pushy and proactive than Sonnet 3.5. It frequently:

  • Modifies code without explicit requests to do so
  • Executes tools that might be outside the scope of your request
  • Takes initiative in ways that can sometimes overreach

The only effective way to control this behavior is to explicitly instruct it not to be proactive unless specifically told to do so by the user, similar to how Claude Code handles this balance.

Have you noticed any strange quirks with Sonnet 2.7?

Update May 10 2025: Another thing I have noticed is the predilection of Sonnet 3.7 to summarize it's actions or edits made to a codebase despite explicit instructions to NOT do so in the system prompt. While a minor nitpick, it is annoying that this so baked into the model that it actually disregards system instructions avoid doing this practice. For context, this is the instructions that I give to the model:

... omitted prompt ...
* **DO NOT** summarize your actions before yielding back to the user, the user can see all actions you took, and all of your thought, as such there is no need to summarize what you did
... omitted prompt ...
https://thespblog.net/006-strange-quirks-with-sonnet-37/
005: Why Neural Code Retrieval is Overrated
Show full content

In recent years, we've seen an explosion of AI-powered coding assistants like GitHub Copilot and Cursor (followed by what seems like a new VSCode fork every other week). These tools rely heavily on neural code retrieval to provide context to large language models. While these tools have shown impressive capabilities, their fundamental approach to context gathering through embedding-based retrieval might be suboptimal. Instead, I argue that we should leverage either language-specific tooling or AST-based parsing combined with targeted heuristics to build more reliable, explainable, and effective coding assistants.

The Current Landscape: Neural Code Retrieval

Current coding assistants typically use a combination of:

  • Embedding models to encode code snippets and find similar pieces
  • Heuristic-based approaches (open files, recently used files)
  • Proximity-based context gathering

While this approach can work, it essentially treats code as unstructured text, ignoring the fact that we have much more powerful tools at our disposal for understanding code structure and relationships.

Three Better Approaches to Context Retrieval1. AST-Based Parsing with Heuristics

Tools like tree-sitter provide fast, reliable parsing of code into Abstract Syntax Trees (ASTs). While tree-sitter itself can't resolve symbols or determine types (as it's a parser, not a compiler), we can use it alongside smart heuristics to retrieve relevant context:

func processData() {
    result := fetchUserData()
    processResult(result)
}

With this approach, we could:

  1. Use tree-sitter to parse the code and identify function calls (fetchUserData, processResult)
  2. Scan project files for these function definitions, again using tree-sitter to parse and identify function declarations
  3. Follow import statements to scan dependent packages
  4. Build a graph of function calls and definitions

This provides much more precise context than embedding-based similarity search, while remaining relatively language-agnostic.

2. Language Server Protocol (LSP) Integration

For editors like Cursor that are built on VSCode, the Language Server Protocol provides an elegant solution for context retrieval. LSP is a standardized protocol that allows any editor to communicate with language servers that provide rich code intelligence features. This approach offers several advantages:

First, LSP is already widely adopted, with servers available for most popular programming languages. These servers provide capabilities like go-to-definition, find-references, and type information - exactly what we need for context retrieval. Second, since many editors already integrate with LSP, this approach requires minimal additional infrastructure. Third, LSP servers are maintained by the language communities themselves, ensuring high-quality and up-to-date language support.

Using LSP, we can:

  • Get precise symbol definitions and references
  • Retrieve type information and documentation
  • Find all implementations of interfaces
  • Navigate through workspace symbols
  • Access semantic tokens and syntax highlighting
3. Language-Specific Tooling

Another powerful approach is to leverage language-specific tooling that already exists for symbol resolution and type checking. For example:

  • Go's built-in go/types package and go/ast for complete symbol resolution
  • Rust's rust-analyzer for detailed code analysis
  • TypeScript's language service for type information and symbol resolution
  • Java's JDT (Java Development Tools) for full semantic analysis

Let's see how this works in Go:

type UserService interface {
    GetUser(id string) (*User, error)
}

func processUserData(svc UserService) {
    result := svc.GetUser("123")
    // Go's type checker can tell us exactly:
    // - The type of result ((*User, error))
    // - Where UserService is defined
    // - All implementations of UserService
}

Using Go's native tooling, we can:

  • Resolve all type information precisely
  • Find interface implementations
  • Track cross-package dependencies
  • Follow symbol definitions across the entire program
Why Language-Specific Approaches Win Over Neural RetrievalAccuracy

Language-specific approaches provide precise symbol resolution instead of relying on similarity-based guessing. When working with code, we can obtain exact type information and guarantee that we find relevant definitions and implementations. This stands in stark contrast to probabilistic matching used in neural approaches, where there's always uncertainty about whether the retrieved context is truly relevant.

Explainability

Every piece of context included through language-aware approaches has a clear reason for its inclusion. We can trace exact paths from where a symbol is used to where it's defined, and the results are deterministic. This makes it easier to debug issues and understand why certain suggestions or completions are being made, unlike the black-box nature of neural retrieval systems.

Feasibility

Implementing language-specific solutions is surprisingly practical. Most popular programming languages already have robust tooling that we can leverage, and supporting the top 10-15 languages would cover the vast majority of use cases. While there is an upfront cost to implement support for each language, this is a one-time investment compared to the ongoing costs of training and maintaining neural models. The engineering effort required is well-defined and builds upon decades of existing work in compiler technology and language tooling.

Context Management and LLM Behavior

One of the most compelling arguments for structured retrieval lies in how it interacts with LLM behavior and context management. While modern LLMs can technically process massive context windows (some handling entire books), this capability comes with significant caveats that directly impact real-world performance.

First, there's the issue of context utilization. Even when an LLM can accept a large context window, it doesn't always effectively utilize all of that context. Embedding-based approaches often try to compensate for retrieval uncertainty by including more "top-k" results, hoping to catch all relevant information. This leads to context bloat without guaranteeing better outcomes.

The cost implications are substantial. Each token in the context window increases the computational cost and latency. When using embeddings with reranking to improve accuracy by including more potential matches, you're essentially paying for the LLM to process a lot of possibly irrelevant code. This affects both the financial cost per request and the time to first token - critical metrics for user-facing tools.

Most importantly, LLMs can actually perform worse when given irrelevant context. It's not just a matter of wasted tokens; irrelevant information can actively distract the model and degrade the quality of its responses. This is where structured retrieval shines: by following actual code relationships through symbol resolution and dependency graphs, every piece of context included is guaranteed to be relevant by construction. We're not guessing at relationships through statistical similarity - we're following the exact links that make the code work.

This deterministic relevance has cascading benefits. We can be more selective about context inclusion without fear of missing critical information, leading to smaller, more focused context windows. This results in faster responses, lower costs, and most importantly, more accurate and reliable outputs from the LLM.

Performance

Language-aware approaches offer significant performance advantages over neural retrieval systems. From a computational perspective, structured approaches eliminate the need to maintain large vector indexes in memory or compute expensive similarity metrics like cosine distance over thousands or millions of vectors. Instead of running neural networks for embedding generation and approximate nearest neighbor (ANN) searches, we can simply traverse ASTs and symbol tables with deterministic algorithms.

This performance advantage manifests in both computational resources and speed. We avoid the high memory footprint required for ANN indexes and the computational overhead of similarity searches. The structured approach can often be faster than embedding-based retrieval since we're doing direct lookups and graph traversals rather than vector similarity computations over large datasets.

Additionally, these approaches are highly cacheable - parsed ASTs and symbol tables can be efficiently stored and reused. When we retrieve context, we get exactly what we need without wasting valuable context window space in our LLMs with potentially irrelevant code snippets. This efficiency becomes particularly important when working with large codebases where precise context retrieval is crucial for generating accurate completions.

A Note on Quick Completions vs Agent Workflows

It's worth acknowledging that not all AI coding features have the same requirements. For quick completions and features like Cursor's Tab autocomplete, an embeddings-based approach combined with smart heuristics (like considering open files and recent edits) might actually be more suitable. These features prioritize speed and don't necessarily need perfect context - they just need to be good enough to help developers write their next line of code quickly.

However, for more complex scenarios like coding agents (think Cursor's composer or Windsurf flows) where an AI is trying to understand and modify significant portions of a codebase, structured retrieval becomes crucial. These agents need precise understanding of code relationships and dependencies to make informed decisions and generate reliable code changes.

Conclusion

Code isn't just text - it's a graph of symbols, types, and dependencies that we can traverse deterministically. When we treat it as plain text and rely on embedding-based similarity to find relevant context, we're throwing away the precise relationships that make code meaningful in the first place. It's like having a map but choosing to navigate by looking at satellite photos and guessing which blurry patches might be roads.

For quick autocomplete features that suggest the next line of code, the fuzzy pattern matching of embedding approaches makes sense - developers can quickly reject incorrect suggestions, and the speed benefits outweigh perfect accuracy. But for coding agents that need to understand and modify entire codebases, this approximation breaks down. An agent can't guess whether it's looking at the right implementation or hope it found all the relevant type definitions - it needs to know.

The tools to get this precise context already exist. Whether we use ASTs to follow function calls, LSP to resolve symbols, or language-specific tooling to trace types, we can gather exactly the context we need. Not only is this more reliable, but it's also computationally cheaper than maintaining giant vector indexes and computing similarity scores. We don't need to approximate code relationships when we can just follow them.

This post was crafted with a little help from Claude 🤖

https://thespblog.net/005-why-neural-code-retrieval-is-overrated/
004: LLM enabled rewriting is the new refactoring
Show full content

I write code for a living. I've also been following along the developments of LLMs, especially their coding abilities. It has become apparent that LLMs are now capable of coding non trivial solutions, supported by the rise of LLM autocomplete in IDEs like Github Copilot, as well as agentic AI developers like Devin. The barrier to writing new lines of code continues to diminish. With LLMs becoming more and more capable (especially powerful open-weight models like Llama 3 70B or DeepSeek-Coder-V2), it is inevitable we see a new kind of developer workflow, one that leverages the cheap intelligence at scale that LLMs provide. I want to talk about one such workflow that may exist: rewriting is the new refactoring.

Rewriting has long gotten a bad rep, as it can be an expensive, time-intensive process. Often, you might have to completely stop feature development while rewriting, all in hopes that the rewrite will be worth it. Rewrites are often initiated when something is perceived as sufficiently difficult to achieve with the current code base, whether that be performance, additional features, etc. As such, rewrites have high expectations from the start, promising to overshadow the previous codebase in one or more aspects, but also are rarely initiated for the same reason. Unless the rewrite will deliver sufficient value over the previous codebase, they are heavily discouraged. But LLMs change that.

LLMs will make rewrites common. When an LLMs can produce thousands of lines of code and test them with countless number of test scenarios, what prevents a developer from rewriting on every new feature request? Well, there are a couple of things.

One is that the developer has to trust the code written by an LLM. It is a pretty common sentiment that code written by LLMs is not good code, and I agree. In addition, LLMs may introduce vulnerabilities, alter existing code which it is not supposed and a whole slew of other problems. But as agentic systems and the models underpinning them become more popular and reliable, these problems will be solved.

The other thing is that LLMs are jagged intelligence. They will often be able to code twitter clones and hacker news clones in one shot, because there are so many permutations of these tutorials online! There are plenty of hacker news clones in every conceivable language. Does this mean LLMs are incapable of solving new and difficult problems? Right now it's true, but with the initial taste of o1 and the newfound interest in scaling compute with search, it might not be soon. Regardless, anyone whose used Claude Sonnet 3.5 (3.6 now?) knows that it is a great coding model, especially given a system that will provide the model with relevant context, so it's not true that LLMs are only good for basic, repetitive code seen online. At least, it feels like Claude can definitely handle some non-trivial amount of complexity.

The last thing I want to share is that rewriting will not only become more common, but occur more frequently at smaller scales. Often, rewriting is associated with rewriting from scratch. But with LLMs, version control, and some well-written tests, we can selectively rewrite entire sections of the codebase as we develop, reducing technical debt. I personally have experienced this as I have developed some of my own tools, rewriting entire sections of a tool within hours as my idea of what I want becomes more clear.

https://thespblog.net/004-llm-enabled-rewriting-is-the-new-refactoring/
003: The varying bandwidth of language
Show full content

I've always liked to read, much more than I liked to write. Writing is a cumbersome process, refining the text over and over again to asymptotically approach the exact shape of a thought you wish to express (which mind you, in the process of writing can evolve and mutate). All the more reason why as an avid reader, I do appreciate well-written pieces (and why I am trying to write more).

I prefer written language for communication to spoken language. Trying to express complex concepts or even emotions is a difficult thing to do in a spoken setting, as there is no scratchpad to refine your words, you have a limited amount of time before the audience's attention drifts away and the quality of the spoken words can vary based on the state of the mind. It's also very easy to make mistakes and express emotion when you don't want to. Most importantly, the pace of information communicated is bottlenecked by the transmitting person's tongue.

When reading text, your mind is free to process the information at a speed it wants. It can skip forward and backward, it can seek specific phrases or sentences, or even reference other text at will. This process applies to writing as well. Well-written pieces are read a thousand times before they are published. This is also the reason why written language is more dense and generally more high quality than spoken language. The pace of information absorbed is bottlenecked by your mind.

This is the reason why I prefer written language to spoken language.

While there is a certain type of joy in engaging in conversation, I felt consistently over and over again many times when trying to absorb some important and/or complex concept that if it was written down, it would be much easier to process.

I have more the complain about the deficiencies of spoken language and the beauty of written language, but I'll stop here. I encourage people to read more. You learn a lot about a lot.

https://thespblog.net/003-the-varying-bandwidth-of-language/
002: Running GSM8K Bench for cents
Show full content

I recently got some time to try out Modal, which is a serverless platform that offers GPUs. I was searching for an easy way to run inference and fine-tune models in an easy and cost-effective manner (since I am GPU poor). There were a couple of options during my research, such as Replicate and friends, Huggingface Inference Endpoints, etc. but I was drawn to Modal for a couple of reasons.

Actually Serverless

Modal is a completely serverless Python platform. This is different from a scale to zero model like Replicate's, which you create and upload a container, select a GPU, and call out to it through an API endpoint which can be rate-limited (If you are calling an endpoint for LLM inference, I encourage you to [[Go into AI with Go|explore using Go]]!). In Modal, native Python function calls becomes a remote container call with just an annotation, no serialization/deserialization logic or platform specific SDK needed. This makes it really easy to adopt Modal, as you can do anything you want as long as the parameters being passed around between functions are supported by cloudpickle. This means I can use Huggingface Pipeline, TGI, VLLM, Ollama, whatever I want, as long as I encapsulate functionality in a Python function. In addition, the function can scale out across inputs by simply provisioning more containers. This is one of the biggest draws, as I can set a concurrency limit and call .map over some list of function inputs and it will automatically create the needed number of containers, do the work, and return the ordered list of results.

Cost Effective

For my use cases, in order for a platform to be cost effective, I just want to pay for compute used. Modal allows as I can run a Python program locally, provision compute in Modal on demand, and once the program exits, all provisioned compute immediately terminates. This is especially useful for experimentation, running evaluations, and offline batched inference to maximize GPU utilization. Later in this post, I'll show you how to run the test split of GSM8K bench against Mistral 7B v0.2, for $0.60 in ~150 seconds using multiple A10G GPUs. If you are not doing something like offline batched inference, you still scale out from zero and back down to zero if some minor cost incurred in the form of idle compute which is configurable. If you want to pay for idle compute of 5 minutes to avoid cold starts, you can.

Some cons

While Modal is great for the most part, and I'm definitely impressed, it does have some minor cons as of the present.

Cost

First, the most glaring con is cost. A simple cost comparison of on-demand GPUs from a couple of different platforms as of time of this blog post:

per hour cost per platform Modal Runpod Lambda Labs T4 0.59 - - L4 1.05 0.44 - A10G 1.10 - - A100 40 GB 3.73 - 1.29 A100 80 GB PCIe 5.59 1.89 - H100 PCIe 7.65 3.89 2.49

It may seem odd that I described Modal as cost effective given the cost table above, but in practice, more workloads are bursty, and if you tune your Python programs, you can maximize GPU usage and minimize execution time. Many times during experimentation, such as prompt engineering, you are only running the GPUs seconds at a time, while the rest of the time the GPU is idling, which is wasted cost. Unless you have sustained, stable traffic, Modal is more likely to be way more cost effective.

Log retention

Log retention is really low: 24 hours. After which, the logs are permanently deleted. This is a minor inconvenience, as realistically we would export logs to some external stack like ELK, Splunk, etc., but during experimentation, it would be nice to have longer retention time to view results or previous runs logs.

Benchmarking

I didn't start from scratch, I took the example of running Gemma 7B using VLLM and adopted it to run with Mistral 7B Instruct v0.2 and run against the GSM8K benchmark. I recommend running the example first and playing around with it, before following along here. The entire benchmark runs in a single main.py file.

Prerequisites

A part of the script runs locally on your machine and a part runs in Modal. In order to run the script, we need to have some dependencies installed for the local part of the script.

pip install modal datasets transformers jinja2
Make the container image

We want to run on VLLM and use the Mistral 7B Instruct v0.2 model. Here is the first part of the main.py file:

import copy  
import os  
import time  

# Some necessary imports to setup our serverless app
from modal import Image, Stub, enter, gpu, method  

# The directory that the model will be downloaded to from HuggingFace
MODEL_DIR = "/model"
# The huggingface repo id for the model
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
# Specifying the GPU we want to use to Modal
GPU_CONFIG = gpu.A100(count=1, memory=40)  

# A function that runs on Modal when building our custom container image. We want to cache the model weights so our cold start time is reduced
def download_model_to_folder():  
    from huggingface_hub import snapshot_download  
    from transformers.utils import move_cache  
  
    os.makedirs(MODEL_DIR, exist_ok=True)  
  
    snapshot_download(  
        BASE_MODEL,  
        local_dir=MODEL_DIR,  
    )  
    move_cache()  

# The spec for how we want to define our custom container image
image = (  
    Image.from_registry(  
        "nvidia/cuda:12.2.0-devel-ubuntu22.04", add_python="3.10"  
    )  
    .pip_install(  
        "vllm==0.3.2",  
        "huggingface_hub==0.20.3",  
        "hf-transfer==0.1.5",  
        "torch==2.1.2"  
    )  
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  
    .run_function(  
        download_model_to_folder,  
        timeout=60 * 20,  
    )  
)  

The next part of the main.py defines what we want to run in Modal's infrastructure. We are creating a Modal Function with lifecycle events to run our model on VLLM

# Define our stub to label our ephemeral apps in Modal
stub = Stub("gsm8k-demo")

# Annotation to let Modal this needs to run on their infrastructure
@stub.cls(  
    gpu=GPU_CONFIG, # Use the GPU config we defined above
    timeout=60 * 5, # Timeout the function after 5 minutes
    container_idle_timeout=60 * 5, # Shutdown the function after 5 minutes. For our experiment, this doesn't matter, as the modal CLI will take care of terminating workers as soon as our script finishes.
    concurrency_limit=10, # Don't run more than 10 concurrent functions
    image=image, # Use the container image we defined above
    retries=3 # In case of problems or timeout, retry n number of times
)
class Model:  
    def __init__(self):  
        self.llm = None  
 
    # Run this function once on container startup
    @enter()  
    def load_model(self):  
        from vllm import LLM  

        # Create the LLM engine to do batched inference
        self.llm = LLM(MODEL_DIR, max_model_len=16752, max_context_len_to_capture=16752, kv_cache_dtype="fp8_e5m2")  

    # The function that our script calls for batched inference
    @method()  
    def generate(self, prompts: list[str], stop_seqs: list[str] = None) -> dict:  
        from vllm import SamplingParams, RequestOutput  
        import time  
        
        sampling_params = SamplingParams(  
            temperature=0.75,  
            top_p=1,  
            max_tokens=800,  
            presence_penalty=1.15,  
            stop=stop_seqs,  
        )  
        
        # Get the time before generating  
        start_time = time.time()  
        
        # Generate the outputs from the LLM Engine
        results: list[RequestOutput] = self.llm.generate(prompts, sampling_params, use_tqdm=False)  
        
        # Get the after generating  
        end_time = time.time()  
        execution_time = end_time - start_time  
        
        # Count the number of input tokens
        input_tokens = 0  
        # Count the number of output tokens
        output_tokens = 0  
        for output in results:  
            input_tokens += len(output.prompt_token_ids)  
            output_tokens += len(output.outputs[0].token_ids)
        return {  
            "input_tokens": input_tokens,  
            "output_tokens": output_tokens,  
            "output": results[0].outputs[0].text,  
            "execution_time": execution_time  
        }

For the final part, we will define an entrypoint into our benchmark script:

# This annotation lets modal CLI know to call this function to run the script
@stub.local_entrypoint()  
def main():  
    from datasets import load_dataset  
  
    from transformers import AutoTokenizer  
    
    # Get the start time of the script
    start_time = time.time()  
    
    # Load the benchmark dataset
    dataset = load_dataset("gsm8k", name="main", split="test")  
  
    messages = []  
    
    # Load the model tokenizer so we can use the chat template
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
    stop_seqs = [tokenizer.eos_token]  
    
    # Number of few shot examples for input
    k = 8  
    
    # Convert the few shot examples to a dict that will represent chat conversation
    for i in range(k):  
        question = dataset[i]["question"]  
        answer = dataset[i]["answer"]  
        messages.append({  
            "role": "user",  
            "content": question  
        })  
        messages.append({  
            "role": "assistant",  
            "content": answer  
        })  
    
    # Remove the few shot examples from the benchmark
    dataset = dataset[k:]  
    
    # Batch size to pass to the workers. This is the ideal number I found for A100 GPUs after testing a couple of different numbers
    batch_size = 200  
    
    # Instaniate the Modal Function
    model = Model()  
    
    # Convert the conversation dicts to a prompt string using the tokenizers chat template, and construct list of batched prompt strings
    inputs = []  
    prompts = []
    for i in range(len(dataset["question"])):  
        question = dataset["question"][i]  
        
        # Copy the few shot dict conversation
        completion = copy.deepcopy(messages)  
        
        # Append the actual word problem we want the LLM to solve
        completion.append({  
            "role": "user",  
            "content": question  
        })  
        
        # Convert the few shot examples to the models prompt format using the tokenizers chat template  
        prompts.append(tokenizer.apply_chat_template(completion, tokenize=False))  
        
        # When we have accumulated enough prompt strings for a single batch, append it to the list of inputs
        if len(prompts) == batch_size or i == len(dataset["question"])-1:  
            inputs.append((prompts, stop_seqs))  
            prompts = []  
  
    # The special starmap function maps our batched inputs to multiple containers and returns the results in the same order
    responses = list(model.generate.starmap(inputs))  
    
    # Report the stats
    total_GPU_execution_time = 0  
    total_input_tokens = 0  
    total_output_tokens = 0  
    for response in responses:  
        total_GPU_execution_time += response["execution_time"]  
        total_input_tokens += response["input_tokens"]  
        total_output_tokens += response["output_tokens"]  
  
    print(f"total GPU execution time: {total_GPU_execution_time}")  
    print(f"total input tokens: {total_input_tokens}")  
    print(f"total output tokens: {total_output_tokens}")  
    print(f"script execution time: {time.time() - start_time}")

To run the above, just run modal run main.py. Running the above, I get this output:

total GPU execution time: 355.26616978645325
total input tokens: 2101425
total output tokens: 221878
script execution time: 97.08975577354431

You can see that total GPU execution time is more than our actual script execution time, since we utilized multiple GPUs concurrently to process our benchmark. It cost me $0.74 to run the script. As a comparison, utilizing an inference platform like Together.xyz to call the same exact model with a maximum of 100 concurrent calls took about 52 seconds and costs $0.4646606. Obviously, in this case calling an optimized LLM inference platform is both cheaper and faster, however the tradeoff is that you are boxed into their API. As of the time of this blog post, you cannot run a custom model, there is no support for providing images as part of the prompt to a vision language model like Qwen, you are subject to rate limits, there is no support for structured output libraries like outlines, and you cannot batch inputs to the API. With Modal, there is no such limitations, and you have a drastically simplified programming model. On Together.xyz, in order to complete the benchmark as fast as possible, I created a Golang program which utilized a bounded number of goroutines, channels and wait groups to concurrently send and manage responses from the API, significantly adding to the code boilerplate needed to run the benchmark. When experimenting, this additional code can slow experimentation development velocity down, or worse, introduce bugs that take away time that can spent on running more experiments.

Sidenote

Just by changing the GPU config and batch size, I was able to significantly reduce the script execution time, and make it cheaper! It now costs $0.64. Here is the output of the script:

total GPU execution time: 155.90221166610718
total input tokens: 2101425
total output tokens: 219306
script execution time: 42.53459310531616

Change the following lines of code:

# Specifying the GPU we want to use to Modal, change this to H100
GPU_CONFIG = gpu.H100(count=1)

# Batch size to pass to the workers. Reduce this to utilize more workers concurrently
batch_size = 175  
Conclusion

I will definitely be adding Modal as part my toolbox when experimenting with models and prompt engineering. Due to its serverless nature, running experiments is cheap and the programming model makes it simple to try out different things. As a general serverless platform, it also lends itself well to offline batch jobs. I don't believe it is cost effective for running any fine-tuning jobs, as it is a stable workload, and we can see that there are many other options in the cost comparison table above that can offer on-demand GPUs for much cheaper pricing.

https://thespblog.net/002-running-gsm8k-bench-for-cents/
001: Go into AI with Go
Show full content
Python in AI

These days, everything from prompting to training to inference is done in Python. There are other languages that also see usage, such as Rust (see HuggingFace tokenizers or Text Generation Inference) or C++ (TensorRT), but these languages augment rather than replace. Interestingly, Python seems to be the primary language used for prompting as well, even though there is nothing unique to Python language or runtime that provides an advantage for prompt engineering. There are many libraries that are meant to aid in prompt engineering, such as DSPy, LangChain, and Instructor. While the necessity of these libraries is debatable, I would argue that the core functionality of these libraries can really be built in any language. Prompt engineering, at its core, boils down to sending specially crafted strings to a text generation server. How fast these strings can be sent to the server and a response returned is often the bottleneck.

Prompt Engineering is Language Agnostic

Prompt engineering is programming language agnostic, since the actual inference of the model is essentially behind a REST interface, which can be called in any programming language. If you really wanted, you can do prompt engineering with just curl and Bash.

Convergence on a single spec

Whether is be a foundation model hosted by one of the giants in the AI field, or one of the GPU cloud offering inference on popular open source models, everyone has converged on a single API spec: the OpenAI spec. This makes it easy to re-use the code scaffolding across providers and models. In addition, there are model inference projects that offer the OpenAI spec for local inference (such as ollama).

Structured Prompting is just JSON (Or YAML or XML)

Some libraries offer structured prompting in one of two ways. They either do constrained decoding, which requires access to the models logits (and its tokenizer!), or they simply specify a JSON schema in the prompt. The latter is very simple and universal, and most decently sized powerful models can output valid, schema-following JSON with just some handcrafted prompting. This means that any language that has some kind of support for reading JSON can be used to parse the model's output, which is pretty much every language. In addition, language models can also be finetuned to simply output valid JSON when given schema, as the recent JSON mode of OpenAI and inference platforms following up (Together and Anyscale) has shown.

Go for AI (for prompting)

Despite prompt engineering inherently being programming language agnostic, I want to make the case that everyone is using the wrong language for prompt engineering. I want to make the case that Golang is built for this type of workload, where there is minimal compute but lots of IO-bound operations and concurrency juggling to do. I will introduce what Go is, and the specific features of Go that will help with prompt engineering.

Meet Go

For those that don't know, here is a brief introduction. Go is a garbage collected language that has a C-like simplistic syntax, and an M:N threading model (otherwise known as green threads). The garbage collection means you don't have to worry about memory usage or memory safety, being a typed, compiled language means its pretty fast, and best part of all, the M:N threading model means that you can lots of IO-bound concurrent operations with minimal overhead. There is no async function coloring syntax, there is built syntax sugar for best practice concurrency (channels), and most of the day-to-day functionality of building server apps is part of the standard library. All of this leads me to believe that the prompting language of choice should be Go.

Syntax

While the syntax of a programming is not directly helpful to prompt engineering, it is helpful if the language is readable and simple to write. Golang syntax is pretty damn simple, modeled after C. One of the core tenets of that language is that code is read way more often than written, so the creators of the language made sure that the syntax of the language is basic enough that even a junior programmer that knows a few mainstream languages can pick up the syntax very fast.

Templates

Golang had a built in text template library similar in functionality to Jinja templates in Python. Templates provide a powerful and flexible way to generate text output, making them particularly useful for prompt engineering. Golang templates can insert values and use control structures like if and for loops into the templates, which can dynamically generate prompts based on user input or context.

Built-in testing framework

This is one of my favorite things about Golang. The language has a built-in testing framework that supports parallel tests, table tests, and even benchmarking. You can even provide command line arguments to your tests to change the behavior on the fly. With table tests, we can create different versions of prompts we want to test while reusing the evaluation test code. You can also provide flags to run only a subset of tests.

Concurrency

A lot of prompt engineering is rewriting prompts, and testing against some evaluation or benchmark. Ideally, we want the feedback loop as fast as possible, which means that we want to run evaluations as fast as possible. With Golang, we can use channels and goroutines to run the evals as fast as our eval endpoint will allow. Goroutines are green threads, or otherwise known as the M:N threading model. What this allows is having many more number of "virtual" threads than actual CPU cores. Since goroutines exist as part of the Golang runtime, context switching, creation and deletion of goroutines are extremely cheap. The runtime also handles smart context switching with a work stealing scheduler and epoll and kqueue integration. This means that Golang is uniquely equipped to handle IO-bound workloads while utilizing all of the cores in a machine.

Fully featured OpenAI client

Of course, without a OpenAI compatible client, none of the above would really matter. The library supports everything, from chat completions to speech endpoints to image generation endpoints.

Conclusion

Hopefully I've convinced you to try out Golang during prompt engeering experiments. There are a lot of benefits to using the language for prompt engineering, especially since the language can be used for both experiments and server-side production applications.

https://thespblog.net/001-go-into-ai-with-go/