GeistHaus
log in · sign up

N=1 (marcua’s blog)

Part of N=1 (marcua’s blog)

I'm the Co-Founder and CTO of B12. Before that, I was Director of Data at Locu, a startup that was acquired by GoDaddy. I went to grad school at MIT's CSAIL. One time in Jerusalem I ate a whole tub of hummus on my own. I don't regret that day.

stories primary
Figmimic
A bookmarklet to copy any webpage into Figma as editable layers
Show full content

I’m excited to share Figmimic, a bookmarklet that lets you copy any website or webapp into Figma in an editable state. The pasted result preserves nesting in Figma’s native format. If you’ve ever wanted to edit a design of an experience that’s already shipped but hasn’t been wireframed before, Figmimic can get you started. Here’s a video of Figmimic in action:

How to use Figmimic: install the bookmarklet, click the bookmarklet to copy any page, and paste the copied result into Figma.

Most product teams have experienced this problem: someone ships a user interface, and then a designer wants to update it in Figma. In a pinch, the designer will copy/paste a screenshot of the existing experience into Figma and then graft vector edits on top of the static image. That’s fine for quickly communicating a design change, but the more you need to edit something in the screenshot, the deeper the hole you dig for yourself. Figmimic skips the screenshot: when you import a screen, it serializes the DOM to Figma’s native file format. You get the full hierarchy and structure of a Figma file, and can click on any element to edit it. It does this with high fidelity to the original design.

While powerful, the underlying library isn’t perfect. I’ve seen it mess up some visibility rules and make some elements not as wide/tall as the original, but it’s been relatively easy to tweak the design to get it back to its original state. I haven’t tried using it to capture flows, interaction states, or animations, so I don’t know how well it would perform in those situations.

Figmimic is less a creation than a discovery. Meredith and I were playing around with Figma’s MCP tooling to figure out how B12’s design process could benefit from AI assistance. One “AI” feature we were excited about was the ability to point Figma’s MCP at any app or element in a browser and bring it into Figma. As we tried to use that feature, the MCP instructed us to paste/inject some code into the browser console in order to give it access to the DOM in that window.

That got us wondering if the code could be injected into any page even without the MCP being set up for it. To our delight, it could! When the MCP doesn’t initiate the code’s injection, the code falls back to copying the screen/element to the clipboard, which is perfect for then pasting into a Figma board. In the same Claude Code session that was directing us to inject the code, we asked Claude to create a bookmarklet for injecting into any page. We then asked it to make a webpage that explained how to install the bookmarklet, and a few edits later, Figmimic was born!

I’m not sure why Figma chose to hide this feature in their MCP. The feature doesn’t seem to use AI to translate the browser’s DOM into the Figma data model. As Figmimic shows, Figma’s code can be used on its own for a fast copy/paste from the web. There’s a broader discussion in the design/engineering world around the challenge of keeping wireframes and implementations in sync and whether code is the source of truth. In that context, I’m curious why Figma hasn’t exposed this feature more broadly, as it’s useful as a standalone experience. Newer tools will make it easier for designers to prototype new experiences directly in a codebase, but at the moment, there’s still a design/engineering tooling chasm. As long as we expect designers to do their work outside of the code, we should build bridges to help keep design and engineering artifacts in sync.

https://blog.marcua.net/2026/05/05/figmimic
Extensions
Northeastern presentation to junior engineers in the age of AI
An AI agent revolution is afoot: How can the next generation of software engineers thrive?
Show full content

Late last year, my grad school colleague Chen-Hsiang (Jones) Yu invited me to give a talk to students at Northeastern University in his Software Engineering classes. I’ve posted the full slides of the talk here. If I could pick only one slide to represent the talk, it’s this one:

A timeline of the professional experiences a junior engineer used to develop over several years (learning the ropes, reflection, task definition, and reviewing plans & code you didn't write). The last two stages are writing up tasks for others to complete and reviewing plans and code you didn't write. These skills are now required in your first moments with a coding agent. A timeline of the professional experiences a junior engineer used to develop over several years. Skills that used to take years to hone are now required in your first moments with a coding agent.

That slide speaks most concisely to the compression of what used to be several years of professional growth that junior engineers experienced. Previously a company would wait years before asking new engineers to meaningfully define work that someone else would complete, and even longer to start reviewing plans and code they didn’t write. Those two skills are being emphasized seconds and minutes into an engineer’s career, which changes how we should think about mentorship and growth for junior engineers.

The talk covers how software engineering is changing with a focus on junior engineer mentorship, and offers some ideas for getting hands-on experience in task definition and code/plan review. I wrote about this in mentoring junior engineers in the age of coding agents. In the last section of the talk, I covered durable skills that I believe will remain valuable even while agents write more of the code. I wrote about that topic in four questions agents can’t answer.

A huge thank you to Jones and his students for the invitation and their great questions. It’s clear many students in the room had a strong sense of the big change happening in the industry they were about to join, and the discussion after the talk reflected their curiosity.

https://blog.marcua.net/2026/04/08/ai-agent-revolution-junior-software-engineers
Extensions
B12 3.0
A decade of helping customers build their home online
Show full content

Wow! We’ve been building B12 for a little more than 10 years now. Today, we announced B12 3.0. As the name suggests, 3.0 represents the third major generation of our approach to helping our customers build their web presence:

  • B12 1.0 (2015 - 2020) was all about having experts do it for you. When we first started B12, we built Orchestra to help assemble teams of designers, copywriters, and project managers in minutes. Using previous generations of AI models, we supercharged a traditional design agency model and provided experts with powertools to make their work more effective.
  • B12 2.0 (2020-2025) introduced AI-scaffolded starting points. As large language and image generation models got increasingly powerful, we moved to a world where you could describe what you wanted to a model. The model would pick a respectable starting point for a website and then fill it in with the copy and imagery that made sense for your business. As customers became more comfortable with AI tooling, we saw a big change in their preferences: In January 2023, 90% of B12’s customers requested expert help after an AI generated their website. By January 2024, 90% of our customers looked at their AI-generated website and decided they could launch on their own.
  • B12 3.0 (2026 and beyond) offers fully customized AI-generated websites. The “AI fills in the blanks” of B12 2.0 was too rigid in the structure, style, and functionality of websites we could offer customers. With 3.0, a coding agent can fully customize every detail of a customer’s project. With B12 3.0, we’ve seen people build online stores, web apps, business tools, blogs, portfolios, and landing pages. Our customers’ freedom to create what they envision is so gratifying.

B12 3.0 was a big gamble. We prototyped it during a hackathon at the end of October 2025, and by February 2026 we were A/B testing it with customers. Could a product we built in 3 months really outperform an experience we spent a decade refining? Would customers put up with the added latency of a coding agent? What bugs or infrastructure quirks lurked?

Our A/B test results put our anxiety to rest: when we randomized reveals for the faster and more established 2.0 against the more flexible and higher-quality upstart 3.0, the purchase rate of B12 3.0 was nearly 2x that of B12 2.0. Despite many opportunities for improvement, customers already see B12 3.0 as our best experience yet.

When we make launch announcements, I typically list the people involved as a sign of my appreciation. I’m breaking with tradition because this release is so big that every B12er made it possible. Our product, customer success, and marketing teams were consumed with it for the past few months. Yes, 3.0 is a momentous thing, but its release was also a momentous process that involved so many people I’m grateful for.

https://blog.marcua.net/2026/03/12/b12-30
Extensions
Four questions agents can’t answer
Software engineering after agents write the code
Show full content

You’ve likely read countless words about how coding agents have massively changed software engineering. At the extreme, December 2025 was the turning point and we’re unlikely to write a line of code again. But amidst all this talk of change, it helps to understand what likely won’t.

I’m particularly interested in the questions agents can’t answer, doubly so if they are unlikely to answer them well in the coming years. Here are four such questions that blur the line between product management and software engineering. The questions reflect the fact that as coding agents cover more of the nitty-gritty of code generation, humans will be responsible for higher-level concerns. I think these questions will remain in the human domain for years to come:

  • What should we work on?
  • How much of it are we doing?
  • How do we do it well?
  • What’s getting in our way?

These questions are durable: we’ve encountered them since software engineering became a profession, and we’ll be responsible for them as long as we’re responsible for the products and systems we share with the world. The answers to these questions require judgment based on goals and constraints that are organization-specific, dynamic, and don’t live in some task management or issue tracking system you can integrate with. It’s hard to imagine an agent successfully synthesizing these disparate and context-specific inputs into answers.

What should we work on?

Companies tend to work at the intersection of what their users find valuable, what the business finds valuable, and what the employees/leaders can agree on. While you might use agents in the process of understanding these needs and coming to consensus, agents are unlikely to “internalize” whatever lessons there are to be learned and turn them into a cohesive roadmap any time soon.

Understanding what users need involves qualitative and quantitative skills that rely on information outside of an agent’s context. On the qualitative side, conducting interviews, observing people using your product, and interacting with your support or sales team all help you find insights. On the quantitative side, building funnels and measuring how segments of users move through your experience can help point your qualitative research in a fruitful direction. Beyond the user, learning how to speak with “business people,” either casually or through stakeholder interviews, can help you find problems worth solving. In those conversations, translating terms like conversion and retention helps you identify where your organization thinks the biggest opportunities are. Agents might help you translate, collect, and even synthesize some of this data, but turning those items into a short list of cohesive and ranked jobs to be done is both valuable and not something agents do well today.

A roadmap isn’t just a summary of user and business needs. It’s the result of your team’s interpretation of those needs balanced with the realities of shipping an improvement. “Do users really value versioning functionality? Can we think of a way to implement it in under a month? What if we could address six onboarding issues in that amount of time?” Exercising the soft skills to argue that your team should work on a solution to a particular user need will help you answer something an AI can’t yet: what should the team do next? It involves creativity in identifying a workable solution, back-of-the-envelope estimation skills, and communication skills to both explain the problem/solution and convince others it’s worth spending time on.

How much of it are we doing?

The “what should we work on?” roadmaps question typically has you working on a vague area (e.g., “we’ll help users undo changes on their projects”) for a vague amount of time (“about a month”). To get past the vagueness, there’s a whole set of skills to develop around the nitty gritty of what will be in and out of the project. An important skill to exercise early in any new project is in defining success. It sounds like business-speak, but a simple way to think of it is the answer to the question “what is the need/pain the user/organization feels now?” Once you’ve got the answer to that question, you can then ask “what is the best metric we have for measuring how much need/pain we’re addressing?”

With your idea of success in mind, you have to define scope. Will your undo functionality be snapshot-based or will you allow fine-grained action-by-action undo? Will the undo log persist across sessions or just in memory? What will users be able to undo, and for how long? You’ve got a month to ship this version, so how do you decide what should go into that month? Hopefully you can lean on some of those user interviews from the previous section to inform your opinion.

To really define scope, you have to practice a more challenging skill: saying no. Your scoping will have you saying yes to snapshot-based undo and not fine-grained action-by-action undo, but some team member is going to remind you of the users that asked for the thing you’re not working on. I like to use various versions of the phrase “in the fullness of time, we’d like to implement it all, but this is the first feature of hopefully many in this area that we’re rolling out” as a nice way of acknowledging other good ideas while also identifying that those ideas might come down the line without committing to shipping them now.

It’s hard to imagine agents meaningfully contributing to scoping work. Agents might make it possible to ship more code than you used to, but they don’t define the scope of the project in which you’ll use them. Since you’re ultimately accountable for the product or feature you’re releasing, you’re also responsible for the “that won’t make the cut” discussion.

How do we do it well?

As the scope of the project takes form and you get into execution mode, AI agents start to shine (they can write the code!), but there are many steps that are firmly in your control, and your responsibility, to ensure the work is done well. Before the code is written and while you review early prototypes, you’ll be making architectural decisions. After the code is written, there are many things you can do to ensure it ships with a high level of quality.

Architectural decisions are based not only on the tools at your disposal, but also on practical concerns that the agent doesn’t have access to. How many people will be using the first release? Are you expecting to use existing infrastructure or stand up something new? What’s the best way to organize the code? Where do you expect it to be extensible and where would extensibility be overengineering? What sorts of security assumptions are you making? The answers to questions like these form context you’ll use to steer the agent before it starts, while it’s exploring the space, and as you review its code.

Once the code is written, it’s also on you to ensure its quality. Ideally, you or the agent write automated tests, but at a minimum you need some process for playing with/QAing the software and iterating with the agent until you’ve ironed out a lot of the issues. In a professional setting, you’ll be held accountable for the quality of the work you generate with agents. It’s hard for me to understand how, with the current generation of coding agents, one can take this responsibility on without reviewing the code the agents write. Reading the code and giving the agents feedback is the only way I know to reason about the solution, identify architectural gaps, look for ways to increase testing coverage, and think through edge cases.

What’s getting in our way?

Periodically taking your head out of the terminal to ask “what’s taking longer than we expected it to?” and then identifying the root cause is both hard and impactful if you can answer the question. It’s also an area where agents will have the least context, so it’s likely to stay in the human domain for quite a while.

Figuring out that the team isn’t performing as effectively as it can is a skill in itself. What sort of baseline do you have for what “productive” looks like on this team or similar ones you’ve worked on? Is the team missing deadlines? Is it not getting to a minimum shippable contribution in days to weeks? Is the team generally productive, but one project or portion of a project seems to be dragging for a while?

If you can determine that the team isn’t performing as effectively as it can, identifying why is a new challenge. To start, even talking about progress with your collaborators is often illuminating. You have to be tactful and respectful, but calmly voicing that you think there’s a problem usually results in nods of agreement and looks of relief from your team. If you feel it, your collaborators likely feel it too, and might be able to chime in with their own interpretation of what’s going on. It’s important to separate problems from solutions here, both to ensure you’re all actually talking about the same problem and to ensure that your solutions aren’t getting ahead of the problem.

Once you’ve socialized the fact that there’s a problem, getting the team toward a solution is a third skill to practice. Can the problem be addressed by a change in process? By helping some subset of the team get some experience in a technique or technology they don’t feel as comfortable with? By paying down some tech debt that’s slowing the team down? By speaking with a few more collaborators to get their perspective? Once you identify a few ideas for improvement: is there a mini-project you can test your proposals on and fail fast in learning whether your proposal really does have an impact?

Identifying blockers and unblocking others are the areas in which I’m most skeptical agents will make a meaningful contribution. Agents will eventually have access to data in the form of threads, emails, transcripts, and issue trackers, or the raw data behind brewing organizational issues. But additional context still lives outside of agents’ reach in the form of tones of voice, facial expressions, and new developments. And let’s say an agent identifies some potential inefficiency: How will it correctly interpret the signal in the context of how this team or team member communicates and operates? How will it socialize/confirm with others that there’s a problem? How will it raise the issue in a way that adapts to your organization’s norms and dynamics? How will it experiment with potential process improvements? As long as humans are bumping into problems in organizations, I suspect we’re going to have to solve those problems for ourselves.

Conclusion

The way we write software is changing under our feet. It makes sense to spend mental energy on learning the new tools and techniques that coding agents require. But it’s equally important to make sure the next generation continues to pick up experience and skills in answering the other questions that drive every software project and team. Beyond prompts and code review, how might we educate, mentor, and create spaces for junior engineers to address the questions we’ll still be answering even as our tools change?

If you’re an engineer who’s just starting your career, don’t wait for the olds to rethink their mentorship plans: start practicing now! Learn about some of the specific challenges that junior engineers face with AI-assisted coding workflows. Interview some users and write up what you learned. Turn that into a prioritized list of opportunities. Scope those opportunities into something shippable. Get an experienced colleague to poke holes in your architectural plan, and then have them give you feedback on your approach to testing. Then take a good hard look at your team, identify an area you think could be improved, ask some colleagues about it, and propose a change to the team’s approach on your next project. The agents aren’t going to!

https://blog.marcua.net/2026/02/25/four-questions-agents-cant-answer
Extensions
Claude Code and core dumps
Finding the radio stream that hosed our servers
Show full content

It’s after dinner on the holidays and you’re on call when a production system alarm starts to go off. After doing some initial investigation, you find that the issue is not a common one you or your team has encountered before, and there are few other collaborators around to turn to. Where do you even start? I was in this situation, and found coding agents to be surprisingly helpful for incident response!

I figured I’d document my over-the-holidays on-call experience in case anyone ends up in a similar spot. In short: a rare and nearly decade-old bug in an infrequently used endpoint was filling up our disks with core dump files. Claude Code helped me track down the root cause and iterated on a fix with me. I estimate it saved me hours to days of work given my unfamiliarity with core dumps and the specific API trick to prevent the issue.

Let’s start with the story of how the incident and investigation unfolded, and then cover higher-level takeaways and my initial impressions.

The story

For a few days leading into the incident, we’d gotten production alerts that our web servers’ disks were filling up, which is a pretty exceptional event. Not being sure if this was a one-off, we terminated the servers and spun up new ones (they are stateless/ephemeral, so why not?), but with time saw that this was a recurring issue. Realizing this wasn’t a one-off and finding several servers near full at the same time, I explored one of them, and it turned out that our Docker containers were slowly accumulating gigabyte-sized core dump files. I know core dumps are created when an application aborts, and have previously used strings to look for hints in them that might guide me toward a potential root cause, but didn’t know much about working with these files beyond that.

Before I could meaningfully inspect the files, I had to address my first problem: I needed to get the core dumps out of the production Docker container and onto a development machine with the utilities to further investigate it. I couldn’t copy the files around within the server since its disk was near full, so I turned to Claude. The agent offered a trick to remotely run docker cp to copy the file out of the container, pipe that to gzip, and send the bits over ssh before unzipping them on the development machine. By piping between docker, the host machine, and ssh, the files never had to be moved around on the machine itself, saving precious disk space:

for pid in <LIST OF IDs>; do
  echo "Copying core.$pid..."
  ssh user@machine "docker cp <CONTAINER ID>:/website/core.$pid - | gzip" | gunzip > core.$pid
done

Knowing little about core dump files, I asked Claude Code to try to figure out what the root cause was. With no other details (the core dumps were in a separate directory, so the agent initially didn’t even have access to our codebase), it ran file on the dumps to figure out what type of file it was (a tarball/archive of other files, apparently), untarred it, and started to explore by running commands like gdb and readelf. Through readelf, it was able to determine that the process had received some signal, but needed to dig more to determine which signal it was.

To find the signal, Claude Code asked for permission to run a script to parse the core dump further:

A screenshot of one of several scripts Claude Code wrote to extract details from the core dump, in this case to identify the specific signal details. A screenshot of one of several scripts Claude Code wrote to extract details from the core dump, in this case to identify the specific signal details. Screenshot courtesy of Simon Willison’s excellent claude-code-transcripts.

Not knowing the specifics of core dump file format, my approval/code review process was to read the code enough to know it contained solely read-only operations with no side effects. In particular, I was looking to ensure that the scripts weren’t accessing any external services to prevent exfiltration. Basically, I was reviewing code for “is it only going to operate on the files locally and is it not going to modify those files?” After I approved a few of these scripts, Claude concluded that all 8 core dumps arose from a SIGABRT. It then spent ~4 minutes searching for various strings across the dumps, eventually striking gold in finding Sentry profiling data from around the time of the crash. It listed a few views related to this profiling data, and suggested that one of these views might be timing out.

I asked Claude what evidence it could find of the problematic view, and it spent quite a few minutes asking me to approve more scripts that I again reviewed mostly for side effects. Its first suggestion turned out to be the problematic one: a view that helps us parse RSS feeds for customers who embed them on their websites. We’ve had this view since the early years of B12, and it’s been reliably available for nearly a decade now. Ultimately, in a long series of greps, the agent searched for the URL path of the RSS feed view, and came back with evidence that 6 of the 8 core dumps had a very specific URL that was causing trouble for the RSS feed parser. In short, a user set a radio stream URL as their RSS feed. Our feed parser didn’t have timeouts set properly and wasn’t expecting a stream, so the endpoint just streamed the endless sounds of the radio waiting for the “RSS feed” to end. Eventually this timed out with the process receiving a SIGABRT, resulting in the core dumps:

A redacted screenshot of Claude Code's conclusion that a particular streaming URL was to blame. A redacted screenshot of Claude Code’s conclusion that a particular streaming URL was to blame. Screenshot courtesy of Simon Willison’s excellent claude-code-transcripts, redaction courtesy of Claude.

This type of bug is particularly hard to identify through our normal monitoring tools like Sentry or canonical logs. The request runs for a long time without error, so there’s no report of a stack trace or other issue to give you a hint. And when the request finally does get killed, it comes by way of the web server/proxy timing out and killing the process, so there’s no opportunity for a log line or exception to be written to disk or sent across the network, as the request is no longer running. The slow request reporting we have relies on the request coming to a close in order to determine its timing, so we simply don’t have a record of this type of slow request. In hindsight, we’re missing some recording/observability of the timeout itself, which we now know and can repair.

It’s less surprising that Claude Code could implement a fix to this issue, since writing code is its bread and butter. When I prompted it for a solution, the initial implementation was both overly complicated (too many try/excepts and branches for somewhat nonsensical conditions) and wouldn’t work in our particular codebase (relied on httpx whereas we heavily utilize requests). After manually testing Claude’s simplified and requests-based solution, I realized it wasn’t as simple as adding a timeout to the request, since the problem wasn’t that the connection was timing out, just that it streamed forever. I iterated with Claude on a solution that would read small chunks of data from the URL a few kilobytes at a time, and check elapsed time in between those requests. This involved elements of the requests API that I hadn’t previously encountered. After two rounds of feedback, I had a version I was happy with and a fix on production:

def _fetch_feed(feed, timeout):
    """Fetch feed with a hard total timeout."""
    start = time.monotonic()
    response = requests.get(
        feed,
        stream=True,
        timeout=timeout,
        headers={'User-Agent': 'B12 RSS Reader'},
    )
    response.raise_for_status()

    chunks = []
    for chunk in response.iter_content(chunk_size=8192):
        if time.monotonic() - start > timeout:
            response.close()
            raise requests.exceptions.Timeout('Total timeout exceeded')
        chunks.append(chunk)

    return b''.join(chunks)

The fix didn’t address the user’s underlying reference to a non-RSS feed, so I wanted to figure out which customer this was affecting. I asked Claude Code to help me find the customer, and it came back with SQL to query for the RSS feed integration that the customer set up on their website. I had to modify the SQL to have it run on our data warehouse, as Claude wrote the query for our production database. The query Claude provided was again overly complicated, so I removed a bunch of joins and ran it and…it found the customer! This was still a time-saving query for me: unlike data sets such as our funnel, I haven’t written queries to extract data about integrations in a long while (ever?), so having a starting point was a great help.

Having shipped the fix and sent a message out to our support team to contact the customer, I headed to bed earlier than I would have if I had to investigate the issue solo.

Some takeaways

Since using Claude Code in this incident, we’ve relied on it during one subsequent incident around an esoteric one-off in how our hosting stack handles wildcard TLS certificates to pretty good effect. The goal of this blog post wasn’t to tell you all about core dumps and certificates, though. I’m sharing these stories because they expanded my understanding of where we can utilize agents in our engineering processes. Here are a few broader thoughts to that end:

  • With an agent assisting me, I went from problem to solution-on-production in about 2.5 hours. Without the agent’s help, I’d have spent hours just learning to parse core dumps and might not have connected the dots to that specific URL. Futzing with files inside the docker container on a full disk would likely take me 1+ hours to figure out, whereas with the agent I had a solution in minutes. I could see spending hours to read about the core dump file format and learn about command-line utilities, and even then I’m not convinced I would have found the right segment of memory or had the patience to look across all of the core dumps to spot the repeating URL. Even if I saw the URL, it wasn’t so obviously a “this is a stream” URL, but somehow the agent pattern-matched the URL as containing streaming content. My guess is after a few hours, I wouldn’t have connected the dots. Even if an oracle told me the problem, writing the working solution (~20 lines of code) would have been easy, but taken maybe an hour or two to understand all of the interactions between the timeout parameter in requests (which I’m familiar with) and the stream kwarg and iter_content method (which I’m not familiar with).
  • I don’t know how things would have turned out if I was left to figure this out without an agentic assistant, but my strong suspicion is that I would have come up with a different and less specific solution. It was late at night during the holidays, and I had evidence I didn’t have as much experience in handling. Taking too long to figure out a root cause would have required manual babysitting servers during the investigation, and my bias would likely have been to apply duct tape. To cut corners, I might have even decided to delete some core dumps so I could docker cp others onto the host machine and then scp them off the machine, trading off less evidence for a faster start to inspecting the files. Forcing myself to think through the evidence in front of me and the lack of clarity on how to transport and meaningfully process the core dumps, I think my “solve the incident now” instinct would be to write a script that runs periodically, checks for core dumps, copies them somewhere like S3 and alerts us, and removes them from the container. This would have been workable in the sense that the servers would no longer have disk usage issues, but it would have kicked the can down the road without any real understanding of the root cause.
  • Despite all of the heavy lifting Claude Code did, I still felt like an active participant in the process. Claude Code doesn’t run on our production servers, so I was the one who explored directories until I found out that Docker volumes were taking up extra disk space, explored the container to find the large files, and then spun up Claude to ask how I might stream the core dumps through ssh and through Docker. When given a SQL query as a starting point, I simplified it and translated it into something that would work on our data warehouse. When shown an initial fix to the code, I iterated twice through a change of library and a change of approach to implementing the timeout. And finally, I communicated with our customer success team to reach out to the customer and packaged up a PR to have a fellow engineer review. So while it saved me time, I still felt very engaged in the problem.
  • Beyond all the anthropomorphizing you could do about Claude gaining an understanding of the problem at hand, my read of the transcript of individual tool calls afterward suggests that it found bits of evidence (a signal, a signal type, a profiling trace, an endpoint, a query parameter) and greped its way between these bits of evidence to surface and then filter through new bits of evidence. There are several steps along this chain that I didn’t have the knowledge to complete, specifically in identifying that the program received a signal and in parsing the core dump to identify which signal. By virtue of Claude “knowing” how to parse through these blobs, it was able to jump from one string search to another to uncover the next element to parse.
  • While I now have core dumps on my “curious things to learn about” list, the incident was the wrong time to get too curious about them. Prior to this incident, I was a one-trick pony when it came to core dumps: run strings on them and see if you can find any hints. There are MANY strings in a gigabyte-sized dump of memory, so my trick didn’t quite help. I’ve used web search during previous incidents to learn about a file format or better understand how to debug something. But being able to say “hey Claude, can you get anything meaningful from these files while I collect other clues” allowed me to defer my curiosity to a less stressful moment.
  • While it was cool that the coding agent wrote Python programs to read program registers, investigate bytecode, and dig into what was in memory with a makeshift hex editor, I’m not sure those tricks gave it the clues it needed to identify the root cause. Ultimately the presence of some profiling traces that the program was meant to send to Sentry did the trick. That said, the game of whack-a-mole that the agent played is not unlike the investigation process I go through: when you start looking for clues, you often sift through a bunch of noise before getting to something meaningful. In a coding agent’s case, the noise can come in the form of programs it writes in seconds. One leg up we have as humans is that it’s more expensive for us to emit code, so we generate less noise in our investigations. But that also means we stumble across fewer signals (no pun intended) as we investigate.

The expansion from “assists with bugs and features” to “assists in incident investigations” makes me feel like “coding agent” is not a descriptive enough term for the types of use cases you can bring one of these agents onto. I hesitate to start giving a single tool like Claude Code multiple categories (e.g., “It’s a coding agent AND a site reliability engineering agent!”). Perhaps something that more broadly encompasses the role of the person calling out to the agent, like “software engineering agent” would help prime people in that role to bring it in to help in more aspects of their day-to-day? Of course that might be misleading in that it’s not a software engineer replacement, so I’m not sure how to best refer to these expanded assistive capabilities. Regardless, if you’re stuck during an incident, consider a coding agent!

https://blog.marcua.net/2026/01/28/claude-code-and-core-dumps
Extensions
Review: a bookmarklet to generate coding agent-ready code reviews
I’m excited to share Review, a bookmarklet that makes it easier for you to code review AI coding agents. The bookmarklet turns all unresolved comments on a GitHub pull request into a Markdown-formatted text blob that you can paste into your coding agent of choice. Since a video is worth a thousand words, here’s the tool in action with Claude Code:  
Show full content

I’m excited to share Review, a bookmarklet that makes it easier for you to code review AI coding agents. The bookmarklet turns all unresolved comments on a GitHub pull request into a Markdown-formatted text blob that you can paste into your coding agent of choice. Since a video is worth a thousand words, here’s the tool in action with Claude Code:

I’ve been using Claude Code to start all code explorations, bug fixes, and new features I’ve worked on in the past few months. As I used it on larger multi-hundred (and increasingly multi-thousand) line projects, I’ve found the need for a traditional code review interface. Since I do all of my other code reviews through GitHub, I wanted to replicate that experience for reviewing agents as well.

So I made Review! (Well, I asked Claude Code to make Review.) It’s a bookmarklet, so it has access to any pull request you can load in a browser. You leave comments (including multi-line comments and code edit suggestions) on a that pull request. When you click the bookmarklet, you see all comments on the pull request formatted as Markdown references to the line(s) that the comments covered. Code edit suggestions are converted into code blocks that show each line prefixed with - or + for removal or addition appropriately. If there are comments you don’t want to send the agent, you can delete them before copying the rest.

I’ve read about approaches that involved giving Claude access to GitHub comments via gh or another interface, but wanted a few things I couldn’t get through that approach. First, we do development at B12 on remote machines, and I felt uncomfortable putting GitHub credentials on a remote machine (the gh command doesn’t work via ssh key forwarding, for whatever reason). Second, it’s rare that every comment on a pull request is one I want an AI coding agent to tackle: I might leave a question or explanation for a co-worker, or a peer reviews my code and leaves comments that I then want to turn into more bite-sized coding instructions for the agent. Finally, on larger code reviews, I worry that sending tens of comments won’t result in good outcomes for the agent, and so I like to copy/paste subsets of the comments at a time and review those smaller diffs piece by piece. By being able to review and edit all Markdown comments before I copy/paste, it’s easier to control what gets sent to the LLM.

If you use Review or have suggestions, reach out! I’d be happy to add features if there’s anything that can help your review experience. Review is Apache 2-licensed and the code is available here.

https://blog.marcua.net/2025/12/08/review-bookmarklet-code-review-ai-agents
Extensions
ayb v0.1.11: Now sporting a web interface!
Quick reminder: ayb makes it easy to create databases, share them with collaborators, and query them from anywhere.
Show full content

Quick reminder: ayb makes it easy to create databases, share them with collaborators, and query them from anywhere.

I’m excited to announce v0.1.11 of ayb (full release notes), which now includes a web frontend bundled into the ayb server command line! Bundling a web interface for managing and querying databases is in line with my goal of expanding the set of people that can use ayb. Since a video of the experience says it all, here’s the new ayb in a minute:

It’s amazing what having a frontend wrapper around an API can do for a demo. In introducing a web interface, I haven’t added any new functionality to ayb, and the commands to do everything in the demo have existed in the documentation for a long time. But it’s one thing to explain that you can do something using the terminal, and it’s another to visually walk someone through what you mean with a user interface.

Beyond the web interface, this release includes a public_url to expose your instance behind a load balancer/CDN, a database_details endpoint/command line option to include some details about databases that the interface required, a way to run ayb without setting up a mail server for testing/a personal deployment, and a protected list of usernames to prevent scammy registration. And in a first experience for me, I’ve fully covered the frontend in Playwright tests.

Since I want ayb to be accessible from anywhere, it was important to me that the web frontend relies entirely on the existing ayb API without giving it special access to ayb data. The views that render the interface call the same endpoints that the command line interface does. I also wanted the interface to require no additional build / compilation steps in order to avoid adding to a user’s setup and operational load. To that end, the entire experience is rendered on the backend via views served from the same ayb server that serves the API, and the frontend flow are progressively and smoothly served using htmx. The implementation has minimal vanilla JavaScript, and the component library is all thanks to Franken UI, which also offers a subset of Tailwind. I’m happy with how things turned out: I didn’t have to compromise too much to work within the constraints of no frontend build and all rendering happening on the backend, and the user ends up getting a complete no-setup-required frontend packaged into their binary.

Try it out and let me know how it works for you. Next up, I’m planning on hosting a version of ayb on a server that friends and curious colleagues can use. I got a pretty sweet domain name for it, and am excited to share it when it’s up!

https://blog.marcua.net/2025/09/27/ayb-v0.1.11-web-interface
Extensions
Mentoring junior engineers in the age of coding agents
A decent amount of digital ink has been spilled about the future of software engineering in light of increasingly powerful coding agents. Will software engineers, especially junior ones, be able to find jobs in the future as coding agents become more powerful? I’m not a futurist, so I’ll make some observations informed by my own increasingly productive use of coding agents like Claude Code and Aider: Today’s best coding agents still very much require a human operator, To effectively use these coding agents in a professional context, you have to make some big changes to how you work, and There are many elements of professional software engineering that coding agents don’t meaningfully enhance.
Show full content

A decent amount of digital ink has been spilled about the future of software engineering in light of increasingly powerful coding agents. Will software engineers, especially junior ones, be able to find jobs in the future as coding agents become more powerful? I’m not a futurist, so I’ll make some observations informed by my own increasingly productive use of coding agents like Claude Code and Aider:

  • Today’s best coding agents still very much require a human operator,
  • To effectively use these coding agents in a professional context, you have to make some big changes to how you work, and
  • There are many elements of professional software engineering that coding agents don’t meaningfully enhance.

If you agree with these observations, then you agree that humans will play a critical role in software engineering in a professional context for years to come. And, as Camille Fournier explains, in a world where there are software engineers, we’ll see the spectrum of junior, mid-level, and senior engineers at any healthy company. So rather than worry about the existential risk to our field, let’s focus on continuing to build a healthy career pathway for engineers of all levels of seniority. I’m particularly interested in bolstering the beginning of a software engineer’s career: how should junior engineering mentorship change in light of coding agents?

Since these tools turn portions of the role from coding-heavy experiences to prompting-and-reviewing-heavy ones, it’s important for the mentorship model for junior engineers to more quickly introduce them to writing up high-level task descriptions and performing code review. This is pretty different from the junior engineering mentorship models I’ve experienced, which tend to delay the introduction of more complex task definition and code review skills. To support junior engineers who are learning these new tools, we’ll have to change that.

In this post, I’ll do three things:

  • Explain my (and B12’s) pre-AI mentorship model for junior engineers,
  • Identify task definition and code review as two skills we need to emphasize sooner in the mentorship model, and
  • Talk a little bit about how we might mentor for those skills sooner.
Junior engineering mentorship in the pre-AI era

To understand how we might improve mentorship for the era of coding agents, I’ll outline how we (still) think about the growth of engineers at B12. I make no claims that this is the best way to offer junior engineers mentorship, but the outline can help show where there are gaps for the coding agent era. We introduce junior engineers to the following list of skills/experiences in roughly chronological order, though we very much adapt it to the skills and experiences an engineer already has on joining the company:

  1. When an engineer joins with close to no industry software engineering experience, we give them bite-sized tasks that are planned and specified by a more senior engineer. These can be bugs or mini-features, but importantly the engineer’s focus is on turning a high-level and reasonably detailed description into code while they learn the mechanics of our tools, codebase, and workflow.
  2. As they become familiar with our engineering practices and more comfortable with our codebase, we increase the complexity of their tasks, and over the course of a few projects ramp them up to implementing multi-step technical specifications that a more senior engineer or product manager wrote. As we increase complexity, the engineer hits learning opportunities in either working on a task that’s not well-defined enough or in submitting work for review that’s not refined enough. We appropriately recommend that they write up some details and get feedback before they start an underspecified task, or suggest that they review their own code before submitting it for another person to review.
  3. As an engineer becomes comfortable writing up small tasks for themselves, we increase the scope of the tasks they have to describe until they are writing full multi-page plans for complex features. This work naturally transitions from “write a plan for yourself” to “write a plan that 2+ engineers can work on,” which comes with its own complexities. We’ve also increasingly experimented with spending some time building/exploring a prototype and writing up a less involved plan for production-hardening the prototype afterward. Nonetheless, the goal here is to get an engineer comfortable coming up with a plan for themself and others with decreasing amounts of guidance.
  4. As the scope of the engineer’s tasks grows, the scope of the code they understand grows as well, and we have the engineer review pull requests and technical specifications created by other engineers. This is most straightforward when another engineer is implementing a component of a project that the junior engineer planned, but as they get more experience reviewing, they can review another engineers’ work on projects and parts of the codebase with which they have less familiarity. To ease them into this, we start them off as “shadow reviewers,” where they review a pull request and then another engineer also reviews it. This gives the junior engineer comfort that they aren’t the only reviewer and allows the second more senior engineer to be able to give feedback on the review the junior engineer left.

The maturity model above is nonrigid, and serves more as a set of skills and levels of comfort we look for as an engineer’s projects increase in complexity and decrease in up-front specification.

Two skills we introduce too late for the AI-assisted era

The model above works well in helping engineers gain experience in thinking critically about their own work and the work of others. In the model, there are two skills that are purposefully introduced near the end of this stage of the junior engineer’s growth, often taking a few years to hone and strengthen: writing up tasks, and reviewing code.

Task definition

Depending on the complexity of the task, it takes years to feel comfortable taking a high-level description of a goal, decomposing it into several subtasks, and describing them in a way that someone other than yourself can complete them. Task decomposition, especially in the face of legacy code and migrations, is a practice that even senior engineers appreciate having some amount of thought partnership and coaching on. Figuring out how to provide enough context in a writeup to enable someone less familiar with the codebase than you to take on a task is yet another skill that takes some time to hone.

Generally, anything more than AI-powered tab completion requires you to pass a meaningful task description and context clues to an agent. The more detailed your description, the more likely it will meet more of your expectations in a few shots. This means that one of the last things we introduce junior engineers to in the traditional mentorship model is the very first thing the engineer needs to tackle when starting off with a coding agent.

One saving grace is that, assuming you have tests and version control, there’s a pretty quick feedback loop on bad task definitions. After a delay, the agent will produce a diff or PR, and if it’s not to your liking, you can either request modifications or start from scratch. So if writing up better and better task descriptions is something that takes practice, bad coding agent outputs are at least self-reinforcing and self-correcting. However, the definition of “not to your liking” leads directly to the next issue: good code/architecture/… review skills are neither self-correcting nor self-reinforcing.

Code review

While we encourage early career engineers to review their own code before submitting a pull request, we take our time on having them review the work of others. This is in part because we want them to have a lot of uninterrupted time in picking up the codebase and the mechanics of professional engineering practices, but also in part because reading someone else’s code (especially in codebases you aren’t as familiar with) is a skill of its own. When we eventually ask junior engineers to code review other engineers’ work, we find that the review initially tends to be shallower and focused on nitpicks, or that they ask questions about design decisions but don’t evaluate answers as critically as they should. Since another engineer is also reviewing the same PR initially, these make for excellent learning opportunities.

Whether or not you code review the output of an LLM feels definitionally the difference between “vibe coding” and “coding with a desire for a lasting and maintainable code base.” A bad task description will lead to a feature that doesn’t work, and you’ll try again. A bad or nonexistent code review will lead to bugs, architectural issues, untested code, and too many abstractions of the wrong kind/too few abstractions of the right kind. A lack of introspection and care with code review is a compounding problem: the more you vibe your way through the work of an AI, the more downstream debt and maintainability issues you’ll be taking on down the line.

Since today’s coding agents take a task definition and produce entire PRs or large portions of them, the junior-senior code review skillset gap is made clear the first time you use these tools. On day 0 of using a coding agent, the subsequent code review spans multiple files, test cases, architectural decisions, and refactors. Junior engineers therefore need experience and advice on performing these reviews way closer to day 0 than year 1 or 2.

What mentors might do about it

Junior engineers that use coding agents now need to be good at defining tasks they won’t be implementing and reviewing code they didn’t write against architectural decisions they at best hinted at. As senior engineers or managers, what can we do to help them pick up those skills more quickly? Here are a few ideas, with the caveat I’m still working through these tools’ sharp edges for myself, and have had few opportunities to introduce them to junior engineers. This last section is the most speculative and least developed of the post, despite covering the most important topic for helping mentees ramp up while maintaining the health of your codebase. I’d love to hear from and amplify people who have mentored junior engineers in the era of coding agents.

  • Share some prompt/agent session transcripts with the team. Just like we can learn from reading someone else’s code, we can learn quite a bit by reading someone else’s prompts. Reading tips and tricks on how someone interacts with coding agents is helpful, and given how new the field and tools are, it’s likely broadly useful to the whole team. Having some lunch-and-learn sessions or wiki pages where people can share examples of prompts and commits that came from those prompts could help everyone, including junior engineers, improve their task definition experience.
  • Discuss common gotchas. It might be ephemeral, but I’ve seen the coding agents I use fail repeatedly in very specific ways. They repeat the same exact block of code throughout codebases they generate. They correctly cover the control flow in a test, but don’t create every assertion you asked for (or don’t write tests even though you prompted them to). They reference older versions of libraries/APIs given their training cutoff. They change the goal in the face of multiple test failures and declare success prematurely. Knowing these patterns of mistakes makes us all better reviewers, so introduce some way to share these patterns across the team, especially with less experienced reviewers. Perhaps it’s more lunch-and-learns or wiki articles, or perhaps you can name these issues when you spot them so your mentee learns to look out for them over time.
  • Double down on your own code reviews. Code review is controversial both as a quality assurance tool and as a mentorship tool. Since I find it helpful for both, I’m biased toward thinking it continues to be helpful in reviewing the work of an engineer and agent they paired with. You only interviewed the engineer, so your extra set of eyeballs is even more useful in preventing the an agent from introducing unideal code and architecture. Since any feedback on the code is also feedback on the engineer’s own review of the agent’s work, you’re inherently also helping reinforce what they should have caught earlier in their process.
  • Shadow code review sooner. Shadow review is the only tool in our traditional mentorship toolkit that allows a senior engineer to give feedback on how a junior engineer gave feedback. To the extent that you specifically want to strengthen the “giving feedback on code you didn’t write” muscle sooner, performing shadow reviews sooner allows you to observe and give metafeedback earlier in the junior engineer’s tenure. I’m not sure how necessary or beneficial this is if you’re already code reviewing the work of the engineer, but it’s certainly a more direct way to observe and discuss the skill.

Despite the speed at which the world of coding agents is moving, I’m pretty confident that the task definition/code review skills gap will remain relevant. The skills we previously delegated to senior team members are now required on day 0 of using a coding agent. If you’ve thought hard about mentoring junior team members through this change, I’d love to hear from you!

All mistakes are my own. A coding agent was working in the background on my side project, but an LLM hasn’t touched this blog post as of its publication :). Thank you to Hunter Knight for reading an early version of this post.

https://blog.marcua.net/2025/07/21/mentoring-junior-engineers-in-the-age-of-coding-agents
Extensions
Rich: Enrich your CSVs with new columns
This week a fellow B12er was performing an ad-hoc data analysis. They had a spreadsheet with some data, and wanted to classify the rows in the spreadsheet by a few different criteria along which they would look for trends. For an engineer, this would have been a quick Python script wrapping some classifier (in our case, the OpenAI API), but there wasn’t an engineer available for the project. We looked at some third-party plugins for Google Sheets, but it wasn’t clear what sorts of guarantees they made around data privacy, and we didn’t feel comfortable installing them. So, with some help from OpenAI’s o3, I created Rich, an OpenAI-powered CSV data enricher that’s a fully client-side single-page application.
Show full content

This week a fellow B12er was performing an ad-hoc data analysis. They had a spreadsheet with some data, and wanted to classify the rows in the spreadsheet by a few different criteria along which they would look for trends. For an engineer, this would have been a quick Python script wrapping some classifier (in our case, the OpenAI API), but there wasn’t an engineer available for the project. We looked at some third-party plugins for Google Sheets, but it wasn’t clear what sorts of guarantees they made around data privacy, and we didn’t feel comfortable installing them. So, with some help from OpenAI’s o3, I created Rich, an OpenAI-powered CSV data enricher that’s a fully client-side single-page application.

To start, you give Rich an OpenAI key (that’s then stored in localStorage for convenience), point it at a CSV file, and prompt it to add any number of columns to the CSV. It then calls the OpenAI API row by row and asks it to fill in those extra columns. You get a new CSV with those enriched columns. Here’s Rich in action: An animated GIF of a user uploading a CSV with three countries in it and adding population and capital columns that are automatically added to the newly downloaded CSV.

Most of the code (a self-contained HTML file with no external dependencies other than OpenAI) was emitted by o3 in an hourlong iterative session. At some point the model stopped streaming a portion of the file and I had to manually take over. There was one bug with the initial/simplistic logic for CSV import, but o3 took that feedback and provided a more robust implementation. There’s a lingering optimization opportunity to use something like Structured Outputs/JSON formatting of the responses to require only one call per row rather than one call per CSV cell, but that’s for another day.

The amount of boilerplate and nitty-gritty to get started makes this one of these projects I never would have undertaken without an AI assistant. I like that I was able to quickly create a standalone HTML file that respects users’ privacy by executing all of its business logic in the browser. I can point someone at Rich knowing I’ll never see any of the things they do with the data (OpenAI will, of course:)). Rich is Apache 2-licensed and the code is available here. If you use Rich or have suggestions, reach out!

https://blog.marcua.net/2025/05/22/rich-enrich-your-csvs-with-new-columns
Extensions
ayb v0.1.10: Now with Docker images!
Quick reminder: ayb makes it easy to create databases, share them with collaborators, and query them from a web application or the command line. I’ve been working on it for a few years, and am newly pushing myself to more publicly discuss releases, so here goes!
Show full content

Quick reminder: ayb makes it easy to create databases, share them with collaborators, and query them from a web application or the command line. I’ve been working on it for a few years, and am newly pushing myself to more publicly discuss releases, so here goes!

I’m excited to share v0.1.10 of ayb, which, in addition to a few quality-of-life improvements, introduces Docker images to make it easy to try and deploy ayb.

While the full details on how to use ayb Docker images are in the documentation, here’s a high-level introduction. To pull the latest version of the image:

docker pull ghcr.io/marcua/ayb

You can then create an alias for convenience:

alias ayb="docker run --network host ghcr.io/marcua/ayb ayb"

To run the server, you’ll need to create an ayb.toml configuration file (see Running a server), create a data directory for the databases, and map the configuration and data directory as volumes when running the container. For example:

docker run -v $(pwd)/ayb.toml:/ayb.toml \
          -v $(pwd)/ayb_data:/ayb_data \
          -p 5433:5433 \
          ghcr.io/marcua/ayb \
          ayb server --config /ayb.toml

Then use the client as normal:

ayb client --url http://127.0.0.1:5433 register marcua you@example.com

That’s it! One command to run a server, and one to run a client!

In shipping Docker support, I got to play with two new-to-me features:

  1. So that a new image can be automatically built and pushed every time a new version is tagged, I created a GitHub Action that triggers on any new vX.Y.Z tag, builds the image, pushes it to the GitHub Docker image repository, and tags the image appropriately. The action’s code shows just how much tooling exists to make this simple.
  2. This was my first time using multi-stage builds in Docker (check out FROM ... AS builder in the Dockerfile) to first create an image with the dependencies to build the project, and then to create the second image with just the binaries users will need to run ayb. The first container with all of the build tooling takes up ~2.7GB, whereas the container with the binaries takes up only ~150MB, which makes for a way faster docker pull and is way kinder to users’ machines and bandwidth.

One annoying limitation is that at the moment, only linux-amd64 images are built due to some bugs I encountered in building linux-arm64. Reach out or leave a comment on that PR if you need a linux-arm64 image.

https://blog.marcua.net/2025/02/23/ayb-v0.1.10-docker
Extensions