Zendesk Engineering - Medium

Bence A. Tóth Mar 16, 2026

Show full content

What agriculture and car manufacturing can teach us about the use of generative AI in software engineering

It’s not often that an article about AI coding tools takes us back a few centuries. Come with me on a trip down memory lane to revisit a couple of lessons other industries learned long ago.

In the 1840s, the German chemist Justus von Liebig popularized an idea about plant growth, first articulated by agronomist Carl Sprengel in 1828. Plants need many things to grow: water, sunlight, nutrients, good soil. But their growth is not determined by the sum of those inputs. It is constrained by whichever one is in shortest supply.

This idea later became famous through a simple illustration known as Liebig’s barrel: a wooden barrel made of staves of uneven height. No matter how tall most of the staves are, the barrel can only hold water up to the height of the shortest one. This illustrated that the capacity of a system is determined by its tightest constraint.

Agriculture spent millennia bumping into this lesson. Farmers improved tools. They organized labor more efficiently. They expanded cultivated land. They experimented with irrigation, crop rotation, and manure. Yet crop yields repeatedly hit ceilings that seemed difficult to explain.

The invisible limit, in many cases, was nitrogen. Plants need nitrogen to grow, but they cannot directly use the nitrogen that makes up most of our atmosphere. Soil nitrogen is a scarce resource, and when fields were repeatedly cultivated, they often lost the fertility needed to sustain higher yields.

Then came one of the most consequential chemical breakthroughs of the twentieth century. In 1909, Fritz Haber, working with Robert Le Rossignol, demonstrated high‑pressure ammonia synthesis from nitrogen and hydrogen. Carl Bosch at Badische Anilin‑ und Soda‑Fabrik (BASF) then solved the engineering challenges required to scale it industrially. By 1913, the first large‑scale Haber–Bosch plant at Oppau was operating.

Synthetic fertilizer transformed agriculture. Crop yields rose dramatically.

But agriculture did not suddenly become unlimited. Instead, other limits became visible. Water availability. Other nutrients like phosphorus and potassium. Pest pressure. Soil degradation. Storage and transport.

Once nitrogen stopped being the bottleneck, something else took its place. Systems are governed by whichever constraint is currently narrowest.

Let’s fast-forward a century and jump to another continent. When people mention the Toyota Production System, they often frame it as a story about efficiency: eliminating waste, producing faster, optimizing manufacturing.

But Toyota’s real insight was not about speed. It was about flow, and how optimizing for speed alone can make systems worse.

During the 1950s and 1960s, engineers like Taiichi Ohno and leaders like Eiji Toyoda developed a production philosophy that challenged a very common industrial instinct: the belief that every station in a factory should run as fast as possible. That instinct sounds sensible. In practice, it often harms the system.

If one workstation produces faster than the next workstation can absorb, the extra output does not become progress. It becomes inventory. Inventory piles up between steps, consumes space, hides defects, and delays feedback. It creates the illusion of productivity while actually making the system less responsive.

Toyota addressed this with a pull system called kanban (if that term sounds familiar, it’s not a coincidence). Each container of parts carried a kanban card. When a downstream station used the last item, it sent the card upstream as a signal to replenish only that amount. No signal meant no production. Limiting the number of cards in circulation set explicit work‑in‑process limits, shortened queues, and surfaced problems faster.

The goal was not to maximize activity at each step. The goal was to optimize throughput, the flow of work through the entire system.

Agriculture learned that systems move at the pace of their narrowest constraint. Car manufacturing learned that speeding up one part of a system can make the whole thing worse if the rest of the system cannot keep up.

The software industry is now learning both lessons at once.

For decades, the bottleneck really was writing code

For most of software’s history, writing code was genuinely expensive. It required specialized expertise, sustained concentration, and a large amount of manual effort.

The first big gains came from abstraction in the 1950s. Higher-level languages let engineers describe what they wanted without hand-encoding every machine instruction, and compilers automated the translation. That didn’t just make development faster, it made whole classes of work possible for more people.

As the field matured over the following decades, tooling kept removing mechanical friction. Editors, debuggers, and build systems improved, while languages became more expressive and idiomatic. Libraries and frameworks steadily absorbed common patterns so teams didn’t have to reinvent the same solutions over and over.

Then came another leap: in the early 2000s we introduced tools that made codebases easier to navigate and less risky to modify. IDE features like symbol search, automated refactoring, and static analysis reduced the cost of understanding code and increased confidence when changing it.

Coordination costs dropped, too. Version control evolved into a core part of how teams work, making branching, reviewing, merging, and collaboration far less painful than earlier approaches.

Finally, the industry invested heavily in lowering the cost of verification and delivery. Automated tests, continuous integration, deployment automation, and observability tightened feedback loops and reduced the fear of shipping.

Looked at historically, the pattern is clear. Software engineering has spent the past roughly seventy years trying to make it cheaper to produce and change code. And those efforts were not misguided. They were attacking a real constraint. And it worked. Software became dramatically easier to build than it had been in the early decades of computing.

But even after all that progress, writing and changing code still remained the dominant bottleneck in the software industry.

Homage to Margaret Hamilton’s famous photo with the Apollo guidance software’s code, reimagined with an android.

That is, until generative AI tools walked in the door. They don’t just continue the curve a little further, they finally bend it far enough to shift what limits progress.

For the first time in the history of software, producing code stopped being the narrowest constraint.

Engineers can now generate working drafts of migrations, tests, components, complex database queries, API clients, utility functions, documentation, and refactors in seconds. Boilerplate and exploration become increasingly cheaper. The cost of trying multiple implementations drops dramatically.

Code becomes abundant. And when one constraint disappears, the next one quickly becomes visible.

The new bottleneck: absorption capacity

Absorption is what that turns a diff into dependable value: deciding what to build, fitting it into the system, proving it behaves, and understanding the value it delivers.

Generative AI accelerates the production of implementation. It does not, by itself, accelerate the production of clarity.

Pull requests arrive faster than human-driven review can stay meaningful. Multiple versions of the same idea appear because generation is quick but alignment is not. Code looks correct within a small context while the architecture slowly degrades. Test coverage increases while the trust in it erodes. Experiments multiply while product decisions lag. And the codebase expands faster than shared understanding.

And more code starts to resemble Toyota’s worst kind of waste. If it can’t flow through the rest of the system, it’s not throughput. It’s inventory.

Motion is not progress. Progress is change that the organization can successfully absorb.

How we increase absorption capacity

The solution obviously isn’t to slow teams down. It’s to find new ways to let them move faster.

If generative AI makes production cheaper, leadership leverage shifts toward increasing absorption capacity and optimizing flow. In practice, that means redesigning our systems so that rapid generation gets converted into reliable value.

1) Make problem framing part of the work, not a preamble

When ambiguity can generate code at scale, clarity becomes a first-order production input.

That has an organizational implication: the work of writing problem statements and PRDs can’t live entirely on the product side of the wall anymore. Engineers are downstream of the ambiguity, and ironically, they’re also the ones who can turn a vague prompt into a plausible implementation that gets accepted before anyone notices that the question was underspecified, and ship a fantastic solution to the wrong problem.

Treat crisp problem framing, explicit acceptance criteria, and clear non-goals as a shared deliverable between product and engineering, not a handoff.

2) Lower the cost of confidence

If generation is cheap and verification is expensive, the highest-leverage investment is obvious: make verification cheaper.

This isn’t just “write more tests”. It’s building fast feedback loops that cover both correctness and outcomes. Strong CI signals that are impossible to ignore. Static analysis and type checks that catch entire classes of mistakes immediately. Security and dependency scans that run by default. Observability as a standard output of change, not an afterthought. Feature flags, staged rollouts, guardrail metrics. Product-facing feedback that closes the loop after deploy, and quick experiments that tell you whether customers are better off.

Reduce the cost of answering the only questions that matter: “is this safe to merge and ship?” and “did it actually improve anything for the customer?”

3) Treat architecture and conventions as scaffolding for AI

AI will scale whatever structures you already have. If your system is legible, it accelerates good patterns. If it’s ambiguous, it accelerates ambiguity — fast, and with a very convincing tone of voice.

The practical implication is that architecture can’t just be a diagram people rediscover during incidents. It has to exist as a set of operational constraints in the everyday developer workflow. The model should be able to infer the right shape of change from the codebase itself, and the humans should be able to verify that shape quickly.

Make the right thing easy to discover and hard to accidentally violate: clear service and module boundaries, consistent naming, and a small number of blessed ways to do common tasks. Back that up with templates and examples that are close to production.

Also, document what must remain true. Invariants, contracts, and “we do it this way because…” notes are exactly the context both humans and AI lack when they make locally reasonable changes that cause drift at a broader scale. Keep those decisions lightweight and available to your AI toolchain — short ADRs, code comments where the constraints live, and guardrails enforced in CI — so intent survives longer than a sprint and doesn’t depend on people’s memory. Your conventions keep the system steerable at higher speeds.

4) Measure throughput instead of output

When writing code becomes fast and cheap, output metrics become actively misleading. You may see them increase sharply while real throughput stays flat, or even declines.

Be especially suspicious of vanity metrics like lines of code, commit counts, PRs opened, story points burned, tickets started, or tokens used. In an AI-accelerated workflow, these are mostly measures of activity, and activity is exactly what becomes cheap.

Prefer measures that reflect absorption and flow: lead times from ideation to ready, and from ready to shipped, PR queue times, change failure rates, rollback rates, and operational load like frequency of pages and incidents.

Generative AI can make teams dramatically more capable. But the advantage doesn’t go to whoever generates the most code. It goes to whoever can turn abundant code into coherent systems and reliable delivery of value.

For many engineering leaders, that is the new challenge: not helping teams produce more code, but building an engineering culture that can absorb more meaningful change.

What happens when code becomes abundant was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/cc3239c4561b

Extensions

How to lead with compassion without losing yourself

Bence A. Tóth Feb 25, 2026

Show full content

The first time you realize you’re a safe leader, it feels like a victory.

People tell you the truth sooner. They admit mistakes instead of hiding them. They bring you the messy thing instead of the polished story. The team takes more calculated risks, and the work gets better.

If you care deeply about people — and if you’re reading this, you probably do — you start treating morale like something you must personally keep upright. You listen closely. You make space. You notice who’s quiet and gently draw them in. You hold the awkward silence long enough for inconvenient truths to emerge. You take emotional weather seriously because you know it affects the work.

You’re carrying a little more of everyone, every day. Not in some dramatic, crisis-soaked way. In the quiet, slowly accumulating way. The way that makes you tired in a way sleep won’t cure.

Tea pouring from a teapot into a cracked mug filled to the rim

And one day you realize: your empathy has become a dependency. You didn’t mean to become the load-bearing wall. But here you are.

Here’s the uncomfortable paradox: the more you value psychological safety, the more likely you are to injure yourself trying to provide it if you’re not careful. And if you don’t learn to be compassionate sustainably, you won’t just burn out. You’ll eventually become less safe to be around, because exhaustion makes you less generous, less regulated, and more reactive.

Safety doesn’t require constant emotional availability

In many enterprise environments, leaders inherit an unwritten job description we don’t talk about enough. You’re expected to deliver outcomes, translate strategy, manage stakeholders… and also function as an always-on emotional landing pad.

It often starts innocently: “My door is always open.” Then it turns into emotional overreach.

You begin to interpret every wobble in a teammate’s mood as a problem you should solve immediately. You mistake compassion for absorption.

The model doesn’t scale — across projects, across crises, across the full, complex lives your team carries into work. Psychological safety was never meant to be built on one person’s nervous system.

Caring genuinely without carrying it all

There’s a way of caring that steadies a room. And there’s a way of caring that drains you until you’re no longer steady.

When you say “I will feel what you feel, and I’ll carry it until it stops hurting,” you absorb with your empathy.

When you say “I can understand what you feel, and I can stay myself while I’m with you.”, you are being truly compassionate.

This isn’t emotional distance. It’s emotional integrity. It’s the ability to be present without becoming porous — to honor someone’s experience without renting it space in your chest for the rest of the day.

It sounds like: “That makes sense.” It sounds like: “I’m listening.” It sounds like: “Tell me what part is hardest.” It also sounds like: “Let’s name what’s true, and then decide what we do next.”

A leader’s calm in a storm is not indifference. It’s a gift.

And it’s practical: when emotions run high, people’s thinking narrows. Research in affect and cognition consistently finds that stress and threat states reduce cognitive flexibility and working memory. Your steadiness widens the room again. Your calm is contagious.

What’s yours to hold, and what isn’t

Most compassionate leaders don’t burn out because they care.

They burn out because they confuse caring with responsibility.

Support means you help someone navigate reality. You make sense of impact. You remove obstacles you control. You advocate where it’s appropriate. You offer perspective, clarity, and sometimes protection.

Over-responsibility is a different posture entirely. It’s the quiet belief that if someone is distressed, you have failed. That you must make them feel better immediately. That disappointment is danger. That boundaries are betrayal.

Over-responsibility is how good leaders become exhausted leaders.

Here’s the memo I wish more managers were handed on day one:

You are responsible for your behavior and the environment you shape. You are not responsible for everyone’s emotional outcome.

That distinction isn’t cold. It’s ethical.

Because once you take responsibility for someone else’s feelings, you will start managing perception over truth. You’ll avoid clean feedback. You’ll over-promise. You’ll say yes when you mean no. And paradoxically, you’ll become less safe.

Psychological safety isn’t comfort. It’s being able to tell the truth, however hard it may be. This is where leadership gets misunderstood, especially by new managers who desperately want to do right by their people.

A safe team is not a team where nobody feels discomfort. A safe team is a team where discomfort doesn’t come with negative consequences. Where one can disagree without humiliation. Where one can deliver bad news without a fear of retaliation. Where mistakes are examined without blame and turned into learnings. Where feedback is direct, and dignity is non-negotiable.

Compassion matters here, but not as cushioning. As conduct.

The most compassionate leaders I’ve known weren’t permissive. They were precise. They could tell you the hardest truths and still leave you with your self-respect.

Boundaries create safety

It can feel counterintuitive to say “no” while trying to build trust. But boundaries are how trust becomes dependable.

Without them, you become available until you suddenly aren’t. You absorb tension until it leaks out sideways — in shorter replies, thinner patience, decisions made with a sigh instead of a spine.

People can’t relax around unpredictability, even when it’s wrapped in kindness. A boundary is not a wall. It’s a line that keeps relationships healthy.

You don’t need a complicated system. You need clarity you can live with.

You can be warm and still be finite. You can say, “I want to give this the attention it deserves, let’s talk tomorrow when I’m fully here.” You can say, “I’m at capacity today; I can do Friday.” You can say, “I’m not the best place for this right now, but I do want to help you move forward.”

When leaders practice limits without apology or sharpness, they normalize the most basic form of psychological safety: the right to have needs. Boundaries are kindness that lasts.

Build a culture that holds when you’re not there

If you’re the only person who can soothe, validate, or translate emotions into something survivable, you haven’t built psychological safety. You’ve promoted yourself to a single point of failure.

And single points of failure will fail on schedule: at the exact moment you most need them not to.

A more durable approach is building a system of mutual accountability and shared ownership as the default way your team operates.

This shows up in small, almost boring practices. Questions asked in the open, so learning becomes communal. Decisions recorded, so clarity doesn’t depend on memory. Mistakes treated as data. Meetings run in a way that makes it easier for quieter people to enter. Norms that make it safe to say, “I don’t understand,” without performing incompetence.

When the team own and carries it all together, your compassion stops being a personal resource that depletes. It becomes the foundation of the architecture. And culture becomes what survives your bad day.

And you get to be what you were trying to be all along: not the emotional engine of the group, but the steward of its climate.

The long game of compassion is the job

If you’re an empathetic leader, you’ll be tempted again and again to prove yourself by taking on more. More availability. More emotional labor. More invisible absorbing.

But your leadership must not turn into self-erasure.

The most compassionate thing you can offer your team, over the long run, is not endless access to your empathy. It’s your ability to show up again tomorrow as you did today.

Your team doesn’t need you to be a martyr with a calendar booked wall to wall with compassion. They need you able to hear hard things without disappearing into them. They need you present. Regulated. Consistent.

If you run yourself into the ground, the culture you worked so hard to build goes down with you.

How to lead with compassion without losing yourself was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/2834d29387f4

Extensions

Heartbeats

Luke Stephenson Sep 8, 2025

Show full content

Heartbeats: How Synthetic Traffic Keeps Us Running

Let me take you on a journey of how we came to use heartbeats in our application design. It’s a happy story of love and no broken hearts along the way.

What are heartbeats?

What my teams have called heartbeats are a form of synthetic traffic generated by the application itself. The deployed application periodically generates heartbeats at a defined schedule.

Heartbeats provide guaranteed regular traffic. In the cases I’ve used them, they have been low volume. In contrast to application traffic, which could vary massively from zero to huge throughput depending on the cluster.

ChatGPT generated image from the intro above

Sounds simple, why do I need heartbeats?

Thanks for asking. Let’s go over some of the use cases that led to us introducing heartbeats.

Use case 1 — Escape

Escape is the name of the service we deploy at Zendesk to support transactional publishing to Kafka along with a write to MySQL. If an application team wants to update a database record AND publish to Kafka as a transaction, Escape is the way to do it. The application team also writes to an additional table the details of the message(s) that need to be published to Kafka, and Escape does the rest. This unburdens developers from having to solve the complex problem of guaranteeing transactional consistency across 2 data stores.

My team is responsible for deploying and managing Escape. As part of that, we want to alert on things like:

High latency between the data being inserted and published to Kafka
The pipeline halting / messages not flowing through the pipeline

Monitoring for the pipeline halting is an interesting case. The application itself can’t be responsible for emitting a metric / triggering an alert that there is a problem with the pipeline being down. The root cause might be that the application is not even running!

Given the application can only emit that it is alive, we can alert when those metrics stop being emitted.

We use datadog for monitoring, and can use a query like the following to trigger an alert:

sum(last_5m):sum:escape.success{} by {cluster}.as_count() < 1.0

This query will alert if over the last 5 minutes there have been no events handled by Escape successfully. Note that the by {cluster} means that if any database cluster is not seeing events handled, the alert will tell us which cluster is experiencing issues.

So far, so good. What is the limitation of this approach? What if the customers on that cluster aren’t performing any updates? What if it is a staging database cluster that doesn’t have updates over the weekend when the devs are off skiing and surfing? That would trigger false alarms when everything is healthy.

To work around this, we generate heartbeats. For each cluster we are processing, we periodically insert a request to publish to Kafka. We send the messages to a heartbeats Kafka topic which nothing consumes, but doing so gives us full end to end confidence in the pipeline. Now we have solved our false alarms.

Not only have we solved our initial monitoring concern, we have also built an always-on smoke test for our functionality. If we deploy a version of our code that breaks the flow of data, we will receive an alert for it pretty quickly.

One limitation of this approach is that if there is an issue with the ingestion of metrics by datadog, then an alert will trigger even though everything is healthy. Contrast this with a latency monitor, if metrics aren’t being ingested or there is a backlog of metrics to process, the latency monitor will not fire.

Also note that for a completely new database cluster, the monitor will not fire until there has been at least one successful metric emitted (the grouping by cluster needs to first be aware of all of the clusters). In practice we haven’t found this to be an issue.

Use case 2 — Account Moves

Behind the scenes at Zendesk, the data for a given customer account lives in one of our regions across the globe. We don’t want an account to be forever in the original datacenter it was created, so we have robust account move tooling which allows us to move an account to a new region with near-zero downtime.

The physical shifting of account data during a move generally has 2 phases, Bulk and Delta.

Bulk takes a snapshot of everything (eg mysqldump). Delta then consumes a change stream for the datastore to handle any updates that might have occurred to the account since bulk started.

While reading the change stream, our account move processes need to determine when they are up to date reading the change stream.

When there is data to process from the change stream, we get an exact calculation for how long ago the event read from the change occurred at. But what if there is no data to read from the change stream. While the absence of reading data from the change stream provides an indication that the process might have caught up, it doesn’t guarantee there isn’t an issue with infrastructure preventing the flow of messages.

To gain confidence, we periodically insert messages into an independent table (for mysql) or collection (for mongodb) so that we are guaranteed to have a steady stream of updates that we can use to calculate change stream lag with confidence.

As a result of heartbeats, we are able to guarantee that the change stream has been processed to a particular point in time with confidence.

Summary

This is a small write up of our use of heartbeats / synthetic events at Zendesk. They provide amazing observability insights for very little overhead. In some instances, we have even managed to obtain continuous testing and monitoring through the use of heartbeats.

Heartbeats are a simple idea / concept so I’m sure many others will be using similar patterns. And if you aren’t, hopefully this has given you some food for thought.

Thanks for reading!

Heartbeats was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/75cd476b7b04

Extensions

Unleashing the Magic of Zendesk Datastore Management: Your One-Stop Self-Service Hub!

Jill Stephen Jun 30, 2025

Show full content

You may be new to this series; and if so welcome! If so, I encourage you to start at the beginning of our datastore journey and see the blog post “Unlocking Efficiency: A New Era for Datastore Provisioning”.

Already up to date in our series? MAGICAL — then let’s continue with a quick re-cap.

Where are we?

We have introduced you to a multitude of aspects all pertaining to how we make the provisioning and utilisation of datastores quick, no-fuss and simple — as simple as clicking your fingers or making a wish.

By continuing our theme of MAGIC, we enabled engineers at Zendesk to:

Make a wish (detail what datastore they want and how they want it through a few lines of .yaml)
Ensuring the wish is realistic (as unfortunately we are in fact engineers, not genies)
Have the genie grant that wish (via kubernetes operator)
Enable our wishes to be shared (via connecting our applications to our datastores)

Your wish (datastore) has been granted — but why stop the fun there? Wouldn’t you like to check-up on that magic whenever you so desire and do additional cool stuff? Of course you would.

Introducing — Zendesk Datastore Magic (otherwise known as Zendesk Datastore Management — or ZDM).

Making the magic visible

Inside the magic lamp there has been a whole lotta magic happening — so as engineers we thought: wouldn’t our fellow engineers like to consume and see this magic anytime they want? We worked with our datastore owners to understand the basic needs for this visualisation:

End-users want to see details on the self-service datastore they have provisioned and own, regardless of type (e.g. Aurora, Dynamo, S3, doesn’t matter — all are visible in one spot)
They want to be able to track key elements such as utilisation and monitoring behaviour
They want to be able to “do-stuff” for their datastore with ease

To summarise the above into actions, we wanted to simplify the ability to perform datastore tasks from within Zendesk Datastore Management. So in an ideal world, our end-users wishes would be:

To confirm that their datastore has been provisioned correctly
To show datastore health
Check up on backups
Integrate with Remote Incident Console (RIC)
Be able to trigger a manual backup
Be able to restore a backup
Link to monitoring capabilities such as Datadog logs
View estimates on how much that datastore is costing

Your wish is our command!

We knew we needed to reflect the beauty and magic of the wishes granted by the Genie while being easy to consume for us mere mortals just wanting to understand our wishes. In theory, we determined this as two “viewpoints” to satisfy the above requirements:

A list of all self-service datastores
Individual datastore and actions

To make this magic come to life, we developed a web application with a React frontend and a Golang backend. The backend interacts with the Kubernetes API and informers to query multiple Kubernetes clusters and maintain an in-memory state of the world. Why? Well, this allows us to completely reskin the app or make quality of life changes without changing the structure of the backend api.

Introducing — real life Zendesk Datastore Management!

Welcome to our one-stop hub for all things self-service datastores! We brought our vision to life so users can navigate to the homepage and view all self-service datastores and key information straight off the bat:

From this one spot, engineers can:

View all self-service datastores
Filter by items such as team, type of datastore or service it pertains to
Estimated monthly costs
The pods the datastore has been deployed to and their health
Better graphical representation of config, including reader node count

You will notice there is another cool element included here — metadata! We capture important elements defined in a datastores service.yml that captures the essence of the importance of the datastore, such as:

Is it customer data in this datastore?
Is it the primary source of data?
What kinds of things does it impact if unavailable (e.g. internal reporting, machine learning etc.)

This critical information helps us define the Tier of a datastore — making it clear to us how much trouble we are in if the datastore becomes unavailable, enabling efficient incident management triaging.

Digging deeper into datastore specifics

Users can then burrow further into the realm of their datastore by selecting the specific datastore, bringing them to our second viewpoint:

End users can now see their wishes come to life — and in real time! Through successfully embedding key datadog information of the datastore to the specific datastore viewpoint, anyone can quickly assess overall health, utilisation and any potential spikes of anomalies of the datastore requiring more drill down into, enabling engineers to problem solve issues quickly and efficiently.

But wait… we didn’t just want to SEE magic right? We also wanted to be able to perform our own spells on our datastores where possible.

Engineers can perform their own magic with the click of a button!

Enabling datastore owners to be able to perform tasks from within the one-stop hub was one of our key requirements. This included the ability to trigger things such as backups, restorations and recovery steps.

We integrated ZDM directly to AWS, ZDM can talk DIRECTLY to AWS therefore enabling Engineers to use datastore functionality via ZDM. Therefore engineers can now execute practical functionality to support maintaining their datastores with the click of a button! That sounds pretty magical to me. The benefits?

Majorly reducing the time to bring up a healthy cluster especially during an incident or disaster and
IAM security by default; engineers do not need AWS console access to perform critical maintenance tasks — this is managed at the application level via IAM role in ZDM!

This works by ZDM essentially telling AWS to perform a certain task, e.g. make a backup, and display results within ZDM; additionally, it ensures that self-service is aware the backup has been created, which in turn will make sure that the new “backup” cluster is also visible in ZDM (and usable in other parts of self service).

Operation ‘keep our engineers happy’ — complete!

ZDM has proven indispensable to our engineers by providing a simple UI to navigate all their self-service datastores and perform necessary actions as and when they need it. By putting the power in the hands of the datastore owners via the self-service way, Zendeskians can feel in control of the datastores they create and manage!

And what else do we want to be in control of for our datastore — COST; next up in our series, we have ‘Optimising Cloud Cost Savings with Tiering and Sharing’. Stay tuned!

Unleashing the Magic of Zendesk Datastore Management: Your One-Stop Self-Service Hub! was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/90bdb8813a13

Extensions

A Journey to Empowerment: How Zendesk Engineers Transformed Infrastructure Provisioning

Vishnupriya Varadarajan Jun 19, 2025

Show full content

In the fast-paced world of engineering, the dream of easy infrastructure management and provisioning is a common aspiration. At Zendesk, this sentiment resonates deeply among our engineers. When we talk about infrastructure, we refer to a wide range of tools such as MySQL, S3, DynamoDB, Kafka topics, compute resources, network and routing configurations, security groups, secrets, credentials, configuration settings, dashboards, monitors, and log management.

Challenges with self-service provisioning

In our recent blog post, Unlocking Efficiency: A New Era for Datastore Provisioning, we introduced the concept of self-service infrastructure and explored the challenges engineers faced before this transformation. As we delved into our self-service journey, we recognized that there was no unified method for teams to provision their various self-service infrastructure components. This lack of a standardized process led to a multitude of inconsistencies, making maintenance and upgrades a significant headache.

We hadn’t quite reached our vision of engineers being able to provision their infrastructure with just a click of a button. In this blog post, we will introduce the missing component that connects the service.yml file and the self-service operators, which ultimately makes self-service infrastructure a reality.

Continuing the Genie analogy

In Simplifying Datastore Provisioning with Kubernetes Operators, we discussed the challenges of manual datastore provisioning and how our storage solutions have evolved to meet these challenges. In that article, we suggested a Datastore Genie was required to provision datastores for engineering teams. To borrow that same analogy, we now envisioned a dedicated ‘genie’ for managing each type of infrastructure. Whether it’s provisioning datastores, managing Kafka topics, or handling compute resources, each genie is specialized in its domain, ready to fulfill the wishes of our engineering teams.

But the genie needs assistance; when granting wishes, the genie needs something that:

confirms the wish is valid
routes the wish to the right genie to grant it
monitors the wish to ensure it has been successfully granted.

So let us introduce our genie’s special helpers.

Gatekeeper: Ensuring valid wishes

Before we can call upon our genie to provision our infrastructure, we need to ensure that the wishes being made are legitimate and that the teams making them are authorized. We need a vigilant guardian — a gatekeeper — that validates the requests for infrastructure provisioning coming in from eligible engineering teams.

The gatekeeper’s role? To check for necessary permissions and ensure that the services requesting infrastructure adhere to a unified convention. Without this gatekeeper, we would run the risk of unauthorized or erroneous requests, affecting the integrity of our provisioning workflow.

Orchestrator: Directing the wishes

Once a wish has been validated, the request needs to be routed to the appropriate genie. Please meet our Orchestrator. It’s responsible for directing provisioning requests to the right Kubernetes operator.

By effectively managing these requests, the Orchestrator ensures that each genie can focus on what they do best: streamlining the provisioning process and enhancing overall efficiency.

Watchdog: Monitoring the wish to completion

While all of the above is occurring, the genie’s wishes are monitored by the attentive Watchdog — they play a crucial role in streamlining the deployment process by monitoring the status of various CustomResources(CRs).

Wish Alchemist: Transforming wishes into actionable requests

But how do we bring together the capabilities of the Gatekeeper, the Orchestrator and the Watchdog to provide such a seamless service? And how do we ensure our engineers don’t need to think about any of the above?

The next genie helper to come to the party is the Wish Alchemist. It blends the requests together for the genies to act upon. Ultimately, we envision a single product that seamlessly integrates the functions of orchestration, validation of legitimate wishes, and the transformation of those wishes into actionable requests. This unified solution, embodied by the Wish Alchemist, streamlines our infrastructure provisioning process, making it more efficient and user-friendly.

Foundation Interface: Zendesk’s Wish Alchemist

The Foundation Interface serves as the entry point for all infrastructure provisioning at Zendesk. Engineering teams submit a YAML file that outlines the specifications for provisioning each infrastructure component. This YAML file is processed through our deployment pipeline via Spinnaker. During each deployment, Spinnaker triggers a webhook to the Foundation Interface. Foundation Interface validates the YAML, orchestrates the infrastructure provisioning, and sends the provisioning status back to Spinnaker. Spinnaker then advances the deployment process to the next stage upon the successful provisioning of the requested infrastructure.

Here’s a sample YAML configuration that specifies the requirements for provisioning:

an Aurora datastore
an S3 bucket
a Redis database.

version: "1.0"
name: "Ticket Service"
description: "A service for creating and storing customer tickets"
product: "ZenTicket"
team: "Zen Tickets team"

infrastructure:
aurora:
  - name: "TicketInventory"
    attributes:
      instanceType: "db.t4g.medium"
S3:
  - name: "TicketInventory"
   attributes:
         lifecycle_policies:
          expire_objects_after_days: 30
Redis:
  - name: "TicketInventory"
    attributes:
        size: micro
        purpose: cache

Responsibilities of the Foundation Interface APIYAML validation — aka our Gatekeeper

All Kubernetes operators working in the background to provision the requested infrastructure expect the YAML to follow a specific format to ensure:

all components are named, tagged, and tracked uniformly so the infrastructure at Zendesk is consistent and predictable
ease of maintenance and future upgrades.

One of the critical responsibilities of the Foundation Interface API is to validate the YAML configuration submitted by engineering teams to ensure it meets the specification, and provide early feedback.

Below is an example of YAML to provision an Aurora datastore.

version: "1.0"
name: "Ticket Service"
description: "A service for creating and storing customer tickets"
product: "ZenTicket"
team: "Zen Tickets team"

infrastructure:
aurora:
  - gnome: "Ticket Service"
    attributes:
      instanceType: "db.t4g.medium"

In this case, the Foundation Interface API would return the below error as the field it is expecting, “name”, is misspelled as “gnome”.

Error: Missing name field

The Foundation Interface effectively identifies the absence of the mandatory “name” field in the YAML configuration, ensuring compliance with specifications.

By delivering this early feedback, engineering teams can address the issue proactively and reduce risks and delays later down the deployment pipeline.

Our Orchestrator

Once the validation is successful, the Foundation Interface routes these requests to the appropriate Kubernetes operators for provisioning the infrastructure. Infrastructure operators extend the Kubernetes API through CustomResourceDefinitions (CRDs). Each operator includes controllers that monitor instances of these CRDs, ensuring that the current state of each datastore aligns with the desired state.

The Foundation Interface is responsible for:

converting the YAML files submitted by engineering teams into Custom Resources (CRs)
routing the CRs to the corresponding operator.

The image above illustrates the workflow where:

User submits the service.yaml which contains the definition for our:
– Storage
– Vault Access
– Service Access
– Secret Access
Foundation Interface validates the YAML file.
The validated YAML is separated into CRs for each defined infrastructure component:
– Storage Request CR
– Vault Access CR
– Service Access CR
– Secret Access CR
Each downstream operator processes the CRs and provision the infrastructure.

Our Watchdog

As an additional benefit, the Foundation Interface plays a crucial role in streamlining the deployment process by monitoring the status of various CustomResources(CRs).

Its primary functions include:

Continuously assessing the status of different CRs to determine if the underlying infrastructure is ready for application deployment.
Downstream operators updating the status of the CR to indicate whether the provisioning has succeeded or failed. Foundation Interface communicates the results back to Spinnaker. If the status indicates success, Spinnaker proceeds with the actual application deployment.

This monitoring capability eliminates the need for engineering teams to follow up on infrastructure deployment.

End-to-end infrastructure provisioning

The image below illustrates the complete workflow.

Deployment initiation: The deployment process begins when a user triggers a deployment via Spinnaker.
Webhook trigger: Spinnaker sends a webhook request to the Foundation Interface, signaling the start of the deployment process.
YAML validation: Upon receiving the webhook, the Foundation Interface validates the provided YAML configuration to ensure it adheres to the required specifications.
CR parsing and orchestration: After successful validation, the Foundation Interface parses the YAML file into Custom Resources (CRs) for each defined infrastructure component. This structured representation allows for efficient processing and provisioning.
Provisioning by downstream operators: Each downstream operator receives the relevant CRs and proceeds to provision the necessary infrastructure components as specified.
Status monitoring: The Foundation Interface continuously monitors the status of the CRs to assess whether the infrastructure is fully provisioned and ready for deployment.
Status communication: Once the infrastructure status is determined, the Foundation Interface communicates the results back to Spinnaker. If the status indicates that provisioning has succeeded, Spinnaker can then proceed with the actual application deployment.

Conclusion: A new era for Zendesk engineers

By implementing the Foundation Interface service, we not only standardized and streamlined the infrastructure provisioning process but also empowered our engineers to take control of their services. This transformation has led to more consistent, predictable, and efficient infrastructure at Zendesk, enabling teams to focus on what they do best: building innovative solutions for customers — all due to the simple act of allowing infrastructure to be built with a few lines of YAML and a click of a button!

As we continue to evolve and expand this platform, we remain committed to understanding and addressing the needs of our engineering teams. The journey towards efficient infrastructure provisioning is just beginning, and we are excited to see where it takes us.

Stay tuned for our next article — “Unleashing the Magic of Zendesk Datastore Management: Your One-Stop Self-Service Hub!”

A Journey to Empowerment: How Zendesk Engineers Transformed Infrastructure Provisioning was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/377769097cf7

Extensions

Connecting Applications to Self-Service Datastores

Jeffrey Theobald Jun 8, 2025

Show full content

A picture of a plug and socket about to be connected.

Are you ready for more self-service datastore adventures? If you haven’t already, have a look at our previous entries in this series:

They’re a fun read.

The story so far

Last time, in Simplifying Datastore Provisioning with Kubernetes Operators, we talked about making datastores easy to provision by just writing a few lines of YAML in a file called service.yml, like this:

version: 1.0
name: "Terrific Tents"
description: "Comfortable Covering for Camping Champions"
product: "Outdoors"
team: "Burke and Wills"

infrastructure:
  aurora:
  - name: "carpentaria"
    attributes:
    instanceType: "db.r6g.large"

So now we have a datastore, but the application we’ve made it for needs to both be able to connect to it and actually use it. We need to deliver credentials to the application somehow.

A diagram showing how all the components of self service fit together. This section, Credential Delivery, comes after Storage Operators and is directly connected to the application itself — How does this fit into the bigger picture?

Getting credentials the old way

The usual way to get credentials for applications was a bit complex. For a new Aurora cluster, the process went something like this:

Our Database Administrators (DBAs) would:
- Log in with their admin credentials
- Manually create a user for the application
- Share the username and password via LastPass
The application development team would then:
- Insert the credentials into a HashiCorp Vault instance
- Add the Vault path for the credentials as an environment variable
- Add a Kubernetes sidecar to their deploy manifest which would fetch the secret from Vault and present it to the application.

It all felt very manual with lots of places where things could go wrong, where credentials could be leaked or lost. The worst part? At least once a year, these credentials had to be rotated using a similar process.

These inefficiencies had built up organically — an example of this being that the DBAs did not have direct access to Vault, so they put the secrets in LastPass.

Since there wasn’t really a reason for things to be this way, it felt less like we should optimize each step, and more that we should approach this from a completely different perspective.

What is the self-service way?

If you’ve read our other blogs, you’ll already know what our design philosophy is. I like to call this “Don’t make me think”. An application developer shouldn’t have to care about Vault, or secrets or chains of trust, or anything like that. We wanted the new way, the self-service way, to be as low-touch as possible. Ideally zero-touch.

We also wanted to remove that troublesome need for manual credential rotation. If credential generation was completely automated, maybe it wouldn’t be so hard to have those credentials regularly rotated as well. If we did that, then everything would be more secure and no human would ever see the credentials that applications were using, minimizing the risk of leaked credentials and removing the need for manual rotation.

So, combining these ideas, we wanted credential delivery to look like this:

The deploy action causes a set of operations, comically hidden in a cloud of magic which result in the application getting credentials. — Deploys should magically cause the application to get credentials. Somehow.

In this magic future, when the developer deploys their app, it magically gets credentials in the file system in a known, consistent place. That’s it. No other work to be done for the application engineers.

So how could we do that?

MAGIC (technically Kubernetes) to the rescue

Zendesk was already using Kubernetes extensively, so we wanted to take advantage of its features. We briefly mentioned above that in the past sidecars were used to load secrets, but we wanted to do better. We wanted to use init containers.

An init container is part of a Kubernetes Pod and runs before your application containers. It is intended to do “start-up”-style tasks such as retrieving your datastore credentials and putting them somewhere the application can read. In our case, this is a short-lived volume that both the application and init container can access.

We set up our init container to expect environment variables which tell the init container which secrets to fetch and from where. So the init container starts up, gets the secrets, and then exits.

But this wasn’t good enough. Instead of manually doing some work in Vault, developers would have to manually add the init container to their deploy. In other words, we’d just traded one type of manual work for another. That’s not magic. That’s not self-service; that’s too much thinking.

Kubernetes mutating admission webhooks

What if, instead of having to add the init container to every single application, something else did it for them automatically? You’ve probably already guessed this, but this is what Kubernetes mutating admissions webhooks are for.

The name suggests something complex and messy, but don’t worry: the concept is straightforward. Kubernetes webhooks let you inspect Kubernetes resources as they are being created, and in the case of mutating webhooks, you can mutate those resources. So, we added a webhook that looks at each Kubernetes Pod as it is deployed and adds the init container that will get the credentials if they’re needed:

The act of deploying creates a Kubernetes pod which is fed to the mutating admission webhook. This then transforms the Kubernetes pod to add an init container which inserts the credentials into a short lived volume — The mutating webhook adds stuff, so the user doesn’t have to care.

And as if by magic, the application gets secrets with no extra work.

But what about secret rotation?

We’ve nearly forgotten to discuss the other requirement with you! We wanted to remove the need for manual rotation entirely by automatically, regularly rotating the secrets.

We realized early on that we shouldn’t try to keep the same user and change the password, as that would lock out the existing application. Instead, when the time came to rotate credentials, we provided an entirely new user and password, or other type of credential.

Sure, we could’ve rotated these new credentials in place on the Kubernetes Pod, but that would have meant that the application needed to regularly scan for these new credentials. We had a lot of applications at Zendesk which are built to read configuration at start-up and never look at it again, making this approach a lot of work.

So instead we took advantage of the principles of Kubernetes: Expect your Pods to be disrupted.

When we know that the credentials are about to expire, we evict the whole Pod.

To do this, we inject a sidecar container — a secondary container that runs alongside the application. This sidecar calculates the minimum of the Time To Live (TTL) of all the secrets we have delivered and then tells Kubernetes to evict this Pod safely, by respecting Pod Disruption Budgets, when the credentials are close to expiring.

The Pod would then be recreated and the init container would perform its dance once again, providing the application with brand new credentials at start-up. By taking this approach, all applications were automatically compatible with regularly rotated credentials without any extra work!

Getting credentials the self-service way

When we mix this magic together, the new self-service way looks like this:

When the application is deployed, our mutating admission webhook inserts:

the init container to fetch the secrets
a sidecar to evict the Pod when the TTL on those secrets expires

The application then runs as normal, and will be restarted before the credentials are rotated. It doesn’t have to rescan the credentials or anything complicated — it’s that easy!

In fact, the application can’t even tell what environment it’s in, so your local dev instance, staging and production all work the same way, which is a breath of fresh air for developers and lets them focus on their business logic rather than debugging the connection to their datastore.

And even though there’s less work for the developer, it’s far more secure. The credentials are never seen by humans and are regularly rotated, there’s almost no chance that a username and password will get stored on someone’s laptop. It also means our dedicated Security team saves time since all datastore credentials are provisioned and delivered the same way.

So just as we wanted, developers no longer have to care about any of the complexities of accessing their datastore. With self-service authentication, they can focus on the code they care about writing, and not about the finer details of their datastores.

Connecting Applications to Self-Service Datastores was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/0c1853699dfb

Extensions

Resolving Incidents With The Remote Incident Console

Alistair Forrester Burrowes May 28, 2025

Show full content

Supporting developers to debug and resolve issues with datastores in the Self-Service ecosystem.

An advanced looking console with the words Remote Incident Console displayed on it. Generated by ChatGPT.

Welcome to the third blog post of our Self-Service Datastore series, where we share our journey towards creating a more efficient and reliable way to manage datastores at Zendesk.

Previous blog posts:

We need reliable, fast, and compliant self-serve methods to provision datastores. Furthermore, we need to ways to access those datastores from applications; otherwise, they won’t serve much purpose. The full story of how we provision credentials for Self-Service Datastores and access them from applications is for another day. Today, I want to talk about another key scenario where we need to facilitate access.

Our application

Before we get into the scenario, let’s take a look at our application. This is the service.yaml which lives in our application repository and is the entrypoint into the Self-Service ecosystem.

version: "1.0"
name: "wishmaker"
description: "A service for managing wish requests."
product: "wishmania"
team: "wishers"

infrastructure:
 aurora:
   - name: "wish-inventory"
     attributes:
       instanceType: "db.t4g.medium"
 redis:
   - name: "wish-cache"
     attributes:
       nodeType: "cache.r7g.large"
       clusterModeEnabled: true
       purpose: cache

In short, our application is wishmaker and we have a Self-Service Aurora MySQL wish-inventory and Self-Service Redis wish-cache. Given this setup, what is the scenario? 😬

⏰ 💥 PROD IS DOWN💥 ⏰

Everyone panic!!!!

That might be triggering 😅 … but we all know that sometimes things go wrong and our applications break. Excluding all the communication and collaboration required for the incident, we will need engineers to:

Assess what has gone wrong.
Perform mitigations or fixes.

We often need engineers to assess what is happening and perform fixes in the database directly. How can we support engineers to directly access the database in the Self-Service ecosystem?

A diagram showing the different components in the Self-Service ecosystem and showing that we are focusing on RIC.

Remote Incident Console (RIC)

> RIC has entered the chat

We need some way for developers to easily get a console that has access to their Self-Service Datastores. This needs to be be quick and easy and for that we developed the Remote Incident Console (RIC). How does it work?

$ ric --partition pod73 --service wishmaker --incident inc-2025-02-18-a

Woah! Slow down!

Ok what does this command do? It says:

ric - I want a RIC console.
--partition pod73 - We need to connect to the application running on the pod73 Zendesk Partition (a regional slice of all of Zendesk, with an underlying Kubernetes cluster).
--service wishmaker - The application (and more importantly, the owner of the datatstores I need to access) is wishmaker
--incident inc-2025-02-18-a - This investigation is related to the inc-2025-02-18-a incident. The incident id is used for a variety of purposes.

What do you get?

This creates a Kubernetes pod in the application namespace (wishmaker in this case) and automatically runs kubectl exec -it /bin/bash to give the user a console.

🗒  Requesting new OKTA token
🎲 Initiating RIC
RIC Session requested with the following details
Team:                    wishers
Partition:               pod73
Environment:             production
User:                    alistair.burrowes
Service:                 wishmaker
Incident ID:             inc-2025-02-18-a

🕕  Session creating ... (10s)
🚀 Taking you to your Session
🕞 Your RIC Session will be valid until [ 2025-02-19 03:10:20 +1100 ]

ric-console@wishmaker-ricpod-tccx4:/app$

RIC console

Now that we are in and have a console, we want to connect to our Self-Service Aurora MySQL database known as wish-investory. How do we do that?

Well we can use our handy aurora command, e. g.:

ric-console@wishmaker-ricpod-tccx4:/app$ aurora wish-inventory

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 10001
Server version: 8.0.28-zendesk_proxy-1.0 Source distribution

Copyright (c) 2000, 2024, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

And we are in, we can start diagnosing what is wrong and performing resolutions.

But this doesn’t just work for Aurora MySQL; what about if we need to access Redis? You guessed it redis wish-cache. We have simple and fast access to all our Self-Service Datastores.

How does it work?

aurora wish-inventory translates to something like mysql --port 10001 so it is simply using the MySQL CLI to connect on a given port. We do not need to pass any credentials.

No credentials?

This is important! If we are passing real credentials from the RIC console, then it will be easy for a RIC user to extract those credentials. Once they have those, they can attempt to bypass RIC. ⚠️️ DANGEROUS ⚠️

There must be credentials somewhere?

Yes! The credentials exist in proxy containers which will forward the request onto the real Self-Service Datastores.

An architecture diagram showing how datastore queries and routed via proxy containers in the Kubernetes Pod. Those proxy containers will audit log the queries, then forward them onto the real datastores.

Cool. What is Splunk?

Splunk is a log aggregation tool. Another critical aspect of RIC is we need to audit log commands sent to the datastores. This way if someone uses it for a nefarious purpose (or makes a mistake!) we know exactly what happened, when, and by whom.

Can anyone get a RIC console?

No, users must be:

A valid Zendesk Okta user.
In a scrum team granted access to use RIC with the application.

Can users bring their code?

Yup, applications can opt into using a custom console container, which can bring along the application code. This is really handy if you want to, for example, run a Rails console.

Is RIC just for incidents?

While designed primarily for incidents, there are many other reasons why you might want a console with access to your Self-Service Datastores. RIC is used for a variety of purposes, such as running backfills, manipulating data in staging for testing, and others.

Conclusion

We need a way for users to quickly get access to their Self-Service Datastores in an incident. This needs to be secure, where only teams explicitly allowed can establish a console. It needs to not expose credentials and ensure all commands sent to the datastores are audit-logged.

Using RIC, we can achieve all these requirements. It is fast, secure, and highly configurable to support any use case.

Stay tuned for more articles about Self-Service!

Resolving Incidents With The Remote Incident Console was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/a03cf8f809db

Extensions

Less is More: Improving job execution by ditching the job executor

Tim Cuthbertson Mar 14, 2025

Show full content

A brutally simple and effective implementation for long-running account move jobs at Zendesk.

This article outlines some architectural changes we’ve been able to make to radically simplify the execution model of long-running jobs.

By leveraging client behaviour, the resulting system improves overall functionality while removing the many complexities of distributed job execution.

Dall-e impression of a server who’s ready to move some data!

Background: Account moves at Zendesk

This tooling is a valuable capability which helps our customers as well as ourselves. It was first built to scale out of our single Rackspace deployment into multiple datacenters, and was crucial again many years later to migrate out of those datacenters into AWS. It’s still in routine use for balancing capacity and other metrics across our various datacenters. And more recently, this tooling has been instrumental in migrating the services from acquired companies into our shared infrastructure.

The account move tooling involves a central orchestrator plus a number of data movers. The orchestrator manages the overall account move lifecycle, coordinating the various systems as the move progresses. The job of actually moving bytes around is typically done by data movers — there’s one for each of our supported datastores.

Welcome aboard! Please place your belongings within the Zendesk infrastructure

So when we acquire a company which uses a datastore we don’t have a mover for, we have a challenge.

The first and easiest solution is “can we stop doing that?”. If there’s a supported datastore which would also work well, we’ll migrate to that.

If that’s not an option and the data needs to be moved, then that typically calls for a new data mover. This is a lot of work so we undertake it reluctantly, but having data indefinitely stuck outside the core Zendesk infrastructure makes things complex, and locks the acquired product out of many organisational benefits.

So when it comes to integrating acquisitions into our shared infrastructure, the complexity of building a data mover has a big impact.

Data movers are job execution servers

The details of how data gets moved are important and interesting, but today we’re focusing on job management, because data movers implement jobs. What makes something a job? At its core, I think of a job as something which is:

Long running (if it were short it could just be a request), and
Monitored for completion (if you’re not waiting for its completion, you could just fire off an event or notification and walk away)

In addition to these typical properties, our data mover jobs are usually continuous. They replicate data from a source system into a destination system, keeping up with new changes as they arrive. So we keep them running until the overall account move is complete.

Typical job system API

If we have jobs, we’ll probably need a system for running jobs. For these jobs our orchestrator is the client which asks for jobs to run, and each data mover is the server which implements those jobs.

Most systems for long-running jobs (including our initial implementation) have an API that looks a bit like this:

StartJob(config) -> jobId
GetStatus(jobId) -> status
StopJob(jobId)

The API is simple, but it must support a number of requirements to be a suitable executor for data movement jobs.

Durability

No job left behind! If a client creates a job, the server must not forget about it after a crash or restart.

Fault tolerance

Jobs can run for a long time, and Kubernetes containers don’t last forever. If a container crashes or is replaced, the job needs to keep making progress by having another container take over.

Resumption

Interruptions should not cause the job to start over, a job should pick up from (close to) where it left off.

Uniqueness

We don’t want two instances to execute the same job at the same time.

Dangling jobs

If the client forgets about a job for whatever reason, we don’t want to be executing that job forever as it’s wasteful and may even cause issues if the client doesn’t expect that work to be happening. We want to detect dangling jobs and stop them.

Job execution architecture

Given the above API and requirements, the obvious architecture involves a database and a lock API. The lock API might reuse the same underlying datastore, or it might be a separate system like Consul or etcd.

Jobs are persisted in the database when they’re created (durability), and its current state is periodically persisted (for resumption). When a process is executing a job, it first acquires the lock for that job (uniqueness). If there are incomplete jobs in the datastore without an active lock, those are eligible for a worker to pick up (fault tolerance).

Let’s tie it all together with 3 server instances, a job database and a lock service. Here’s the sequence of steps for an example job, including resumption on another server instance:

Job done?

Well, we’re getting there. We still have a few problems to address.

Dangling jobs: This one’s not too hard if we don’t mind jobs running for a little while after the client has gone away. We decided to only execute an inactive job when the client requests its status. If the client stops calling GetStatus, the current container will still keep running the job until the container terminates, but after that the job won’t be executed again.

Duplicate jobs: If a client creates a job but an error prevents it from processing the response, that would create an immediately-dangling job. We won’t waste resources on it forever, but we might be performing that job for at least a few hours. When it comes to moving data, having two jobs moving the same data can also result in write conflicts and move failures.

Insight: idempotency keys

There’s a common and extremely effective approach to preventing duplicate jobs, called an idempotency key. These are commonplace in payment APIs like Stripe and Square, since people are big fans of only paying once per purchase.

The idea when applied to jobs is that the client generates a unique key for each job it intends to create, and sends that as part of the StartJob request. If the server sees two requests with the same idempotency key, it knows that the client is referring to the same job. So the client is free to call StartJob 10 times, and the server knows to only start it once.

This is an elegant allocation of responsibilities because it’s trivial for the server and client to implement their respective parts, and together it results in a robust solution against duplicate jobs.

But that’s not the only thing the client can do for us — it turns out we can solve many problems by leveraging properties that the client can easily provide.

A brutally simple interface

Earlier, I said that if a job is short, it could just be a request. Well, what if jobs were just requests? There are two obvious problems with this:

The client wants to know the status of the job while it’s running.
Requests are fragile — you can’t rely on a single request living long enough to complete a job.

The first problem (knowing a job’s status) can be solved with streaming responses. We use GRPC, but streaming HTTP would work fine too. The server can emit a new state whenever it likes, and the client receives it immediately. This is simpler and more responsive than having the client poll for a job’s status.

And as for the fragile nature of connections, our jobs already need to be resumable. So if the connection ends, the client can make a new long-lived RunJob request (with the same idempotency key & configuration), and the server can resume the job from its latest state.

With this setup, here’s how a job execution would look, including resumption of the job on a different server instance:

That’s right— we’ve removed the server’s lock API and job store.

Astute readers may suspect I’ve cheated by shifting these responsibilities to the client, whose infrastructure is not shown in these diagrams. Keep reading and you’ll see this is an intentional benefit, rather than dodgy accounting.

This is beautiful 🌅

I’m not usually one to get emotional about sequence diagrams, but it’s striking how well this modest API restructure meets our requirements. Let me count the ways:

No such thing as a dangling job

In this model, work only happens while the client is actively waiting. It demonstrates this by holding the connection open. When the connection drops, the work stops.

This has a nice parallel with structured concurrency, which I’m a big fan of. Structured concurrency prevents runaway fibers by preventing an asynchronous child task from outliving its parent. Forcing the client to actively wait by holding open a request creates a similar protection against runaway jobs.

Job assignment

The client only performs one request at a time. We originally relied on a distributed lock to make sure only one process executes a given job at a time. But if work only happens while the client has an active request, and the client only has one active request, we don’t need explicit assignment — we simply do the work on the instance that receives the request.

Errors and retries

Account move jobs are long-running and can be both expensive and important. The previous system was fragile-by-default: any error will fail the job, until the data mover implements robust error retries (including logic around how long to backoff and when to give up).

With this interface, any error will by default cause the request to fail. But the client can already handle failed requests. We can make the client as smart as we want around when to retry and when to give up, keeping server implementations simple.

In fact, the client now defers to a human operator if a move is classified as important. Instead of giving up after too many errors, the client stops retrying and waits for a human operator to abort or resume the move. Again, this requires no specific support from servers.

Workload balancing

This is a bit of a stretch goal, because it’s hard. Ideally with 10 instances and 100 jobs, we’d like each instance to be running 10 jobs. There are simple tricks to get a bit of balance, like sleeping briefly before picking up an unowned job. If your sleep time is proportional to how many jobs you’re already running, idle instances will pick up work more often than busy instances.

But when number of connections becomes a reliable proxy for number of jobs, suddenly balancing is trivial because that’s what load balancers do — Istio’s default behaviour sends traffic to the instance with the least active requests. This won’t actively rebalance work when jobs finish, but aside from that we get optimal balancing, for free.

Storing state

This is where we maybe took the notion of leaning on the client a little too far — we give the state to the client for it to store.

As part of our streaming response, we have a persist_state field of opaque bytes. Upon receiving this, the client will store it somewhere. When beginning each request, the client includes the most recently stored state as the persisted_state field in the RunJob request.

This means the server can be 100% stateless, which is an odd property for a service that moves persistent data. But that data belongs to the service we’re moving, it’s not an appropriate place for our own job store.

For us, this was worthwhile because we have far more servers than clients (just one client for the foreseeable future), and the benefit of a fully stateless server was worth the effort to have the client persist the state.

You could definitely adopt the rest of the ideas in this article without giving the client control over your state. And obviously don’t do this unless you trust the client. We choose to trust the client enough to sabotage its own data (e.g. by sending us falsified state which may cause us to skip parts of the transfer). But we don’t expose any authorization-sensitive data in the state which might allow the client to affect with a datastore it doesn’t own.

Surprisingly, the desire for stateless data movers was the original motivation for this whole design, because a stateless system is an obvious thought when trying to reduce complexity. In retrospect, removing the state store was probably the least important benefit — writing to a database isn’t all that hard if you don’t have to worry about all the distributed coordination challenges.

Why (and when) does this work?

Of course, all of this only works when the client has handy behaviour like “not forgetting about jobs” and “only making one request per job at a time”. That… sounds like a job executor?

Well yeah, the account move orchestrator is a glorified job executor, most of its work involves running various internal jobs and storing their states. The approach described here doesn’t remove the need for a job executor, but it means we can utilize a single job executor at the outermost layer of the system. We’re not directly integrating with that job system, we’re simply structuring our interface to leverage the useful properties it provides.

This is obviously nice for simplifying our existing data movers, relieving them from managing jobs (and any complexities / bugs that come with that). But more important are the data movers which haven’t been written yet. Now when we need to implement a data mover for a datastore used by an acquired company, the bulk of that work is simply moving the data, rather than first assembling a robust job execution system.

Hooray for coupling? 🔗

It can be tempting to want to build a system that’s modular, decoupled, independent and all the other feel-good adjectives that nobody’s supposed to dislike.

And indeed, Conway’s Law suggests that if you carve out a “data mover” as its own system and team, it’s natural to want to build that as a standalone system, exactly as we did. But there are tremendous efficiencies to be gained in lightweight coupling. And this really is very lightweight coupling, we’ve simply decided on a specific contract between client and server which results in the most robust, lowest complexity system overall.

Epilogue: “Why don’t you just use [my-favourite-job-system]?”

Without knowing the details, maybe we could have! For our needs which span multiple different programming languages, there was no obvious fit which had what we needed out of the box. I’m sure we could have made various approaches work, with additional code to integrate or augment missing features. But what’s better than writing a bunch of code? Not doing that!

Thanks for reading!

I hope you’ve found this alternative approach to a job system API interesting. It’s not necessarily the right approach for every job-like system, the point is that if you think about the broader context of a given system and how it’s used, you can sometimes get away with a solution which is orders of magnitude less complex, and that’s rather beautiful. Good luck!

Less is More: Improving job execution by ditching the job executor was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/d00eff680de4

Extensions

Leading while learning: why great managers don’t have all the answers

Bence A. Tóth Feb 10, 2025

Show full content

Leading while learningWhy great managers don’t have all the answers

I used to think leaders had to have all the right answers. Now I know that the best ones have all the right questions.

Early in my career, I admired managers who seemed to have everything figured out. They spoke with certainty, made quick decisions, and never hesitated. I assumed that was what leadership required: knowing more, proving more, and never showing doubt.

Then, I became a manager myself. And I quickly realized something unsettling. I didn’t have all the answers. I had experience, sure, but every new challenge exposed gaps in my knowledge. For a while, I worried that this made me a weak leader. Until I realized something surprising.

The best leaders around me weren’t the ones with perfect answers. They were the ones asking the best questions. The ones who admitted they were always learning, who invited different perspectives, and who built environments where their teams could thrive by thinking together.

Leadership isn’t about having all the answers. It’s about creating an environment where the best answers can emerge.

That means embracing a growth mindset, and not just for yourself, but also for your team. It means leading while learning, showing vulnerability, and fostering a culture where curiosity and experimentation aren’t just encouraged, but expected.

Neon sign on the wall that reads “You are what you listen to” — Photo by Arno Senoner on Unsplash

Here are three key principles that reshaped my approach to leadership.

1. Show vulnerability and model learning

In my opinion, one of the most common myths of leadership is that confidence means certainty. In my experience, people seem to be far more motivated by authenticity and transparency.

I remember the first time I became the manager of a team where I had no domain knowledge whatsoever. Before that, I had led teams in which I used to be the most senior engineer, and my influence stemmed from my expertise and the respect I had already built over time. But now, I found myself in an entirely different situation — I had no foundation of expertise, no established credibility, and no real capacity to catch up on the technical depth my team possessed.

The first few weeks were rough. I sat in meetings, listening to discussions that were a blur of unfamiliar terminology and deeply technical debates, sometimes entirely beyond me. I was trying to keep up, but the truth was, I felt completely lost.

What I did have was trust in my team. Over time I realized my new role wasn’t to have all the answers, and my team did not expect that from me, either. My role was to create an environment where the best ideas could surface. Instead of pretending to know, I made it a point to ask thoughtful questions, to encourage discussions, and to let my team take ownership of their expertise. The shift was powerful. Conversations became more open, collaboration strengthened, and over time, my role evolved into being the person who guided decisions.

To help your team embrace learning, you will have to model it yourself first.

You have to be very open about the areas where you’re lacking and growing. When you are honest and forthcoming about your own growth, it normalizes continuous learning for everyone. It shifts the culture from one of perfectionism to one of progress.

2. Create a safe environment where your team can fail

Failure isn’t just an inevitable part of innovation, it’s a necessary one. Many teams operate under an unspoken rule: failure is tolerated, for as long as it happens rarely and quietly.

I was a member of many teams in the past where mistakes were met with blame rather than learning. In such cultures, people stop taking risks. They hesitate to admit uncertainty, and creativity suffers.

A shift in culture happens when failure is no longer seen as something to be avoided at all costs, but as an inalienable part of innovation and progress. When leaders openly acknowledge mistakes, reflect on what went wrong, and share their learnings, it creates a ripple effect. Teams start feeling safer to experiment, knowing that setbacks are opportunities for growth rather than reasons for blame. Over time, this openness fosters an environment where new ideas can emerge more freely, collaboration strengthens, and decisions become well-informed and effective.

When people feel safe to acknowledge failures, teams improve faster and make better choices.

Research backs this up. Harvard professor Amy Edmondson set out to study the relationship between error-making and teamwork in hospitals. She initially expected to find that high-performing teams made fewer mistakes. But what she discovered was the opposite: the best teams — those scoring highest on a team diagnostic survey — actually reported more errors, not fewer. The reason? They weren’t making more mistakes; they were simply more willing to admit them. Psychological safety, it turns out, isn’t just a nice-to-have, it’s a prerequisite for growth and innovation.

So, how do you create this kind of culture?

Make learning from setbacks a core part of your team’s culture, not just an occasional reflection. When teams treat failures as opportunities to improve, they build resilience, and innovation flourishes.
Recognize and analyze failures thoughtfully — was it driven by calculated risk-taking, or due to negligence or lack of preparation?
Shift post-mortem conversations from blame to growth. Instead of asking, “Why did this happen?” ask, “What did we learn?” and “How can we do better next time?”
Reinforce that progress is a result of trial, error, and iteration, and not flawless execution.

A team that isn’t afraid of failure is well positioned for long-term success.

3. Balance your confidence with curiosity

Don’t get me wrong, confidence is important in leadership. But the strongest teams aren’t led by those with the most knowledge — they’re led by those with the most curiosity.

Years ago, I worked under a leader who always made decisions quickly. Even when he wasn’t sure, he’d confidently decide and move forward. At first, it seemed impressive. But over time, cracks began to show. Solutions weren’t fully thought through, and other perspectives weren’t considered. The team stopped questioning decisions because they assumed their leader had everything already figured out.

Compare that to another leader I later worked with. Instead of rushing to conclusions, she often asked, “What options do we have? What are the trade-offs?” She invited diverse viewpoints, encouraged dissent, and coached rather than dictated. The result? A team that thought together, owned their decisions, and trusted one another.

Leadership is a balancing act between listening and leading, between gathering insights and moving forward with confidence.

Striking the right balance is crucial. If your team is stuck in indecision, you need to help them break free and move forward. Moving slightly in the wrong direction is often better than remaining perpetually stuck in one place. Leadership isn’t just about asking questions; it’s also about knowing when to push for action.

So, how do you balance confidence with curiosity?

Replace directives with open-ended questions. Instead of “Here’s what we should do,” try, “What do we think is the best approach?”
Encourage dissent. Ask, “What might I be missing?” or “Who sees this differently?”
Adopt a coach mindset instead of a fixer mindset. Instead of jumping in to provide solutions, help your team develop the skills and confidence to solve problems on their own, and help each other when needed.

Curiosity keeps both leaders and teams adaptive, innovative, and open to better solutions. It is your job as a manager to encourage thoughtful discussions, and create space for team members to take ownership of challenges. By doing this, you empower them to grow, make better decisions, and ultimately become more resilient and independent as a team.

A team of people forming a huddle with their hands — Photo by Arno Senoner on Unsplash

Leadership is a journey, not a destination

Remember: great leadership isn’t about reaching a point where you have all the answers. It’s about continuously evolving, learning, and adapting. The best leaders don’t operate from a place of certainty, but from a mindset of curiosity and growth.

Your influence as a leader comes not from knowing everything, but from creating an environment where learning is encouraged, failure is treated as a stepping stone, and innovation thrives.

Forget the pursuit of proving yourself. True leadership isn’t about validation, it’s about transformation. The most impactful leaders are not those who stand alone at the summit, but those who lift up everyone around them.

Leading while learning: why great managers don’t have all the answers was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/f297cc383d01

Extensions

Simplifying Datastore Provisioning with Kubernetes Operators

Bruno Marques Dec 2, 2024

Show full content

Introduction

Welcome to the second blog post of our Self-Service Datastore series, where we share our journey towards creating a more efficient and reliable way to manage datastores at Zendesk. In today’s dynamic application development landscape, the ability to swiftly provision datastores is crucial for maintaining agility and delivering exceptional user experiences.

Provisioning encompasses all steps involved in requesting a datastore: configuring it to meet company standards, ensuring security and compliance, and managing access credentials. In this article, we’ll explore how we have simplified this process using Kubernetes storage operators, enabling our engineering teams to obtain datastores with ease and minimal complexity. Our approach has eliminated the need of those teams to master every detail of our datastore offerings or navigate the intricacies of third-party provisioning tools.

The pain of manual datastore provisioning

Consider a scenario where your Zendesk engineering team needs a new Aurora MySQL cluster. The process begins with submitting a request ticket, followed by a waiting period for approval. Once approved, you must configure the datastore to meet stringent security compliance requirements — hardening it against vulnerabilities, determining the number of replicas, managing access controls, and ensuring alignment with company policies. This tedious process can take days or even weeks, diverting valuable resources away from actual product development.

Now, imagine a smart entity — let’s call it the Datastore Genie. This entity expertly manages all security compliance needs for various datastores, understanding the sensible defaults for your Aurora MySQL databases. It allows you to configure only essential parameters, such as instance type, without overwhelming you with unnecessary details.

The Datastore Genie knows how to grant your wishes:

It knows precisely where in the cloud environment (AWS account, region, VPC) the datastore should be provisioned and which applications require access, configuring networking and security groups appropriately.
It is aware of the engine versions widely supported within Zendesk, ensuring that applications run on the most reliable configurations.
It automates user creation and rotation, freeing database administrators to tackle real database issues rather than getting bogged down in provisioning tasks
Best of all, this entity grants your wish in mere seconds or minutes, drastically reducing wait times and frustration.

The Kubernetes operator: your Datastore Genie

This is precisely what our Kubernetes storage operators provide. The operator pattern encapsulates the functionality of our Datastore Genie, automating the complex tasks of provisioning and managing datastores.

Kubernetes operators extend the Kubernetes API through CustomResourceDefinitions (CRDs), enabling us to define new resource types that represent our datastores. Each operator includes controllers that monitor instances of these CRDs, ensuring that the current state of each datastore aligns with the desired state.

At Zendesk, applications are deployed across multiple partitions. Each Zendesk partition is associated with its own Kubernetes cluster, and storage operators are deployed to each of these clusters with configurations that guarantee datastores are provisioned in the correct AWS regions. Deployed in its own Kubernetes namespace, the storage operator watches for custom resources (CRs) of specific storage types across the entire Kubernetes cluster. Then, they communicate with the proper AWS account and region, and credential storage to ensure efficient and secure provisioning of the datastore and its credentials.

From the application’s perspective, there is just one configuration file in the root of their GitHub repository, serving as the specification source for provisioning each datastore based on the intended deployment location.

Here’s a sample YAML configuration that specifies the requirements for provisioning an Aurora datastore:

version: "1.0"
name: "WishMaker"
description: "A service for managing wish requests."
product: "Wishmania"
team: "Genie folks"

infrastructure:
 aurora:
   - name: "WishInventory"
     attributes:
       instanceType: "db.t4g.medium"

In this example, the configuration outlines a minimal setup for the Aurora datastore, including its name and instance type. During the deployment process, this file is converted into a custom resource (CR) which is an instance of a CRD that the Aurora operator monitors to manage provisioning.

The pact between the team and the Genie

To successfully provision their datastore, the engineering team needs to merely grant its datastore wishes to the Genie by:

Writing a few lines of YAML specifying the datastore requirements.
Triggering the application deployment process.

As simple as rubbing a genie’s lamp!

In return, the Datastore Genie will grant these wishes by:

Handling all the complex provisioning and configuration processes.
Continuously ensuring that the datastore remains aligned with the desired state, converging every few hours.
Ensuring that the credentials and endpoints are ready to be injected into the application container.

This collaborative approach allows engineers to easily acquire a datastore that meets their needs so they can focus on what truly matters: creating amazing products, backed up by fully configured and compliant datastores.

The magic within

The Zendesk datastore is provisioned as part of the deployment pipeline. During each deployment, the provisioning stage makes a call to the Foundation Interface API as shown in the diagram below which converts the YAML file into the relevant datastore CR that will be processed by the Storage Operator. Foundation Interface monitors the status of that CR to determine if the database is ready. Once it gets the green light, it proceeds with the actual application deployment. Only then is the application deployment stage triggered, which starts up the application containers. The application can easily read the credentials and datastore endpoints from the container filesystem to connect to the datastore. This ensures that the infrastructure is consistently kept up to date, reducing the likelihood of any component becoming a snowflake that’s difficult to manage.

The enchanted toolkit: our storage operators

At Zendesk, we have implemented a select group of Kubernetes storage operators to streamline our datastore provisioning and management. By concentrating on a smaller set of datastores — one for each type — we enhance our operational efficiency and simplify management. Our operators include:

Aurora MySQL: A fully managed relational database service that provides high performance and availability.
DynamoDB: A fully managed NoSQL database service designed for high scalability and low latency.
S3: Amazon Simple Storage Service (S3) for scalable object storage with high durability and availability.
ElastiCache: An in-memory data structure store used for caching and real-time analytics.
SQS: Amazon Simple Queue Service (SQS) for reliable message queuing between distributed systems.

Additionally, we have implemented controllers that allow these datastores to be shared across multiple applications. This capability not only reduces costs but also promotes efficient resource utilization within our infrastructure.

Conclusion

By leveraging Kubernetes operators, we have created a self-service, efficient, and reliable framework that significantly reduces the burden on our engineering teams. The investment in building these operators pays off handsomely, not just in terms of time savings but also in the enhanced security and compliance that they bring to our data management processes.

The simplicity of the YAML configuration allows engineers to quickly provision the necessary datastores without getting bogged down in the complexities of security and access management. With the storage operators automatically handling the intricacies of provisioning, configuring, and rotating credentials, our teams can focus on what truly matters: delivering exceptional products to our customers.

Moreover, as we scale, the operators exhibit remarkable adaptability, seamlessly managing datastores across multiple AWS regions and ensuring that each deployment is both efficient and compliant with Zendesk’s standards. This scalability, combined with the robust architecture of our operators, underlines their value not only to engineering teams but to the entire organization.

See the magic continue as we introduce the next component of our self-service storage ecosystem, which ensures that applications consistently utilize fresh credentials that are transparently rotated to meet security and compliance requirements without human intervention. Stay tuned for more insights in our upcoming post!

Simplifying Datastore Provisioning with Kubernetes Operators was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://medium.com/p/a403d9ecd99c

Extensions

https://zendesk.engineering/feed

Posts