Vaidik Kapoor — GeistHaus

2024 — Year In Review

Vaidik Kapoor Dec 29, 2024

Show full content

Wrapping up another year, I’m continuing a tradition I’ve tried to keep for the last three years. I admit I’m often restless, rarely satisfied with where I am, which can be both a driving force and a source of impatience. That's why I really look forward to these because it allows me to reflect: celebrate what's been going well, thank those who have supported me, and plan for what’s next. And this year, I am ahead of schedule. I have started writing this before the year ends.

So, here it goes.

It's been a crazy year, in a good way. In short:

I had a great consulting gig and helped a growing business move forward.
We relocated to London.
I took a full time role at an early-stage startup as their Head of Engineering.
Been doing a lot of uncomfortable things that are helping me grow.
I read a lot more than I have managed to in the past. It was hard but felt really good.

Work Work Work

A lot changed on the work front this year.

A New Consulting Gig

As 2023 ended, my work at Three Ways Consulting took a new turn with a project that needed more time and energy than I expected. I started working with an amazing startup in online streaming, focused on helping people in India feel closer to their roots by creating great content in local languages and celebrating local cultures. At first, it’s easy to compare them to platforms like Netflix or Amazon Prime, but they’re different. Their work is more than just entertainment—it’s about helping people feel connected to their culture and making local languages and traditions a bigger part of everyday life.

I started by helping the team with organizational and cultural challenges, then moved on to improving how they worked on engineering projects and sped up their release cycles (guess what was the biggest leverage here). Along the way, I learned Flutter for the first time while working hands-on with the team, transitioned to web engineering, and eventually refocused on building the organization before wrapping up my time with them. After six months, I had to step away to focus on the next big chapter in my life (more on that soon). The outcomes at the end are worth celebrating:

A more cohesive team that is now more outcome focussed and works well together, playing several roles to ship things faster, and continuously learning as a team to improve their processes.
A bigger engineering team more suited for the business needs - helped fill key roles in engineering to increase their execution bandwidth.
A much faster and efficient release process which helped the team release continuously across all their major mobile platforms (iOS app and multiple Android apps built in Flutter). Their iOS app was behind a year as compared to their main Android app when I joined. The other two Android apps were usually a couple of months behind. All their mobile apps for all the platforms are now tested using automated tests fully, are not lagging behind each other in terms of feature parity and are released together several times a week with little to no manual work.
A better sense of build-vs-buy strategy - the team used to build a lot more software than they could manage and spent a lot of time focussing on problems not core or strategic to the business. They now buy non-essential software and integrate instead, essentially moving faster towards what matters to their customers and the business.
A more confident engineering leadership that was struggling to marry the team, their culture and processes with the needs of the business. They now move with a lot of confidence in terms of taking decisions, managing the team, managing their roadmap, investing in tech, overall execution and engineering craftsmanship.
And to my surprise, some of my coaching around hiring and building a team unintentionally spread to other functions in the business. I didn’t plan for this, but it was amazing to see!

Hello, Gaia!

Consulting after Blinkit was a lot of fun. I got a chance to meet some amazing founders and teams, learned a lot from them, challenged myself to learn new business domains (B2B SaaS, DevTools, gaming, social e-commerce, online streaming and influencer marketing), and learned some new tech. It was so much fun. But I had been feeling the itch to go back to committing and building something more dedicatedly. Consulting did not really give me the space for that. It's a great lifestyle business or a way to retire but I am far from done.

Sonam and I had been thinking about moving to London for a while. We really wanted to make it happen, but I needed to find the right opportunity—and consulting wasn’t the answer.

Last year, I received my UK Global Talent Visa, which opened up avenues for me to work at startups. I love startups. Startups are where I feel most at home. So, I began searching for roles in London. It was a long and exhausting journey, especially during a tough time for tech. After months of searching and facing many rejections, I finally found the right team.

I joined Gaia as their Head of Engineering. At Gaia, we are trying to make IVF work for families across the world. IVF is an incredibly stressful journey for families - mentally, emotionally, biologically, and financially. We try to take away a lot of their financial and emotional stress today. But that's just the beginning. The healthcare industry at large is broken across the US and the UK. We are trying to fix a part of it to make IVF more approachable to families who want to have a baby.

Why Gaia?

Why not? I had a lot of fun working in consumer tech and B2B SaaS. But work has always been about selling something. Gaia felt different—a chance to genuinely help families who need support and make a real difference. It’s also a huge opportunity.

The team is incredibly talented and mission driven. Everyone at Gaia is truly connected to the mission. During my interview process, I had the chance to meet the leadership team several times, and they impressed me with their talent and kindness. The turning point for me was a lunch with Nader and Alexia, followed by a weekend chat with Nader at Gail’s in Paddington. We talked openly about work and life, and I really appreciated his honesty and clarity. It was clear that we were on the same page about building a startup and a team.

Joining a startup is always a gamble—you sign up for one thing and often end up doing something entirely different (that’s certainly been my experience, like at Blinkit). So it doesn't matter what the team is doing in that very moment. It is always about the team and the founders. After that candid conversation at Gails with Nader, it was clear that he is a founder that I can play this gamble with. The rest is hardly in anyone's control.

The tech side of Gaia is also a fresh challenge for me. It's a unique business—not about high traffic or volumes of transactions. On a good day, we get a few hundred visitors. But each sale results into a sizeable amount of revenue for the business (for context, a treatment cycle of IVF in the US costs $20,000). So each visitor and lead if incredible valuable. Gaia also operates at the intersection of niche healthcare, financing, and insurance, in a heavily regulated space. That makes it a complex but fascinating domain to navigate and innovate in. There’s so much to learn here, and that’s what excites me.

What an incredible ride it has been!

I just completed my first 6 months at Gaia and it has been such a humbling experience. A lot of it has been uncomfortable. But there is growth in discomfort. Here is what the last 6 months have been about:

I’ve been learning a lot about IVF, healthcare, financing (loans), insurance, and regulations. I wouldn’t call myself an expert in any of these areas—they’re incredibly complex, and there’s always so much more to understand.
Since I joined, we launched in the US, which has been a huge learning experience. I’ve been diving deep into the US healthcare system and figuring out how to bring Gaia to the US market.
It hasn’t been all smooth sailing. I inherited a small team that became even smaller after I joined, but we’ve since rebuilt and grown stronger. I am incredibly proud of the team we have now and everything we have achieved in such a short span of time. It’s been a mix of hard work and a lot of fun.
- We inherited a lot of technical debt and we have been paying it down aggressively.
- We encountered a really interesting framework called Frappe (built by a team in India!) and used it to rebuild our core platform that manages the healthcare and financial journey of IVF patients. We rebuilt it from scratch in record time (6 weeks) and better (well... time will tell).
- We rebuild a part of our data platform to increase trust and reliability of data, and make it more accessible for reporting and analysis to internal users.
- On the public-facing side, we revamped our tech stack with better tools, enabling our growth teams—marketing, sales, product, and design—to experiment and iterate much faster in the US market.
Building a team in London has been a new challenge. I’ve been able to draw on my experience of building teams in India, but it’s been a learning curve to adapt to hiring here—everything from sourcing to pitching to evaluating candidates.
I’ve also been thinking a lot about how we can use AI to transform the IVF experience for patients and their families, as well as the industry as a whole.
And through it all, I’ve been writing a ton of code and staying hands-on with building the business. It’s been an incredible journey so far!

I couldn’t have asked for more—it’s all in my hands now. It’s time to put into practice the same advice I used to give while consulting, returning to the way I worked at Blinkit but applying it to Gaia. Looking forward to an exciting 2025!

What about Three Ways?

A full-time engineering leadership role takes a lot of mental and physical energy. Trying to focus on two major things at once wasn’t fair to either. So, I made the decision to step away from Three Ways. I’ve wrapped up my consulting work and no longer take on new projects. I still occasionally support my previous clients when they need advice, and I’m grateful that they continue to reach out to me. I’ve learned a lot during this time, met some fantastic teams, and I’m glad I gave consulting a try to figure out what I’d want to do in the long run.

Moving to London

London is a beautiful city and has been really welcoming to us. The move has been so much better than how we imagined it be. People are really nice, too nice perhaps. As one of my friends recently said, people in the UK are so nice that one gets compelled to be nice to them in return. Not that I am complaining but it is definitely something I am not used to.

We have some friends in London so that has been incredibly helpful in the sense that we are not alone here. And we have been stepping out of our comfort zone and been making new friends here, which has been a good experience.

There are so many things that we love about London and the UK:

Clean air: For those who haven’t lived in Delhi or Gurgaon, it’s hard to understand how refreshing it is to breathe fresh air.
Endless things to do: The city truly has something for everyone—from amazing food and lively pubs to beautiful parks and cultural experiences like theatres and concerts. We’ve barely scratched the surface.
Public transport: Being able to commute via public transport has been a pleasant change.
Sonam's work: Sonam is still working remotely with Watsi, and moving to this part of the world has made a big difference time-zone-wise, especially since most of her team is in the US.
Django’s happiness: The weather here is perfect for Django. He loves going for walks and spending time in the parks, which are plentiful near where we live and all over the city. In India, we had to specifically look for pet-friendly places, but in London, most restaurants, pubs, and cafes assume pets are welcome. People here are always showing him love, and he enjoys the attention. Public transport is also super convenient for him—he loves traveling on the bus.
Traveling with Django: We’ve been taking Django on trips. While we used to travel with him by car in India, here we take the train. So far, we’ve visited Cambridge, Rye, and York, with Django along for the ride every time.
A great place for Indians: We already have some friends in London, which has been a huge help. Plus, with so many people visiting London for work or just passing through, it never feels too far from home.
Christmas magic: London becomes truly magical during Christmas. The festive atmosphere is something special to experience.

Things that have not been as nice:

Finding an apartment: It’s genuinely tough, especially if you're looking for a pet-friendly one. There aren’t many available. If you need help, we’ve become experts at it—feel free to reach out for advice.
Indian food: It’s been disappointing. Given how many Indians live here, we expected the food to be much better.
Food and grocery delivery: It’s really expensive. I do miss Zomato and Blinkit sometimes, but we’ve managed without them.
Adjustment period: While not exactly a problem, settling into life here has had its challenges—things like navigating the NHS, pet care, and other everyday details. But this is just part of moving to a new country, and it’s our first time doing so.
Schengen visa: Getting a Schengen visa from London is unnecessarily complicated.

Learning

I wanted to pick up a few new things to getter a better perspective on tech, business and life in general. Some have worked well, while others didn’t quite land the way I hoped:

Rust - has been on my radar for a while. I finally picked up rustlings (which is great) and got midway. I eventually lost focus because of existing work and personal commitments. It was an interesting experience. But I also realised that it's much harder for me to learn something unless there is a clear goal/value attached. That's still not clear enough to me about Rust yet or how it will be useful to me in what I do. Not saying that Rust is not useful - I get it's value. I am just not clear about how I will benefit from learning it. Need to be clear about it or should have a ton of time to get through it.
Flutter for mobile development - played around with Flutter in my consulting gig. It was also the first time I got into mobile development. Interestingly, my entry point was to help the team building mobile apps write tests to fully automate their release process. It never ceases to amaze me how valuable quality processes can be on different dimensions, even learning a new business domain or a new tech stack. Also, played with Tramline for automating mobile apps release processes.
Frappe - has been a new addition in my toolbox. And I am so happy about this encounter. For people who know me well, they know that I really don't like work that is undifferentiated and I also care about quality at the same time. I have worked on internal tooling at several occasions in my previous roles but usually struggled with providing the tech stack and the products the focus they deserve to build good internal systems. Every piece of work (such as building backend APIs, a React frontend, etc.) in the process of building good internal products/platforms means nothing if the sum of all the parts is not good. I really don't care about building APIs or frontend components. I want a working product that is reliable and usable. Interestingly, Frappe allows exactly that. It is a vertical framework that comes with an opinionated backend and frontend that works really well in tandem. In our rewrite of an internal platform at Gaia, we hardly built any REST APIs or frontend components. It allowed us to focus on what matters the most - the core business logic. It was super fun and I can't recommend Frappe enough to teams who are working on similar projects.
Beyond "Hello, World!" with LLMs - its criminal to not be exploring LLMs for what we do today. I have been playing around with LLM models by OpenAI and Claude, and with frameworks like Langchain, LlamaIndex and Chainlit to do POCs around consumer experience and internal process automation at Gaia. A long way to go that I need to cover faster in 2025.

Reading

This is a new section in my year end review because it has been my ongoing struggle for many years. I finally made progress, and I’m genuinely proud of how far I’ve come.

I finished several books and really read them properly:

A Philosophy of Software Design
Our Iceberg is Melting
Clear Thinking
The Culture Map
Read The Five Dysfunctions of a Team again
Partly finished Trillion Dollar Coach, and The New Leaders 100 Day Action Plan

What worked this year? I think I got a lot more serious about learning, largely thanks to Farnam Street. I’ve known for a while that I need to learn much more, and voracious reading is a key part of that. This year, I kept reminding and pushing myself, almost as if my life depended on it. I also changed my approach—rather than just reading, I started studying, like an academic. I wanted to read even more, but the move to London made it harder to keep up.

Travel

Another year of travel, though perhaps a bit less than in 2023, as our move to London took up much of our time.

Seychelles

Sonam and I celebrated five wonderful years of marriage in Seychelles—a true hidden gem! The islands boast breathtaking beaches, stunning sunsets, scenic drives, and beautiful walking trails. We explored three islands in the archipelago: Mahé, Praslin, and La Digue.

We visited during the off-season, so it wasn’t too crowded. The beaches had just the right balance—not too quiet, not too busy—and it made me realize that I can really enjoy a beach vacation if it’s the right kind of beach with the right atmosphere. Naturally, we spent a lot of time relaxing on the sand, reading, and sipping drinks.

We also did scuba diving. It was our second time diving. We have probably never experienced anything so beautiful before. It was like watching an underwater sea video on an OLED television, but in real life. It was a 10x better experience as compared to our first dive. We definitely want to do more of it if we get the opportunity.

Food has taken us to places before. So on the way to Seychelles, we found ourselves an opportunity to do a quick detour for food. We had our connecting flight to Seychelles from Mumbai. So we planned it in such a way so that we could take a half-day pitstop in Mumbai. We experienced some amazing stuff in Mumbai - bhajiye, vada paav, bun maska and parsi food. We also got to see a ton of charm and character of Mumbai. This short food trip excited us to plan a longer visit to Mumbai. Hopefully soon.

Kenya and Oxford

Sonam’s work at Watsi took her to incredible places this year. She visited Nairobi to connect with local healthcare partner organizations and meet families whose lives have been transformed by the remarkable work Watsi and their partners are doing. Every time she returns from these trips, she brings back heartfelt stories and unique perspectives about people and their struggles.

After Kenya, she attended the Skoll World Forum in Oxford—a gathering of philanthropists, leading nonprofits, and social entrepreneurs from around the world. It’s an annual event where they tackle some of the world’s most pressing issues, share ideas, learn, and collaborate to drive change.

What stands out to me about Watsi is their unwavering focus on trust and transparency. Seeing it firsthand through Sonam’s work and hearing about the team’s dedication, I’m constantly reminded of just how genuine and impactful they are. They’re so good at what they do—changing lives in the most thoughtful and sustainable way.

Obviously, I donate to Watsi and support their efforts to help people who really need it. Organizations like Watsi need all the support they can get to continue making a difference. With Watsi, you can either contribute directly to the healthcare needs of a patient whose story resonates with you or join their Universal Fund, where your monthly donations support their broader mission. As we enter the new year, I encourage you to take a step toward helping those in need. Donate to Watsi and make a difference.

Donate to Watsi

Meet a patient or Join the Universal Fund to donate monthly

Road tripping around Inverness

I was visiting London for some work this summer and was crashing with Apoorv. We decided to visit Inverness and do a road trip around it for 2 days, primarily focussing on the amazing landscapes nearby and whiskey!

We did a part of the North Coast 500 route. We started from Inverness, took a pitstop at The Singleton Distillery for trying a whiskey flight, then took another stop for lunch in the beautiful lake town of Ullapool, stopped in the middle of nowhere in rural Scotland for a coffee and then ended up grabbing a few drinks and having a good time with some super friendly locals, stopped in a couple of more towns on the way for food and beer, and finally ended our road trip in Inverness.

Interestingly, we hand't planned to visit Isle of Skye but had enough time on the last day to wander around. Our wandering led us to just touch Skye and make it back on time to catch our flight. The changing of landscapes during the road trip was serene. But this was just a teaser. We gotta plan a proper trip to Skye.

Bhimtal

We did a road trip to the quaint town of Bhimtal in Uttarakhand with Sonam's family. It was a great time for the entire family to get together and spend some quality time before we relocated to London. Also, a fantastic getaway from Gurgaon and its pollution. All you need in a place like Bhimtal is a cosy cottage and your people. We spent days and evenings talking, playing board games, listening to music and walking around scenic trails.

Cambridge

Cambridge was our first trip in the UK since we moved. We travelled with Apoorv and Django. It was also Django's first train ride ever. And Sonam got to visit another university town in the UK.

What a beautiful place. While most people do a day trip to Cambridge, we stayed there for 2 nights. There are tons of pubs and restaurants to visit, beautiful lanes to walk and of course the college campus to see (we couldn't get in, next time perhaps). A trip to Cambridge is incomplete if you haven't done punting in their canals. Another interesting adjacent experience is sitting on this pub on the canal and watch self-punters fall of their punt as they tried to manoeuvre under the bridge.

Sonam's verdict is Cambridge is better with more things to do than Oxford and also prettier.

Rye

A cute little town with one high street that is probably just half a kilometre long, some cute cafes and restaurants and pretty winding streets. There is not much to do in Rye, which is great. You eat good food, drink good beer, shop and just stroll around. It's a great place to slow down and relax, and that's what we did.

Vietnam

Two of my friends from college, Anky and Muddy, are getting married in 2025. So we did a stag trip to Vietnam. There were 12 of us so it was a large group. It was great to be able to pull this trip together given how busy everyone can get adulting. We are lucky that we are still able to do this once every few years.

We visited Da Nang, Hoi An and Ho Chi Minh. Highlights of the trip - Hoi An was beautiful, Ho Chi Minh was chaotic and fun, karting in F1 format was exhilarating and tiring, Vietnamese food was amazing, and cheap massage everywhere was relaxing. It was super fun!

York

It is our first Christmas in the UK. So this holiday season, we travelled to York for Christmas and Boxing Day. We stayed there for 3 nights and slowed down quite a bit. Apoorv travelled with us too (our 4th wheel). We found ourselves a cosy Airbnb and stayed in on Christmas but saw a fair bit of York, learned a bit about Vikings and their influence on York. Didn't expect York to be a walled city (to be honest, we didn't do much research). It has character. And of course a lot of significance for Harry Potter fans as you can visit the Diagon Alley (Shambles). And continued on our hot chocolate and beer rampage.

Retrospective on the year

Things that went well this year:

Moved to London: Started a new life and career. While we do miss our friends back home, we’ve begun building new connections here.
Reading: Finally overcame my mental block about reading and made significant progress this year.
Cooking: Cooked much more since moving to London. I can now confidently make chicken biryani and yogurt chicken curry, though I’m excited to expand my repertoire.
Walking Django: Establishing a regular routine of walking Django has been a grounding ritual, no matter how busy or tired I am.
Side projects: Made great progress on side projects in the first half of the year. Although I couldn’t keep up after moving, I’m determined to revisit them in 2025.
Networking: Built a solid tech network in London, starting from scratch.
Angel investing: Took my first steps into angel investing and am proud to have invested in 3 companies this year.

Things that didn't go so well and could be better in the new year:

Running: I couldn’t get back to running, partly due to my knee and largely because I lacked the drive. It’s a mental game, and I need to push myself toward a healthier lifestyle.
Writing: Haven’t been consistent with writing on this blog, which I want to prioritize in the new year.
Learning: I intended to dive into Finance and Economics but didn’t follow through. This year, I want to make steady progress in learning something new.
Confidence and imposter syndrome: Starting my new role at Gaia shook my confidence at first. Worried with questions like will I be be able to manage and a build a team in London, will I be able to learn this new domain, will be able to work with people with different cultural backgrounds, will be able to do this at all? Imposter syndrome hit hard. I’m in a much better place now, but I know I still need to figure out how to manage those feelings better in the future.

Gratitude

I come back every year with my gratitude for Sonam. One might think that is obvious because we are partners. But as obvious as it might be, it is easy to overlook all the great things your partner does for you. I am able to do everything today because of her support - if that might be looking after Django while she is working from home and I am in the office or pushing me to be healthy. But most importantly, she is my therapist in moments when I really need one. When I was struggling during my initial time at Gaia and was doubting myself, she told me that I am worrying about it more than I should and even if it doesn't work out, it's fine. Not everything is meant to work out and we can always go back to India if I don't like working in London. She made me feel safe and provided that safety net. Most importantly, she gave me the confidence that while I might be struggling in that moment, I can do this. That helped me operate differently. And I got through it. Thank you so much, my love ❤️

I am lucky and blessed to have a few friends who I can count on. One of them, Apoorv, is in London. He is one of my oldest and closest friend from college. He helped us a ton with our move. We practically turned his apartment into a dormitory when we moved. Living in a home and not an Airbnb made our move feel so easy. And of course, for all the good times we have been having here with him since we moved. Django has a new uncle who can walk him around 😆

Our friends back home have been incredibly helpful as well. In times of need, I can just pick up the phone and call Akshat, Ankur, Jacob and Konark. We don't talk frequently but I know that they are around whenever I need them. They are always there to hear me out, counsel and offer valuable advice.

Looking forward to 2025

Continue my reading habit. In fact, double down on it. I am not going to put a number to it though.
Get fit - start gym again.
Build build build. Excited about being in a full time role at a startup again. Get to the deep end of applied AI and solve consumer problems with it.
Learning - I didn't get much done around the things I planned to do in 2024. I want to do those and probably more.
- Finance and economics - from last year.
- Getting into maths has been on my mind for a while. I would like to start something, probably Category Theory.
Probably take on a hobby. Thinking about whiskey tasting and rowing.
Travel to mainland Europe often.
Blog more frequently. I have been terrible in 2024.

That's all, folks. Can't wait for 2025 and what we make of it in London 🎄

676f249693ec7a000171349e

Extensions

2023 — Year In Review

Vaidik Kapoor Jan 7, 2024

It's the end of another year. I am back with my 2-year-old tradition of writing a personal Year In Review. 2023 was a busy year, filled with work, travel, lots of food and learning new things that I had never thought of doing before. It was a fulfilling

Show full content

Work and Three Ways Consulting

I started my consulting practice, Three Ways Consulting, in 2022, which has been going pretty well. It was hard in the beginning, but I feel I have a pretty good handle on how things are going. Getting back to working with smaller teams and helping them build stuff has been pretty fun, just like the old days.

The volume of work coming in now is pretty good compared to when I started. I get to say more no than yes to incoming leads either because the company doesn't suit my requirements, which is growth stage tech startups, or I don't have the bandwidth to take on more work. I also got to learn a ton about new business and problem domains, thanks to my clients ranging from a variety of sectors like B2B SaaS, e-commerce, healthcare, marketplace, fintech, entertainment, etc.

Building a startup is not easy. I have been a part of two myself, and I know how difficult it is. Working closely with the founders and especially their CTOs has been a reminder of that and a humbling experience. At the same time, I see founders repeat similar mistakes I have seen before several times, because of which they slow down and waste time, incur opportunity cost, burn out and have their teams burn out - essentially always end up being in war mode. Work should not feel like a war all the time. Even working hard should be fun. A big goal of my consulting is to turn work from feeling like a war into fun. Leaders have a big goal to play in making it happen. I wrote something about what it takes for a CTO to lead product engineering teams.

Another interesting pattern of problems I have consistently seen across all startups I have consulted with or talked to is the problem of quality, slow releases and absence of Continuous Delivery practices. I guess that's where I come in. Startups dealing with such problems end up finding me. The problem of Continuous Delivery and quality often starts with understanding how software changes are tested before releases, but it is so much more than that once you get over that hurdle. Interestingly, most teams don't even get over the first problem - automating the process of testing software changes to speed up releases. A big part of my consulting work revolves around these problems. I thought writing about how to approach solving some of these problems in a series of articles could help the teams struggling with similar problems. I have started a series of articles called Engineering Transformations, where I talk about some of these problems and how tech leadership in startups should approach them. The first article in the series is on Adopting Automated Testing as a Practice. Give it a read, and let me know what you think.

Writing Code

I wrote a lot of code in a long time and learned quite a bit as well. As you can see, I have written a lot more code in the last two years as compared to how much I was writing when I was at Blinkit because my job largely involved managing teams.

It was so refreshing! There is something about it that takes me down the rabbit hole. I also find it therapeutic, like gardening. In the process, I learned a ton of new things.

I started with TypeScript by building a simple tool to simplify working with multiple Google Tag Manager accounts, which is something that I have wanted to build since my time at Blinkit. It is almost done but not release-ready. It is such a pleasure to work in TypeScript. With proper use of types and the Intellisence experience in VSCode, I have yet to come across a programming language that is so nice to work with.

My experiences with TypeScript got me thinking that it was time to try Type Hints in Python properly (yes, I have been lagging behind for a Pythonista). Python is my first love as a programming language. I built a couple of small projects in Python 3 with Type Hints. While it was a great experience, the TypeScript experience won over the Python experience for me. I thought I'd never say it, but it is true.

I got to work a ton with several other technologies like FastAPI, Nest.js, Ruby on Rails, RSpec, Temporal, Github Actions and CircleCI.

Learning New Things

I had a chance to learn some interesting things in 2023.

I learned how to build a business website in Webflow, a low-code, no-code website builder. I have been wanting to do this for a while because I understand how important marketing is and what kind of a role a marketing website plays in it. I just have grown to not like the idea of writing code for it myself or having anyone worth their metal do it. I'm not saying that marketing is not important. However, the work of building a marketing website is tedious and distracting from the core, i.e. building the core that creates value for the business. Low-code tools can help reduce some of the unnecessary heavy lifting that goes into doing non-core stuff. Doing it hands-on myself with Webflow was mostly about turning the theory into practice. There are more such tools I would like to add to my toolbox in 2024.

I try to make my travels as worthwhile as possible. If the destinations are relevant for work as well, I try to set up some meetings. But often, I don't have a good starting point beyond the people I already know, which can be very few. We were in London, New York, Chicago and San Francisco this year. I wanted to network with people in tech in these cities. I didn't know how to get started. Somebody recommended that I should try cold messaging on Twitter. I had never thought that I would do it, mainly because I don't think I am effective in approaching people on social media. For the first time, I tried cold messaging on Twitter, LinkedIn and email. I approached it structurally like I was solving a problem. I built lists of interesting people (tech leaders, founders, VCs, etc.) in these cities (who I found using this cool tool called Apollo.io) and reached out to them on different forums. I tried different messages on different channels. Many people responded, and we connected. It was great! A mental block was lifted off. I can do this again.

The world has been all about Gen AI and LLMs this year. I am happy that I got a chance to move beyond just using Gen AI tools and build something. I built a small finance assistant to help me categorize my expenses at the end of every month for personal accounting. I wrote about my experiments with LLM in a blog post where I shared my journey, learnings and challenges. There is a lot more to learn in Gen AI, and I look forward to more experiments in 2024.

Finally, I will leave you with something light in this section. I didn't know I had it in me to be excited for a Halloween party. Thanks to our friends Nupur and Atin for coming up with the idea and hosting the party. What did I learn? YouTube is an extremely resourceful thing for learning random things, like colouring my face white for Halloween makeup. It was so much fun 👻

Travel

If you know Sonam and me, you know how much we love to travel. This year was full of travel. Sonam and I also got to travel solo as well.

Cambodia

The year started with Sonam’s impactful work trip to Cambodia in January 2023. She leads marketing at Watsi (YC’s first investment in a non-profit), and it’s fascinating for me to see her work connecting data and storytelling for fundraising and donor stewardship in the non-profit sector. She spent her time there with Watsi’s medical partner located in the capital city of Phnom Penh and also met many patients whose lives have been transformed by Watsi’s work. It sounds simple, but it does get really overwhelming seeing the problems first-hand, so I, in turn, helped her chill out a bit by putting together a list of places to eat in Phnom Penh. I bet she’d also recommend doing the Siem Reap Food Tour and hiking the temple walk in the majestic Angkor Wat city.

Pench National Park

In 2021, we travelled to Ranthambore for a jungle safari. It was a great experience. Since then, we have been wanting to do it again. In February, we went for another jungle safari to the beautiful jungles of Pench National Park and Tiger Reserve on our wedding anniversary. For those of you who don't know this, the comic book character Mowgli is based on Pench. So, we had to do it 😆

We stayed at Jamtara Wilderness Camp. It is perfectly located near the Jamtara Gate in Pench and is probably the only accommodation in that part of the national park, making it a peaceful place to enjoy the jungle experience. It is practically inside the jungle. Tigers often cross the camp late at night. It wasn't scary at all.

Jamtara Wilderness Camp is owned by the family of the Late Kailash Sankhala, who is also known as the father of Project Tiger. Naturally, everything at Jamtara Wilderness Camp is done with a lot of intention to allow you to be one with nature and get the best of wildlife experiences you can. Everything was so thoughtfully done (for example, they give blankets and hot water bottles for the morning jungle safari to help you stay warm). The tents are perfectly located on the property. All the amenities are provided sustainably, which means you don't get everything in the accommodation besides a comfortable bed, stunning views, fresh air, the quiet of the jungle at night and peace. Did I mention the food? Oh yes, they cooked such amazing food in the middle of nowhere. I think I had one of the best biryanis of my life, better than what I have ever had in Delhi. If you are visiting Pench, we highly recommend Jamtara Wilderness Camp.

Our safari experience was amazing, probably one of the best we have had so far. The jungle is beautiful. The national park is well managed by the authorities, who ensure that the tracks don't get too crowded and the environment is not disturbed beyond a limit. In my past safari experiences, it always gets too crowded. That wasn't the case in Pench. Also, we were super luck this time. We had two tiger sightings. The second one was so good - that's all that we could have asked for on this trip.

While I lived in Madhya Pradesh for many years during my childhood, I never had the opportunity to experience the wildlife there. It was our first time experiencing the wildlife of Madhya Pradesh. But there are so many wildlife experiences that Madhya Pradesh has to offer - Kanha, Bandhavgarh, Panna, Pench, Sanjay Dubri, and Satpura. We had such a wonderful experience at Pench that we want to go back to Madhya Pradesh again and try more intense safari experiences.

Dubai

In March, Sonam went to Dubai not just for a break but to experience working from a different city. It’s something we’re lucky to be able to do. I am grateful to have friends like Divya and Harsh, who opened their homes and are always welcoming. A highlight of her trip, and my personal favourite, was when our friend Divya introduced Sonam to Tres Leches at Magnolia Bakery. Our life has changed since then, hunting down for the (next!) best one around the world.

I think the most important highlight of her Dubai trip for me was that Divya introduced Sonam to Tres Leches at Magnolia Bakery. Sonam liked it so much that she went there several times when she was in Dubai. She even got it for me...all the way from Dubai! How cool is that!

Tres Leches Bonus: You can find Tres Leches at Magnolia Bakery in Bangalore and Hyderabad as well. If you live in these cities or are planning to travel, please go and try this amazing dessert. We are lucky to have friends like Ashish Dubey, who always sign up to bring food to us from other cities, like bringing us Tres Leches from Bangalore. So lucky to have such friends!

Leh, Ladakh

While Sonam was travelling to Dubai, I wanted to get out of the city for a few days as well before starting a new project. I wanted to go to the hills, but I did not have the time to drive or take the bus. I decided to go to Leh, Ladakh. The connectivity to Leh from Delhi is amazing. There is a one-hour flight to Leh from New Delhi. And I love Ladakh. I have been there before. I absolutely loved it, and I think it is one of the most unique places in the world. I can keep going there over and over again.

The visit to Ladakh was a short one this time. I was there for only three nights, out of which I spent the first night to acclimatize. I stayed at this gem of a place called Dolkhar in Leh city. It is a beautiful luxury resort with only seven or eight cottages built in Ladakhi architecture. Everything about this place is so intentional.

They only serve vegan food, which was a catch, and I was nervous about it. But if the food is that good, who cares if they serve non-vegetarian food or not? The quality of their food and drinks matches the standards of the best restaurants in Delhi or maybe even better than most. I was so surprised that they are doing this in Leh. Everything they cooked had organic ingredients that came from carefully selected sources. The food they serve is an experience. I tried their Chef's Tasting Menu. What amazing food!

I had always known that Snow Leopard sighting is a thing in Ladakh. I had always imagined how cool it would be to see a Snow Leopard. While this was not the plan, I was lucky that the nice folks at Dolkhar arranged it for me and made it possible for me to try. Snow Leopard sighting is very difficult. One can be right in front of you, and you might not see it at all. It is a game of extreme patience. People track Snow Leopards over weeks and stay in the national park itself to increase their chances of seeing one. I was doing this for one day. So, the chances for me to actually see one were incredibly rare. I went for it still, especially after our recent experiences of wildlife safaris in Ranthambore and Pench. The wait and the thrill of the chase is an incredible feeling.

A safari in the mountains is nothing like a safari in the jungles. The safari actually happens on the roads across mountains. Spotters are deployed at different locations in a region trying to spot snow leopards. They communicate over the radio to different convoys if they see something. And then the cars rush to those spots, pretty much like wildlife safaris but on actual roads.

I didn't see any Snow Leopards, but I was not sad at all. I didn't expect much. What I didn't expect and got to see was a pack of wolves. Besides the wolves, I also got to see herds of Ibex. Apparently, spotting a pack of wolves is a rarer event than spotting snow leopards. So it wasn't nothing. Besides that, driving around in those mountains is always a great experience.

Washington DC, San Francisco and The Bay Area

It had been more than ten years since I visited the US. We were trying to visit San Francisco last year but couldn't due to visa issues. San Francisco and the Bay Area are quite familiar to Sonam and me since we have been there several times for work. However, we had never been there together. So we decided to travel there together and stay with our friend Ravit in Cole Valley. I don't think we would have ever stayed in Cole Valley if Abbie and Ravit didn't live there.

Sonam had to attend her offsite in Washington DC. She was meeting her team for the first time. I, on the other hand, didn't have any major objective other than to hang out a bit in the Bay Area, enjoy the city life, network with people in tech, catch up on the Silicon Valley tech vibes, eat great food and travel a bit. We achieved all those objectives 💪

We made a road trip to San Diego to meet Abbie. We took the Pacific Coast Highway (Highway 101), one of the most scenic highways in the US. It was a beautiful drive! We saw the majestic Sequoias in Big Sur and stayed in a rustic cabin there for a night. We spent the next two nights at Abbie's mom's place in Coronado, a small island next to San Diego, which also happens to be a major US Naval base and the training centre for US Navy Seals. It is also where many scenes of Top Gun Maverick were shot. I had never seen an aircraft carrier in my life. In San Diego, we saw three at the same time! Interestingly, the US Navy and the Navy Seals were conducting training exercises when we were visiting. Exciting!

California is known for its amazing Mexican food. Both on the way to San Diego and on the way back, Sonam found us some legit Mexican food joints. One was located at such a place where eating felt like being in a scene of the deserts of Breaking Bad. We didn't know that we would enjoy Mexican food so much.

When we got back to San Francisco, we had enough time to network and absorb the city. We spent most of our time working from cafes, trying new food, meeting friends, networking with folks in tech and attending events. San Francisco and the Bay Area feel more like an extension of Bangalore in terms of just how many people we know there. I literally bumped into friends from India whom I have not met in years. Attending tech events in San Francisco almost after a decade was so funny. It felt like an episode of Silicon Valley 🤣

In the remaining time we had in San Francisco, we decided to get out to Sausalito for a night. Cute place. They have nice ice creams and great roads to walk around.

Andamans

It was the first time for me in Andamans and the second time for Sonam. We were there for her birthday. It is such a beautiful place. Every Indian must visit Andaman. Beautiful, untouched white sand beaches. Beautiful tropical forests. Mangroves. History. Andaman & Nicobar Islands have so much to offer. It's unfortunate that most islands in the archipelago are inaccessible to the common public. We spent countless hours seeing these islands via satellite images. If it's so beautiful from up there, what would it be like if we got to visit? It's literally unfair.

We were in Andamans for three nights. We visited in August, which is the off-season, so it wasn't crowded at all. The weather was humid, but it was bearable. We stayed at Sanctuary in Wandoor village, which is south of Port Blair. It is rustic, raw and bare. Don't expect the comfort of a hotel. Instead, expect the experience of staying in an intentionally created personal space of a nature and adventure lover, which is probably something you can't expect. It is a different experience. It is not for everyone. If you like something different, go stay at Sanctuary. Don't complain. Enjoy the experience. The host of Sanctuary likes to stay away from the web, so I am not going to mention his name. But we learned so much about these islands, the life there, the good parts and the bad, from our conversations with him.

London, Chicago and New York

We spent the last two weeks of September and the first week of October travelling to London, Chicago and New York. This was a long trip. I had to visit London and Chicago to attend and speak at tech conferences. A trip to London to visit our friend Apoorv had been long due since last year. Sonam and I had been wanting to see London City as well. And since we had to travel to Chicago as well, we thought it wouldn't be worth the time and effort to visit Chicago and come back. We had been wanting to visit New York for a long time, so we added that to our list as well.

I had to attend conferences like SREDay London, DevOps Days London and DevOps World Chicago (I spoke here). Conferences have usually been a good learning experience for me.

A lot of what we did during this trip was inspired by a bunch of shows that we watched on Netflix, like Somebody Feed Phil and Chef's Table. Coincidentally, some of our friends from India were travelling to London around the same time. So, the trip was even more special.

London is a vibrant city which has so much to offer. It's hard to summarize our nearly ten days in London. In short, it is a beautiful city with a lot of character, nice people, great food and fun things to do. The weather was perfect when we were visiting so we could enjoy the outdoors a lot.

Key experiences include the London pub crawl in the centre on a Friday night with our friends, The Lion King theatre show, lunch at Rovi and the steak experience at Blacklock Shoreditch. Our Lion King theatre experience was out of this world. We had not experienced anything like this before. The Lion King is nostalgic. The show was magical. Also, a weekend trip to Borough Market is a must. If you go there, try the toastie at Kappacasein Dairy.

Chicago is beautiful. It has an amazing architecture and great food. We stayed in Chicago for four nights but had to extend it to five nights due to sudden rains in New York, because of which our flight got rescheduled. Key experiences include eating all kinds of Polish dogs, especially the ones at Jim's Original, modern Southern cuisine at Virtue (great food!), the original Chicago deep dish pizza at Pequod's Pizza and my favourite Detroit-style pizza at Paulee Gee's (to die for!). The Chicago architecture tour is a great way to see the city. Besides that, we also did a walking Pedway tour, an interesting way to see the lesser-known parts of Chicago. We had some good experiences in Chicago, but overall, we didn't feel that the city was safe to venture out at night without understanding where not to be and what not to do. We wish that was not the case.

I used to think about New York as just another artificial city that only had tall buildings, no culture and no greenery. I was so ignorant and wrong. I was never really excited to visit New York until last year after our friend Ben, who is a New Yorker, told me how wrong I was about New York. I started paying more attention. We watched New York in a bunch of documentaries on Netflix, and it became clear that we had to visit. We stayed in New York for four nights, one night in Manhattan and three nights in Brooklyn. This city never sleeps. Even at 2 AM at night, the subway was crowded. We were travelling. What was everyone else doing at that time? This city is active. There is so much going on. It's a fun city that has so much to offer.

Besides doing touristy things, we had a major objective - try as many different types of pizzas as possible. We went to six different pizza places. Here is our favourite list (in this order): Di Fara Pizza, L'Industrie Pizzeria, Lombardi's Pizza, Luigi's, NY Pizza Suprema and Fini Pizza. And that's not even all that is there to offer. We didn't get to go to so many pizza joints like Totono's, Scarr's, Lucali's, to name a few. Besides Pizza, which was the highlight, some of our key experiences include visiting speakeasy bars, trying Southern and Korean fried chicken, cycling in Central Park (what a beautiful park) and shopping on Fifth Avenue. I have to say, I love New York City. It is truly one of the best cities in the world.

London

I got the opportunity to revisit London in November to speak at the London edition of the DevOps World conference. This time, I travelled alone and decided to extend this trip to explore the tech scene in London and spend some time networking. I attended several small and big tech events and met interesting people working on some exciting things. While in London, I was continuing my consulting work in India, so during this trip, I spent most of my time either meeting people or working from different cafes in the city. It was great!

I spent almost three weeks in London. So it's not that I did nothing besides networking. I had some great food experiences again. We couldn't make a trip to Cafe Leto to try their Tres Leches cake (the same cake from Magnolia Bakery I talked about earlier in this post) in our first trip. I had to make up for the last time. So I went there three times on this trip and got as many friends along to try their Tres Leches cake. Magnolia or Cafe Leto - which one is better, you ask? I think they are both really good in their own ways. I can't take a stance yet.

London was lit because of Christmas. The city was beautifully decorated. Given the amazing theatre experience during our last visit, I went for another theatre show. It was The Mousetrap this time, which was running for the 70th consecutive year. Can you believe that? It was a conventional style of theatre, like how you would imagine a theatre show to be. And it was quite entertaining. Another great experience was the Christmas vibes and cosy food at Maltby Street Market.

Retrospecting on the year

Things that went well and I am happy about:

I wrote quite a lot on my blog and otherwise. Writing helps me get clarity. I am also happy that I didn't leave any posts incomplete.
I used to enjoy writing reviews for places I visited. I am happy that I wrote a lot of reviews on Google and Zomato this year.
I worked on a couple of side projects where I am progressing seriously every month if not every week. It feels good to build and cultivate something.
I injured my knee in 2022 because of running. I was advised to get regular physiotherapy for strengthening. I have been pretty regular this year, and my knee is getting better.
I attended and spoke at several tech conferences - API World 2023, DevOps World Chicago, DevOps Days London, DecOps Summit Canada and DevOps World London.
I got the Global Talent Visa for the UK, which allows me to move to the UK to work, start a business or study. I don't know what I will do with it. But it's good to be able to travel to London like a boss 😎

Things that didn't go so well:

Didn't cook much this year.
I didn't meet my goal of reading five books. I only finished two books.
I wanted to learn a new topic like finance or economics. I just started one in December, but the progress has been pretty slow.
While I have been regular with my physiotherapy, my knee hasn't fully healed. I have not been able to do any workouts, especially running (which I love). I have just started with upper body workouts this week. I am hoping that I can do more next year.
I developed some muscular problems with my right wrist, most likely due to long hours with the keyboard. That prevented me from even doing upper-body workouts.
I have a tendency to get attached to work, which is not a bad thing. But while consulting, it is important to draw the boundaries to be able to work across clients, ruthlessly prioritise problems that matter and not lose your sanity in all of this. I have not lost my sanity, but I haven't been able to draw the boundaries clearly. Gotta work on that.

Gratitude

My wife, Sonam, is my support system. Since I quit Blinkit, she has supported me in doing whatever I have wanted to do. I have gone through ups and downs. But she has always been around to support me and provide me the energy and mind space to keep trying things. She doesn't say it, but she sees what I need and just makes things happen for me. We don't explicitly talk about all those direct and indirect things she does that make living worth enjoying, especially the indirect things I think I wouldn't get the value of if she was not around. I only think about such things retrospectively and feel incredibly thankful for having her in my life. Also, I don't think anybody can keep up with my love for food the way she does (I can't think of many people in the world who would sign up for eating pizzas at six different places as the primary objective of their trip to a city). I love you ❤️

I want to thank our friends who made it possible for us to enjoy our travel experiences. Apoorv Parijat, thanks for being the roof over our heads in London twice and for all your time. Without you, London wouldn't have been half as fun. But the comfort of a home is invaluable. Abbie Strabbala and Ravit Srivastav, thanks for being such wonderful hosts in San Francisco and Coronado. We got to see the West Coast in a different way because of you guys. Again, your cosy space in San Francisco made the experience so different. We could hang around in San Francisco and live the city life like a local. That, and with all your recommendations, was a different way to experience San Francisco compared to all our past visits. Your trip to India is still due.

Konark Modi is an old friend whom I was fortunate to meet back when I was in college because of open-source communities. Konark messaged me out of the blue after a long time, and we caught up. And since then, he has been such a guiding light in my professional life. He has pushed me out of my comfort zone and given me valuable perspectives about navigating professional life. This old relationship came out to help me when I least expected it. Konark, I am eternally thankful for how you have helped me in 2023.

Looking forward to 2024

While 2023 was great, there is a lot more I wanted to do. Somewhere, I lacked the energy to keep pushing and procrastinated. I want to change that and be more committed to results for myself. I keep saying this to the teams I coach - discipline is the key to growth. The same applies to my own life as well. There is much to learn and improve.

Here are some specifics I am going to push for in 2024:

Learn a non-technical subject like economics. I have started a course in finance. I want to finish this by the end of February.
Commit to reading more often. There is no point in putting numbers yet because I do that every year. Something needs to change. I am working on it to figure out why I am not getting better at reading.
Start workouts again, slowly and not injure myself.
Cook more often. Didn't do enough of it in 2023.
I haven't walked Django enough in 2023. I want to get back to it.
Learn a new programming language because learning is fun.
Keep building. Building something valuable is probably one of the most important skills a person can have. So gotta keep building.

That's all, folks. I'm looking forward to an exciting 2024 and getting a lot more done this year! 🚀

658e924076618e000114e62d

Extensions

Experiments with LLMs

Vaidik Kapoor Dec 27, 2023

LLMs are all the rage right now. How big of a hype cycle it is is hard for me to comment on. But one thing I am confident about is that this is an innovation that is a clear inflexion point. The world from now is going to be so

Show full content

I have been a user of LLM-based tools for a while now. ChatGPT, Perplexity and Github Copilot are regular tools in my toolbox. Besides these, I try new AI tools that come my way every now and then. I have been reading (as much as I could) about the developments in the applied AI world. I have discussed ideas with peers and in my consulting. However, I have been lagging behind in building something myself with LLMs. I wanted to make sure that I ended the year by taking the first step in this super exciting space.

I am a big believer in experiential learning. As always, I start learning something with an objective to build something useful with it. So, I did the same with this as well. I wanted to build a tool that would categorize all my monthly expenses from my credit card and bank statements and tag them as business expenses for accounting.

Trying out different models and frameworksExperiments with Langchain and OpenAI

Langchain is an LLM model agnostic framework for building applications using LLMs. It works with most of the LLM backends like OpenAI, Cohere, Anthropic, etc. I have been wanting to try Langchain for a while. Also, I prefer starting with a framework that does most of the job and allows me to switch between options quickly instead of getting into the specifics of the APIs of every LLM model provider. So that's what I did.

I decided to use Langchain and OpenAI to run some experiments to first learn how the framework works and what I can do with it. The first hurdle wasn't even technical. I was not able to add credits to my OpenAI account using my Indian credit card. My free credits had expired. So I couldn't do much. I was itching to move forward quickly so that's where I stopped with OpenAI and started looking for other options.

Experiments with Cohere Classification API

I started exploring Cohere, which provides free credits. I tried Cohere with Langchain. While fiddling with Cohere, I discovered that it also has an API for classification, which seemed relevant to what I was trying to do - classify financial transactions into categories. Langchain does not support classification, so this is where I moved out of Langchain to Cohere's Python SDK. While the "hello world" of classification seemed promising, it didn't really work. I tried to provide a large enough dataset of my previously categorized transactions, fiddled around with the format in which I was providing my transactions, etc. But the results were very off. I even tried to classify a slice of exactly the same data that I had provided to Cohere's Classification API as the learning dataset. Even that wasn't getting classified correctly. I couldn't make sense of it.

I left my experiments with Cohere's Classification API, understanding that classification works better on language constructs (sentences, phrases, etc.). What I was providing were names of transactions (for example, Kayak.com, Zomato, MTA*NYCT PAYGO, etc.). Some of these can be quite cryptic, depending on what the merchant is called in the payment provider's system. Many of the transactions are going to be just new names every month. Names don't mean anything unless backed by a large dataset. Hence, their classification is a tough problem.

But I don't think that I understand classification fully yet. What I couldn't make any sense of was why Cohere was not able to correctly classify exactly the same transactions that I had provided as the learning data set is beyond me. I have to dive deep into this to understand how classification algorithms really work.

Back to Langchain and OpenAI

After trying Cohere's Classification API, I was back to Langchain and OpenAI as that seemed like the only reasonable option for getting started. I figured out a way to add credits to my account somehow.

I started working with gpt3.5 based models. In fact, I started with the exact models in Langchain's docs (why not). My transactions dataset was in CSV format. I wanted to categorize each transaction and get the result back in CSV format with an additional "CATEGORY" column so that I could use the result in Google Spreadsheets.

Interestingly, Langchain has a concept called Output Parsers, which you can use to parse outputs into specific formats like lists, JSON, etc. It also works with data structures native to Python, like Pandas Enums, Pydantic model, etc. This was interesting, and I was wondering how Langchain does it. It turns out it just does really smart Prompt Engineering to instruct the model to give the results in the expected format as a text response and then uses its own parsers to parse it.

This was an interesting learning experience because I could do a lot with this as a prompt engineering pattern. For example, Langchain does not support a CSV output parser. I could have worked with Pydantic models as well, but it just seemed like a lot of work. I wanted to use CSV and Pandas because that's what I am most comfortable with. So, I started instructing the model to give its results in a format that would be easy for me to parse.

As soon as prompting for getting the right output format was solved, I moved to doing something more meaningful. Very quickly, I ran into the limiting context window problem of the model I was using because of my large dataset for context (400+ transactions) and input dataset to categorize (100-130 transactions). That got me to explore other OpenAI models with larger context windows.

Even with the models with the largest context windows available for public use, I was still struggling with the context window problem. So, the next obvious step was to break the input and categorize transactions into smaller chunks. Again, some chunks would invariably not output the entire result, mostly ending abruptly with incomplete lines of output. As much as I tried to debug it, Langchain provides limited output from backend LLM providers like OpenAI. So, there wasn't enough information to debug what was going on. Something like Langsmith could have helped with debugging the interactions with OpenAI, but that's available in closed beta only. I faced a roadblock again.

From here on, the only viable option I had was to work with OpenAI's APIs directly instead of using Langchain.

An important learning in this process was that getting desired results while working with low-level LLM model APIs usually involves multiple steps and interactions instead of expecting that one API call can do everything due to their limitations, like limited context windows. Then, you wrap those interactions in a higher-level function to build an API for your own application. A case in point is breaking the input into chunks to get results in chunks and stitching them together. That's just one example. Multi-step multi-interactions can come in several ways. Read on.

OpenAI Assistants FTW!

While trying to use OpenAI APIs directly, I remembered that OpenAI recently released Custom GPTs and OpenAI Assistants. Both of them do something similar - give ChatGPT custom instructions to build your own GPTs (or Assistants) and then interact with them using a user interface (in the case of Customer GPTs) or APIs (in the case of Assistants). They can also do things like retrieval from your proprietary datasets. This seemed relevant and something I had been wanting to play with. I decided to try out OpenAI Assistants. I still had to chunk my inputs, but I found that I wasn't getting the truncated incomplete outputs I was getting with Langchain. I am not quite sure why the outputs were truncated in the case of Langchain, which also uses OpenAI's public APIs. Maybe Assistants uses another API under the hood that works slightly differently and is more reliable than the public APIs? I can't say for sure.

Now that I was not struggling with truncated outputs, it was time to tune my results and make them more accurate. This was an interesting journey of learning Prompt Engineering. I have documented some of my personal learning experiences about prompt engineering in the remaining article.

Things I learned about Prompt EngineeringName your input and intermediate results to make subsequent prompting easier

I was working with multiple datasets - the historically categorized transactions and the transactions to categorize. In fact, the end-to-end process involved transforming the inputs into another form and working with that.

Initially, I would give my inputs that would refer to different datasets as "initial dataset", "categorized dataset", "output in the previous message", etc. I often ended up confusing the assistant and got incorrect results. Here is one of my bad prompts:

Categorize all the transactions in the output of the previous message.

Search for transactions in the provided dataset. If you don't find a transaction in the dataset I provided, search for it by PLACE in your own dataset. Then use this information to categorize the transaction.

Bad Prompt

Instead, explicitly name your datasets and input them to make it easy to refer to them. For example, I now use prompts like this:

The attached file is a set of historical transactions to use as reference for categorization.

Each transaction is in the following format: {DATE}, Time: {TIME}, Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

Let's call this data "historical transactions".

Improved Prompt

Another prompt:

Here is the file with the transactions which are not categorized. The file is in CSV format. Load this file to work with it. Format every transaction in it into the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}

Lets call this "current transactions".

First Prompt

And then, I will use these references in my future prompts:

Categorize all the transactions in the "current transactions" by looking up similar transactions in "historical transactions".

Second Prompt

Treat the model as someone who has no context and can do one very specific thing well

Treat the model as if you have hired someone who has no idea about your job, work and goals. It's like hiring someone to do a one-off job on Fiverr. This person can work really well if you are very specific, but when given complex tasks, this person might get confused because they don't have the context of the problem and clarity about your expectations of them.

I wanted my assistant to categorize all my transactions according to transactions that I have already categorized in the previous months. If something is not present in my previously categorized transactions, I wanted it to use its own knowledge (the internet) to categorize the transactions. Here is the prompt that I used:

Categorize all the transactions in the input dataset.

Search for transactions by PLACE. If you don't find a transaction
in the historical dataset I provided, search for it in your own dataset.

Example Prompt

This turned out to be too complex, and the result was never accurate. It would either use my historical transactions or its own knowledge, but not both.

What's a better way? Break it down. Here is an improved version.

First prompt:

Categorize each transaction in "formatted current transactions" by looking up similar transactions by PLACE in "historical transactions". If you don't find a matching transaction in "historical transactions", categorize it as PENDING.

The output should be in the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

First Prompt

Second follow-up prompt:

Great. Now, categorize all the PENDING transactions with your own dataset and knowledge base. Preserve categories for transactions that have been already categorized and are not marked PENDING. The output should be in the same format as the previous message with all 50 transactions.

Second Prompt

The results were more accurate and more predictable.

Be as specific as possible for accuracy in results

Since I was trying to categorize transactions, I expected that the count of categorized transactions would be the same as the transactions in the input. Interestingly, I saw that either the model would decide to output only the first 5 transactions to show me a sample of the work done by it and get my approval, or it would show all the transactions, but erratically, some transactions will be missing.

To make sure that the model does its job correctly and you are not going in loops to get accurate results, make your expectations of the results explicit and precise. Here is an example:

Categorize all the transactions...

The output should be in the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

Also, the output should have all 50 transactions.

Example Prompt

Notice that I specify the expected output format and the exact number of transactions I expect in the output. This reduced the random disappearance of transactions in the output, or the model just showed the top 5 or top 10 transactions.

Cross-question the model's responses to reason and improve your prompts

You might start with a set of prompts that you think might work. But the result might not be what you expected. In that case, you can cross-question the model to understand why the output is not what you expected. This can help you improve your own understanding of how the model is processing your input and instructions. Once you have a better understanding, you can improve your prompts. This is called Chain of Thought reasoning. Consider this example:

Categorize each transaction in "formatted current transactions" by looking up similar transactions by PLACE in "historical transactions". If you don't find a matching transaction in the historical transactions, categorize it as PENDING.

The output should be in the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

The output should also have all the 50 transactions.

We will call this dataset "first pass".

First Prompt

In my case, the response had all the transactions categorized as PENDING. This was unexpected as I knew that my input had some transactions that were present in my "historical transactions" dataset. So, my follow-up prompt was to reason about it. Here was my next prompt:

You have marked everything as PENDING. Didn't you find any transaction in the historical dataset?

Second Prompt

The model responded by saying this:

Apologies for the confusion. It seems that I mistakenly used a placeholder mapping for historical transactions instead of the actual historical transactions dataset. Let me correct this and categorize the transactions again using the correct historical transactions dataset.

...

Response to Second Prompt

Now, this helped me understand that my First Prompt was ambiguous. Using this kind of reasoning, I was able to iterate further on my prompts to make them more accurate. I could also inject automated tests in intermediate steps if I was doing it through code and have a fallback prompt to have the model improve the results.

What's next?

This has been an exciting learning experience. I am a little better at understanding how to build LLM-based applications. There are still some areas that can be improved in this assistant to make the output more deterministic and accurate. I will keep working on that.

The next set of things I am excited about to learn:

Using Retrieval in OpenAI's Assistants to further fine-tune my assistant
Exploring RAG by implementing a vector database myself and going down the rabbit hole of learning all about vector embeddings, cosine similarity and retrieval
Running open-source models instead of using OpenAI's models and testing for accuracy using an automated (or semi-automated) process

Update: OpenAI's Prompt Engineering Guide

It turns out that OpenAI has written a useful guide on Prompt Engineering to get accurate results, which overlaps quite a bit with what I have learned. While it would have been helpful to read this before I did everything myself, the engineer in me is happy that I was able to figure out most of this on my own 😄

6587089876618e000114e485

Extensions

Engineering Transformations - Adopting Automated Testing As A Practice

Vaidik Kapoor Dec 17, 2023

Show full content

Engineering Transformations - Adopting Automated Testing As A Practice

Leadership, in a way, is about finding and using levers that help the business take step jumps in making progress towards the vision of the company. Engineering leadership is similar but limited to engineering and product development. As engineering leaders, one of our major responsibilities is to help our teams take step jumps in how we innovate quickly, ship high-quality software at break-neck speed and be able to operate them in production as well. I call such step jumps Engineering Transformation.

I use the term growth stage at the risk of sounding jargony (digital transformation, IT transformation, etc.) as they are often used by consultants and I am one right now. But that's just due to the lack of a better term in my vocabulary. I don't intend to make this article about expensive consulting that nobody likes (and perhaps does not deliver what most businesses need). I intend to talk about leadership, that too, at startups and growth-stage companies.

Bringing about transformations is not easy. To bring about a transformation and uplevel an organization, you need to intervene at many levels - skills, culture, mindsets, discipline, to name a few. The journey of transformation is not easy because the challenges are partly technical and mostly sociological.

It is unique for every organization. And it is worthwhile to discuss the challenges and techniques to bring about these transformations. I hope to discuss some of those transformation experiences I have led through a series of articles. And I am starting with one such transformation in this article.

Adopting Automated Testing as a practice

It's 2023. While, as an industry, we have been talking about Automated Testing and Continuous Integration for almost two decades, most teams continue to struggle with quality, Automated Testing, Continuous Integration and Continuous Delivery. So, as the first case study, I want to discuss the challenges of adopting the culture of automated testing, which helps improve quality and speed of execution, and eventually practising Continuous Integration and Continuous Delivery. In this article, I will try to cover patterns from my personal experiences of helping teams adopt automated testing as a regular practice in their teams.

This article is not to sell automated testing and shifting left on quality. That's a different topic and has been written about many times by many accomplished people in the industry. Books have been written on the topic. So, I am assuming that the value of automated testing as a practice is known to the reader. I am only going to talk about the struggles of adopting automated testing as a practice, which any team would have faced only if they wanted to adopt the practice, meaning that they understand the value of it at some level. And hopefully, this article will help them attempt it again.

Failure Modes

Teams that try to adopt the practice of writing automated tests usually try it several times and fail. They fail to adopt it as a practice and get value out of it. Let's unpack that first.

What does it mean to realise the value of automated testing?

Better user experience - Automated tests help prevent regressions, meaning that known user flows of the application are tested on every change. This helps ensure that developers are testing all the ways that their applications can be used by their users, thus ensuring users don't face functional problems while using the application. Since the tests are automated, they can be run repeatedly over and over again.
Confident releases - Because teams can test every change automatically, they can ship changes with a high degree of confidence. No more late-night deployments. No more gated PR merges by senior engineers only. No more batching a large set of changes and then testing them manually. Manual testing is expensive. The cost of it introduces bias in deciding which scenarios to manually test. With automated testing (theoretically), you can test all the known ways a user uses the application. Hence, automated testing also helps remove bias in the testing process.
Continuous and on-demand releases - confident releases lead to the ability to release frequently and on-demand whenever the team wants. Continuous delivery of small batches of work leads to customers getting the latest work frequently. A positive side-effect of this is that engineering teams also get to work on the latest and well-tested version of the code base, which leads to lesser chances of rework and integration hell (merge conflicts).
Shipping new changes and system maintenance become easier - confidence in releases opens the doors to make more frequent releases. More frequent releases mean faster value delivery to customers, faster learning from customer feedback and faster iterations. So, existing features can be improved faster, and new features can be built faster. But also, other work like system maintenance also happens faster. For example, most teams struggle to update their dependencies to newer versions because they are not sure if the application will work after upgrades. Security vulnerabilities don't get patched on time but automated tests make it possible to confidently patch software without breaking customer experience.

There are several failure modes for a team to try to adopt automated testing:

Juniors working on automated testing - change starts from the top. This is also a side effect of the leaders of the organization not understanding the value of the transformation itself, and hence, they don't lead the change themselves. If the leaders want to see a change in their organization, they should drive it themselves (being hands-on with writing the code is not necessary). Otherwise, transformations and organisational changes are doomed to fail and should be suspended. Why suspend? As a leader, when you commit something to your teams but don't make it happen, your commitments stop meaning anything and become just words.
Not focussed - organization-wide changes are hard. They require focus, especially if you are also learning the skill related to the change. In the case of automated testing, it could be learning how to write tests, what kind of tests to write, how to architect the test infrastructure, etc. Learning requires focus. Driving a big change, especially getting everyone else in the organization to learn a new thing and change their behaviours, requires focus on training, coaching, resolving anxiety and providing any kind of support to see through the change. Not being focused also means no clear, achievable goals and milestones. Without that, the value of automated testing will not be realised.
Too few resources for the change - if you have no tests, the change is too big to bring the test coverage to a level where it provides high confidence during releases. If enough resources are not provided to increase automated tests, achieving "meaningful test coverage" might take forever to realise the value of your automated testing initiative.
Not taking everyone along - you will likely start with a few people to get to "meaningful test coverage". But then what after that? How do you scale the practice to other teams and make it a "practice"?
Poor feedback loops - not running tests frequently enough. The process of running tests is automated using a pipeline, but the pipeline is run only on-demand when a developer wants to run them. And then developers choose when to run tests, adding extra effort in getting valuable feedback from your automated tests. Tests should be run while developing locally, on every push to a remote repository and before merging the PR. The latter two are the controls that ensure that tests are always run before a change gets to production. Enforcing everyone to run tests locally might not be feasible depending on the size of your org (you can't monitor everyone, or maybe you can?). But, if possible, absolutely get everyone to run tests locally. But besides that, run tests before merging the PR at least? This leads to some important failure modes:
- Tests are run on-demand when someone wants to test something instead of running them automatically in a CI loop. Delays feedback, which leads to rework in the later stages of building software. Adds extra decision work for the team. It could even lead to changes not getting tested before being released to production because developers can skip running tests (because that's just extra work).
- Tests are flaky because of environment-related issues like sharing of the same environment for manual testing, poor test data sanity due to previous test runs, and lack of automation to setup the infrastructure for the application, to name a few. Flaky tests reduce confidence in the test suite. Hence, teams start to skip running them. Think about it - if you use a product that does not often work, especially when you need it the most, you will stop using it and find an alternative. In this case, it could just be not running tests.

Strategies for Success

Transformations are difficult. They are less about the technology and more about the people and what they must do to collectively be transformed. Understandably, the strategies of success are related to the failure modes.

Change starts from the top - changes are hard. They don't just require technical skills but also organizational buy-in. Leaders are better positioned in an organization to drive change. When they lead the change themselves, they motivate everyone. They are also better suited to start, build momentum and then monitor progress in later stages because they now understand the problem space better to iterate and monitor progress without being hands-on. Leaders should lead the change themselves, especially if the org has been failing to drive the change bottom up, especially with difficult changes like adopting automated testing.
- If the senior leaders cannot be involved in the transformation initiative, they should be involved enough to understand the problem space and monitor execution and progress. But if they are not involved hands-on, the next available layer of seniors should be involved hands-on in the initiative. This could be senior engineers, tech leads or engineering managers. If they don't know how to write tests, then they should learn it by doing it and then learn what it takes to maintain a good test suite.
Start by learning - if you are starting a transformation about anything, make sure that there is "enough" knowledge in the organisation to drive the change. I focus on "enough" because knowledge levels are contextual. For example, the knowledge for automated testing and the skills to build the infrastructure for it look very different at Google scale than for a startup. So "enough" is contextual. However, the fundamentals are universal. In the case of Automated Testing, understanding different kinds of automated tests, what value they bring, their implementation complexity, and their associated costs are important to understand. Most teams often start with Unit Tests. However, if you have an existing large enough codebase, unit tests alone are not going to provide enough value. You are going to need Integration Tests that test user behaviours.
- If you have the required knowledge and skills on the team, you can skip this. However, knowing where you stand in terms of the right technical knowledge is important to plan the strategy and its execution.
- If you don't have the required knowledge and skills to drive a transformation like this before, then start by learning. Follow SMEs, read literature, and talk to experts who have done this before. Don't underestimate the fact that you probably don't know enough about a domain that you are trying to venture into. Most teams that try to adopt automated testing as a practice struggle with this - not knowing what tests to write that provide value in the current context of the team.
Create focus - solving any tough problem in business requires focus. Most teams are able to (or not) create focus on innovation while trying to solve a product problem but they struggle to create focus on driving transformations. Create a focus to learn, experiment and learn by doing, analyse your current context and then build a plan to execute. In the case of automated testing, first make it a clear goal (OKR, objective, deliverable, SMART goals - whatever language works for you). Then set aside a people or team for it. There are two ways to go about doing this:
- Dedicate a team that just does this (like working on a new problem or product) and drives the change centrally. This works well when the team also needs to acquire the knowledge to drive the change, learn the skills, bring clarity to clearly define the objective, and execute themselves to get to the first few milestones before scaling to other parts of the organisation. It also works well when a certain amount of work is needed in the beginning to lay the foundation and build momentum. I think this structure works the best for most teams. In the case of automated testing, this will look like learning how to write automated tests, write the first set of tests to set examples, build the infrastructure like CI pipelines and test data infrastructure, achieve "meaningful test coverage" to confidently release on-demand, etc.
- Set up a loose team (or a working group) that meets at a certain cadence (say weekly or bi-weekly). They support and drive the change across the org. This works well when the org already has the knowledge and needs to incrementally drive the change but still deliver results in a finite time period. In the case of automated testing, this might look like working with multiple teams to run different parts of the overall project with multiple teams (like one team solves the test data infrastructure, another one documents behaviours to test, and another one learns how to write tests in the existing tech stack, another one teaches engineers how to write tests by pairing with them, etc.).
Provide adequate resources to meet timelines - initiatives and projects with one or two people usually suffer. Two people are not a team, but sometimes that's all you have. architectural may require learning, arechitecture work, technical execution, evangelism, coaching, support, documentation, continuous assessment of progress, and more. With an org wide behavioural change, it is already an uphill battle. Be aware that if enough people and the right people (i.e. seniors but also people with the right motivation) are not tasked to work on the initiative, the initiative might be doomed to fail. So, as a leader, it is also important to motivate the team to work on the initiative. In the case of automated testing, it is such a practice for everyone in the team. A change like that requires the leaders of the org to become experts in the practice themselves so that they can coach others, motivate them and also hold them accountable. Beyond the point of building the thought leadership on the topic, there is going to be just pure execution work as well, like writing a ton of automated test cases. Having some juniors to be a part of the initiative to help execute is necessary to move fast in the direction of the leaders of the initiative.
Make the transformation goals finite and time-limited - vague goals are most likely to fail. Anything like "we will increase test coverage to 90%" is a bad goal. 90% coverage of what? In what time frame? Make the goals clear, finite and time-limited. "Write automated test cases to cover 60% of user behaviours with 100% coverage of critical flows in 3 months so that we can confidently release on-demand at any time of the day" is probably a clearer goal. This will drive execution towards writing test cases for features and user flows that matter to the business instead of just arbitrary test cases. The next goal could be to "add automated tests for every bugfix and hotfix without fail as a practice in the next 3 months, increasing test coverage from X% to Y%". The next one after that could be that "every feature must be released with automated tests increasing test coverage from Y% to Z%".
Make meaningful progress to prove value - transformation initiatives are about improving how work happens. Often, teams get stuck in the loop of incremental improvement that does not provide value in a reasonable time period or, worse, will make it impossible to catch up with the speed of everything else happening. In the case of automated testing, I have seen teams commit to adding a "few tests every sprint". If a few tests are added every sprint, but those few tests are negligible compared to all the other code changes, it's mathematically impossible to catch up. In a legacy code base, it is already a very difficult goal. To match up to that debt of automated tests, momentum needs to be built. To build momentum, everyone developing software has to write automated tests. For everyone to commit to learning and writing tests, they have to first truly believe in doing that because change is hard and motivation is probably the biggest factor that is going to turn that into a reality. To get everyone to believe in being a part of the change, one of the best ways is to show them the value of the change. To show them value, work towards a "finite and time-limited goal" of writing enough automated tests that provide value. I have already defined value earlier in this article. To summarize, enough automated tests should give the entire team the confidence to ship changes without breaking core customer experience on-demand multiple times in a day and should provide them feedback quickly to course correct development in earlier stages. To be able to do that, work towards achieving "meaningful test coverage" - write enough test cases that test core business and user flows and enable the entire team to release on-demand multiple times in a day.
The right way is the only way to work; stick by it and make it easy to follow it - if you expect a change in the organization, then you have to make sure that everyone is following the change you expect them to follow, even if it means manually catching them. If you want your team to run tests before shipping changes to production, make sure that they do it even if you have to check with your teams manually. Too much work? Of course, it is. So, automate it. Make sure that tests are run automatically on every change before changes are released to production, and then make sure that there are no backdoors (like SSH or skipping the release pipeline) to release changes, and now you don't have to do a check manually. Ensure that the right way (running tests and deploying via a pipeline) is the only way to release changes (no backdoors). And then, make it easy to follow the right way. Run tests automatically on every push to a remote server so that developers don't even have to think about it. Test runs should be fast. Otherwise, nobody would like to wait for the test suite and will try to find ways to skip the pipeline. In my opinion, 10 minutes should be the upper limit of time for tests to run. Ideally, they should be much faster. When the pipeline or the test automation setup breaks, stop all the work and fix it. Otherwise, your developers will find another way to release changes. If it can be done once, it can be done again. High-performing engineering teams treat pipeline issues the same way they treat production incidents.
Evangelise, coach and support - a central team driving the change and doing a lot of work hands-on themselves is a great starting point to build the momentum and make meaningful progress to prove value, but what's next? As I said, in the case of automated testing, every engineer has to write tests to keep up with the pace of software changes. That requires bringing the entire team along. They might need training, coaching and continuous support, especially when they need to learn with regular feature delivery work. It is not easy to do both things at the same time. So, empathise with their situation and extend help to make the transition easy. Some strategies that could work to scale execution - start with one team and go team-by-team to allow each team to go in-depth of the topic, or start with all the teams together (more or less) but with relatively easier tasks and allow them to build expertise over (a finite) time. Some tactics that could work to extend help: organise a boot camp to bring teams up to speed, office hours to resolve queries and doubts, prefer synchronous discussions over asynchronous back-and-forth, drive discussions to understand the thought process over (say) just doing PR reviews, pair program on a few tasks, dedicate bandwidth of the central initiative team to provide proactive and reactive support to teams adopting the practice.
Build systems for continuous progress - while supporting teams is important, you have to build systems for continuous progress, almost as if progress is guaranteed. Making progress in transformation initiatives is hard. There will be opposing forces. For example, in the case of automated testing, the opposing force is delivering new features. So, what kind of systems can ensure continuous progress? As leaders, hold all teams accountable to the change and their related tasks (like writing tests every sprint to increase the coverage). Remember I said that change starts from the top so that leadership can be in the weeds and see through changes? This is where the organization leaders can hold their team leads and managers accountable for meeting their teams' objectives for automated testing. Leaders can weekly monitor the progress of all teams on how many tests are being added by their teams and if the defined sprint objectives for improving automated testing as a practice are being met. Track the progress of test coverage against a finite, well-defined set of tasks that need to be completed. Weekly monitoring is a good starting point but can be slow (long feedback loop). Consider automating whatever you can to simplify monitoring over shorter periods of time. For example, one approach that has worked really well for me in the past is enforcing that new tests are added (or existing tests are modified) for every bugfix (detected by a consistent branch naming strategy for bugfixes, like bugfix/name-of-the-branch. You can practically enforce this via automation in the CI pipeline so that a pull request cannot be merged if a new test case is not added or an existing test case is not modified. Tactics like these may work well to scale accountability for transformation in day-to-day without seniors keeping an eye out and micromanaging. Read more about the philosophy of using DevOps for scaling engineering management.
Monitor progress - if you can't measure it, you can't improve it. Set up metrics to track outputs and outcomes. Tracking does not have to be fully automated. Full automation (AKA perfection) may come in the way of making progress. Just start with weekly tracking of progress. That might be enough to start and then automate depending on which feedback loops need to be shorter. Now let's talk about some things that could be worth tracking periodically - the number of automated tests, coverage of business-critical flows in the product, regressions caught before escaping production, on-demand releases in a day, number of reported bugs over a week, number of hotfixes in a week, how many engineers can confidently write automated tests. Then, work towards improving these metrics. When you see success stories (my favourite is when potential regressions are caught before code changes are released), celebrate those success stories publicly and widely to build confidence and momentum.

Too much to do?

Yes, it is. I will repeat myself - change is not easy. While I discussed automated testing in this article, this applies to any organizational change. This is my (perhaps incomplete) model of driving changes. It feels a lot because it is new. You run your teams with some said or unsaid rules already, but they don't feel like a lot because you are comfortable with them. The idea is to do it enough to feel comfortable so it does not feel too much.

Getting a team to work in sync requires commonly agreed-upon principles and core values and sticking by those values. It's not easy, but I don't think there is another better way as well. It feels better when you do more of it.

What's next?

I am interested in discovering and building generic models for transformation that work well and scale for growing startups. I have been meaning to write about transformations for a while. I might not have all the answers, but I can share my experiences that have worked for me in the past. I have more ideas that I want to explore on similar lines. And I want to write about them to test my own models. If you have any topic that you would like me to cover, please feel free to reach out to me on Twitter.

657df94c76618e000114e2e8

Extensions

The extent of GitOps

Vaidik Kapoor Nov 26, 2023

I have often thought about what extent of Infrastructure as Code and GitOps is relevant for a company. And I am probably not the only one. Many people in the DevOps, platform engineering and SRE space would also have thought of this question. What should be managed the GitOps way?

Show full content

While the ideal answer is everything is managed via code and operated via the GitOps way, the right answer is that it depends on the situation of your company. The right answer is rooted in the cost (manpower cost, infrastructure cost and most importantly opportunity cost) of moving existing infrastructure and operations to the new way vs the benefits of such changes. Usually, the cost is high.

Revisiting GitOps at DevOps Days London

While attending DevOps Days London, this question came up again in an open spaces discussion on GitOps with Kubernetes. Specifically, the interesting part was the discussion around how some teams have automated the process to commit the number of pod replicas for a deployment/replicaset when an auto-scaling event happens for a deployment in Kubernetes. I have thought about this before while implementing GitOps so I had some opinions on the topic, but it was also interesting to see how others see and solve this challenge, and their extent of implementing GitOps. In the discussion, we tried to go deeper into this but only scratched the surface due to lack of time and other discussion items.

So, through this article, I am going to put my thoughts together on the extent of implementing GitOps and hope to learn what the community thinks about it.

Why GitOps?

First question - why GitOps? I am not going to cover this in detail. In short, GitOps enables teams to practice Continuous Delivery for the entire stack, i.e. from applications to configuration to infrastructure. Why is that important? Because a product or service is built up of all those things.

How does GitOps help with CI/CD? Git is a cornerstone tool for practising CI/CD because of its many benefits:

Benefits of a Git-based workflow:
- Enables collaboration
- Workflows similar to shipping software changes
- Code reviews
- Continuous integration using automation and pipelines
- Audit log and change history help understand what changed
- Free backups work as a great Disaster Recovery mechanism
A single source of truth helps ensure that engineers can easily reconcile their code and the system
Additional benefits of Infrastructure as Code (IaC) provide an easy repeatable setup of infrastructure

Hence, bringing in Git to manage the infrastructure and its operations declaratively enables teams to follow similar release workflows for managing infrastructure operations as for applications.

To what limit should we adopt GitOps?

Like some of the teams in the open space discussion at DevOps Days London, one of the teams I used to manage in the past had also gone down the path of replicating infrastructure state (like replica count of a Kubernetes deployment or number of instances in an auto-scaling group).

Git was primarily built for humans to collaborate on shared codebases. Sure, it can be used to store any changes (like a database), but you can use anything for anything by that logic (like bash for building a web service, or S3 is an application database). The benefit of bringing infrastructure configuration in a code repository is that (platform and product) engineers can collaboratively build the infrastructure the same way they have collaborated on shared codebases to build software. So in my opinion, that's the main value of GitOps - collaboration between teams.

But soon after we started on the path to replicating the infrastructure state into a Git repository, we realised that replicating the infrastructure state into a Git repository is complex and perhaps not worth it. Some of the complexities we faced when we were trying this approach (one of these was also discussed at DevOps Days London):

Automating the PR workflow - the biggest complexity was in automating the part where a pull request is created when the infrastructure state changes (like in the event of auto-scaling) and then also merged automatically. Since machines and humans are collaborating on the same repository, frequent PR merges can be a frustrating experience for humans. Race conditions in merging PRs can lead to unnecessary rebasing. More importantly, it is probably not so straightforward to guarantee that the commits in the Git history reflect the exact order of the infrastructure state change events because race conditions can happen in backfilling the Git repository with infrastructure state changes triggered by the infrastructure itself.
- Complications of CI - if you are running additional checks and verification (like tests or validations), the above problem gets more complex because the exact duration of CI runs is not deterministic, making race conditions in a super fast-changing system (due to automated PRs) highly likely.
well-tuned - The Git history is also not left of much use anyway with a ton of commits to sync infrastructure state. When I say "a ton of commits" due to infrastructure state sync or reaction to an infrastructure state change event, that's not a random assumption for argument. I actually mean it. In a well-tuned infrastructure with thoughtfully designed auto-scaling policies, auto-scaling will happen many many times throughout the day to keep infrastructure costs in vary as the demand of the system varies, and that's okay! But then replicating those infrastructure changes in the Git repo makes the Git repo less useful to humans.

Given such complexities, we started evaluating our objectives all over again and realised that it is pointless to put the entire infrastructure state, especially the state that changes due to external conditions, in a Git repository because we follow GitOps. That's not the point of GitOps, not in my opinion anyway.

The point of GitOps is to enable practices like CI/CD for infrastructure development and infrastructure configuration management and enable seamless collaboration among humans.

Configuration vs State?

I have mentioned infrastructure configuration vs state several times. To be on the same page, let's get the definitions right for the scope of this article.

What's Infrastructure Configuration?

Infrastructure Configuration is a declarative way to specify in code how you want your infrastructure to finally end up i.e. state.

For example, Kubernetes manifests are a type of configuration. A Kubernetes deployment manifest specifies how the infrastructure for running an application should be provisioned and how much of it should be provisioned (i.e. pod replicas). A Kubernetes HPA manifest specifies how the infrastructure should scale up or down on the basis of rules. It does not specify the exact number of pod replicas at a given point.

What's Infrastructure State?

Infrastructure State is the final state of the infrastructure, in compliance with how we declared the infrastructure to be i.e. Infrastructure Configuration.

So, while an HPA manifest specifies the min/max pod replica count and the rules for scaling up and down, the actual count of pods is the state of infrastructure in compliance with the HPA configuration. The same logic applies to the count of instances in an EC2 Auto-Scaling Group. The same logic applies in the case of Vertical Pod Autoscaler.

Does knowing the infrastructure state help?

If infrastructure is to be treated "as cattle and not pets", then knowing the exact infrastructure state (and this is extremely relevant in the case of stateless ephemeral infrastructure like EC2 instances, Kubernetes pods, etc.) is going against that philosophy. Even if we were to challenge that wisdom, I have yet to come across how knowing the exact state of the system can be helpful.

The state changes triggered by the infrastructure platform itself, like adding new pods, are not really modified by humans directly. So even if new pods in the cluster are synced back into the Git repository, that code is not going to be changed by any human. If we don't add the exact pods into the Git repo but just modify the replica count, that doesn't help much as well in any scenario related to changing the infrastructure configuration. If the baseline configuration is changed, Kubernetes will replace all the running pods with new pods while maintaining the replica count at that time. So we don't really need to know this while changing infrastructure configuration. While I have taken only one example, I will extrapolate that to say there isn't much need to sync infrastructure state back into the Git repo from the perspective of future configuration changes, not for managing the infrastructure and not for collaboration between developers and platform engineers.

If we were to look at syncing changes back into the repo from the perspective of debugging an ongoing issue in production, the infrastructure state can be helpful. Being able to visualise what all changed in the system overall, either because of changes caused by humans (i.e. change in code and configuration) or external factors like an increase in demand leading to an increase in the number of pods, can be extremely valuable.

For example, at a high traffic when a lot of pods get spawned, the application database could start rejecting connections because every pod starts a connection pool and beyond a certain number of running pods, the database cannot accept more connections. Now, the information about the increase in the number of pods over time while debugging such an issue could be handy in reasoning about why the database is not accepting new connections all of a sudden.

Another example is catching unexpected side effects of changes, like a change in HPA policy leading to sudden and unexpected shutdown of pods, which leads to an increase in latency and error rates. Again, when the latency and error rates spike, it would be nice to know that there was a deployment around that time but also what else happened in the infrastructure, like a sudden reduction in the number of pods. This is an important piece of information to build a view of what might have happened that led to service degradation in production.

Events matter

The knowledge of infrastructure state changing over time can be valuable in debugging problems in production. The exact infrastructure state at any time is probably not as useful because it is impossible to read through the entire infrastructure state. But the incremental change in infrastructure state and what led to that change is probably far easier to consume.

Changes in infrastructure happen due to events that are either triggered by humans or are system-triggered. A deployment triggered by a change in a Git repository (a pull request getting merged) is an event triggered due to human actions. Changes triggered by the control plane of an infrastructure as a result of an infrastructure configuration and changes in some other external conditions lead to an event (like auto-scaling as a result of an increase or decrease in traffic) triggered by the infrastructure itself. Both kinds of events are important to debugging the system when something goes wrong.

In modern Cloud and Cloud Native infrastructure, there will be a lot of events triggered by the infrastructure itself. When such changes are replicated in the Git repository as well, it will lead to a lot more changes in the Git history from the infrastructure than the engineers on the team itself. So while knowing how the infrastructure state has changed over time can be extremely valuable while resolving production issues, that much information dumped in one Git repository and accessible mostly via the repository's Git history is going to be cognitively heavy for engineers to consume and put it to use effectively. When things go bad, you don't want to be reading a Git history where most changes are from a bot about syncing state.

So all the events making their way into a Git repository does not make much sense. But all the events triggered by humans (i.e. engineers making changes) should be in a Git repository for all the earlier discussed benefits of using Git for managing changes. However, there isn't much value in replicating events triggered by the infrastructure itself. In fact, it is more harmful to the developer experience.

What's a better way to manage state and events?

What is a better way to manage infrastructure state and events if not a Git repository? The infrastructure state itself is less important for engineering teams. It is primarily important for infrastructure tooling for operations like reconciliation. So if you are using Terraform, then the Terraform State file is used and managed by the tooling in the Terraform ecosystem. If it is Kubernetes, then the state is managed in etcd. And it is okay for such a state to be there.

What about events? If events (both human-triggered and infrastructure-triggered) are helpful for debugging production issues, then what's a better way to manage them for human consumption?

While debugging issues in production, system telemetry is the first thing we look at - infrastructure metrics, application metrics, APM profiles, logs, etc. In our hypothetical (but realistic) scenario of latency and error rate spikes, the on-call engineer would first jump on the observability platform to see what's happening.

If the relevant events could be visualized along with those metrics, then that makes consuming events that led to relevant infrastructure state changes a lot more consumable. Overlay that on top of the latency and error rate spike data, it all starts to be a lot more useful, especially more than being in the change history of a Git repository.

The screenshot above from a post on BenNadel.com does justice to the point I am trying to explain. The chart above is used to visualize the Request Load Time metric of the server. The light red vertical bar is an event when the server re-initialized itself, leading to a sudden spike in request load time. If this happens over and over again, a pattern can be established, and the next steps can be taken to reduce gracefully re-initialize the server without affective the Request Load Time metric.

Conclusion

Most Observability vendors have some kind of support for Events. Datadog supports it by the name Events. Grafana calls the feature Annotations. NewRelic calls it Events. Chronoshphere calls it Events as well.

This makes managing and consuming events (and infrastructure state) a lot more realistic according to practical use cases. The complexity of replicating the infrastructure state in a Git repository is not worth it. It does not improve the user experience in any way, just makes it worse. However, co-locating events with telemetry data on your observability platform makes them more usable and valuable.

This is how I have settled my thinking around infrastructure changes, state changes and events, and deriving maximum value out of them. What's your strategy to manage infrastructure changes and state changes?

6562199c271c31000180b426

Extensions

The Evolving Job of a Startup CTO - Part 1

Vaidik Kapoor Jul 31, 2023

Show full content

The Evolving Job of a Startup CTO - Part 1

As a tech consultant and advisor, I am usually hired to help solve a burning problem that the company cannot solve internally. While working with startups specifically, I have observed that the founders (or the tech leaders) are aware of the presence of one or many problems, but they often only talk about the symptoms. Naturally, if they understood the problem or the root cause, they would have solved it themselves. But sometimes, while executing fast, they cannot make the time to retrospect, investigate, and fail to see and clearly articulate these problems. I thought it might be a good idea to write about some common symptoms I have observed and their potential root causes.

This topic is particularly interesting to me not only because this is what I do as a tech advisor and consultant but also because there is so much leverage in solving these problems. Tech is where most of the heavy lifting happens in most startups. So an attempt to unbundle this seems like a way to build clarity for myself and help other startups in the process.

There is a lot of ground to cover, which is impossible to get done in one article. So I intend to write on this topic in subsequent articles in this series. Through this article, I want to talk with the founders and especially the CTO, if there is one.

Let's go with the first thing I usually hear in conversations with my clients.

We hired more engineers, but we are not shipping fast enough (or worst, have slowed down)

It is one of the most counterintuitive things that startup founders struggle with. Founders, naturally, want to make sure that their business grows faster - serving more customers, generating more revenue, expanding in new markets, raising money to grow faster, etc. But to be able to do all that, they need people. That's when they hire so that they can step back.

But interesting things happen when they step back, and what happens can sort of depend on the composition of the founders (if more than one) from the perspective of their technical experience. There are two possible founder compositions:

The non-technical founders - the founder(s) do not have a software engineering background.
At least one of the founders is technical - one (or more) founders have worked full-time as a software engineer in the recent past, meaning that they can still take another job as a software engineer if they want.

The non-technical founders

When founders with no tech background start a new company, they usually have a founding engineer on the team. Work in such a situation happens by sitting at a common table, where the founders usually decide what needs to be built with inputs from the founding engineer on ideas and (most importantly) feasibility. Work gets decided and prioritised on a daily basis. It goes into execution when the founding engineer takes over. They write the code, put it out on a shared environment (like a staging environment) where their work can be tested, the entire team tests the changes, the code is deployed, and then the founders are out again to get some users to use what they have built.

At least one of the founders is technical

This is not very different from the previous scenario. The major difference is that the technical founder takes on the role of the founding engineer, and hence, the technical founder is mostly writing the code. Depending on the situation, they are probably accompanied by one or more founding engineers in building the software.

Now, let's explore where things start to go south.

Growing

So now the startup is growing. The startup is scaling "something".

Maybe there is a product that a few users use, and the company has raised a seed round. So they are scaling to build more features to solve problems for a wider audience and work towards achieving PMF.

Or, maybe there is PMF, and now the company is scaling to onboard more customers (i.e. scaling sales, improving the onboarding experience, improving support, improving quality, optimizing margins, etc.).

Each of these scenarios would most likely lead to hiring more people, some specialists and some generalists. For the scope of this article, we are concerned about hiring more engineers. Engineers are essential to be able to do most of the above-listed things, i.e. building more features, improving onboarding experience at scale (think automation and product experience), improving support (quality issues, missing features, automating support, etc.), optimizing margins (reducing tech cost?), etc.

We hired more engineers, but we are not faster.

The founders hired engineers to work on more things simultaneously and grow faster. Besides product and engineering, other functions also need attention, like sales, support, customer success, HR, etc. The founders must spend time setting up these functions as well. So the founders hired even more people in engineering and product and perhaps have stepped back a little from day-to-day execution in product development.

But, things are not going as the founders had planned. Everything seems to have slowed down. Here are some of the common symptoms I hear:

Product feature releases take more time and often miss their deadline.
Small changes take painfully long to get done.
Product execution is not up to the mark. New features are not properly baked, leading to a poor first-time user experience. Often features need rework before the release.
Quality issues in the product have started to creep up, leading to frustrated customers.
Sales and product managers are not able to meet the commitments they make to customers.
The catch-all - important things don't get done fast enough without pressure, and the founders don't understand why.

These are only the symptoms. Founders must identify the root causes and clearly articulate the problems leading to these symptoms.

Side note: If all this sounds familiar, we should chat. This is the kind of stuff I love talking about, learning about and solving. Working together or not, I'd love to have a conversation.

The Root Causes

From my experience of solving these problems in different contexts, I'd say that most root causes are common across companies. But at the same time, there could be nuances in some businesses where the specifics might differ, or these recommendations might need some tuning. So please digest with consideration what I am about to discuss next.

Product feature releases take more time and often miss their deadline.

Faster is always better. Every team must strive to be faster. But if they are not getting faster, they should at least not slow down. Speed is existential for every company and even more so for startups.

In the initial days, the execution was much faster with a small team. So why does execution slow down with a bigger team? Here are some reasons that I have experienced first-hand.

Lack of clear direction and focus

If everything is important, then nothing gets done. After all, only so much can be achieved with finite resources (law of nature). Tech leaders have to provide their teams clarity with what needs to be done with the finite amount of resources (time, manpower and money) to achieve a definite goal without creating wastage (think "task done" but value not delivered). To achieve goals with finite resources (including time), planning has to be done at some cadence (for low cognitive overhead and discipline). I will discuss this in the next section.

Even after stepping back from hands-on execution, founders must make sure that their teams have clarity to execute well. New information is collected on an ongoing basis. So founders must continuously engage with their teams to have conversations and provide them clarity (written if it helps).

Poor planning or no planning at all

Since resources are finite, work must be planned so that (ideally) every task done always delivers some value to customers or the business. Writing code is only "work done" and does not necessarily mean value is delivered to anyone. For example, the backend team deployed the API for a feature, but the frontend work or integration with the front end remains. This is a classic case of a task done, but the feature the customer will use (the value) is not delivered.

Planning is a big topic, and a ton is written about it (Agile, Scrum, Kanban, Extreme Programming, etc.). I will abstract it into a few simple rules that I like to follow:

Plan Do Check Act - stick to a Plan Do Check Act cycle. When we think of moving fast, frameworks like Scrum (synonymous with Agile) and Kanban come to mind. If implemented poorly, they can lead to poorer behaviours in the team (more on this later). Those frameworks are great. Learn them. But until you fully understand them, I'd suggest tech leaders stick to a simpler Plan-Do-Check-Act cycle and do it at a regular and well-defined cadence. For most web products, a cycle of 2 weeks (also popularly known as a sprint) makes sense. When information changes frequently in the early days, a 1-week cycle might also make sense. For hardware product companies, a different kind of cadence will make sense.
Plan with clarity - this is related to the previous section. Defining what needs to be done and then planning needs clarity of what must be done to solve the customer's and your business's problems. So if your Plan-Do-Check-Act cycle is 2 weeks and starts on Monday, make sure the plan is in place at least on Friday. To put a plan in place so that your teams can execute without your day-to-day involvement, make sure that, as a leader, you strive to articulate the necessary clarity for yourself first and then use it while planning the upcoming cycle. Following this structure will also provide leaders with a structure to have ongoing conversations with their team, provide them clarity and help them learn about the rapidly changing product and business context.
Release at a cadence - moving fast and being agile is not just about the raw execution speed. It is also about releasing frequently, learning fast and reducing waste. What you have not released and not got anyone to use yet is not useful because there is no feedback to learn about the usefulness of what you have built. So building further on top of it could mean you are not heading in the right direction. I love how Intercom has articulated this in their article Shipping is your company's heartbeat.

Moving fast and being agile is about being smart about what you choose to build and when. It is also about how much investment to make behind a feature or an idea. New ideas could require significant investments consuming months of a team's work. So it is important to define what is absolutely necessary to be done and then validate the next steps.

A great forcing function to ensure that your team is releasing fast and regularly so that you can learn from customer feedback is to force yourself to plan features in your Plan-Do-Check-Act cycles to release at least at the end of every cycle. If there is an idea that seems to take longer if it goes into execution, force yourselves to cut it down to a smaller scope so that it can be released at the end of a cycle. However, that small-scoped idea must still be valuable to customers if released (this could be a private release to a few customers by feature flagging in production or even shown in a demo to customers).

Founders must own the Plan-Do-Check-Act cycle, which is a way to get minimally but critically involved in execution. It will allow them to converse with their teams on an ongoing basis to provide them with new information and context about customer requirements and the changing needs of the business, and ensure that valuable work is prioritised and planned for their teams to execute. This will further help ensure that their teams are working on the most important things and continuously delivering value to customers. Continuous delivery will enable the founders to learn fast from real customer feedback and do timely course correction.

Inability to ship fast with high confidence

Lack of confidence is rooted in high risk in doing something. It is the fear of breaking things in production. Nobody likes to break things and cause trouble. In the context of shipping fast and releasing software frequently, the risk is the software breaking, i.e. introducing bugs as we change the software. To ensure we don't ship broken software, two things are essential:

getting the requirements right (partly covered in planning, and I will cover the remaining part in the next section)
a good quality assurance (i.e. testing) process.

In the early days of building a product, things moved fast because the codebase and the software were not as big yet. So quality assurance via a manual Regression Testing process where everyone on the team, including the founders, was involved hands-on in manually testing the changes. When the codebase grows, manually testing every change is not scalable, efficient or effective:

Humans are bad at doing repetitive manual labour with high accuracy. Testing is a repetitive process which requires high accuracy.
Manpower is costly. So humans try to optimize their testing efforts by being selective. But humans are also individually and uniquely biased. So every human looks at the process of testing differently, which makes manual testing less deterministic and difficult to scale (there are ways to deal with this, but the cost angle of human labour always introduces the need for judgement). This makes manual testing ineffective and hard to scale.

I am not saying manual testing should not be done. Manual testing is extremely important for Exploratory Testing. Exploratory Testing is the process of discovering unknown behaviours (side effects) and user-experience related issues in the software. It is partly the practice of intentionally breaking the software before customers discover those broken experience issues. It should be a cross-functional effort, at least involving engineers, designers and product managers, but anyone in the company can get involved in this.

Coming back to the inability to ship fast, Regression Testing is a part of the QA process that is repetitive. If it is done manually, it will lead to:

the slow pace of execution and poor quality of releases
quality issues in production, which will lead to low morale and self-esteem of the team, and low confidence, which in turn will make the cycle of shipping slower

Founders must ensure that as the product and the codebase grows, strategic engineering investments are made to ensure that (at least a part) of the Regression Testing process is automated. The right level of test automation is not "100% tests coverage". The right level of test automation is what allows us to release with high confidence. So just straight 100% unit test coverage does not help. Invest strategically in the right kind of test cases. Functional integration tests are a good starting point.

Here are some references on testing strategy:

Lack of feedback on work at the right time

As founders (and leaders), an important part of our jobs is to provide timely feedback. I am not talking about feedback for personal growth (yes, that is also important). I am talking about feedback on work. Is the solution developed to solve customer problems accurately? Will it create some other problems? Does the solution fit well in the product and the larger scheme of things?

Sooner or later, these things get surfaced to the founders. At that time, the right thing (most of the time) is to intervene to get it fixed and ship the right solution to the customers. But delayed intervention leads to rework, which means wasted time and effort. If the teams repeatedly fail to get the requirements right, it will lead to rework, which is one of the root causes of the inability to ship fast.

Continuous conversation with teams (as highlighted in the "Lack of clear direction and focus" section) can reduce the occurrences of these. But I'll reiterate that new information comes in really fast and often cannot be continuously conveyed to teams. Founders, by virtue of their position and their vantage point, have the leverage to consume and process a lot more information. Their judgement is a lot more reliable in the company. Their continuous feedback on work being done is extremely important to avoid rework. Even with experienced product managers in place, founders' feedback is essential to help teams meet their goals in shipping the right solutions. Usually, nobody in a startup understands their business domain better than the founders.

Plan-Do-Check-Act cycles provide the "Check" phase (end of cycle) as the minimal intervention that founders have at their disposal to review work and provide feedback before the work is shipped. But the end of a cycle is already delayed. What can founders do to provide faster feedback? Here are a few ideas:

Introduce a mid-cycle review of work or at least "key work". Use judgement to define key work.
Founders probably don't have the time to review everything. So again, use your judgement to set clear expectations with your product, design or engineering leads to get "key work" reviewed as early as possible, ideally before starting software development. Shift Left.
Make yourselves available so your teams can approach you for your input.

Founders must make sure that they provide regular feedback to their teams on work, by reviewing solutions early for "key work / initiatives". Delaying feedback leads to rework and possibly poor customer experience, which leads to wasted effort, inability to ship fast and frustrated customers.

Small changes take painfully long to get done.

I was recently talking to two non-technical founders. They have a fairly high number of customers using their platform. So, the product seems valuable to a large enough audience. They want to expand and grow. Their challenge is that everything in product engineering moves too slowly. One of them said: "Even a simple fix on phone number validation in the login screen to handle an additional zero prefix has taken us more than a month to get done. We just cannot move at this speed." They are right. They will die if they move at such speed. It is also mind-boggling to know why something so simple would take so long. That's probably an extreme example, which could be rooted in a cultural problem. But there could be other slightly more complex yet simple improvements that take a lot of time to get done. Here are some of the reasons that I have noticed so far from my experience:

Cultural problems
- Lack of a sense of urgency or the intention to solve customer problems
- Lack of customer centricity
Technical problems
- Inability to ship fast with high confidence (already covered previously)

I have covered the inability to ship fast already, so I will not discuss it again. Let's talk about Cultural Problems briefly.

Lack of customer centricity is the absence of empathy for customers and solving their problems from their lens. If the customer struggles to log in because of a zero prefix added by an auto-complete on their phone, it is a simple problem that a team can solve. But they don't solve it because they are not close to their customers, and they are not made aware of the emotions that their customer feels when the product experience breaks. In this case, the customers are low-income group cart vendors using the product to make a living. Their level of education in tech is also fairly low. So as builders of the product, it is the team's job to make it easy for them to use the product.

Continuous conversations (as covered in the "Lack of clear direction and focus" section) can solve this problem to an extent if customer experience issues are also talked about in those conversations. Another idea is to get your product managers, designers and engineers to speak to customers directly and get a first-hand experience of what the customers love and what they would like to see get improved. To make it a part of the culture, systemise it. Make it a habit.

Lack of urgency or the intention to solve customer problems is a different beast. The first thing is to identify if it is an individual problem (a junior or a leader) or is it a systemic problem. Individuals can be coached if you, the founders, have the time and resources. If you don't have the time and resources, it is best to take the hard call and part ways. If it is a systemic team-wide problem, it needs a cultural intervention. At a high level, this comes down to doing the following:

Coaching team leaders on customer centricity, what it means and helping them build a sense of urgency.
Reiterating to the entire company the urgency for solving customer problems and being customer-centric.

To do both of these things, different kinds of management tactics (communication, business review, work review, customer support tour duty, etc.) can be used. I will not go in-depth about this because neither I am an expert at this, nor is this a small topic to cover. But enough books have been written about management, goal setting, communication and customer-centricity.

Founders must find ways to embed customer centricity in their teams and culture, explicitly state their expectation of how customer problems must be addressed by teams on a daily basis, and hold their teams accountable for solving customer problems with utmost urgency.

Product execution is not up to the mark. New features...

...are not properly baked, leading to a poor first-time user experience. Often features need rework before the release.

Founders often say this. Product engineering teams release features that either do not cover all the cases of the problems they had set out to solve or feature implementations have a poor user experience (think of things like validations, references to other entities in the software, etc.).

The first one unarguably is a bigger concern, leading to unimpressed customers who do not become promoters of your product, or worse, they become frustrated and stop to trust your company's ability to solve their problems.

Missed user experience issues, in my opinion, might not be a very big problem as long as your team can do a fast follow-up to release fixes. However, they still lead to unplanned work (productivity killer) later in terms of support requests.

Why do these things happen? In my experience, teams struggle to execute product features well because of the following reasons:

Lack of domain experience and context
Lack of agile and product management practices
Lack of customer centricity and the desire to solve customer problems (already covered previously)

Lack of domain experience and context

Over time, this has become one of my favourite topics that I talk to teams about when I coach them. Domain experience is so underrated. It is something that you can hire for, but even if you don't have it, it can be learned if you focus on it and approach it from first principles. Let's look at this with an example to understand it and the side effects of its lacking in a team.

In one of my engagements, I worked with the engineering team at an e-commerce marketplace company to help them improve their test automation setup. This team was responsible for the product catalogue service responsible for controlling what product assortment is made discoverable to customers in a location. While working on test automation, we discussed a particular scenario that had to be tested, which led us to discuss some gory details of their architecture. Initially, the product catalogue would hold the mappings of products to merchants. So a customer at a certain location can be serviced only by the merchants who can service that location, allowing the customer to discover products available with these merchants. Over time, they added "bigger" merchants who could service a larger geolocation. But they wanted different pricing in different areas. So... they created the concept of a backend merchant and a frontend merchant. The backend merchant is, well, the backend, only responsible for holding the inventory and responsible for order fulfilment and logistics. A frontend merchant was mapped to a backend merchant so the frontend merchant would show the backend merchant's inventory. The frontend merchant controlled the pricing of products. Many frontend merchants could be mapped to one backend merchant. This overtime led to so many complications in the architecture that any new person's head would go spinning. It was too hard to follow because of entities created out of thin air that did not reflect the reality of the real world (AKA the domain). A good architecture is easy to follow because it represents the domain clearly, making the architecture easy to understand. In this case, there should have been only one type of merchant (like before). A separate service should store location-wise pricing overrides for a merchant. This would have made it much easier to follow what was happening in the system.

In another engagement with a company that builds a DevOps tool as a SaaS service, the founders faced difficulties in having their engineering team support their existing customers and onboard new customers. When the team receives concerns or feedback from their customers, they would either not act with urgency or, worse, they would propose incorrect solutions, leaving the customers frustrated. It seemed obvious to me that when engineers build products for other engineers, they can communicate well and "they get it". They can easily understand what their customers want. And that is true to quite an extent. But the product engineers in this team had no background or exposure to DevOps at all. They also, unfortunately, had no inclination to learn about DevOps. So they regularly struggled to get to the root of the problems their customers brought to them. They did not understand the domain well, and unfortunately, they did not even want to learn.

Understanding that you need to learn about the domain is also being customer-centric. A great way to learn about the domain of the business is to do frequent customer conversations.

Founders must systemise how their team continuously learns more about the business domain and becomes an expert of their domain. There are several ways to do this but getting the team to do regular customer conversations is one of the most definite and powerful ways to make it happen.

Lack of agile and product management practices

This section introduces jargon (which I have avoided so far). I will try my best to break them down to reduce the obscurity of jargon. To me, and to a lot of really smart folks (who are not product managers by their job function) I have been lucky to work with, have principles and disciplines that they stick to, principles that are rooted in logic and First Principles to maximise return on investment of time and money. Instead of getting into the specifics of product management practices, I would prefer to stick to these foundational principles for product management:

Develop an effective strategy
Set and stick to clear priorities
Set measurable outcomes to determine success
Support product engineering teams

When a product engineering team is tasked to do something, the company is making an investment to build something that they can sell to customers and make more money than what was put in to build it. This depends on the fact that you know what to build that customers really want and would pay you for.

When it comes to knowing what customers want, let's just agree that nobody knows enough about what customers want unless we systematically make an effort to learn. We build with the best information available to us, but we have to constantly be on the lookout for more new information that helps us either be sure of our current understanding and correct it when we are not.

At the minimum, we want to learn from our customers which problems we can solve for them, and when we attempt to solve those problems, have we actually solved those problems well enough to our customer's expectations? In practice, this looks like:

Learning about unsolved problems
- Doing frequent customer interviews, surveys or creating any touch point that gets you to interact with customers directly or indirectly to understand the challenges in whatever job they are trying to do.
Getting feedback
- Build something, and then demo it to customers to see if the new feature or solution solves their problem and does that well enough. We don't know if we have done our job well unless the customer says it and is ready to pay for it. So show early and show regularly before it is too late.
- Discover user experience issues with existing solutions (including your product) by regularly reviewing support tickets, analytics, and bugs. Talk to customers for feedback. Our systems already have a lot of information (hopefully), and we can use that to build a better understanding of how our customers use the product and where they struggle.

Obviously, do all of this regularly, with discipline, without failing. Demo after every major release. Review support tickets and bugs regularly. Talk to customers regularly. If you follow the Plan-Do-Check-Act cycle, you already have a cadence in your company. Tie things to it to maintain discipline. For example, review support tickets and bugs once every two weeks (at least).

Most of this might sound obvious. But doing this is so important to build the clarity to decide where to invest to get maximum returns on your investment (engineering time).

Clarity guides strategy. Strategy guides priorities. Measuring and using metrics for success helps stay objective.

Now that we know learning from customers regularly is important and we can receive new information that can change direction, we must support product engineering teams to execute. To support product engineering teams, it is important to plan work so they can ship in small iterations frequently. This enables you to demo to customers regularly and learn from their feedback. We have already discussed some of this earlier in this article. Now, we are now inching into the Agile territory.

How frequent is frequent enough? Ideally, every day or even better if multiple times a day. That is probably how you operated in the early days of your startup. That ideally should not have changed. If it has, then you need to fix it. A good starting point for shipping frequently is at least once in 2 weeks (sprints or Plan-Do-Check-Act cycles). We are now inching into the Scrum territory. Why 2 weeks? 2 weeks strike a good balance between time, flexibility, team efficiency, focus and customer collaboration. But this is the widely followed norm for software teams. Different teams (like ones working on ops or hardware) can have different reasons to choose a different duration for sprint cycles.

But we already follow Scrum, and we are still slow.

You may say that "We already follow Scrum, and we are still slow". I didn't write all this to tell you to follow Scrum. But I did write all this to tell you to work towards being more agile. Scrum is one of the tools to help you be more agile.

Agile does not mean scrum. Scrum does not mean agile. Agile means agile. What does that mean? Ship small, ship frequently, learn fast from feedback and course correct often. Scrum is only one of the ways to do it. You can very well be agile without Scrum. So if Scrum is not working for you, you have somewhere lost the essense of why we do Scrum, which is to be agile. So learn Scrum to do Scrum well, or don't do Scrum at all. It can be counter productive and introduce more inneficiencies.

When you (genuinely) learn Scrum, you understand the different techniques and systems that help your teams be more agile. So follow Scrum or use First Principles to create your own processes - whatever works for you. But strive to be more agile in your execution.

Founders must institute the discipline in their teams to regularly learn from customers (directly or indirectly using data) and understand problems can you solve for your customers. To learn regularly and frequently, you must ship frequently as well. A good starting point for founders to make this happen is questioning if their teams and execution is truly agile. If that leads to an unsure feeling, work with your teams (or team leads if you have them) to understand where is execution slow in the process of product management and building software. Use first principles to solve your bottlenecks. If that doesn't help (and it doesn't always help if you have not had the experience of building software yourself), learn about Agile, Scrum and Extreme Programming. If you don't have the time to do it yourself (a scarce resource for founders), reach out to someone (an experienced engineering leader) who can help you understand this.

If you are looking for an experienced tech leader to discuss your execution problems, this is what I do, and I'd be happy to chat.

Software quality issues have started to creep up, leading to frustrated customers.

This is another big one that I see so many teams struggle with. Shipping new features is easy. Prioritising bugs and the work that improves quality is not straightforward because it is hard to understand its impact (besides noisy customer frustration). There are two sub-problems while dealing with quality:

Fixing known reported bugs timely
Ensuring that you don't introduce new bugs (bugs in new features or breaking existing features) with every release

Fixing known reported bugs timely

Often in the early days, bugs are reported and fixed as they come. Essentially, every team and every engineer is on-call. Every reported bug (especially the ones reported by the founders) is a high priority. So someone leaves their current work (interrupted) and jumps on to it, which is not bad in the early days, but such interruptions impact our ability to focus and deliver planned work. So not prioritising some bugs if they are not severe is absolutely fine because the cost of interruptions is high. But someone has to take that call to prioritise incoming bugs. To prioritise something, it is important to have a process to track it and prioritise it. If you prioritise not fixing it now, maybe you want to fix it later or use that bug as information in guiding new product development to avoid similar mistakes.

Unplanned work, like fixing bugs, is hard to deal with. If you are getting a lot of them, the only way to sanely deal with them is to have some bandwidth carved out so that the execution of planned work is not impacted.

Whatever you choose to do, you must track bugs (and collect information to recreate them to help developers fix them fast) so you can prioritise them to solve now or later. An additional benefit of tracking bugs is to be able to report quality metrics (release frequency vs reported bugs over a period of time). This helps you understand the quality aspect of your software development process (the system, which is what we will talk about in the next section).

Another useful feedback loop for quality is to build some kind of cadence with the Customer Support (or equivalent) team to get a first-hand report on customer issues. Customer Support teams are the ones that customers reach out to (or yell at). They not only have an idea about bugs but also the customer impact of those bugs. Using their insights into prioritising bugs can be really helpful. But more importantly, if you choose to invest time in improving the quality of your product, its impact should indirectly show up in how frequently customers are reaching out to your Customer Support team and what they think of the product experience.

Ensuring that you don't introduce new bugs with every release

Shipping on-demand every day as and when the team feels that they are ready to ship is not enough, not without the explicit guardrails of quality anyways. If shipping every day leads to bugs (in what you shipped, and worse, in unrelated areas of the product), you have a misplaced sense of moving fast and being agile. All you are doing is frustrating customers, making progress but taking two steps back, creating more work for yourself to do later in terms of engineering and getting back the confidence of frustrated customers.

Move fast but at least don't break everything?

If you don't track bugs, you cannot track a trend of quality over time. Reporting bugs on a Slack channel is not enough. That is just communication. Unstructured bug reports are initially helpful. But as you scale, you need some amount of structured information to at least be able to import them in a spreadsheet, create filters, and generate metrics (number of bugs over a 4-week rolling window, number of valid bugs vs reported bugs, number of SEV 1 bugs over a rolling window, etc.). A useful way to think about quality in your software delivery process is by comparing the number of new bugs vs the number of releases over a rolling window. If the release frequency is increasing, but the number of new bugs with releases is growing faster, you have a problem - a false sense of moving fast.

Anyways, exact metrics are not as important as the ability to get metrics when you need them. So, track bugs.

Besides tracking bugs, how do we ensure we don't introduce more bugs? It is through proper risk management (non-technical founders should understand this). Lack of risk management in the software delivery process makes the problem exponentially worse at a larger scale. So the sooner you nip it in the bud, the easier to scale your team and software delivery process.

What is risk management in software delivery? I love the term safety nets - a way to make software delivery safer. Let's look at some tactics that are safety nets for your engineering team to ship software safely:

Automated quality assurance, Continuous Integration & Continuous Delivery. We have covered this in the "Inability to ship fast with high confidence" section.
Reviewing changes, i.e. pull requests. Have someone else in engineering review new changes for logical bugs, architectural problems, reliability concerns, and security. Systemise this as a process.
Limit the blast radius of changes. You can ensure that not every change impacts every customer immediately. You can release changes to customers slowly. One of the easiest ways to achieve this is by using Feature Flags to limit the exposure of new features or changes. When Feature Flags are insufficient because of engineering complexity, you might have to look at more complex engineering investments like Canary Releases or Blue-Green Deployments.

All these topics are fairly complex in themselves. Also, there is a lot more to risk management in software engineering (database migrations, database query optimization, security, reliability, performance, etc.), but I am being selective with these for the problem we are discussing, i.e. reducing leakage of bugs to production. I can't cover all of these in this article. Perhaps another one. But if you research these on Google, you will find a ton of wisdom.

Sales and product managers are not able to meet the commitments they make to customers.

Poor planning (already covered previously)
Inability to ship fast with high confidence (already covered previously)
The sales team makes commitments on behalf of the product and engineering team

Briefly discussing the sales team making commitments on behalf of product engineering, I'd just say that it is impossible for someone to make a near-accurate commitment on the deadline of something they will not do themselves. It is not just about the sales team but anyone in the company.

Founders must discourage (or rather outrigtly stop) their non-tech teams to make commitments on behalf of their tech team. Situations where a commitment must be made immediately will not completely go away (for example, a critical strategic customer, a strategic time bound partnership decision, etc. Critical opportunities for survival can be time bound sometimes.) must be very few). Deal with them carefully and also work towards reducing such occurrences as much as possible.

The Catch-All - important things don't get done fast enough without pressure, and the founders don't understand why.

I have literally heard this in every job I have ever had and every engagement I have ever taken. I can count on one hand the number of founders who have not had this problem (good for them). As a founder, you understand the urgency of certain things, and you want them done fast for various reasons. But they don't happen. There can be a number of reasons for this. I have tried my best to list them down (but I am sure there are more reasons I'd love to learn about them):

Poor planning (already covered previously)
Lack of a sense of urgency (already covered previously)
Lack of customer centricity (already covered previously)
Lack of clear direction and focus (already covered previously)
Lack of context and domain experience (already covered previously)

There is probably a lot more to this. I have only listed the most common problems I have come across with the companies I have worked with and the founders I have interacted with.

Easier said than done

It's easy to say that the founders should do this and that. I fully empathise with their situation and the difficulties of being a founder. It's hard. They surely have a lot on their plate. A lot of it could also feel like jargon. But every time I have attacked these root causes, I have seen teams improve. I also understand that I could be ignorant of the realities of the life of a founder (which is also probably true). But then, building a startup is not easy. My intention behind this article was to help provide a framework rooted in the industry's knowledge and my personal experiences to make the job slightly easier.

I have used the word "systematic" a lot. The reason is that we are discussing The Evolving Job of a Startup CTO (or founder) managing technical teams. When you step away from daily execution, you are not doing everything yourself. When you were involved hands-on, you used first principles or just followed your gut. You can't scale that. So how do you get your teams to execute better, move fast, be quality focussed and meet commitments? You build systems and culture that get your team to operate in a way that helps you succeed as a business without hands-building the product at the micro level all the time.

If you are a startup founder needing help streamlining product and engineering, I'd love to chat and see if I can help. If you'd like me to double-click on any of these topics, let me know.

64a1bf76e8dd670001b140d2

Extensions

2022 — Year In Review

Vaidik Kapoor Jan 13, 2023

Back again with another personal Year In Review. I am late as always with all my writing commitments, but I do hope to keep up.

2022 was an interesting year in so many ways. Besides the world bounced back from COVID, I had a bunch of interesting things that happened,

Show full content

Back again with another personal Year In Review. I am late as always with all my writing commitments, but I do hope to keep up.

2022 was an interesting year in so many ways. Besides the world bounced back from COVID, I had a bunch of interesting things that happened, a lot of them were firsts. I started the year by ending my long stint of 6.5 years at Blinkit. I was excited about the change and not doing anything at all for a while or taking a break. It was an interesting year.

The Break

It has been a year since I left Blinkit after my long 6.5 years there. I took the much needed break and not do anything for a while and focus on some life goals along with not worrying about a job. It was an interesting experience. My one key takeaway from this break is that no break is long enough to sleep enough and wake up late, watch Netflix, hangout with friends and just do nothing. In fact, it is addictive especially after an intense work tenure at a growth stage company. Friends from Blinkit who quit and decided to take a break would agree 👊

The break started with my three close friends getting married. What a great fun it was raging all those weddings, with the entire gang coming together after a long time. It was a perfect start!

Break Means Travelling

One of the things that Sonam and I really look forward to is traveling and seeing new places. Not having a job allows you to do that for extended periods of time. So we had to make the most of the opportunity.

Bahrain - Right after our friends' weddings we headed over to Europe to travel with a stopover in Bahrain to visit my brother Vaibhav and his wife Isha. Our visit to Bahrain had been due for a long time. So we decided to club it with our little Euro trip to Belgium and Netherlands. It was our first time in Bahrain. Cool place with enough ways to spend money, shop, eat and have fun. Driving around in most Gulf countries is a great experience, so we did that. Speaking of driving, one of the major things we did there was to watch the first Formula 1 race of the season. Bahrain was lit. All you could see everywhere was F1. That's the biggest event the country hosts. So that made our visit even better. As for the race experience, it was our first live F1 race. And we couldn't have asked for anything more. The race was full of action and surprisingly great performance from Charles Leclerc and Ferrari winning the race and defeating the regulars. Full value for money!

Euro Trip (Belgium and Netherlands) - we headed over to Belgium and Netherlands after Bahrain. We started in the city of Antwerp where we spent 2 nights and then drove to Bruges via a stop over in Ghent for a few hours. Bruges feels like it has come straight out of a fairy tale book. Cute little charming city with lots to offer. Bruges is where we ended our time in Belgium. We then drove to Keukenhof for a day trip and then stayed two nights in The Hague. The Hague was nice stay where we decided to just slow down and relax a bit instead of hopping from place to place for sight seeing. After the much needed slow down, we drove down to Delft for a day trip and then stopped in Alkmaar for a night where we hoped to see the Cheese Market but missed it because we didn't do our research well. They do that on Tuesday every week. Also, there isn't really much to do there besides the cheese market and it can easily be a day trip from Amsterdam. But the trip was still worth it because we bought some of the most crazy cheese there. Our next stop was Giethoorn, another town that feels like comes straight out of a fairy tale book. They have no roads there, only walking tracks and water ways to go around by a boat. Imagine that! We stayed there for a night and then drove to Amsterdam which was our last destination. We stayed in Amsterdam for 4 nights, had endless. Can't stress enough on how beautiful Amsterdam is. But the city has so much to offer. I think our highlight in Amsterdam was the discovery of Gyro.

Goa - Goa just always works out. We traveled to Goa to celebrate my birthday. But this time around, we decided to stay in South Goa (actually south of South Goa) at this stunning property called Cabo Serai. Head over to this beautiful place to stay away from rest of the Goa and have a beautiful secluded beach with backwaters to yourself.

Landour - I had no idea about what Landour has to offer. I think it is the most ideal hill station destination for me - adequately developed, well connected and yet secluded. It is so charming. It was also a special trip because we went there to celebrate Sonam's birthday and also took Django along. It was our first trip where we saw the town along with Django and took him to almost wherever we went including cafes. Since it was a super busy weekend in Mussourie, we just stayed in Landour and didn't even care to go to Mussuorie.

Maya's Crest, Kasauli - this is place we have been trying to go to for a few years. We finally got the chance to do this. This boutique stay has been one of the most unique travel experiences. Besides the fact that the house is beautiful and is situated at just the perfect spot for the most stunning views and hill experiences, the host Usha Hooda is the one of the most interesting and inspiring people we have met in our lives. When you stay at Maya's Crest, expect to be hosted by Usha. She ensures that you have a great stay, oversees that you are well fed, tells you thrilling life stories that keep you hooked. It's not a hotel, it's her house. So you get treated like you are at someone's house. Mastaan (her pet dog) makes it even more homely.

We hoped to travel more. There were two other trips that were meant to happen but got cancelled due to logistical issues and family emergencies. Hopefully, 2023 brings more travel opportunities for us ✈️

Exploring Startup Ideas

After taking the much needed travel and doing-nothing break, I started looking for what should be next for me work wise. There was an itch to startup and build my own company. I joined ODF 14 to pursue that goal. I also explored a couple of ideas seriously.

My first idea was something I felt very passionate about and worked a lot on during my time at Blinkit. But as I explored the idea more, I realised that the space has become overly crowded and it didn't make sense to enter it now.

The second idea I think was a solid idea. I even built a small POC and spoke to potential customers. But i gave it up chasing it, largely due to lack of clarity about what exactly I want to do in life, which was holding me back from committing to take the leap. This one was a downer but I still learned a lot of things in the process, particularly how to interview customers.

Also, I bought the first laptop in my life with my own money. While I have always known, I really felt it this time - MacBooks are expensive. Buying that shit hurts 😂

Started my own tech consulting practice

When the exploration of startup ideas was not becoming concrete, I started thinking what else I can do. Some friends doing their own startups reached out to join them. But I was not ready to take a job or even join as a cofounder yet. But I started spending time with them for free. After a while, I realised that advising startups and growth stage companies can be a decent interim way to work and meet interesting people. I had thought of consulting in the past several times. Felt like this is a good time to give it a try. So I started my consulting practice called Three Ways Consulting.

It has been a great experience doing this. Starting your consulting practice is also building a business so it comes with its own set of challenges. I felt that it would be easy to start consulting with my network. But I did not account for the cold start in consulting and faced a lot of issues initially like getting leads, spending enough time to nurture them, structuring the engagement, pricing, etc.

I had to spend a lot of time reviving my old network and connecting with people I hadn't met in years. I think the key thing that I learned was starting anything is not easy. In case of consulting, you have to like meeting people and networking. You have to build relations and even create value for free to establish authority and get your potential customers to like you. This definitely takes a lot of effort. Besides that, there is definitely a lot of context switching when you work with multiple clients. And context switching is hard.

But hard things aside, I have enough clients that give me a decent amount of business. I like working with them. They are doing interesting things in their companies, solving interesting challenges and I am happy to be a little part of their journey. Another thing that I also really liked was getting back to writing some code from time to time for my clients. So fulfilling 🐱‍💻

Retrospecting on the break and the year

I had taken the break to be able to focus on a few things in life which I otherwise had not been able to. So while there was no professional goal during my break, I had some personal goals.

🚀 Some of the highlights are:

Had a long streak of cooking regularly. I feel slightly more confident in my cooking to impress Sonam (well, sometimes)
Came across Farnam Street and learned about "learning"
Spent time learning Statistics and Data Analysis in Python
Spent more time with Django at home, on walks and playing in dog parks
Spent more time with my Bua
Spoke at multiple tech conferences and events (see some of them here)
Wrote a few blog posts (see some of them here)
Rebuilt this website in Ghost
Regularly provided mentoring sessions to engineers on Plato (although I don't like their business model, anyone up for a collab to build an alternative?)

😔 Things that didn't go so well:

No running, no other health focused initiatives - I wanted to regularly run during my break and I was not able to do it because of the knee injury caused due to running too much (200kms in 17 days) and not taking care of myself. Not being able to run or do any other physical activity generally demotivated me and didn't give me enough strength to keep working on improving my health. In all of this, I learned that ageing is real and you can't brute force your way into everything. It seems to be getting better now and I hope to get back to it soon.
Wish I had read more books - I say this every year. I managed to read only 3 books. This is one of those things that I struggle doing. I wonder what the real reason. May be its not a focus problem but a lack of desire to read. I have to think about this more deeply to find the real reason and work out of the cycle of not meeting this commitment every year.
Left a few projects and blog posts mid way - struggled to find the time to complete a couple of my side projects and complete a couple of blog posts. I hope to complete them in 2023.

People I want to thank

My wife Sonam for being an absolute rock and supporting me through the time when I didn't know what to do after leaving Blinkit. Thanks for guiding me, giving me the comfort to feel that it's okay to not do anything, patiently listening to me and helping me figure things out. I was often very lost but you were always there to make me feel safe. Thanks a lot for that and for a ton of other things that you do. I love you 😘

Our friends Nupur Gupta and Akshat Bansal for being our saviours at the toughest times and for being second parents to Django. Thanks for being there. We are lucky to call you guys friends ❤️ And Django is lucky to have Zorro and Bucky as his buddies 🐕

Our friends Radhika Joshi and Major Dheeraj Bisht for being killer hosts again and making it a fun and memorable new year's eve celebration in Patna (of all the places in the world) 🎉🥂🥃 That Champaran Mutton is to die for!

Heading into 2023

The beginning of 2022 was exciting but soon turned into a lot of unknowns. With a lot of those unknowns resolved, I feel a lot more excited about 2023. Here are a few things that I might do in 2023:

Complete some of those side projects from 2022
Figure out a way to run sustainably by training for it
Cook more often
Read 5 books (last year was 10 books)
Learn a non-technical subject like economics
Attend an international tech conference IRL
Build some product on the side, for the fun of it. Carrying forward from last year. I would like to at least finish one of the projects I started.

That's all folks. Looking forward to an exciting 2023! 🚀

63bb0ab924c238003dba0900

Extensions

Attending DevOpsDays India 2022

Vaidik Kapoor Nov 21, 2022

Show full content

Thank god the pandemic is over. I have been attending conferences virtually. But I have been missing the in-person interactions, open spaces and BoF sessions, and being able to meet old friends - all things that make attending an IRL conference a different feeling altogether. I finally broke my streak of not attending IRL conferences by making it to DevOps Days India 2022 which happened in Bengaluru. It was a 2 day event. This post is about my experience of attending the conference.

The conference

I want to appreciate the efforts that the volunteers put to organise the conference which I am sure must not have been an easy task after so many years. Having organised smaller events myself, I know it is not easy to put together a conference.

For me, the highlight of the event was meeting the DevOps community and old friends in-person after many years. I personally spent most of the time hanging out in the corridors of the venue meeting friends and other attendees - talking shop, learning about all the cool stuff they have been working on and exploring opportunities to work together. Just this part in itself was valuable enough for me to make the trip from Gurgaon to Bangalore.

However, I expected more from the conference in terms of content. I hoped to have learn more from the talks than I did. I felt that either the topics were not relevant enough or fresh, and the speakers did not focus enough on the business value failing to communicate their ideas. For example, topics like Google SRE practices have been covered so many times. Other topics were introductory in nature.

The talks in themselves were not as bad. But they probably don't cater to an audience of different experience levels. An event like DevOps Days should cater to a wider range of audience. It would be nice to see multiple tracks like beginner, intermediate and expert in subsequent events so that beginners as well as practitioners at all levels get to learn something.

What was everyone talking about

Like I said, the most valuable part for me was the conversations I got to have with people across discussing a range topics from DevOps and cloud-native technologies to pricing of SaaS dev tools like Github and Gitlab. Some of the conversations I would like to highlight are:

Difficulty of adopting Kubernetes as a common theme - Kubernetes is great. But it's also hard. The time to value can be extremely high and is often not considered enough by engineering leadership and platform engineering teams, so much so that it might not even be the right solution for many teams. There can be other viable solutions for different businesses at different times - HashiCorp Nomad, ECS or not doing containers at all.
The best of teams have technical debt in their platforms - we assume that the best of businesses have the best of tech but that might not necessarily be true. While working for a client, I made the mistake of assuming that a certain really large (unicorn) fintech company in India will have all their infrastructure problems solved. After speaking to engineers on their team, I learned that's not the case. The experience of deploying and operating applications on their internal Kubernetes platform was far from ideal, fairly complex and often broken. I had a similar experience when I spoke to an engineer working at a really large global HR tech company. I assumed that they would have a rock solid Kubernetes platform already deployed whereas the reality is that they have been struggling to break their monolith into microservices and migrate to Kubernetes. The best of businesses have interesting technical challenges but also an equal amount of technical debt, and sometimes fairly basic in nature.
Challenges of testing changes in microservices - microservices are great as long as you know what you are doing, and most of us don't know what we are doing. I have spoken several times on this topic and my belief just keeps getting stronger as I talk to more people on this topic. In a conversation with some of friends at other companies, the topic of automitically came up while discussing Kubernetes and microservices. Engineers experienced in working with microservices often bring up the topic of doing microservices the "right way" and how most teams struggle with testing changes in microservices. Theoretically, a change in a microservice should be independently deployable but the reality is usually different. In reality, changes in a microservice can break the application if not tested properly, which defeats the purpose of microservices. Testing methodologies like contract testing (using tools like Pact.io) can be helpful in theory but are hard to understand by most developers in reality. Small teams built by carefully hiring experienced and smart developers can deal with complexities of microservices. But microservices are often deployed in large environments with large teams working on them - and their experiences of adopting microservices are different from "ideal teams". In such situations, testing all microservices as a whole using business facing end-to-end API tests is one of the best bets for continuous delivery. But here we are - treating our microservices backend as a Distributed Monolith.
DevRel is an upcoming role - it was great to see so many folks who are now devrels at companies building something in the space of devtools, cloud-native technologies and infrastructure. The role is still fairly new and the folks in the role are fairly young as well. They have great energy and charisma to engage with the beginner community and seem to be using it fairly well to their advantage. But at the same time, they need more technical experience to engage with the larger community and connect with more experienced audiences as well. In a conversation with a friend active in this space, we discussed how to help accelerate growth of devrels. The idea that I found most valuable was to have young devrels do a lot of customer focussed work like demos, solution engineering, webinars and customer support. The customer facing exposure can accelerate the learning and getting the relevant experience to connect with more mature audiences.

Conclusion

It was great to be attending conferences IRL again - what a fresh change from regular work and an amazing opportunity to meet new people. Attending DevOps Days has pumped me up to get more active in the community again. Looking forward to attending more community events in the near future.

637b9b69a2bb73004d57a384

Extensions

Team Goals Over Individual Goals

Vaidik Kapoor Jul 16, 2022

In a conversation with a senior engineer at a multi billion dollar successful Silicon Valley tech company, the likes of which we cite in case studies and conversations, I observed a major anti-pattern in how work happens on the team to meet business goals.

I don’t get it.

Show full content

I don’t get it. I have come across enough people in tech who work in companies that expect engineers to have individual OKRs, which is fine. But they don’t have OKRs at team level. Team is merely a construct for grouping by function. But the team does not really work together.
— Vaidik Kapoor (@vaidikkapoor) July 14, 2022

We were discussing a challenge this person was facing on their team. The team was under these circumstances:

the team is a platform team
mostly firefighting because they are understaffed (3 engineers)
super experienced team with only staff level engineers
the team does not have a dedicated engineering manager and the director is sort of managing the team with multiple other priorities
responsible for a critical project (let's call it Project X) of de-risking a company-wide reliability risk; the project is operationally intensive, requires coordination and alignment of other teams for success

They had recently hired 2 junior engineers after struggling to hire for a few quarters. So this person was looking for advice on how their team should approach their work now, since their long running "bandwidth" problem seems to be getting better. While discussing possible options, we arrived at the point of managing firefighting. Here is how the conversation went (reworded):

Vaidik: So here is how you should think about work now. Firefighting is going to continue until you guys hire more developers because your team is heavily understaffed and you have to support other teams. Support work is not going to stop really. So budget some bandwidth for that. Then, obviously try to finish Project X first because it is already delayed and it is probably best for your team's morale and reputation to just get it over the line. If there is any bandwidth left, then perhaps use it for one of the new initiatives we discussed.

The Person: Yeah that makes sense. Project X is something that we really don't like talking about any more because everyone keeps asking us about it. And it's delayed big time. Honestly, the person on our team who is responsible for Project X is struggling with it. When we realised this, we decided to help him out with some of the tasks. We would do long calls for several days together to figure out ways to speed up progress. But then, he would get distracted by volunteering to do other work like writing postmortems for incidents that he was involved in but not primarily responsible for and that additional was totally avoidable.

Vaidik: Interesting. What happened then? Did this not get discussed any time? May be in 1-on-1s or team meetings?

I am assuming at this time that there is nobody to give regular feedback to that individual because there is no manager.

The Person: Not really.

Vaidik: Hmm. Do you guys do retrospectives as a team? Do you do any kind of planning on a regular basis? How does your colleague get the feedback to correct this behaviour?

The Person: No. We don't do that. And we just decided to help our colleague.

There are many companies out there where teams are nothing more than a functional grouping construct for management to get a certain kind of work done. But they hardly work like a team.

Here is how work happens in such settings:

Goals are assigned to a "resource" on the team
Of course, these are organisations that value "developer autonomy". So developers can also volunteer for a goal or suggest what they would like to work on.
The team meets and talks, obviously or nothing will get done. But for the most part, every individual have their own goals to chase.
Some goals are more challenging than others, obviously because the goals are decided to solve business problems. Some problems can be more hairy than others.

What's the impact of this?

Challenging projects have higher risk of not getting done on time or not meeting acceptable level of quality because there is just one person responsible for getting them done. So progress will be usually slower as compared to what the team collectively is capable of doing.
When critical projects don't get done, the team as a group takes the blame because the team is more visible to the organisation than the individual.
Developers on the team think in the box of their own capabilities and aspirations instead of working as a team to solve business problems. No bad intentions here. Their company have put them in this framework. This is just how things happen.
Feedback is delayed. The team does not really work like a team that cares about the goals they want to achieve together. So there are no frequent plan-do-check cycles. There is no time to retrospect on how the team is progressing against their goals and what is not working - a strategy, tactic or even a behaviour or practice.
People are nice. When they see someone struggle, they try to go out of their way to help. But that is a matter of chance. What if everyone is struggling? How does that change?

I recommended him to think from the perspective of a team and what is impacting them. If the team's external reputation is impacting the team, then they should work as a team to protect it and solve everything that comes in the way. There is nothing wrong with individual OKRs but business priorities should be a concern for the entire team and not an individual.

For the person I was talking to, it's a challenging situation to navigate, especially in absence of a dedicated manager and when the precedent has been set. The only way out is to raise this concern with the director with the intention to make changes in how the team works and take one step at a time.

How companies become successful despite following anti-patterns of how to run teams and organisations — a digression

Having witnessed both failed and successful companies from the outside, knowing how a few of them work internally, and further overlaying the best practices I have learnt from leaders in our industry, I often end up being confused about how some companies end up being successful even when they don't follow the best practices. It is quite frustrating honestly because things just don't add up. It's hard to make sense.

I guess there are a lot of factors that make a company successful:

Problem Space, Market & Timing - the business is about something that is bound to grow, despite having poor systems, ways-of-working and organisational practices. Many examples come to my mind but to name a few problem spaces - e-commerce, payment wallets, edtech, to name a few. Basically, any new big wave is sort of resistant to a certain extent to bad practices if they get something right. If the company does a job of putting a useful product out there and there is enough demand for it (organic or due to external factors like demonetisation, regulations, a pandemic, etc.), the business still takes off and makes money.
Founders - their intelligence, drive and determination sometimes outperforms the entire company. These companies are essentially founder run companies. Employees are mostly cogs in the machine.

In both the cases above, imagine what such companies could have done if they had a better organisation that harnesses the collective intelligence of their workforce and enables the employees to do more. But I understand that this is like saying "what if they had more money" or "better engineers". The realities of how businesses are built are often far from ideal. No company gets everything right. A lot of successful companies get a few things right - but may be not how to best run their teams.

While there is no one proven way to build a successful business and the importance of good practices can always be debated when you have many successful businesses that do the exact opposite and follow many anti-patterns, I would still argue that organisations that follow good organisational practices get to better outcomes faster with happy employees.

The quest to help build great teams and organisations should go on for all leaders, despite these contradictory cases.

62d1c738a7343b003d9a422e

Extensions

10x engineers or 10x impact?

Vaidik Kapoor May 23, 2022

Hiring 10x engineers is hard for most companies. It’s a tough battle out there for talent, sometimes an endless chase when every company on…

Show full content

Hiring 10x engineers is hard for most companies. It’s a tough battle out there for talent, sometimes an endless chase when every company on the planet (including FAANGs) is fighting for the same talent. In times of such tough competition in the hiring market where 10x engineers are high in demand and extremely short in supply, hiring 10x engineers can become a game of luck. So how should most companies approach building their team?

Why do we need 10x engineers

For 10x impact. That’s what we really care about. That 10x impact is the goal. Hiring is a way to get to that goal. Unfortunately, many companies who manage to hire 10x engineers often fail to retain them. They get their 10x impact in the short term but get stuck in the loop of chasing 10x talent over and over again because they have a leaky bucket of employee retention. And they don’t get lucky every time.

Eventually, failing to retain and hire 10x engineers leads to making compromises like adding more heads in the team to work on more things in parallel, hiding incompetencies by putting too much engineering process, and squeezing valuable time of CTOs because they have to jump back in to solve problems (CTOs should absolutely solve problems and get hands-on if necessary but should have the time and mind space to spend on other business problems as well).

What do 10x engineers need

10x engineers need a sense of purpose and an environment to thrive in to be able to create that 10x impact. Their sense of purpose drives them to do everything they do. Even without organisational support, they make things happen. But working without organisational support can go on only for so long as it can be taxing. Failure on leadership’s part to make it easy for your most talented engineers to function will lead to them burning out, eventually leading to their exit.

So to hire 10x engineers and to make them successful, you need to create a supportive environment where they can easily work and not feel dragged when they are trying to get anything done.

Create 10x impact

If 10x impact comes by having 10x engineers on the team, you want to turn that luck of hiring 10x engineers into a deterministic process that delivers results. But that’s still solving the problem by hiring and that’s not a fast process(more on this later). So how do we create 10x impact today?

The answer, unfortunately, does not involve getting results immediately. At best, you can improve results today and prepare for the long term. It’s not everything we desire but it is better to be certain of the future than leaving things to chance.

Things that come naturally to 10x engineers can be taught to other engineers by giving them the right environment, guidance and opportunities.

10x engineers demand (directly or indirectly) a great culture and work environment. A great environment and culture, even though demanded by 10x engineers, is good for everyone in the team. It creates the time and space for engineers to learn the skills to solve tough problems. So a baseline requirement to create 10x impact is having a supportive environment where it is easy for engineers of any skill level to learn and solve problems.

The first step to creating an environment where high impact work can happen is, well, acknowledging that there is high impact work to be done. 10x engineers can naturally identify the kind of work that they must do that leads to high impact. In their absence, identification of such opportunities still needs to happen. CTO or whoever is the best person in the team today should prioritise identifying such projects that can lead to high impact.

The second step is to create structures that make it possible to systematically work on these high impact projects. Given the time and clear mandate to solve high impact problems, any engineer can execute. They will probably take longer but they can get there. That is a much better situation than not having anything happen and just waiting for 10x engineers to join your team. Although, making such structures and investments requires alignment with the rest of the organisation, which can be achieved if you can prove that your high impact projects remove big bottlenecks.

There are several benefits to this approach. The immediate benefit is that you have made progress instead of just waiting for new engineers to join. The long term benefit is that you have systematically started providing exposure to your existing engineers to train them on identifying and solving high-impact problems. Many companies that have abundance of 10x engineers did not always have 10x engineers. In fact, they systematically made it possible for engineers, who wanted to create an impact, to do high impact work and prime themselves to do more of it in the future.

An approach that I particularly like is to set explicit expectation of senior engineers to identify and deliver high impact projects that act as force multipliers for their teams. This kind of work can include things like driving adoption of new practices, improving ways of working, adopting new technologies, fixing teething system issues that comes in the way of shipping, etc. Setting explicit expectations upfront helps engineers identify their purpose in the organisation and work towards that expectation.

Building a 10x team

So create an environment for 10x impact work to happen. Acknowledge that there is high impact work that needs to be done, and then create the time and space for engineers to do that kind of work, even if they are not 10x engineers. With time, they will get there. Even if they become 5x engineers (if that’s a thing), that’s a much better place.

All said and done, you should still keep looking for those 10x engineers. That should never really stop. But being able to create a healthy environment will help you grow your engineers to be able to continuously do high impact work. This will also help your 10x hires be successful when they join. In fact, this can also become the reason why 10x engineers will want to join your team.

629f9312d53cbc004de8f03a

Extensions

2021 — Year In Review

Vaidik Kapoor Jan 4, 2022

I have been thinking about writing personal year in review for the last two years but somehow never really managed to get past the…

Show full content

I have been thinking about writing personal year in review for the last two years but somehow never really managed to get past the laziness and make it happen. Even though 2022 has already started, it’s not too late to get this done, especially when I have been trying to write more actively (more on this later). So here is how 2021 went for me.

Can’t go on without saying that 2021 was a tough year for everyone. My heart goes out to those who were affected by the wave of COVID. Time will help everyone move on hopefully and I truly hope that the pandemic comes to an end. Just when we believed that things were getting back to normal, we are at the brink of another wave hitting us in India while some other countries are already seeing a massive surge in cases.

For me and my family, we took all the precautions, but honestly we were super lucky to have been spared by the first two waves. Thankfully, nobody in the family was infected with COVID.

The silver lining in all this was more time with Sonam (my wife), my friends and Django (my dog who was a great addition to life in 2020). It was also a year of long-hair :D

Work-wise there were ups and downs, but overall great experiences of tackling new kinds of problems (from org building, to roadmapping, to compliance) and solving them systematically. There is not much to complain really.

Health & Running

Working from home and avoiding all the wasteful travel time to get to office did create some time for me to focus on things that would always sit on my list. Health became an agenda.

I started eating healthy home cooked food and started running. I could run a few KMs when I was in school but never really tried long-distance running properly. I discovered that I can run long distances continuously and I really enjoy it as well.

Here is a short summary of my running streak:

Ran my first 5K in December 2020
Ran my first 10K in January 2021
Best 5K in 25:26 min. Average pace of 5:05 / KM
Best 10K in 56:06 min. Average pace of 5:30 / KM
Ran first 100KM-in-a-month in January 2021
Ran another 100KM in March 2021
Ran first 200KM-in-a-month in April 2021. Covered last 100KM in 7 days.
Ran a total distance of 670KM in 2021.
I was active on 102 days in 2021. Not sure how this happened but I trust Strava :D

That’s good distance and decent speed and a great achievement coming from the place of doing nothing. Unfortunately, my running streak ended in April with the second wave of COVID. Since then, I have been trying to get back but haven’t been as regular. I’d like to get better at running this year but more importantly be regular.

Lucky to be able to travel

Both Sonam and I love travelling. Before COVID, we would be out for getaways almost every other month. Naturally all of that reduced because of COVID. But I am happy that we were able to do some interesting things (here are some ideas for you) even under constraints of traveling responsibly with masks and practicing social distancing:

Khem Villas, Ranthambore — we travelled to Ranthambore to celebrate our wedding anniversary. Ranthambore was a great trip. The highway is amazing. We stayed at Khem Villas. It was a decent experience but got limiting really soon given that they only serve vegetarian food in buffet and there is no option to order a la carte. Ranthambore, however, is beautiful and the thrill of a safari to spot a tiger is just amazing. We were unlucky that we didn’t get to spot a tiger but it was a great experience none the less. We are going back there and other places for a safari experience!

Amrit Bhawan, Haridwar — this time we travelled to celebrate my birthday. India was just out of the second wave so we were scared and had to be really careful but we also needed to get out. So this was a staycation. We stayed in a neighbourhood in Haridwar which was far from Har Ki Pauri so we were away from the crowd. This place used to be a holiday home of a business family and is still owned by them but now converted as a BnB. The most amazing thing here is that their garden opens up directly in the river. Super peaceful! Also, great food but only vegetarian. This was a good break!

Here is a travel hack while going for a staycation — carry your Google Chrome Cast! :D

Fort Ahilya Heritage Hotel, Maheshwar — hidden gem! We were here for Sonam’s birthday. There is nothing to do in Maheshwar by itself except enjoying the ghats, boating and enjoying this property. But the ghats are beautiful, boating is a serene experience, food at the hotel is limited but delicious and Maheshwar is not over crowded and touristic in an irritating way. You can venture out to nearby places if you have the time but we just decided to stay in Maheshwar and avoid crowded areas.

Euro Trip (Paris, Prague, Mallorca, Barcelona) — bachelor’s trip for two of our friends, Jitender and Praveen, are getting married (not to each other). Highlights of the trip — drink, food (special mention for Raclette — thanks Pragun for the experience), party, scuba diving. But most importantly, it was amazing to be traveling with my friends after a really long time. Discovered my love for Paris and Barcelona — two beautiful cities. I wish we had more time there to really see Mallorca. It needs another trip.

Reading Books

I continue to be terrible at reading books. Super slow. I wish to be better at it. Here are some of the things I read or attempted to read:

Measure What Matters, by John Doerr — we practice OKRs at Grofers and I was part of setting that practice up but I never really read one of the most cited books on OKR. Managed to finish it.
Manager’s Path, by Camille Fournier — read it again actually. I felt the need to go through some parts of it again. Ended up reading the whole thing again to go over some the nuances of technical management.
Flow, by Mihaly Csikszentmihalyi — started reading it. Seems like a good work on achieving happiness and flow but I have not really been able to stay at it.

Not good enough. It would be great to get to one book a month as a goal. Let’s see how 2022 goes!

Community Presence

I was a lot more involved with communities this year, even though my capacity was just being there mostly as a participant in some of the conferences. Here are some of the events I could participate in a meaningful way:

Talk reviewer and mentor at PyCon India 2021 — I was a volunteer for the talk selection process to give feedback on the submitted proposals and give feedback to speakers on their presentation.
Speaker at DevOps Enterprise Summit (DOES), Las Vegas 2021 —DOES ecosystem means a lot to me. This community of DevOps practitioners has taught me so much, led me to knowledge and ideas that have been super helpful at work. From the books that have come out of this community to the kind of real stories that they share — there is so much learn here. I attended DOES for the first time in 2020. This was my third DOES and I was a speaker this year. This felt like quite an achievement to be able to present the Kubernetes story at Blinkit (formerly Grofers) in this community. You can watch the full recording here:

Speaker at Agile India 2021 — Another great conference this year was Agile India. I attended Agile India for the first time in 2020 although I have been trying to do this for a few years. But I was lucky that I got accepted as a speaker to speak about the challenges we faced and solved for in deploying microservices at Blinkit (formerly Grofers). You can watch the full recording here:

Speaker at Sequoia EPD Studio — EPD Studio is a closed community for tech leaders in Sequoia portfolio companies. Thanks to Jacob Singh for inviting me to speak. I spoke about “Pragmatic DevOps — an appeal for tech leaders to stop chasing best practices and be careful about their tech decisions”. It was a 30 minute talk followed up by a Birds-of-a-Feature session with some really good discussions. Unfortunately, the recording is not public so I cannot share it here. But this is something I hope to speak about at some DevOps conference this year.

Apart from these conferences, there were tons of panel discussions on IT modernisation and DevOps that I was invited to and attended. It was a first and a mixed experience.

But overall, I am proud that I was able to share some of the learnings we had at Blinkit with the community. And I am also proud that I pushed myself to get out more in front of audience and share my experiences.

Blogs this year

I was generally more active this year with blogging as well (relative to last year). Here are some of the posts that I wrote:

Learning From Two Years of Kubernetes in Production — this blog got a lot of attention from the community for a lot of practical wisdom compressed in a blog post.
Managing key-values in Consul using ConsulKV CRD
Technology Vendor Risk — Check the Product Strategy and Roadmap

It’s not a lot. But when compared to 2020 when I wrote just one blog post, it’s decent improvement.

There are unfortunately two blog posts that are 80% done and I haven’t been able to work on them to get them published. I would want to change this in 2022.

Things I might do in 2022

Simple things that make life better and more enjoyable:

Be regular with running, try hit a weight mark consistently
Read 10 books
Keep up the writing, speaking and community game
Spend more time training our dog Django. Would be great to have the confidence to walk with Django without the leash
Spend more time writing code to keep myself updated with what’s new and not get rusty
Learn financial modelling for tech management
Learn value stream mapping
Build some product on the side, for the fun of it

Bye-bye Blinkit!

One of the big changes that I am making this year is that I have decided to move on from Blinkit. Having spent more than 6 years at a startup and having gone through the grind of building and then re-building and then re-building again was a humbling experience to learn what it takes to build products, technology, teams and a business.

Blinkit is now building quick commerce retail for India which is super exciting. The growth in the past few months has been insane and I am confident that it is only going to get bigger from here.

But for me, I think my time has come to an end. I need a break for a while and explore what I should be doing next.

I am lucky to have been part of this journey and even more lucky to have met all the awesome people. A large part of what I got to do at Blinkit wouldn’t have been possible without the support of my peer group and my teams. You are all A-players. Keep kicking ass!

I have truly learned a lot from this journey. And I wish the fine folks at Blinkit good luck!

Life is hard to imagine without Blinkit. But I am also equally excited to see what comes next.

There is a lot that didn’t go well for the world in 2021. That aside, I feel I was able to make the most of it, both personally and professionally. I look forward to a better 2022 for everyone.

Wishing you and your loved ones a very happy new year!

Stay safe, stay healthy!

629f9312d53cbc004de8f03b

Extensions

Managing key-values in Consul using ConsulKV CRD

Vaidik Kapoor Jun 10, 2021

We have been deploying applications on Kubernetes for over two years. We mostly followed a lift-and-shift approach while migrating to…

Show full content

Managing key-values in Consul using ConsulKV CRD

We have been deploying applications on Kubernetes for over two years. We mostly followed a lift-and-shift approach while migrating to Kubernetes. We looked for everything that Ansible used to do for us and tried to replicate it in Kubernetes. At first, everything seemed to work. But over time we realized that a simple lift-and-shift approach can complicate things over time and perhaps adopting a more Kubernetes native approach from the beginning can be better to avoid rework later. We will talk about one such challenge in this blog post — our Consul KV based approach to configuration management over using ConfigMap.

We have been often asked if we write our own CRDs, operators and controllers. We also attempt to demonstrate how we use these patterns in this blog.

ConfigMaps might not be enough for your needs

In our legacy Ansible based setup, our playbooks would orchestrate the following high-level tasks on an EC2 instance to deploy an application:

Pull the source code or the artifact/binary
Generate configuration file using Ansible template. Configuration was managed using host and group vars in Ansible. Secrets managed in Hashicorp Vault and pulled at runtime by Ansible. This enabled a clean way of managing environment specific configuration and secrets without repetition.
Start the application’s process

Everything was committed in Git repositories. We liked this way of working.

The most common answer to configuration management in the Kubernetes community is using ConfigMap and Secret and storing all configuration using environment variables. We explored this approach and it would have made a lot of sense if we were starting from scratch. But it’s hard to get legacy applications to follow all the modern practices. For us, migrating from configuration files to environment variables for all the complex configurations would have been a costly migration.

The cost of migration aside, an important workflow that we feel is missing in the ConfigMap and Secret approach is the ability to reload pods when a ConfigMap or a Secret is updated along with rotation of secrets completely missing as a feature.

Consul for Configuration Management

We started looking at other options that helped us reduce the complexity and the cost of migration. In our configuration management setup on Kubernetes, we wanted to follow similar high-level semantics as we did in our Ansible setup:

Keep configuration files templated using some kind of templating language
Keep environment configuration committed in code as well and repeat the configuration code as little as possible.

Since we already had Consul in our stack, we decided to use:

consul-template for templating configuration files
Consul’s KV feature for storing configuration that consul-template would use while rendering templates.
To keep Consul’s key-values in git, we decided to use git2consul.

This worked well initially. However over time, it started adding complications in our developer workflows and CI pipelines. The issues were rooted in the entire deployment workflow, in that it was not declarative any more. It was these two steps at a high-level:

Run git2consul to sync key-values in git to Consul. This would involve additional logic for environment specific handling.
Run kubectl apply to apply the manifests

Specific challenges in this approach were:

Lack of support for namespacing in git2consul. This made provisioning new on-demand non-production environments hard. Conflicts were inevitable as namespacing could not be enforced.
Git2consul has an unintuitive execution model. You have to tell git2consul the git repository where the key-values are committed. Git2consul then clones it and syncs the key-values to Consul. This is unnecessarily slow, especially for developers as they already have their git repositories cloned. We ran git2consul in the background to parallelize and speed up. But then we started seeing typical async workflow related issues (debugging was hard, observing for completion was even harder, etc.).
To avoid conflicts to an extent (and we couldn’t go beyond a certain extent), we had to centralize all the configuration in a repository. So while our microservices were in their own repositories, their respective configurations were centralized for coordination, making developer experience unnecessarily complicated.
We had to hack namespacing in consul for provisioning new environments by keeping all our key-values at a key as source that will be cloned by a script while creating a new environment. This was a hacky orchestration which made our CI pipelines slow and hard to reason about. Debugging configuration management related issues was getting hard. But most importantly, developers could not test configuration changes from code as new environments were seeded in with configuration already there in Consul.

Our learning through this experience was the rush of moving applications to Kubernetes without thinking through the entire software delivery architecture made problems worse later.

Simple Reliable Workflows Using CRD

To make the development, continuous integration and deployment process consistent and easy to reason about in developer workflows and in automation, we had to make every detail of deploying an application consistent with how any other object is deployed on Kubernetes — make it declarative, idempotent, eventually consistent.

Essentially, we want our applications to be deployed with just one command: “kubectl apply .”. This as a principle is important for us to be able to rely on any other cloud-native tooling like kustomize and skaffold.

We realized that we can replicate git2consul’s functionality in a more Kubernetes-native way. We started working on a CRD and after a few iterations arrived at the following API:

This ConsulKV object will create the following key-value paths in Consul:

Where “kubernetes/” is a configurable prefix to ensure that all the keys managed by ConsulKV CRD are kept isolated from other users of Consul in our infrastructure.

What are the benefits of this:

Simple to use — Doesn’t this look a lot like ConfigMap? The design of ConsulKV CRD was intentionally kept similar to ConfigMap to keep the developer experience as simple as possible. Developers don’t need to install anything other than kubectl. No more git2consul.
Logical namespacing in Consul — logical namespaces in Consul are created on the basis of Kubernetes namespace and name of the object. Kubernetes namespace based namespacing is helpful for setting up the same application in multiple namespaces (say, for development or CI pipelines). Object name based namespace is helpful to ensure that two different objects in the same namespace don’t end up with conflicting keys in Consul.
Works with cloud-native tools like Skaffold and Kustomize — for example, in Skaffold’s dev mode, developers can change key-values in ConsulKV manifests and just expect the values in Consul to be directly changed without any manual intervention, making it for a much better developer workflow. Achieving GitOps with tools like ArgoCD is now possible. Compatibility achieved by using CRDs with cloud-native tools is a great benefit that simplifies a lot of things and makes everything just work.
Achieve DRY with kustomize — common KVs across environments don’t have to be repeated. Since it’s just another Kubernetes object, you can use kustomize to override KVs specific to those environments.
Simpler access control for Consul — developers don’t need to write to Consul directly anymore. You can depend on access control features in Kubernetes to expose Consul to your developers in a secure way.

Of course this comes with the consideration of how to namespace actual usage of paths in your applications. There are possibly two ways you are using Consul:

Consul client libraries in applications — in this case, you will have to make your application configurable to prefix all instances of API call to account for the Kubernetes namespace which can be passed via an environment variable.
envconsul or consul-template — both support a “prefix” option that can be used to prefix all the paths being read with <namespace>/<consul-kv-object-name>. In the above example, it will be “grofers-namespace/grofers-dot-com”

consul-kv-crd is written in Python and built using Zalando’s kopf framework for building operators which makes it super easy for teams who speak Python to adopt a more Kubernetes way of doing operations.

The implementation itself is quite simple. And we hope to make it open-source some day.

What did we really learn?

When we started moving to Kubernetes, a lot had to be still figured out even by the community about all aspects of operating applications on production. The process was painful but taught us a lot about semantics and designing processes on Kubernetes:

Stick to the standard way every time possible. In our case, the best thing would have been to just use ConfigMap and Secret to avoid all the complexity we encountered later. Our challenge was dealing with legacy.
If you cannot stick to standard, then use solutions that are easy to observe and reason about. In our case, the git2consul model was just not the right solution. It was slow and there were no easy ways to speed it up. No hard feelings for git2consul, but we feel just not built for our needs (and perhaps these needs are common for other teams as well).
Try to be as declarative as possible. Use kubectl as your tooling to do everything. Best is to use “kubectl apply” to tell the cluster the eventual state you desire and walk backwards to let applications eventually stabilize.
If being declarative is not possible or too costly in the short term and must absolutely look at an imperative process, prioritize for speed and observability. Avoid async tasks in imperative workflows that could be hard to monitor from things happening inside Kubernetes (for example, should we wait-for git2consul to finish before attempting to start the process?).

Giving developers a single pane to manage everything

Now our configuration sits right next to all other Kubernetes manifests in our application source code repositories. Developers can work with Consul the way they work with all other objects in Kubernetes.

This is a pattern that we are trying to follow with as many developer workflows as possible. An example of this is legend, a CRD for standardized Grafana monitoring dashboards for applications deployed on Kubernetes.

What’s next?

consul-kv-crd is now being used in every new application getting deployed at grofers. While it has simplified our legacy git2consul based setup, we are excited about moving to consul-template based hot-reloading of configuration changes when KVs update in Consul and rotation of secrets using leases in Vault.

We have not open-sourced consul-kv-crd yet. We are curious to learn about what the community thinks about this approach and we would be happy to open-source it if we gather enough interest from the community.

Interested in cloud-native technologies? Did you know we are hiring?

If this kind of work interests you, apply to work with us through our careers site or feel free to directly reach out to me on LinkedIn, Twitter or just drop me an email.

Vaidik Kapoor is the VP Engineering (DevOps & Security) at Grofers.

Thanks for reading Lambda.

Say hello on Twitter or follow us on LinkedIn.

We’re hiring!

We are hiring across various roles! If you are interested in exploring working at Grofers, we’d love to hear from you. You can either apply on LinkedIn or directly reach out to the author on Twitter or LinkedIn.

629f9312d53cbc004de8f03d

Extensions

Technology Vendor Risk — Check the Product Strategy and Roadmap

Vaidik Kapoor Jun 6, 2021

Every company today most likely has vendors in their technology landscape. There are ways and parameters that technology teams use to…

Show full content

Technology Vendor Risk — Check the Product Strategy and Roadmap

Every company today most likely has vendors in their technology landscape. There are ways and parameters that technology teams use to assess risk while selecting vendors. In this post, I am going to talk about one such risk to consider while selecting a (software or SaaS) vendor — the product strategy and roadmap.

Why is the product strategy important?

It’s important that the vendors you are evaluating satisfy your requirements. Their current features should meet all your current business requirements. And while you are at it, perhaps also consider that the current features product can also satisfy your short term to mid term requirements. And of course the vendor should provide all that while offering the service at a low or acceptable cost.

But what about the long term? Can they be a long-term partner for your business? You must explore the answer to this question. There can be many parameters relevant to your business that could help you find the answer to this question, one important parameter is the product strategy.

The product strategy can help you understand a few things — what kind of customers is your vendor going after and will prioritize in the future and if you fit in that plan, what kind of needs are they trying to solve, what kind of geographical markets are they trying to be prominent in and if they will ever cover regulatory needs of your region, etc. The most important thing that it will reveal for you is that if you are a type of company that the vendor is looking to make money from. If not, the relevance of features and pricing of the service for your company will reduce with time.

So knowing where your vendor is trying to go as a business and hence as a product is important for you to evaluate if the vendor can continue to be relevant for you.

What is the best way to understand this:

Mostly product teams or founders (for relatively new vendors) of this vendor will be able to share with you. Sales teams might not have this information or might not be able to communicate this well. So being able to get some time with product team or the founders directly might be the best way to understand what the long term goal is and which kind of customers or markets are important for the vendor. Some established vendors might not be open to letting you speak with their product team directly. This should be possible depending on size of your commercial agreement or if the vendor is a new kid on the block.
How much the vendor shares with you will depend on how important you are going to be for them as a customer. This does not just mean dollar value. It can also be that the vendor wants to get you onboard as a strategically important customer that they might want to advertise to get in a segment of market. May be try to establish that with the vendor as to how the vendor can benefit from having your company as a customer.

Why is the product roadmap important?

It is also important that the vendor keeps shipping relevant features continuously so that the vendor can support your company’s short term and evolving needs.

If there is a time or business critical that you want the service to solve for you, getting the vendor to agree to delivering those features timely is a good starting point. To really protect your interest, try to get these commitments factored into the agreement before signing the contract. Having some kind of legal commitment can safeguard your interests.

What if the service provided by the vendor is not business critical? You will really have to assess the criticality of the vendor and what it does for you and decide accordingly how diligently you need to understand the vendor. But even if the vendor is not providing any critical service, generally being inquisitive about the features that they are planning to ship in the next couple of months or quarters can expose interesting things about your vendor and how it can assist or be detrimental to the business problem you are trying to solve with this vendor.

What is the best way to understand this:

Asking the sales team might not be enough. Sales teams are infamous for making commitments without checking with their product teams. So again, try talking to the product team about what kind of features they plan to release in the next few months and quarters. Again, how much information you can get depends on how important you are for the vendor.
If you are not able to get the time with the product team, get something in written (even email is fine but of course legal agreements may be the best, although relatively hard, option) from the sales team so that you can hold them accountable in the future and even escalate to the vendor’s management if commitments are not met.

Importance of frequency of releases

Every vendor will have their own process for releasing features to customers. Some might not have a defined process for releases. It might be okay if they don’t have a defined process. What is important if they constantly ship features. A defined process, like bi-weekly, monthly or half-yearly release cadence, guarantees the vendor will be always be shipping something and how to factor that schedule in your company’s roadmap.

In absence of significant features being shipped regularly, the product would get stagnant and might not be a good partner for your evolving business.

What is the best way to understand this:

SaaS products typically have a mechanism for announcing new features. Look that up and see their past releases. Were they frequent enough? Were they significant features solving the core use-case or insignificant features targeted at a set of customers unlike you?
If the vendor does not have a way of communicating what’s new in their product, I’d be concerned at this point. This could be okay for a startup. You can have them make up for this by setting up some kind of recurring office hours until they come up with a more streamlined way to inform you and your team about product updates.
Ask the vendor’s sales or product team about their release cadence what were the past few releases like. Sales teams should be typically able to put something like this together if you ask them.

Retrospecting on real-world experiences

I have had first hand experience of being bitten by this because we overlooked the importance of product strategy and roadmap. I’ll try to share some of those incidents (without naming names) to try to paint a real picture:

At my current employer, we decided to switch our vendor for Process X from Vendor A to Vendor B because Vendor B had all the features that we needed for our process’ requirements today while Vendor A had some important features missing. Vendor B was also cheaper. What we didn’t factor in was Vendor A had certain features that we would have found ourselves using 6–9 months in the future. These features were missing in Vendor B. Also, Vendor B got recently acquired and stopped shipping significant features. They didn’t ship anything useful for almost one and a half years, stagnating maturity of process and leaving us in the middle of nowhere. In retrospect, staying with Vendor A would have been better even when a couple of features were missing because those features were eventually added in Vendor A.
Another instance at my current employer was with a vendor that would help us monitor uptime of different applications. We onboarded this vendor many years back and it was extremely relevant then. But the monitoring space has evolved so much over the years (think synthetic monitoring, SLOs, real-user monitoring, etc.). Not knowing where the vendor is heading and what they are going to be building in the long term left us with this vendor that would not be able to help us evolve with time. The vendor kept making UX improvements without really being able to solve for more complicated monitoring requirements. That might be fine for the vision of the vendor and what they want to do with their business. But for us, this vendor was not a good long term partner. Thankfully we had used this vendor in limited places so swapping out was easy and that’s what we eventually did. However selecting vendors and onboarding them is also a time taking process, no matter how easy it is and you might not want to do that over and over again for the same problem.

How do you de-risk technology and vendor decisions in your company? I’m curious to learn about this and happy to share about my experiences about specific risks that you are curious to learn about.

629f9312d53cbc004de8f03c

Extensions

Learnings From Two Years of Kubernetes in Production

Vaidik Kapoor Nov 3, 2020

Almost two years back, we took the decision to leave behind our Ansible based configuration management setup for deploying applications on…

Show full content

Learnings From Two Years of Kubernetes in Production

Almost two years back, we took the decision to leave behind our Ansible based configuration management setup for deploying applications on EC2 and move towards containerisation and orchestration of applications using Kubernetes. We have migrated most of our infrastructure to Kubernetes. It was a big undertaking and had its own challenges — from technical challenges of running a hybrid infrastructure until most of the migration is done to training the entire team on a completely new paradigm of operations to name a few.

In this post, we would like to reflect on our experience and share our learning from this journey with you, to help you make better decisions and increase your chances of success.

Be clear about your reason to migrate to Kubernetes

All that serverless and containers thing is nice. If you are starting a new business and building everything from scratch, by all means go ahead and deploy your applications using containers and orchestrate them using Kubernetes if you have the bandwidth (or may be not, read on) and the technical skills to configure and operate Kubernetes as well as deploy applications on Kubernetes.

Even if you offload operating Kubernetes to a managed Kubernetes service such as EKS, GKE or AKS, deploying and operating applications on Kubernetes properly also has a learning curve. Your development team should be up for the challenge. A lot of benefits can only be realised if your team follows the DevOps philosophy. If you have central sysadmin teams writing manifests for applications developed by other teams, we personally see lesser benefit of Kubernetes from the perspective of DevOps. Of course, there are numerous other benefits that you can choose Kubernetes for, for example cost, faster experimentation, faster auto-scaling, resilience, etc.

If you are already deploying on cloud VMs or perhaps another PaaS, why are you really considering migrating to Kubernetes from your existing infrastructure? Are you confident that Kubernetes is the only way to solve your problems? You must be clear about your motivations as migrating an existing infrastructure to Kubernetes is a big undertaking.

We made some mistakes on this front. Our primary reason to migrate to Kubernetes was to build a continuous integration infrastructure that could assist us with rapid re-architecture of our microservices in which a lot of architecture debt had crept in over the years. Most new features required touching multiple code bases and hence, development and testing all of them together would slow us down. We felt the need to be able to provision an integrated environment for every developer and every change to assist with faster development and testing cycles without coordinating who gets the “shared stage environment”.

micro-services micro-services

Great! What did it take us to build all of this? It took us almost 1.5 years. So was it worth it?

It took us almost 1.5 years to stabilize this complex CI setup by building additional tooling, telemetry and redoing how every application is deployed. For the sake of dev/prod parity, we had to deploy all these micro-services to production as well or else just the drift between the infrastructure and deployment setup will make the applications hard to reason about for developers and would have made ops for developers a nightmare.

We have mixed feelings about this topic. In retrospect, we think we made our problem of solving for continuous integration worse because the complexity of pushing all microservices to production for dev/prod parity made the challenge of achieving faster CI builds a lot more complex and difficult. Before Kubernetes, we were using Ansible with Hashicorp Consul and Vault for infrastructure provisioning, configuration management and deployments. Was it slow? Yes, absolutely. But we think we could have introduced service discovery with Consul and optimized Ansible deployments a bit to get close enough to our goal in a reasonably shorter time.

Should we have migrated to Kubernetes? Yes, absolutely. There are several benefits of using Kubernetes — service-discovery, better cost management, resilience, governance, abstraction over infrastructure of cloud infrastructure to name a few. We are reaping all these benefits today as well. But that was not our primary goal when we started and the self-imposed pressure and pain of delivering the way we did was perhaps unnecessary.

One big learning for us was we could have taken a different and lesser resistant path to adopting Kubernetes. We were just bought into Kubernetes as the only solution that we didn’t even care to evaluate other options.

We will see in this blog post that migration and operations on Kubernetes are not the same as deploying on cloud VMs or bare metals. There is a learning curve for your cloud engineering and development teams. It might be worth it for your team to go through it. But do you need to do that now is the question. You must try to answer that clearly.

Out-of-the-box Kubernetes is almost never enough for anyone

A lot of people get confused with Kubernetes as a PaaS solution — it’s not a PaaS solution. It is a platform to build PaaS solutions. OpenShift is one such example.

Out-of-the-box Kubernetes is never enough, for almost anyone. It’s a great playground to learn and explore. But you are most likely going to need more infrastructural components on top and tie them well together as a solution for applications to make it more meaningful for your developers. Often this bundle of Kubernetes with additional infrastructural components and policies is called Internal Kubernetes Platform. This is as an extremely useful paradigm and there are several ways to extend Kubernetes.

Metrics, logs, service discovery, distributed tracing, configuration and secret management, CI/CD, local development experience, auto-scaling on custom metrics are all things to take care of and make a decision. These are only some of the things that we are calling out. There are definitely more decisions to make and more infrastructure to set up. An important area is how are your developers going to work with Kubernetes resources and manifests — more on this later in this blog post.

Here are some of our decisions and rationale.

Metrics

We finalized on Prometheus. Prometheus is almost a defacto metrics infrastructure today. CNCF and Kubernetes love it very much. It works really well within the Grafana ecosystem. And we love Grafana! Our only problem was that we were using InfluxDB. We have decided to migrate away from InfluxDB and totally commit to Prometheus.

Logs

Logs have always been a big problem for us. We have struggled to create a stable logging platform using ELK. We find ELK full of features that are not realistically used by our team. Those features come at a cost. Also, we think there are inherent challenges in using Elasticsearch for logs, making it an expensive solution for logs. We finalized on Loki by Grafana. It’s simple. It has necessary features for our team’s needs. It’s extremely cost-effective. But most importantly, it has a superior UX owing to it’s query language being very similar to PromQL. Also, it works well with Grafana. So that brings the entire metrics monitoring and logging experience together in one user interface.

Configuration and Secret Management

You will find most articles use configmap and secret objects in Kubernetes. Our learning is that it can get you started but we found it barely enough for our use-cases. Using configmap with existing services comes at a certain cost. Configmap can be mounted into pods in a certain way — using environment variables is the most common way. If you have a ton of legacy microservices that read configuration from files rendered by a configuration management tool such as Puppet, Chef or Ansible, you will have to redo configuration handling in all your code bases to now read from environment variables. We didn’t find enough reason to do this where it made sense. Also, a change in configuration or secret means that you will have to redeploy your deployment by patching it. This would be additional imperative orchestration of kubectl commands.

To avoid all this, we decided to use Consul, Vault and Consul Template for configuration management. We run Consul Template as an init container today and plan to run it as a side car in pods so that it can watch for configuration changes in Consul and refresh expiring secrets from Vault and gracefully reload application processes.

CI/CD

We were using Jenkins before migrating to Kubernetes. After migrating to Kubernetes, we decided to stick to Jenkins. Our experience so far has been that Jenkins is not the best solution for working with cloud-native infrastructure. We found ourselves doing a lot of plumbing using Python, Bash, Docker and scripted/declarative Jenkins pipelines to make it all work. Building and maintaining these tools and pipelines started to feel expensive. We are right now exploring Tekton and Argo Workflows as our new CI/CD platform. There are more options you can explore in the CI/CD landscape such as Jenkins X, Screwdriver, Keptn, etc.

Development Experience

There are a number of ways to use Kubernetes in your development workflow. We mostly zeroed down to two options — Telepresence.io and Skaffold. Skaffold is capable of watching your local changes and constantly deploying them to your Kubernetes cluster. Telepresence, on the other hand, allows you to run a service locally while setting up a transparent network proxy with the Kubernetes cluster so that your local service can communicate with other services in Kubernetes as if it was deployed in the cluster. It is a matter of personal opinions and preferences. It has been hard to decide on one tool. We are mostly experimenting with Telepresence right now but we have not abandoned the possibility of Skaffold being a better tooling for us. Only time will tell what we decide to use, or perhaps we use both. There are other solutions as well such as Draft that deserve a mention.

Distributed Tracing

We are not doing distributed tracing just yet. However, we plan to invest into that area real soon. Like with logging, our desire is to have distributed tracing be available next to metrics and logging in Grafana to deliver a more integrated observability experience to our development teams.

Application Packaging, Deployment and Tooling

An important aspect of Kubernetes is to think about how developers are going to interact with the cluster and deploy their workloads. We wanted to keep things simple and easy to scale. We are converging towards Kustomize, Skaffold along with a few home-grown CRDs as the way for developers to deploy and manage applications. Having said that, any team is free to use whatever tools they would like to use to interact with the cluster as long as they are open-source and built on open standards.

Operating a Kubernetes cluster is hard

We mostly operate out of the Singapore region on AWS. At the time we started our journey with Kubernetes, EKS was not available as a service in the Singapore region. So we had to set up our own Kubernetes cluster on EC2 using kops.

Setting up a basic cluster is perhaps not as difficult. We were able to get up our first cluster running within a week. Most issues happen when you start deploying your workloads. From tuning cluster autoscaler to provisioning resources at the right time to configuring the network correctly for the right performance, you have to do research and configure it all yourself. Defaults don’t work most of the time (or at least they didn’t work for us back then) for production.

Our learning

You have to still think about upgrades

Kubernetes is so complex that even if you are using a managed service, upgrades are not going to be straight forward.

Even when using a managed Kubernetes service, invest early in infrastructure-as-code setup to make disaster recovery and upgrade process relatively less painful in the future and be able to recover fast in face of disasters.

You can make an attempt to push towards GitOps if you will. If you can’t do that, reducing manual steps to a bare minimum is a great start. We use a combination of eksctl, terraform and our cluster configuration manifests (including manifests for platform services) to set up what we call the “Grofers Kubernetes Platform”. To make the setup and deployment process simpler and repeatable, we have built an automated pipeline to set up new clusters and deploy changes to existing ones.

Resource Requests and Limits

After we started migrating, we observed a lot of performance and functional issues in our cluster due to incorrect configuration. One of the effects of that was adding a lot of buffers in resource requests and limits to eliminate resource constraints as a possibility for performance degradation.

One of the first observations was pod evictions due to memory constraints on nodes. The reason for this was disproportionately high resource limits as compared to their resource requests. With surge in traffic, increase in memory consumption could lead to memory saturation on nodes, further leading to pod eviction.

Our learning

This does not apply in case of non-production environments (such as development, staging and CI). These environments don’t get any spike in traffic. Theoretically you can run infinite containers if you set CPU requests to zero and set a high enough CPU limit for your containers. If your containers start utilizing a lot of CPU, they will get throttled. You can do the same with memory requests and limits as well. However, the behaviour of reaching memory limits is different than that of CPU. If you utilize more than the set memory limit, your containers get OOM killed and they restart. If your memory limit is abnormally high (let’s say higher than the node’s capacity), you can keep using memory but eventually the scheduler will start evicting pods when the node runs out of available memory.

In non-production environments, we safely over commit resources as much as possible by keeping resource requests extremely low and limits extremely high. The limiting factor in this case is memory i.e. no matter how low the memory request is and how high the memory limit is, pod eviction is a function of sum of memory utilized by all containers scheduled on a node.

Security & Governance

Kubernetes is meant to unlock the cloud platform for developers, make them more independent and push the DevOps culture. Opening up the platform to developers, reducing intervention by cloud engineering teams (or sysadmins) and making development teams independent should be one of the important goals.

Sometimes this independence could pose severe risks. For example, using the LoadBalancer type service in EKS provisions a public-network facing ELB by default. Adding a certain annotation would ensure that an internal ELB is provisioned.We made some of these mistakes early on.

Open Policy Agent

Deploying Open Policy Agent to build the right controls helped automate the entire change management process and build the right safety nets for our developers. With Open Policy Agent, we can restrict scenarios like one just mentioned before — it is possible to restrict service objects from getting created unless the right annotation is present so that developers don’t accidentally create public ELBs.

Cost

We saw massive cost benefits after our migration. However, not all the benefits came immediately.

Note:

Better Resource Capacity Utilisation

This was the most obvious one. Our infrastructure today has far less compute, memory and storage provisioned than we had before. Apart from better capacity utilisation due to better packing of containers/processes, we were able to better utilise our shared services such as processes for observability (metrics, logs) than before.

However, initially we had an enormous amount of wastage of resources while we were migrating. Owing to our inability to tune our self-managed Kubernetes cluster the right way which led to a ton of performance issues, we ended up requesting a lot of resources in our pods as buffer and more like insurance to reduce chances of outages or performance issues due to lack of compute or memory.

High infrastructure cost due to large resource buffers was a big problem. We were not really able to realise any benefits of capacity utilisation due to Kubernetes that we should have. It was after migrating to EKS and observing the stability it brought helped us become more confident, which helped us take the necessary steps to correct resource requests and bring down resource wastage drastically.

Spot

Using spot instances with Kubernetes is a lot easier than using spot instances with vanilla VMs. With VMs, you can manage spot instances yourself which might have some complexity of ensuring a proper uptime for your applications or use a service like SpotInst. The same applies to Kubernetes as well but the resource efficiency brought in by Kubernetes can leave you enough room for keeping some buffer so that even if a few instances in your cluster get interrupted, the containers scheduled on them can be quickly rescheduled elsewhere. There are a few options for efficiently managing spot interruptions.

Spot instances helped us get massive savings. Today, our entire stage Kubernetes cluster runs on spot instances and 99% of our production Kubernetes cluster is covered by reserved instances, savings plan and spot instances.

The next step of optimisation for us is how we can run our entire production cluster on spot instances. More on this topic in another blog post.

ELB Consolidation

We used Ingress to consolidate ELBs in our stage environment and reduce the fixed costs of ELBs drastically. To avoid this from becoming a cause of dev/prod disparity in code, we decided to implement a controller that would mutate LoadBalancer type services to NodePort type services along with an ingress object in our stage cluster.

Migration to Nginx ingress was relatively simple for us and didn’t require a lot of changes because of our controller approach. More savings can come if we use ingress in production as well. It’s not a simple change. Several considerations have to go in configuring ingress for production the right way and needs to be looked at from the perspective of security and API management as well. This is an area we intend to work in the near future.

Increased Cross-AZ Data Transfer

While we saved a lot on infrastructure spend, there is an area of infrastructure where the costs increase — cross-AZ data transfer.

Pods can be provisioned on any node. Even if you control how pods are spread in your cluster, there is no easy way to control how services discover each other in a way that a pod of one service talks to the pod of another service in the same AZ to reduce cross-AZ data transfer.

After a lot of research and conversations with peers in other companies, we learned that something like this can be achieved by introducing a service mesh to control how traffic from a pod is routed to the destination pod. We were not ready to take the complexity of operating a service mesh ourselves just for the benefit of saving the cost of cross-AZ data transfer.

CRDs, Operators & Controllers — A Step Towards Simplified Ops & A More Integrated Experience

Every organisation has its own workflows and operational challenges. We have ours too.

In our two years of journey with Kubernetes, we learned that Kubernetes is great but it’s better when you are using its features such as controllers, operators and CRDs to simplify daily operations and provide a more integrated experience to your developers.

We have started investing in a bunch of controllers and CRDs. For instance, LoadBalancer service type to ingress conversion is a controller operation. Similarly, we use controllers to automatically create CNAME records in our DNS provider whenever a new service is deployed. These are a few examples. We have 5 other separate use-cases where we are relying on our internal controller to simplify daily operations and reduce toil.

We have also built a few CRDs. One of them is widely used today to generate monitoring dashboards on Grafana by declaratively specifying what monitoring dashboards should be constructed with. This makes it possible for developers to check-in their monitoring dashboards next to their application code base and deploy everything using the same workflow — kubectl apply -f . .

We are seeing the benefits of controllers and CRDs massively. As we work closely with our cloud vendor AWS to simplify cluster infrastructure operations, we free ourselves up to focus more on building “the Grofers Kubernetes platform” which is architected to support our development teams in the best way possible.

Interested in infrastructure engineering? Did you know we are hiring?

If this kind of work interests you, feel free to directly reach out to me on LinkedIn, Twitter or just drop me an email.

Vaidik Kapoor is the VP Engineering (DevOps) at Grofers.

Thanks for reading Lambda.

Say hello on Twitter or follow us on LinkedIn.

We’re hiring!

629f9312d53cbc004de8f03e

Extensions

Yet Another OKR Template

Vaidik Kapoor Oct 11, 2020

There are more than a dozen OKR templates and tracking tools out there. We started adopting OKRs at Grofers a couple of years back and…

Show full content

There are more than a dozen OKR templates and tracking tools out there. We started adopting OKRs at Grofers a couple of years back and went through our cycles of adopting the methodology. While adopting a new way of working, having the right tool can accelerate the transformation.

We explored a lot of different tools and ended up settling with 15five. We liked it’s simplicity, features that pushed the behaviour around using OKRs for goal setting, along with some additional features for employee engagement and performance management. Unfortunately we had to discontinue 15five due to cost cutting during the COVID pandemic.

We had grown to like 15five quite a lot. Discontinuing it left a gap in our working. We needed a cheap, possibly free, replacement for 15five.

I had personally explored a lot of solutions and worked on deploying OKRs within the technology function at Grofers. I had explored free Google Sheet templates in that process. They didn’t work for our needs back then. With our experience of deploying OKRs and having used 15five, I thought it was worthwhile to give Google Sheets another try and make it work for our way of doing OKRs. One of the reasons for not exploring anything else was cost but another one was that the entire company is comfortable with Google Sheets. It was not difficult to introduce this change.

So here is another Google Sheet template for tracking OKRs — an opinionated and simplified one that I created for our needs at Grofers. I’m sharing this so that if you are looking for a free template, you can give it a try.

Here are the features supported by this template:

Add Objectives, Key Results, Key Result targets.
Support for specifying OKR owners and parent OKR.
Automatic progress calculation at objective and key result level — updating your key result will show total progress automatically.
Support for KR types and auto-formatting of KR values on the basis of KR types (number, currency, percentage, done/not done).

It was not straight forward to make this. I had to write a bunch of Google Apps Script code to make it all of it work, which is also fun to do and learn that the Google Apps platform is incredibly powerful for hackers. Tools like Airtable or Coda can most likely make all of this a lot simple. However Google Sheets makes it far more accessible to teams that don’t use or don’t want to use another paid software for collaboration.

Give it a try. If you need features or help with the sheet, feel free to reach out to me.

629f9312d53cbc004de8f03f

Extensions