GeistHaus
log in · sign up

nablag

Part of nablag.com

Personal Blog. ∇ + Blag

stories primary
The repercussions of missing an Ampersand in C++ & Rust
rustc++
TL;DR
Show full content
TL;DR

There’s a funny typo that causes someone to copy data instead of “referencing” in C++. Rust is nice because it provides defaults that protect you from some of these “dumb” mistakes1. In this example, I’ll go over how the “move by default” can prevent us from introducing this subtle behavior.

Motivation

I originally hesitated to write this because I thought the topic was too “obvious”, but I did it anyways after watching this presentation discussing migrating from C++14 to C++20. I was specifically inspired by a performance bug due to a typo. This mistake is the “value param” vs “reference param” where your function copies a value instead of passing it by reference because an ampersand (&) was missing… Here’s a minimum version of the difference below:

// You're the mythical 100x developer
void BusinessLogic(const Data& d) {
  // ...
}

// You're trash that deserves to be replace by AI
void BusinessLogicThatCopies(const Data d) {
  // ...
}

This simple typo is easy to miss and the penalty won’t matter for people who aren’t performance sensitive (although if you aren’t strongly affected by stuff like this you probably don’t need to be using C++). One could argue that this example is a one-off and no competent C++ developer would make this mistake, but I’ve even seen it in Google codebases (interpret that as you well). There are plenty of linters and tools to detect issues like this (ex: clang-tidy can scan for unnecessary value params), but evidently these issues go unnoticed until a customer complains about it or someone actually bothers to profile the code. The fact that we have to be vigilant about such a minor behavior is exhausting, and that maybe we should design our language to guide us to sensible defaults.

Rust Defaults

I like Rust because it provides a handful of C++ patterns by default2. Compared to the Rust hype marketing this benefit is quite small compared to “memory safety” and “fearless concurrency”, but I like the improved ergonomics nonetheless. Adopting performance oriented defaults removes a lot of the weird “gotchas” early on in the C++ learning curve, as well as the toil about having the proper tooling setup. For brevity, I’ll just focus on the concept of C++ ownership (std::move) and reference parameters to keep things short.

  1. By default “pass by value” in rust moves objects (unless the object implements the Copy trait) instead of copying them

Rust’s “pass by value” behavior differs from C++’s behavior where it copies the object. This has some subtle implications and its hard to grasp why this is nice without visualizing the C++ code. So let’s start with the toy example we started at the beginning. In the below C++ snippet, passing our expensive-to-copy struct “by value” will result in the struct being copied:

void BusinessLogic(const Data d) {
  d.DoThing();
}

Data expensive_to_copy = Data{...};
BusinessLogic(expensive_to_copy);

Copying is desirable for types that are more performant to copy (like int, bool, floats, etc), but for larger objects/heap allocated objects, it will slow down our code. If you’re trying to execute based on the contents of the object, an improvement might “pass by reference” like so:

// note the `const` + `&`
void BusinessLogic(const Data& d) {
  d.DoThing();
}

Data expensive_to_copy = Data{...};
BusinessLogic(expensive_to_copy);

Again, if we repeat that typo from the beginning (const Data d), only the linter will point out our mistake.

There are some cases where you want your function to “take ownership” of the parameter you’re passing to it, so you might employ a “move” using something like this:

std::unique_ptr<Owner> CreateOwner(Data &&d) {
	// ...
	return std::make_unique<Owner>(s);
}

Data expensive_to_copy = Data{...};
auto data_owner = FactoryFunction(std::move(expensive_to_copy));

However, moving in this context adds extra restrictions to the original object. After an object that have been “moved from”, that you’re not supposed to use the original object after it’s been moved or else you’ve introducing potential bugs. For example, even though the below compiles, the snippet is bad and linters will complain about it.

Data expensive_to_copy = Data{...};
auto data_owner = FactoryFunction(expensive_to_copy);
expensive_to_copy.DoThing();  // Linter will complain about using expensive_to_copy after it has been moved from.
// The compiler won't say anything. The maintainer needs to watch out for accidental uses

With Rust executing a function for either case deploys the “optimal” version (reference or move) by default, moreover, the compiler (not the linter) will point out the any improper “use after moves”.

struct Data {
  // Vec cannot implement "Copy" type
  data: Vec<i32>,
}

// Equivalent to "passing by const-ref" in C++
fn BusinessLogic(d :&Data) {
  d.DoThing();
}

// Equivalent to "move" in C++
fn FactoryFunction(d: Data) -> Owner {
  owner = Owner{data: d};
  // ...
  return owner
}

Rust prevents us from accidentally writing sub-optimal versions of the C++ function (BusinessLogic(const Data d))… with the caveat that this choice propagates throughout the language, which can be unintuitive or confusing.

Revisiting the bug with Rust

Now that we have established context with the fake example, let’s try to see how rust could’ve prevented the presentation’s problem in a practical instance.

#1 vec::retain

The rust library function for removing elements of a vector (vec::retain) doesn’t give us an option to use a closure that copies by value. Even if we wanted to make a lambda that copies the elements of our vector, the compiler notices the type mismatch and rejects it.

If we were to take the approximation of the C++ code in the presentation (below)

std::vector<Request> LoadRequests;

void OnComplete(int id) {
  // If you're confused why it removes elements from the vector like this, it's
  // used as a part of the "erase-remove" pattern. The context in the video is that
  // they were using C++14, so they couldn't use the C++20 std::erase_if
  const auto DeleteRange = std::ranges::remove_if(LoadRequests, [](const Request r){
     return r.id == id;
  });
  LoadRequests.erase(DeleteRange.begin(), DeleteRange.end());
}

and convert it to the idiomatic rust expression (below), and try to pass in a closure that has a typo (Request instead of &Request), the compiler will throw an error saying “type mismatch in closure arguments”.

let mut LoadRequests : Vec<Request> = ...

fn OnCompleteDefault(id: i32, load_requests: &mut Vec<Request>) {
  load_requests.retain(|r: Request| r.id != id); // Throws a compiler error
} 

This is technically an example of a type system preventing us from making dumb mistakes rather than “moving by default” preventing these mistakes. Since “pass by value” is a move by default, the type system is able to come in and recognize the error.

One could also argue that were comparing a C++14 pattern to a newer standard library, isn’t a fair comparison. So in order to convince the audience, I’ll try to write an unidiomatic, suboptimal expression to see how far Rust can stop us from doing something dumb.

#2 Weird Hypothetical Implementations

Suppose we hate using methods like vec::retain because you’re trying to prove a point to strangers on the internet, let’s try coding up a reasonable implementation. Let’s start with the original “correct” version of the code, and add in random & typos to see what the compiler does. The “correct” version:

fn OnCompleteWeird(id: i32, LoadRequests: Vec<Request>) -> Vec<Request>{
  let mut filtered: Vec<Request> = Vec::new();
  for lr in LoadRequests.into_iter() {
    if lr.id == id {
        filtered.push(lr);
    }
  }
  filtered
}

Let’s try making the parameter from the existing move (Vec<Request>) to borrow (&Vec<Request>). Trying to compile this variation results in an error because the filtered.push expects another move, but instead it got a reference (expected Request, found &Request). We can try following the compiler’s recommendation to explicitly copy the element by using .Clone(), but that copy doesn’t happen by accident (unless someone was blindly following what the compiler suggested to do).

Trying to compile the below fails because iter() will return references to the Request data, and we cannot convert this into a copy unless we explicitly .Clone() the underlying data. Again, I can’t imagine a scenario where someone would do this when there are existing library functions that do this more efficiently… but it’s nice to know that it’s still hard to make the mistake.

C++ can do this too

In C++’s Defense, C++ offers several ways to prevent copying, support automatically moving objects, etc. Things like delete copy constructors + assignors, make copy constructors explicit/move constructors implicit, leverage copy elision, etc. However, these methods can be annoying to use because you need to worry about rules like “rules of 3/5/0” and might be restricted to a specific version of C++.

In fact, a rust struct that doesn’t derive/implement the Copy trait is similar to a C++ struct that makes its copy constructor explicit. Similarly, a rust struct that doesn’t derive/implement the Clone trait is similar to a C++ struct that deleted its copy/copy assignment constructor. In a way, exclusion of these traits is another protective default.

Conclusion

As a disclaimer, I’m not a fan of several aspects of Rust, but I do think some of its language defaults that are good for performant programs. More importantly, these defaults reduce the mental burden of double checking minor C++ traps, and lets me trust the compiler to do this for me.

Appendix Pass by value, reference, pointer

This abseil source can probably explain this better than I can: abseil.io/tips/234

Copy/Clone/Drop Traits
  • https://doc.rust-lang.org/std/marker/trait.Copy.html
  • https://doc.rust-lang.org/std/clone/trait.Clone.html
  1. Although these defaults can feel frustrating and cause other parts of the language to feel clunky to me 

  2. Granted, these repercussions of these defaults also result in (in my opinion) verbose language constructs like iter, into_iter, iter_mut 

tedkim97.github.io/rust_cpp_reference_mistakes
Architecture of a Joke
programmingsoftwarerustarchitecturecloud
My April Fool’s Joke: Ad Supported, Subscription DNS My april fools joke (briefly) made it to the front page of hackernews. The joke itself is small, but what I think is more fun and interesting are the constraints I had for this project, and how there are a couple of clever decisions that affect the architectural, implementation, and operational choices. I’m going to write down as many details as I remember + the details I wrote down after the event had concluded. Spoiling (explaining) the Joke The core of the joke was that this was another tech product, while technically impressive, is for a business need that people don’t really want or need. I.E it’s a “bad” product. There’s also small jokes making fun of memes in the programming community like: Rust is “blazingly fast” Rewriting a product in a different language is a marketable benefit…? Companies constantly make product promises that they inevitably break Companies building products just to try to sell your data Companies breaking specs to make money The Inspiration This idea was inspired because my workplace semi-regularly pushed “hackathons”/”call for product proposals” within my org. Being on a DNS team, there’s not a lot of room for “out of the box” DNS products… I would regularly joke that we should do “LLM DNS”, or “DNS with Advertisements”. The History As a fun little fact, DNS hijacking for advertisements isn’t even anything new. There was another blog link talking about this and I bookmarked it (but I somehow lost it). I’ll update this page if I ever do re-find it. The Requirements If you want a quick summary of the requirements, they are: Relatively small implementation time. I don’t want to spend hours implementing a complete DNS resolver (this is where the clever part in the architecture comes in) Performant. Ideally this service should be able to handle large loads and minimize latency to impress my fellow nerds (also in case someone decided to spam the resolver). “Performant” is intentionally vague which I’ll expand upon later. Seemless operations. I don’t want to carefully monitor + extinguish fires in case the resolver breaks for any reason. April fools was on a business day this year and I couldn’t be distracted from work to find my personal computer/ssh/whatever ad fix it. Reasonable spend on services. Excluding my time, I am not going to spend 100’s of USD on a 1 day joke for clout among internet strangers. How The Architecture Saves Developer Time The problem with making a joke like this is that making a DNS resolver from scratch is a lot of work. There’s a billion reason for this, but mainly The DNS spec has a lot of nuance and detail This is means there’s a lot of testing that you need to do There’s a lot of caching you need to add in While I appreciate RFCs, I don’t think spending hours carefully reading, testing and following a spec is a fun april fools joke. While thinking about whether or not I wanted to go through with the idea of this joke, I realized there’s a more devious way of getting a full DNS resolver functionality. Instead of programming a resolver from scratch, it’s easier to develop a program that receives DNS queries & forwards them to an actual resolver (like BIND9), capture and edit the response, and return it to the user. This way, you can guarantee that the DNS resolution behavior is at least as correct to the spec as can be… before you ruin it and add a bunch of advertisments. Moreover, we can profit by reselling open source projects, just like the big companies :). Theoretical Architecture of the joke The actual execution of this was different from the end goal made from last minute testing (expanded upon later). Implementation Details In the end, what I need to program was a interceptor that just injected ads. To meet my performance goals, I didn’t want the program to bottleneck the performance of the underlying DNS resolver. In essence, I wanted the throughput of the advertisement interceptor + resolver to match the throughput of the plain resolver. I planned to do this by: Minimize copying of buffers Reduce serializaition/deserialization between wire-encoded format & logical structs To get as much compute throughput as possible! The above was completely overkill and unnecessary and made testing + debugging a huge pain. This was unnecessary because the resolvers I was testing with ended up being more of a bottleneck (more on this later). What about TCP & DNSSEC? I didn’t feel like dealing with TCP forwarding, so I just made my resolver reject all TCP requests and return a response saying “TCP REQUIRES A SUBSCRIPTION”. In retrospect it would’ve been funny if I added a rick-roll in the message. I could’ve handled DNSSEC. Since the underlying mechanism for resolving queries is a real DNS product, I could just forward the details. However, I was worried about message sizes overflowing + didn’t want to deal with the headache, so I rejected all DNSSEC queries and returned a joke. The Execution/Operations Initially I was going to set up port forwarding on my internet router and forward traffic to my desktop. However, I wasn’t sure if that was going to fit my goals of “seemless” operations. I was nervous of what my ISP would do when they notice a sudden influx of packets coming into my residental IP; moreover, I wasn’t sure if I could even use port 53 of my router for DNS. At the time I had roommates, and they were definitely not going to be happy if our apartment’s WIFI wasn’t working because of some random shennanigans I decided to do. As a result I decided to host this service in the cloud. I chose GCP because I was more familiar with that. Moreover, GCP VMs have a lot of monitoring out of the box, so I could track how many packets were coming in/out and use that as a proxy for how much traffic my joke was receiving. I deployed this one ONE region in the midwest. I made a guess that most of the hackernews audience was based in the US to have reasonable latency to the general US population. Depending on my audience, maybe picking a region in the west coast may have been a better idea. Which DNS resolver did I end up using? Originally I planned to use Bind9 as the true DNS resolver. However, I was cheap and thought I could save money by using an external resolver that I didn’t have to host. My goal was to forward queries to DNS resolvers on the public internet like Cloudflare’s 1.1.1.1 or Google’s 8.8.8.8. While perf testing with dnsperf, I discovered that these resolvers rate limit you pretty quickly (duh). Then I had an idea to use my VM’s DNS resolver (https://cloud.google.com/compute/docs/internal-dns) to resolve my queries, which ended up increasing my throughput ~5x. Of course, I would NOT recommend this solution for ANY production scenario The Cost For roughly 24 hours of operation, the total bill ended up being $8.67 which included the IP Address + the VM. I don’t remember which model of the N2 I ended up choosing (n2-standard-16, n2-standard-32, n2-standard-64) but it ended up being completely overkill for the task. How many people used the service? I did get a decent amount of traffic - I even took screenshots and did some back of the napkin math. Also note that not all of the traffic I was receiving was from people resolving DNS traffic. Some of that traffic is just random crawlers/bots on the internet. As I mentioned earlier, the amount of compute I provisioned for this service was completely overkill. MISC Notes Anycast-ing It would’ve been cute if I were able to have anycast support. Anycast would help meet the performance goals for international audiences. However, I wasn’t able to find a way to support anycast with 1 IP address and multiple machines with GCP (without using an additional loadbalancer product)1. Moreover, I wasn’t going to fork over the cash to support such a niche functionality. There was a comment saying that reddit.com took 11.4 seconds to resolve. I guess that this person was from a far away location from the server. Monitoring I had this idea of tracking all the unique IP addresses that used my DNS resolver; however as the deadline got closer, I decided to cut that part out because I didn’t have enough time. The Joke Execution Ironically one of the toughest decisions for this project was to decide on how subtle to make the humor. I prefer subtle, dry humor, but I wasn’t confident the audience (technology enthusiasts/software engineers) would understand the joke if they weren’t over the top. One hackernews comment pointed out “Little over the top. Sometimes subtle is better/more entertaining” which I completely agree with. Conclusion Overall, it was a fun project and I’m glad to see there were people who liked it and thought it was funny. See you next year! Granted I put it only 30 minutes of research ↩
Show full content
My April Fool’s Joke: Ad Supported, Subscription DNS

My april fools joke (briefly) made it to the front page of hackernews.

on the front page after one hour

The joke itself is small, but what I think is more fun and interesting are the constraints I had for this project, and how there are a couple of clever decisions that affect the architectural, implementation, and operational choices.

I’m going to write down as many details as I remember + the details I wrote down after the event had concluded.

Spoiling (explaining) the Joke

The core of the joke was that this was another tech product, while technically impressive, is for a business need that people don’t really want or need. I.E it’s a “bad” product.

There’s also small jokes making fun of memes in the programming community like:

  • Rust is “blazingly fast”
  • Rewriting a product in a different language is a marketable benefit…?
  • Companies constantly make product promises that they inevitably break
  • Companies building products just to try to sell your data
  • Companies breaking specs to make money
The Inspiration

This idea was inspired because my workplace semi-regularly pushed “hackathons”/”call for product proposals” within my org. Being on a DNS team, there’s not a lot of room for “out of the box” DNS products… I would regularly joke that we should do “LLM DNS”, or “DNS with Advertisements”.

The History

As a fun little fact, DNS hijacking for advertisements isn’t even anything new. There was another blog link talking about this and I bookmarked it (but I somehow lost it). I’ll update this page if I ever do re-find it.

The Requirements

If you want a quick summary of the requirements, they are:

  1. Relatively small implementation time. I don’t want to spend hours implementing a complete DNS resolver (this is where the clever part in the architecture comes in)
  2. Performant. Ideally this service should be able to handle large loads and minimize latency to impress my fellow nerds (also in case someone decided to spam the resolver).
    • “Performant” is intentionally vague which I’ll expand upon later.
  3. Seemless operations. I don’t want to carefully monitor + extinguish fires in case the resolver breaks for any reason. April fools was on a business day this year and I couldn’t be distracted from work to find my personal computer/ssh/whatever ad fix it.
  4. Reasonable spend on services. Excluding my time, I am not going to spend 100’s of USD on a 1 day joke for clout among internet strangers.
How The Architecture Saves Developer Time

The problem with making a joke like this is that making a DNS resolver from scratch is a lot of work. There’s a billion reason for this, but mainly

  1. The DNS spec has a lot of nuance and detail
    • This is means there’s a lot of testing that you need to do
  2. There’s a lot of caching you need to add in

While I appreciate RFCs, I don’t think spending hours carefully reading, testing and following a spec is a fun april fools joke.

While thinking about whether or not I wanted to go through with the idea of this joke, I realized there’s a more devious way of getting a full DNS resolver functionality. Instead of programming a resolver from scratch, it’s easier to develop a program that receives DNS queries & forwards them to an actual resolver (like BIND9), capture and edit the response, and return it to the user.

This way, you can guarantee that the DNS resolution behavior is at least as correct to the spec as can be… before you ruin it and add a bunch of advertisments. Moreover, we can profit by reselling open source projects, just like the big companies :).

theoretical architecture of the joke

Theoretical Architecture of the joke


The actual execution of this was different from the end goal made from last minute testing (expanded upon later).

Implementation Details

In the end, what I need to program was a interceptor that just injected ads. To meet my performance goals, I didn’t want the program to bottleneck the performance of the underlying DNS resolver. In essence, I wanted the throughput of the advertisement interceptor + resolver to match the throughput of the plain resolver.

I planned to do this by:

  1. Minimize copying of buffers
  2. Reduce serializaition/deserialization between wire-encoded format & logical structs

To get as much compute throughput as possible!

The above was completely overkill and unnecessary and made testing + debugging a huge pain. This was unnecessary because the resolvers I was testing with ended up being more of a bottleneck (more on this later).

What about TCP & DNSSEC?

I didn’t feel like dealing with TCP forwarding, so I just made my resolver reject all TCP requests and return a response saying “TCP REQUIRES A SUBSCRIPTION”. In retrospect it would’ve been funny if I added a rick-roll in the message.

I could’ve handled DNSSEC. Since the underlying mechanism for resolving queries is a real DNS product, I could just forward the details. However, I was worried about message sizes overflowing + didn’t want to deal with the headache, so I rejected all DNSSEC queries and returned a joke.

The Execution/Operations

Initially I was going to set up port forwarding on my internet router and forward traffic to my desktop. However, I wasn’t sure if that was going to fit my goals of “seemless” operations. I was nervous of what my ISP would do when they notice a sudden influx of packets coming into my residental IP; moreover, I wasn’t sure if I could even use port 53 of my router for DNS. At the time I had roommates, and they were definitely not going to be happy if our apartment’s WIFI wasn’t working because of some random shennanigans I decided to do.

As a result I decided to host this service in the cloud. I chose GCP because I was more familiar with that. Moreover, GCP VMs have a lot of monitoring out of the box, so I could track how many packets were coming in/out and use that as a proxy for how much traffic my joke was receiving.

I deployed this one ONE region in the midwest. I made a guess that most of the hackernews audience was based in the US to have reasonable latency to the general US population. Depending on my audience, maybe picking a region in the west coast may have been a better idea.

Which DNS resolver did I end up using?

Originally I planned to use Bind9 as the true DNS resolver. However, I was cheap and thought I could save money by using an external resolver that I didn’t have to host. My goal was to forward queries to DNS resolvers on the public internet like Cloudflare’s 1.1.1.1 or Google’s 8.8.8.8.

While perf testing with dnsperf, I discovered that these resolvers rate limit you pretty quickly (duh). Then I had an idea to use my VM’s DNS resolver (https://cloud.google.com/compute/docs/internal-dns) to resolve my queries, which ended up increasing my throughput ~5x.

  • Of course, I would NOT recommend this solution for ANY production scenario
The Cost

For roughly 24 hours of operation, the total bill ended up being $8.67 which included the IP Address + the VM. I don’t remember which model of the N2 I ended up choosing (n2-standard-16, n2-standard-32, n2-standard-64) but it ended up being completely overkill for the task.

How many people used the service?

I did get a decent amount of traffic - I even took screenshots and did some back of the napkin math. Also note that not all of the traffic I was receiving was from people resolving DNS traffic. Some of that traffic is just random crawlers/bots on the internet.

As I mentioned earlier, the amount of compute I provisioned for this service was completely overkill.

MISC Notes Anycast-ing

It would’ve been cute if I were able to have anycast support. Anycast would help meet the performance goals for international audiences. However, I wasn’t able to find a way to support anycast with 1 IP address and multiple machines with GCP (without using an additional loadbalancer product)1. Moreover, I wasn’t going to fork over the cash to support such a niche functionality. There was a comment saying that reddit.com took 11.4 seconds to resolve. I guess that this person was from a far away location from the server.

Monitoring

I had this idea of tracking all the unique IP addresses that used my DNS resolver; however as the deadline got closer, I decided to cut that part out because I didn’t have enough time.

The Joke Execution

Ironically one of the toughest decisions for this project was to decide on how subtle to make the humor.

I prefer subtle, dry humor, but I wasn’t confident the audience (technology enthusiasts/software engineers) would understand the joke if they weren’t over the top. One hackernews comment pointed out “Little over the top. Sometimes subtle is better/more entertaining” which I completely agree with.

Conclusion

Overall, it was a fun project and I’m glad to see there were people who liked it and thought it was funny. See you next year!

  1. Granted I put it only 30 minutes of research 

tedkim97.github.io/architecture_of_a_joke
My April Fool’s Joke made it to the front page (of hackernews)
april fool'sDNSnetworking
My april fools joke made it to the front page of hacker news: https://news.ycombinator.com/item?id=39895453! Git Repo.
Show full content

My april fools joke made it to the front page of hacker news: https://news.ycombinator.com/item?id=39895453!

on the front page after one hour

on the front page after four hour

The april fool’s joke was that I was a pretend-SAAS company that was “innovating” DNS by providing an advertisment-supported DNS resolver - where we would randomly inject ads for the users. I.E when a user made a DNS query, advertisments would pop up like so:

dig @35.223.197.204 hackernews.com

...
    # Actual DNS record
    hackernews.com.  46 IN A 13.249.141.39    
...
    # Advertisment
    hackernews.com.  7200 IN TXT "Meet hot, lonely DNS records in your area tonight"

For the unfamiliar, DNS is very technical, foundational part of public internet and networking - essentially, something that no one would ever ask for ads in. If you can’t tell… this is a VERY niche joke.

I initially had an idea for this joke back in early 2023 - I had a vague idea that “making DNS with advertisements would be so bad it’s hilarious”. I ended up spending a decent chunk of time on the project, and I was a bit nervous that it would fade into obscurity, but I’m glad it got some attention and people found it funny.

The time investment

It was surprisingly stressful to work on this project… because trying to be funny can be stressful. When it comes to jokes I constantly go through this seesaw of “it’s funny” to “it’s cringe” - moreover, I was also worried about making jokes that wouldn’t offend anyone (not that this is a particularly touchy topic).

Explaining the joke/satire

I’m going to do the thing you shouldn’t do… explaining the joke. The audience for this is incredibly niche, so I think it’s fair to make the joke more accessible for bigger audiences.

I sorta wanted to make a bunch of tiny jokes about trends you see in the software industry, something that might not be picked up to all audiences.

On one level, the jokes in the DNS records are obvious:

  • “Meet hot, lonely DNS records in your area tonight”
  • “I make $100,000 USD every month! Buy my course to find out how!”
  • “This response is sponsored by Raid Shadow Legends”

Other jokes are more “programmer” niche:

  • Devs constantly migrating to Rust or bragging about their “blazingly” fast software was written in rust
  • The enshittification of products over time.
  • Providing a free service, but also providing a terrible, paid subscription on top of it (think Youtube Premium that still has ads…)
  • Collecting & selling user data
  • The stochastic entrepreneurial process of inserting advertsiements into anything to try to make a big business out of it

Some of the jokes are just me complaining about boring ads are

  • “This response is sponsored by $MEAL_DELIVERY_KIT! Use this coupon to get $2 off your next order!”
  • “Watch the action-packed, romantic, comedy of the century $MOVIE in theaters near this summer”
  • “Welcome to the metaverse! Come here to buy real-estate in the metaverse”

The final funny bit I like is:

  • Why would anyone waste their time putting so much effort into an april fool’s joke?
Delivery

The delivery of the joke (besides setting up an actual DNS resolver that anyone can use), is intentionally overt.

One comment pointed out “Little over the top. Sometimes subtle is better/more entertaining” - which I completely agree with (I didn’t address it because I was “playing the joke” in the comments). I think I would have liked a more subtle tone to the overall joke - to make the satire more subtle.

However, I consciously made the delivery very blatant becasue I was nervous that people wouldn’t get the joke… When I shoed initial versions of the marketing material to friends, some of them didn’t get it was a joke. They certainly weren’t stupid and they were engineers, but the the specificity can make it difficult to see the humor. Sarcasm isn’t clear when you don’t have tone to help emphasize the point.

So on the last day I decided to add emojis, more obvious jokes that i’s a cash grab, etc to prevent ANYONE from POSSIBLY misconstruing this has a serious product.

Technical breakdown of how I programmed/deployed/architected this joke coming soon.

tedkim97.github.io/dns_with_ads_aprilfools_joke
Failed Branchless Optimization - are we actually optimizing what we think we are?
programmingsoftwarerustarchitectureassembly
TLDR
Show full content
TLDR

This code change makes your code (slightly) faster - but the reason is really simple/dumb.

let mut flags: u16 = 0;
// original
if value.field1 {
    flags |= dns_header_masks::QR;
}
// faster
flags |= dns_header_masks::QR * (value.field1 as u16);
Context (you can skip this section)

In DNS land, all communication is done as “messages” in a specific wire format originally defined in RFC1035#4.1. I’m not going to dive into the RFC, but the relevant part is the “header” section. Every DNS Message includes a “header” that specify the remaining sections of the message for parsing, in addition to including query information like response codes, query parameters, etc.

The header format has change a little, but the current best practice (RFC6895#2) defines the wire-encoded format like this:

                               1  1  1  1  1  1
 0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                      ID                       |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR|   Opcode  |AA|TC|RD|RA| Z|AD|CD|   RCODE   |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    QDCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    ANCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    NSCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    ARCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Even though the wire format exists, most of the DNS libraries I’ve seen include a logical representation of a “message”. I.E a struct that would look something like this.

struct DnsHeader {
    id: u16,
    qr: bool,
    opcode: u8,
    aa: bool,
    tc: bool,
    rd: bool,
    ra: bool,
    ad: bool,
    cd: bool,
    rcode: u8,
    qdcount: u16,
    ancount: u16,
    nscount: u16,
    arcount: u16,
}
What’s the point?

History aside, what’s the point?

Code that interacts with logical representations (i.e struct) need to convert to the wire format (i.e bytes) when sending the data over the socket. In the RFC we have logical bools that represent a message. The QR (query), AA (authoritative answer), TC (trucation), AD (authentic data), RD (recursion desired), and CD (checking desired) are all booleans/1 bit.

This means when we convert the logical code to the wire format, we often employ an if statement that leads to simple branches like this:

let mut header_flags = 0u16;
if header.authoritative {
	header_flags |= 0b0000_0100_0000_0000;
}
if header.recursion_desired {
	header_flags |= 0b0000_0001_0000_0000;
}

Don’t take my word for it, we can see some real life examples on some open-source repos.

This dns-parser library will apply a Bitwise OR to the value that will be written to the wire - it looks something like this:

pub fn convert(&self, ...) {
	// other conversions/checks
	let mut flags = 0u16;
	// ...
	if self.authoritative {
		flags |= 0b0000_0100_0000_0000;
	}
	// continue
}

Hickory-dns (formerly trust-dns) header code will also apply a bitwise OR depending on the logical value. It would boil down to something like this

let mut flags = 0u16;
// ...
// I would describe this as more of a ternary-like operator..?
flags |= if self.authoritative {
	0b0000_0100_0000_0000;
} else {
	0b0000_0000_0000_0000;
}

This is a branch - and where there is a branch, there’s room for a branchless optimization (sometimes…).

Branchless Optimizations & Pre-Optimizations

The reason why I emphasize branches is because branches can be potentially slower when it comes to branch mispredictions, and employing branchless patterns can potentially speed up codes. The reason why I emphasize potentially is because it’s highly dependent dependent on compiler, CPU architecture, how well branches are predicted, etc.

TL:DR what is a branch prediction? Modern CPUs are complicated, and there are a lot of optimizations that make your code go fast. One of those optimizations is branch predictions1. Essentially the processor will try to guess the outcome of a branch, and perform the subsequent computation. Predicting the outcome of a branch is a way of improving throughput on modern CPUs. If the branch is guessed correctly, great! If it was a wrong guess, your CPU will need to do backtrack throw away the results, backtrack and compute the correct branch. This is expensive, and the more frequently your CPU guesses wrong, the bigger a penalty your code pays (In practice CPUs are quite good at this). I wouldn’t consider myself an expert on branch-predictions, so I recommend these resources:

  • https://danluu.com/branch-prediction/

  • https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array

  • https://en.algorithmica.org/hpc/pipelining/ (this chapter in general)

  • https://blog.cloudflare.com/branch-predictor

For instance, here is one of the simplest branched/branchless comparison:

// branched
return (a > b) ? a : b;
// branchless
return (a > b) * a + (a <= b) * b;

The tradeoff with the branchless version of code is that you’re computing more (in this case we’re performing additional arithmetic operations), while the branched version isn’t registering the additional compute. Moreover, if the CPU correctly guesses the branch 100% of the time, the function will “free” compared to branchless code. However, depending on the language, compiler, CPU architecture, the branchless optimized assembly may have more or less instructions the non-optimized version.

Another tradeoff you’ll get is that employing branchless logic really runs against “semantic” coding principles. If I saw a random branchless algo, I wouldn’t be able to tell if an idiot or a genius wrote it. That’s a non-trivial argument that the branchless version hurts readability, thereby hurting maintainability.

Finally, you should never assume your “branchless optimization” actually reduces branches. Smart compilers will be able to utilize the cmov instruction and reduce their vulnerability to branch prediction failures and compile if/else statements to use conditional moves to avoid branches. IN FACT, I expected this to be the case for rust. The rust compiler (and/or LLVM) should knows that we should just use a conditional move (cmov). Spoiler alert, the rust compiler is pretty good about that here, but I think you should still read on.

Finally, there’s also the meta that branchless programming usually isn’t where the biggest wins are going to be for performance. Odds are, there are other places to hunt for performance gains such as database/query optimizations, software architecture, or messy memory usage,

Benchmarking & Investigating Microbenchmark

I wrote three implementations of the function that look like this:

pub fn branchless(header: &DnsHeader, bytes: &mut [u8]) {
    // ...
    let mut flags: u16 = 0;
    flags |= dns_header_masks::QR * (header.qr as u16);
    flags |= u16::from(header.opcode) << 11;
    flags |= dns_header_masks::AA * (header.aa as u16);
    flags |= dns_header_masks::TC * (header.tc as u16);
    // ...
}

pub fn branched_1(header: &DnsHeader, bytes: &mut [u8]) {
    // ...
    let mut flags: u16 = 0;
    if header.qr {
        flags |= dns_header_masks::QR;
    }
    flags |= u16::from(header.opcode) << 11;
    if header.aa {
        flags |= dns_header_masks::AA;
    }
    if header.tc {
        flags |= dns_header_masks::TC;
    }
    // ...
}

pub fn branched_2(header: &DnsHeader, bytes: &mut [u8]) {
    // ...
    let mut flags: u16 = 0;
    flags |= u16::from(header.opcode) << 11;

    flags |= if header.qr {
        dns_header_masks::QR
    } else {
        0u16
    };
    flags |= if header.aa {
        dns_header_masks::AA
    } else {
        0u16
    };
    flags |= if header.tc {
        dns_header_masks::TC
    } else {
        0u16
    };
    // ... 
}

I modeled branched_1 to be like tailhook/dns-parser snippets, and branched_2 to be like the hickory-dns lib. I did the basics of avoiding the compiler optimizing away the important bits here. Repo is here for reproduction.

When I actually ran the benchmarks, I was genuinely surprised (and annoyed) that the branchless version was slightly faster. Out of 10K calls, we saved a whole 18 microseconds (18,000 NS) per iteration compared to the slowest implementation…! Increasing the number of trials maintained this speed difference - I’m going to skip the stat test for my sanity.

cargo bench
...
test header_conversion::tests::bench_wire_format_query_header_branched_1        ... bench:      55,772 ns/iter (+/- 735)
test header_conversion::tests::bench_wire_format_query_header_branched_2        ... bench:      61,655 ns/iter (+/- 487)
test header_conversion::tests::bench_wire_format_query_header_branchless        ... bench:      43,204 ns/iter (+/- 962)
...
Profiling w/ perf

However, was this speedup actually because we’ve magically reduced our branch mis-prediction rate? Or is there something else at play?

To see if we’ve actually improved performance, we need to profile the code. I used perf stat with a non-benchmarked version of the code (it semi-randomly generated structs, marking the output as non-optimizable, etc). Note that the creation of these structs take up a decent chunk of the perf runtime here, so the absolute times are noise. Here are the results I got:

Branchless
 Performance counter stats for './target/release/random_bench branchless 5000000' (1000 runs):

            241.31 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.22% )
            # ... 
       393,247,063      branches                  # 1629.638 M/sec                    ( +-  0.03% )  (72.93%)
            24,491      faults                    #    0.101 M/sec                    ( +-  0.00% )
       393,160,350      branches                  # 1629.278 M/sec                    ( +-  0.02% )  (72.30%)
           708,438      branch-misses             #    0.18% of all branches          ( +-  4.95% )  (70.70%)

          0.241632 +- 0.000522 seconds time elapsed  ( +-  0.22% )

Branched 1
 Performance counter stats for './target/release/random_bench branched1 5000000' (1000 runs):

            248.24 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.14% )
            # ... 
       394,276,819      branches                  # 1588.313 M/sec                    ( +-  0.03% )  (72.36%)
            24,491      faults                    #    0.099 M/sec                    ( +-  0.00% )
       396,680,480      branches                  # 1597.995 M/sec                    ( +-  0.02% )  (72.43%)
           665,171      branch-misses             #    0.17% of all branches          ( +-  0.04% )  (71.35%)

          0.248560 +- 0.000359 seconds time elapsed  ( +-  0.14% )
Branched 2
 Performance counter stats for './target/release/random_bench branched2 5000000' (1000 runs):

            249.35 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.15% )
            # ... 
       393,823,034      branches                  # 1579.386 M/sec                    ( +-  0.03% )  (72.24%)
            24,491      faults                    #    0.098 M/sec                    ( +-  0.00% )
       396,491,903      branches                  # 1590.089 M/sec                    ( +-  0.02% )  (72.39%)
           668,578      branch-misses             #    0.17% of all branches          ( +-  0.29% )  (71.39%)

          0.249680 +- 0.000386 seconds time elapsed  ( +-  0.15% )

So the branchless version still has the same branch miss rate as the other implementations AND the same order of magnitude of branches (I believe the branches are coming from the rand crate I pulled in). So the speed up can’t be from the “branchless optimization” we thought we did.

Maybe there’s some caching at play? No our perf stats don’t show any better cache locality usage:

Performance counter stats for './target/release/random_bench branchless 5000000' (1000 runs):

           241.31 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.22% )
       10,132,540      cache-references          #   41.990 M/sec                    ( +-  0.78% )  (69.99%)
          423,594      cache-misses              #    4.181 % of all cache refs      ( +-  0.49% )  (70.45%)
      970,739,141      cycles                    #    4.023 GHz                      ( +-  0.20% )  (71.34%)
    2,044,234,986      instructions              #    2.11  insn per cycle           ( +-  0.04% )  (72.30%)
# ...

Performance counter stats for './target/release/random_bench branched1 5000000' (1000 runs):

           248.24 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.14% )
       10,255,956      cache-references          #   41.315 M/sec                    ( +-  0.56% )  (70.73%)
          481,062      cache-misses              #    4.691 % of all cache refs      ( +-  0.24% )  (70.79%)
    1,000,298,005      cycles                    #    4.030 GHz                      ( +-  0.14% )  (70.91%)
    2,071,827,255      instructions              #    2.07  insn per cycle           ( +-  0.03% )  (71.44%)
# ...

Performance counter stats for './target/release/random_bench branched2 5000000' (1000 runs):

           249.35 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.15% )
       10,254,911      cache-references          #   41.126 M/sec                    ( +-  0.43% )  (70.80%)
          481,199      cache-misses              #    4.692 % of all cache refs      ( +-  0.24% )  (70.86%)
    1,003,748,791      cycles                    #    4.025 GHz                      ( +-  0.15% )  (70.95%)
    2,117,250,613      instructions              #    2.11  insn per cycle           ( +-  0.03% )  (71.37%)

What is the difference here? The difference we should pay attention to here is the number of instructions run. branched_2 has the most instructions and is the slowest, while branchedless has the least number of instructions (and is the fastest). Assuming every instruction has the same cost, if the CPU is running at a fixed rate, less instructions to do the same computation is faster.

We can confirm this by analyzing the assembly.

Analyzing the assembly

What’s the underlying implementation of the three snippets here? For that we need to actually look at the assembly the code is generating. Using the crate cargo-show-asm (NOT cargo asm which is considered unmaintained and wasted a bunch of my time) we can get the generated assembly using cargo-asm.

cargo asm header_util::header_conversion::convert_to_wire_format_branched_1
cargo asm header_util::header_conversion::convert_to_wire_format_branched_2
cargo asm header_util::header_conversion::convert_to_wire_format_branchless

I’ll just save the assembly in the repo because it takes too much page space here. If we count the number of instructions, each version of the function has:

  • Branched1 has 134 instructions
  • Branched2 has 143 instructions
  • Branchless has 126 instructions

The reason why the branchless version is faster than the other is because the machine code just has less instructions - meaning we have faster code. This also lines up neatly with the relative speed ordering of all these three functions.

Conclusion + Other thoughts

Three thoughts:

  1. If you make 10 million QPS, you could save up to 0.018 seconds every second..! In a year you’re saving 157 hours worth of time… Jokes aside, I want to emphasize how pointless of an exercise this is. Serializing the header is only 12 bytes out potential 100’s of bytes in a DNS message. There’s a lot of other stuff that needs to occur, parsing, and the speed increase is probably not worth the amount of time necessary to actually confim/profile this behavior.

  2. I was honestly very surprised this “optimization” (aggressive air quotes)… “worked”. I was expecting these three versions to all have the same speed (to the nanosecond). Mainly because I was expecting the branched versions of the code to use cmov instructions (which they do), but also I expected the final representations of two branched versions to be identical.
    • I was actually more interested in analyzing how much randomness in those header bits affected the speed, and how manipulating the prediction rate affected conversion. I was sort of hoping to do an analysis like the one here and make a cool visualization. I think the problem is a bit more interesting here because the branches are being predicted consequentially, depending on how the CPU behaves, we could observe the entire instruction having to be restart because of a pipeline flush.
  3. I do think language matters here. There are certain optimizations that C++ compiler can make that a Rust compiler wouldn’t (I heard this is due to rust’s aggressive safety checks). I’m curious how the C++ version with clang would behave, but that’s only if I can finally decide on a build system…
Appendix

What OS/kernal?

  • uname -r (5.4.0-150-generic) && lsb_release -a (UBUNTU 18.04.5 LTS)

Rust Version

  • cargo 1.78.0-nightly & rustc 1.78.0-nightly

What about the debug versions (unoptimized) of the code?

  • The debug version of the code DOES have branches, and running perf on them did show the branched versions having ~0.15-0.20% more branch misses than the branchless version. However, I didn’t think it was worth doing an analysis here because people are going to complain that I analyzed a debug build.

Does the result change if use a struct method (fn convert_to_wire_format_method_branchless(&self, bytes: &mut [u8]))rather than a function like convert_to_wire_format_branchless(header: &DnsHeader, bytes: &mut [u8])

  • No significant difference in runtime

Does the assembly change with unsafe rust?

  • I wrapped the inside body of each function with a giant unsafe {} (and changed the function signature to unsafe) and the assembly didn’t change. That result makes sense given what they say about the unsafe in the rust book: https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html

What about the unoptimized versions of the code?

  • The debug versions of these functions are performed roughly the same amount (and a lot slower than the optimized version). You can use cargo bench with the debug binaries by including this within cargo.toml
[profile.bench]
opt-level = 0

Did you target your CPU architecture?

  • I ran these benchmarks and perf tests by including this in my CLI call (which I read on the internet somewhere that this was enough):
    RUSTFLAGS='-C target-cpu=native' cargo bench
    
  1. Branch predictions are also related to a set of CPU vulnerabilities 

tedkim97.github.io/rust_simple_branchless_optimization
My Traumatic AWS Bill
programmingsoftwareopinionmicroservicessoftwarearchitecturecloudsecurity
My AWS bill seems a bit larger than the usual (pictured below):
Show full content

My AWS bill seems a bit larger than the usual (pictured below):

my anonymized, surprise AWS bill around 13000 dollars

my anonymized, AWS bill


My personal AWS usage is usually $0 per month, so a charge around ~$13,XXX was a bit surprising. Unlike other stories where surprise AWS charges come from configuration or architectural mistakes (here is one I can remember off the top of my head), these charges were a result of a security mistake (I made).

Credential Stuffing

I seem to have been a victim of a “credential stuffing” attack. I made my personal account when I was in undergrad to get more experience with the “cloud technologies” trend. Unfortunately one of the password variations I picked was compromised in other password leaks.

As a result, a malicious user hijacked my AWS account and ran heavy amounts of AWS sage maker (their machine learning platform) in all the default available regions1.

A breakdown of my AWS bill

Amazon Responses & Resolution

Luckily for me, Amazon was incredibly helpful during the whole process and issued a one-time forgiveness for the bill.

Amazon detected this malicious behavior fairly quickly and alerted me that their was suspicious activity on my AWS account. Unfortunately I was on vacation, so I was sloppy and inattentive about auditing my AWS resources - checking the services I used like EC2, only checking US-EAST-1.

After responding to the first email and getting a 3rd reminder email I realized something was very wrong and checked the billing.

It was a very uncomfortable 2 weeks knowing that I might be on the hook for a lot of money that I didn’t use. However, AWS support helped me revoke all keys, permission groups, and resources in all of the major regions - as well as talking to the billing department to give me an exception for this mistake.

Takeaways

My initial reaction to this situation is that a beginner should avoid making cloud service accounts until they actually need it. However, I realized that this isn’t the correct takeaway. Mucking around and experimenting with actual cloud resources (AWS, GCloud, Azure, etc) in an interactive approach is one of the most effective way to have ideas “stick”.

The actual lessons is that when one makes an account they should have these considerations in mind:

  • Enabling 2FA
  • Setting up cost and resource monitoring and alerts
  • Setting up proper users and permissions
  • Disabling certain regions
  • Rotating passwords (although I find this personally annoying)

This lesson also serves as a reminder of how scarily effective cloud services can be - in a span of a few hours someone can easily spin up hundreds of VMs in all parts of the world and create (or destroy) a lot of value.

  1. What these trained models are used for - I’m not sure 

tedkim97.github.io/traumatic_aws_bill
Testing: A Tragedy
humoropinionmicroservicestestingdeploymentsenvironments
Context This was a “story” I wrote to illustrate a point about testing within complex software architectures in a different blog post. The story got too long, too cringe-worthy, too difficult to make “good” - and was distracting from the actual purpose of my original post. As a result, I decided to cut my losses and quarantine it to a different post.
Show full content
Context

This was a “story” I wrote to illustrate a point about testing within complex software architectures in a different blog post. The story got too long, too cringe-worthy, too difficult to make “good” - and was distracting from the actual purpose of my original post. As a result, I decided to cut my losses and quarantine it to a different post.

The gist of the story is that testing comes with a lot of little & large frustrations that are too difficult and erratic to describe, you just need to “feel” them. Hopefully, the intended audience will know the pain that I’m talking about.

“Testing: A Tragedy”

As a story telling exercise, let’s pretend we’re a developer working in a trendy “service-oriented” architecture.

We’re jazzed because we just wrote a clean implementation for $FEATURE. Wanting to follow good SWE practices, we plan to test our changes to check the correctness of our implementation. We think, “this should be easy”, as all we need to do is pull the latest images for each service and make sure their configurations are correct.

We spend ~15-20 minutes parsing through configurations, dockerfiles, scripts and double checking to make sure everything is correct. Some of these services have a “dependency” order - meaning we need to make sure Redis boots before Service1 and Kafka boots before Service2 and Service3. We diligently consult our notes and start everything in the right order. Our feature may involve a different team’s service, so we spend 30 minutes downloading their repos, following their docs1, and praying that they didn’t skip a tiny but critical detail in their documentation.

While we were focused on setting up the new repo, our Kafka container failed to build and now something is error-ing about schemas. In the back of our minds we have to worry about all of the possible reasons why this random failure occurred. Regardless, we’re annoyed as we have to manually kill containers, and hope a reboot will fix the error.

With a nag in the back of our mind that something is misconfigured in our local setup, we start testing. After spending another hour diligently testing and documenting our tests for reproducibility, we are confident enough to make a pull request. After 1-2 days asking for reviews & approvals we merge our changes and push to DEV - reproducing our tests from earlier. Testing in DEV goes great and then we push to INT!

Tragically, we discover an error that occurs 25% of the time in INT. The issue involved is because our changes didn’t account for all of the authorization flows, and failed to properly handle the resulting errors. We lament our foolishness for missing something so obvious! “$SENIOR_DEV” wouldn’t have made that mistake.

With this new bug in mind, we write some more code to fix our issue. Instead of trying to reproduce the error in our local environment, we end up having to modify our local versions or directly injecting the behavior in our code. Of course that’s not the “correct” way of testing out fixes, but we’re already 1 day over our sprint value. We restart the 45 minute process of setting up the local environment. We spend another hour diligently testing and documenting our tests for reproducibility, and are ready to make another BUGFIX pull request. After an urgent day asking for reviews & approvals, we merge our changes and push to DEV and INT.

After pushing to STG, we discover an error in STG that occurs 5% of the time. Processes are mysteriously failing and hanging for cryptic reasons… Even $SENIOR_DEV is confused. We spend 2-4 hours investigating and eventually discover that the Payments service sends requests to cancel, causing $EXTERNAL_SERVICE to delete database entries before we’ve read from them.

We write more code to fix our issue. We’re clever, and we reproduce the error by setting breakpoints/timers in our local service, and make queries to delete entries from our database. The process is a little annoying because of the processing of having to constantly juggle different terminals and IDEs. However, our local testing goes great! We create a bugfix PR, ask for reviews, wait 1-2 days, and merge the branch and push to DEV, INT, STG Again.

Everything in DEV, INT, STG goes great, so the team decides to push to PROD. And then something goes wrong…

Burned by our experiences of implementing and deploying our $FEATURE, we start getting lazier and lazier with our testing. We begin to haphazardly test and merge our changes in - relying on the integration and end-to-end tests to catch any bugs (but we know they won’t catch all of them). Our ability to develop tickets starts slowing down as we starting saying things like

  • “I’m waiting for a meeting with $SENIOR_DEV to discuss any cases I missed in my pull request”
  • “I’ve tested my branch with mocks, so I’ll do more testing once I push my merged on DEV & INT
  • “oh I can’t test XYZ because it’s really sensitive to environments, so I need to deploy my changes to INT for more reproducible/accurate conditions testing”.

As developers we’ve accepted the fallout of software complexity, completing tickets, making bugs, and fixing bugs.

The end

  1. docs is shorthand for “documentation” - a magical scroll that only exists in myths 

tedkim97.github.io/testing_a_tragic_short_story
Better Testing with Environment Composability
programmingopinionmicroservicessoftwarearchitecturetestingdeploymentsenvironments
Introduction
Show full content
Introduction

The motivation for this blog post is pain1.

Software development can be painful because testing well & thoroughly is tedious and exhausting (there’s a reason why QA is usually a full-time job). If you find testing easy, (1) you might not work with complex software or (2) you are the source of bugs.

Regardless of whether a company has scaled up their development & testing processes, the developer still needs to ensure the correctness of their code.

In service oriented architecture, this tends to suck for developers (you), because testing complexity scales with dependencies (other internal services, external APIs, databases, queues, etc). There are painful moments where the developer has to deal with cumbersome setup processes, or rely on mocks, or write even more code just to do basic testing.

Despite the increased number of headaches, there are ways that we can leverage service architecture and development environments to break away from this slog, and test “better”.

Short Story: “Testing: A Tragedy”

Originally this section was dedicated to a story. The gist of the story is that testing comes with a lot of little & large frustrations that are too difficult and erratic to describe, you just need to “feel” them.

I wrote it to illustrate the testing experience within large service architecture to try to provide context for this post. I also wrote it because it was difficult to communicate the little & large developer frustrations - unless you’ve already run into these frustrations professionally. I cut out it because the story’s length and tone was distracting from the actual purpose of this post The story is too long, too boring, and too difficult to make good, so I decided to cut it out of this post and quarantine it to a different post. Feel free to read it, but it’s incredibly boring and disorganized.

Reviewing “Testing: A Tragedy”

Aside from my poor creative writing skills, this story is a boring, frustrating, and repetitive because testing can be boring, frustrating, and repetitive.

Development for a well-meaning (but relatively green) developer is incredibly boring or frustrating. In my initial drafts of this post, I struggled to “objectively” explain the pains of testing, but every explanation was clunky because it ignores the “human” experience. the audience remembers the “micro-frustrations” and inconsistencies when trying to test their own software. While seemingly trivial, aspects like setting up, introducing outside dependencies, testing locally add unnecessary friction and stress that’s outside of the actual coding and testing process.

an annotated excel plot charting my increasing sadness over time

I also wanted to point out that the development cycle for these changes is incredibly wonky. Software bugs have slipped through because the environments couldn’t catch the “gotchas” of productions. In a monolithic architecture, the development, testing, and deployment of our code would look something like this:

a flowchart depicting the development process described in the short story - and in a monolithic architecture. The developer needs to merge and push changes

Every time a developer found a mistake, they would need to restart from scratch and progressively moving their changes through the environments. This is fine because environments are designed to test and verify behavior as well as catch bugs - progressively higher environments also model more realistic behavior. However, depending on the frequency and processes around deployments (git flow releases vs CI+CD, permissions required to deploy, etc) - this can significantly slow developer output. A one day delay between deploying a branch to INT or DEV adds an extra day where a developer is stalled from working on a feature.

Finally, we have the idea that a senior engineer could have caught (obvious or obscure) bugs from reaching production, if they carefully reasoned through their PR or personally tested the feature branch. Relying on the senior to personally catch all your mistakes (outside of code review) is not a scalable practice. Seniors have higher priorities and won’t be able to thoroughly review and ensure everything is correct - the more popular or important the product is the busier senior gets. At a certain point, junior developer should be able to figure out, find, or predict bugs through their own testing.

Moreover, the senior engineer is only human and also capable of pushing bugs to prod themselves…

a meme depicting a junior vs senior engineer. Of course there's more complexity to this, but nuance is too hard to fit in a crappy meme

Work Smarter Not Harder (The Solution)

The issue is that it’s difficult to reliably and realistically test software because emulating behaviors and bugs heavily relies on the environment.

I’ve had issues with bugs that couldn’t be reproduced “locally” because its dependencies couldn’t be mocked. Or situations where request flows are locked by dependencies. For example, some dependency (like authentication, payments, data) can lock you out of testing certain flows or situations or is unable to behave the way you want it to. Or maybe a difference in dependency version changes some underlying behavior - subtly changing the correctness of your implementation

It feels like it’s impossible to get remotely close to “production ready” code without merging and deploying changes into higher environments 2. The result is to compromise and merge something that the developer understands is reasonably imperfect (with the expectation for more amendments to be made). However, this is a sloppy and risky habit that was necessary from the “monolith” days - when our environments were single a hunks of immutable code where everything needed to be in the right place at once.

random figure depicting monoliths

However, we’re using microservices now (cue sarcastic fanfare)! We can actually leverage some of benefits of service architectures. One of them being that developers can now treat the environments themselves as composable components that we can mix and match to satisfy unique testing situations faster!

What does “treating our environments as composable components” mean? It means that instead of running local dependencies or mocks, we source our dependencies directly from other environments (DEV, INT, STG, etc).

random figure depicting monoliths

In this approach, we have access to realistic, predictable behavior of services in higher environments, without having to actually deploy changes to these higher environments.

The process of testing & deploying environment by environment only to have to restart locally is over! Freeing us from (some) of repetition of testing. Developers can predict and discover bugs “early on” without having to make embarrassing PRs over and over again (“haha I missed a bug so i need to redeploy laugh in pain”).

Compared to our previous workflow, it now looks something more like this: a flowchart depicting the development process with composable environments as suggested as above

We won’t be able to catch all of the bugs and not everything can be hooked up like this, but it’s a start.

The Execution

Moreover the execution of this development pattern is easy. It just requires changing some configurations within your local services like this.

From a configuration like this:

{
  "timeout": 1000,
  "upstream_service_1": "http://****:8001",
  "upstream_service_2": "http://****:8002",
  "upstream_service_3": "http://****:8003",
  "feature_toggle_1": true,
  "feature_toggle_2": false
}

To a configuration like this:

{
  "timeout": 1000,
  "upstream_service_1": "http://INT_UPSTREAMSERVICE1",
  "upstream_service_2": "https://****:8002",
  "upstream_service_3": "http://INT_UPSTREAMSERVICE3",
  "feature_toggle_1": true,
  "feature_toggle_2": false
}

I will note it won’t always go as easily as described. Sometimes the environment is going to require a special authorization, or some SSH Tunnel, cloud credentials, etc.

Implementing development patterns like this can let you discover unexpected behavior early on, inject service behavior, mix environments, or even perform forbidden techniques.

Behavior Injection

With these patterns we can easily modify, augment, or change behavior through pass through/middleman services. If we need to account for a new responses, headers, etc from dependent services from higher environments - we could easily sub in the information we need.

example of behavior injection

// Hypothetical code for what this could look like
// probably should use reverseproxy from the net/http/utils 
package main

import (
    "net/http"
)

const ORIGINAL_SERVICE_ADDR string = "http://..."

func modify1(w http.ResponseWriter, req *http.Request) {
  // Do the things you need to
  req.URL = ORIGINAL_SERVICE_ADDR
  resp, err := client.Do(req)
  // Do the responses
}

func modify2(w http.ResponseWriter, req *http.Request) {
  // Do the things you need to
  req.URL = ORIGINAL_SERVICE_ADDR
  resp, err := client.Do(req)
  // Do the responses
}

func main() {
  http.HandleFunc("/passthrough1", modify1)
  http.HandleFunc("/passthrough2", modify2)
  http.ListenAndServe(":8080", nil)
}

“Frakensteining” Environments

If for some reason we need random things from random environments, we could mix environments as well (like mixing a local service with a STG database and DEV authorization service). By mixing several environments we can grab specific behaviors from specific environments, at the huge risk of not getting in realistic tests.

random figure depicting monoliths

The Forbidden Technique: Using PROD

A really degenerate technique someone can use is connecting a local service directly to a production dependency. As a disclaimer, you should not really do this. You should never be writing to production for any reason, and only use READ-ONLY operations. However, reading from production can help you effectively investigate production behavior as well as guard your service from “production-level” wonkiness.

random figure depicting monoliths

Conclusion & Drawbacks

This testing methodology is “better” in that there’s less frustrations and we can observe/tackle certain classes of bugs earlier on. However, it’s important to acknowledge all of the ways this method falls short.

Manually testing this way is not a remotely scalable or stable testing methodology. Ideally, there should be processes that automatically test changes every step of the way. We also will probably need some unit tests, or mocks, or E2E.

It’s also really important to note that we cannot use this method to drive development. If we were to develop our code around this method (rather than developing around specifications) - we’re essentially writing junk code for rapidly changing/unfinished software.

  1. Not really, I’m just being melodramatic. 

  2. I will note that this problem can be sidestepped if your team can deploy feature branches into higher environments. There are places that do this, but also a ton of places that do not… 

tedkim97.github.io/microservice_environment_composability
Temporary Microservice Pattern
programmingsoftwareopinionmicroservicessoftwarearchitecture
Introduction
Show full content
Introduction

I’ve been preoccupied with miscellaneous things in life making it hard to write blog posts in the past few months, but I’ve worked with a pattern and couldn’t help write about it.

The pattern occurs when one service (lets call DownstreamMS) is blocked by another (UpstreamMS), and our UpstreamMS team cannot help us because they have their own issues (not enough manpower, different priorities, system design disagreements, etc).

problem scenario

microservice diagram of a potential blocked situation


There’s a handful of potential solutions, but one pattern is to introduce a temporary “middleman” service (TemporaryMS) that implements the necessary functionality. The idea being that this temporary service will be maintained and deployed until the upstream service implements these features themselves.

a possible solution

we can avoid this blocker by employing a middleman/temporary microservice that implements the features necessary


I haven’t seen any book/documents outlining this pattern (granted maybe I haven’t read enough or know the right name for this), but I thought I would outline the contexts and some pro-cons from my experience.

Context & Other Solutions

For a while, I struggled to come up with a general, yet relatable scenario to describe this “situation” without leaking work details. Then I remembered this video:


In the context of this video, our senior engineer could remove this blocker by implementing and deploying an ISO timestamp-converter-service that exists outside of Omegastar - and take it down once Omegastar “gets their shit togther”.

Satire aside, implementing “minor” behavior changes between two different APIs can be surprisingly difficult because of a handful of small headaches. For instance:

1. Mismatch in priorities or manpower

I’m sure we all have moments where we would like X team to implement Y feature or fix Z problem. Unfortunately that team has a lot of other critical priorities, and not enough people meaning they can’t get to your tickets until $DEMORALIZING_AMOUNT_OF_TIME.

2. Subtle bugs/edge cases from either party

Moreover, small bug or edge cases can pop up that complicate the development cycle. These issues aren’t the end of the world, but they might be problematic enough to trigger a version rollback or cause the changes to be scrapped entirely.

3. Coordinating the Deployment of These Changes

I imagine there are a couple of situations where teams accidentally deploy incompatible service versions at the same time like so1:

incompatible deployments

Your team deploys their changes too early


incompatible deployments

The other teams deploys their changes too early


The headache is more likely to occur the further away the development team is from CI/CD (meaning teams need to manually cut releases or make release packages), and scales with the number of testing and production environments your team uses.

Possible Solutions

Individually, these aren’t serious issues because there are common solutions to each of these problems. For example:

1. Solution to the manpower/priority problem (D.I.Y)

If the UpstreamMS team is too busy, a member of DownstreamMS can program the changes on UpstreamMS and have their team approve it2.

2. Solution to the implementation problem (Comprehensive Testing)

We can avoid running into surprising bugs or edge-cases by making well-defined unit tests and integration tests.

3. Solutions to coordinating deployments

To avoid any surprise incompatibilities, we can extend our services by adding more API endpoints and versions (rather than changing existing ones), or implementing fallback behavior that depends on service version (via service registry).

Pitfalls

However, these solutions aren’t “no brainers” and come with their own set of problems.

1. Volunteering is going to be a lot of effort

The volunteering programmer will need to learn a new codebase as well as understanding its components to come up with a good implementation.

There’s a “catch-22” moment where if UpstreamMS team is already strapped for time, they probably won’t be able to consult or guide the volunteer on the “correct” approach.

If they are willing to sacrifice their time to give you the necessary details to start, you might be wrapping up a bigger portion of their teams’ time anyways.

Your pull request might have large waiting periods, where the upstream team is too busy to thoroughly review and critique your PR. Furthermore, if your PR doesn’t match up to their vision of the implementation (or is complete spaghetti), the other team will have to take even more time giving you the proper information to fix your PR.

2. Thorough Testing Is Hard

It’s really easy to fall into your own biases and come up with tests that

  • don’t test everything
  • badly test everything

After all, finding edge and corner cases is much easier in leetcode rather than real life. Unless you have a business expert on hand, a lot of effort will be spent on refining tests as different behaviors pop up.

3. Creating additional complexity

Extending services and implementing fallback behavior will increase service robustness, but end up increasing the complexity. Complexity isn’t the plague, but its acceptability is determined by the team.

4. Disrupting the development cycle

As an culmination of the previous three points, all of these factors can disrupt a team’s development cycle which can make your (or someone else’s sprint) incredibly wonky. For some teams that wonkiness is acceptable or expected, but for others it might harm their reviews.

Introducing a New Microservice

Creating a new microservice lets us circumvent some of the problems we see in the previous sections.

A volunteer from DownstreamMS won’t need to learn the internals of UpstreamMS, potentially speeding up development of the minor feature.

While we can’t do anything about testing quality, both parties don’t need to worry about deployment coordination. By decoupling the feature from UpstreamMS into its own service, the team’s respective APIs don’t need to be in sync. As a result we can change TemporaryMS microservice without affecting the UpstreamMS - meaning we can enter the integration testing phase much faster.

decoupling deployments

Visualization of the middleman service easing coordination concerns


Because TemporaryMS is meant to be temporary, we can justify its existence until everything is in place.

Employing the middleman pattern lets us introduce crazier dependencies because we aren’t changing the scope of the upstream service. Although I wouldn’t recommend it, we could supplement the data from our UpstreamMS with information from OtherMS as easily as this:

incompatible deployments

A new service is not a free lunch

Despite the pros in the previous section, making and deployment is not “free”. Depending on the tooling and documentation, deploying a new microservice can be:

  • mindless (someone else can do it or there’s plenty of tooling to support developrs)
  • medium effort endeavor (need to make configuration files and work with deployment tools)
  • complete misery (your environments are configured esoterically and it’s a point-and-click adventure game of finding the right people/documents that can help)

Not to mention that compute is not cheap - the DownstreamMS needs to use this service enough to justify provisioning and eating the cost of compute3.

Ownership and maintenance are also concerns. Who will be responsible for maintaining the service? Will this TemporaryMS actually be temporary, or will it become a piece of “legacy technology”? Just because someone commented //this is a temporary solution, TODO: make better one doesn’t mean it will be retired in a timely manner.

Conclusion, Reflections, and Rambling

This pattern is neat and demonstrates the flexibility of microservice architectures, but I think this solution feels a bit inelegant.

We’ve added complexity to the systems topology4 as well the overhead of a new service for the benefit of being able to work faster. Obviously I would love if the simplest, easiest, purest solution could be done out of thin air, but other priorities just prevent this from happening.

Accepting the Microservice Philosophy

My first reaction to a project was me lamenting how I was creating such a “small” service. One principle of programming is reuse code by looking for the right “abstraction”. Rather than trying to hardcode for individual cases, we want code that is flexible and can handle generalizations. Deploying single-use code to do one niche task feels a bit sloppy, BUT is a sacrifice that lets us streamline other aspects of the development process.

Is this a practical solution?

My biggest fear is that someone reads this article and thinks, “there’s no way this would ever be applicable to a real software engineering”. To convince the reader (and keep details vague), there was a situation where our services needed an overhaul of some Upstream Business Logic, but the earliest the Upstream Team could make the change was three years later. Obviously this was unacceptable for our product launch, so we created a temporary, middleman service to handle the conversion of the business logic.

In this context, this change would be a part of a big software migration/rewrite, and the skeptic might point out that the company is forcing an unnecessary rewrite/migration. Joel Sposky has a great essay on legacy software and rewrites, and he outlines that rewrites can be killer moves because although legacy software is hard to understand and complicated it works and is well tested.

My philosophy is that we need to take a more holistic approach to understanding software. On the surface level, if the code works, it works. BUT If the code is making other aspects of development, integration, maintenance, or deployment difficult maybe the code has become “broken”. I can’t imagine a world where you need to spin up a WindowsXP VM in order to support a legacy backend, or forking over tons of money for legacy hardware (like mainframes) or software is an “okay” pattern. In the context I was working in, this piece of business logic was incredibly brittle and sometimes did not work. Even if I wanted to, I didn’t have the necessary access or authority to make changes on this software. Other aspects of the business weren’t able to extend the functionality, and were forced to reuse the same bits and pieces over and over again - something I would warrant as “broken”.

Anyways Happy March!

  1. This depends on your company’s deployment practices 

  2. dependent on company/dev culture 

  3. This could be a situation where AWS Lambdas could be useful - but I have no hard evidence/metrics 

  4. Wow what a fancy word 

tedkim97.github.io/temporary_middleman_microservices
Advice (and opinion) for Taking CS229
mldata-scisoftwarestatisticsdeeplearningadvicestatprogramming
Intro
Show full content
Intro

Recently, I’ve completed Stanford’s CS229: Machine Learning for a non-degree option1. It was a huge time sink and a bit difficult to juggle with a full time job, but I have some opinions and advice on how to effectively learn and pass the class.

Opinion

In my opinion, CS229 is much more organized and beginner friendly than other ML courses I’ve taken (at UChicago). For instance, students get incredibly detailed lecture notes that mirror the actual lectures, a fairly large TA staff, starter code with clear instructions for problem sets, and clear guidelines for the expected outputs of algorithms (which reduces the burden and confusion when trying to implement algorithms from scratch). This makes the firehose of knowledge you need to learn/understand more like a garden hose - difficult, but more manageable.

However, at a whopping $6000 per course, I wouldn’t recommend it over the same source but taught through a MOOC like Coursera (especially with a massive discount) unless you were trying to get a certificate or credits for the class.

Advice 0. Prepare to commit some time

Even if you intended to or not - this course ends up being a huge time commitment. I’ve spent many weekends and after-work hours reading/watching the course material, office hours, and completing the pset.

1. Don’t be scared if you don’t “get” it the first time

I’ve realized that you have a lifetime (or at least the next couple of years) to really master these concepts and ideas. A lot of the “learning” I’ve had in school didn’t really “click” until I ran into these ideas later in work, hobbies, or different courses.

2. Get a comfortable understanding of the fundamentals

The prerequisites for an ML class are (1) Computer Science, (2) Probability Theory, (3) Multivariable Calculus, and (4) Linear Algebra.

I think it’s fine if you don’t have a strong foundation of knowledge for all of these fields (but I would always recommend it). If you aren’t comfortable with coding and linear algebra, you’re might have a really bad time.

3. Use a debugger (or a Jupyter notebook)

A big portion of the time spend in this class will be spent coding on the homework or projects. Coding anything involves discovering bugs and solving those bugs. It is really easy to get into the trap of not learning a debugger or jupyter and use print() debugging2 - but in reality it will save you much more time using these tools.

4. Optimize your numpy code

It’s good practice to leverage numpy functions as they tend to be more performant. It also has the side effect of being easy to understand and looks clean.

5. Don’t pre-optimize your numpy

On the other hand, don’t shoot yourself in the foot by being overzealous with your numpy operations. If you’re just learning how the algorithm works, it’s okay to start with simple looping with numpy arrays. If you don’t have a solid grasp of the algorithm, and the functions you’re using to compose the algorithm debugging is going to become exponentially difficult.

Optimizations come after it works.

6. Visualize your understanding

Learning things get harder the more abstract they become. A really useful learning method I’ve learning is to visualize by drawing my understanding of the topic. This has an added benefit of being very clear to a peer or TA what you’re talking about.

7. Boost your fundamentals with resources!

The University Of Waterloo has this excellent PDF of various matrix operations and their results! Use this if you’re stuck on a proof or derivative.

Cookbook

There’s also this cheatsheet provided (but I never really ended up using it)

8. Start the Homework early

Start the homework early and start the coding implementations just as early. Since the class has been taught so many times and lecture notes have been refined, you don’t need to wait for the lecture to start the homework. You could just read the lecture notes or watch past lectures.

9. Pick a project topic you’re interested in

If you’re going to spend a bunch of time (1) reading background literature, (2) debugging code, (3) conducting a lot of research, you might as well create something that you’re interested in!

Here’s a link to the repo dump of my final project with the paper!

  1. … and got an “A” (no flex). To be fair, I have prior experiences taking courses related to Machine Learning, so I had an advantage of understanding some of the fundamentals of the course 

  2. I’m guilty of this sometime as well 

tedkim97.github.io/cs229_advice
Purchasing USB-C Pro Micros
keyboardstypingarduinocustomdesign
I haven’t been posting much recently because I’m taking cs229: machine learning at Stanford as a non-degree option. The class is a lot of work (but I would recommend, especially the lecture notes), so I had to (temporarily) cut time from my other endeavors such as (1) Work (2) CS229 (3) Social Activities/Hobbies (4) Blogging about Hobbies. Since work and the class are non-negotiable and maintaining a “healthy” mental state involves (3), I decided to cut down on high-effort blogging temporarily1. This break has given me more time to jot down topics/ideas for other posts down the line. ↩
Show full content

I haven’t been posting much recently because I’m taking cs229: machine learning at Stanford as a non-degree option. The class is a lot of work (but I would recommend, especially the lecture notes), so I had to (temporarily) cut time from my other endeavors such as (1) Work (2) CS229 (3) Social Activities/Hobbies (4) Blogging about Hobbies. Since work and the class are non-negotiable and maintaining a “healthy” mental state involves (3), I decided to cut down on high-effort blogging temporarily1.

I’ve purchased a few pro-micro USB-C devices from Aliexpress - essentially these are just a slightly modified version of the traditional Pro Micro w/ a micro-b connector. This is exciting for a couple of reason:

  • The Pro Micro did not really have a cheap clone with USB-C until recently (August/October 2021? - I will track down a more precise date at a later time)
  • I’m happier with my prototypes

For some context, an old passion of mine2 is mechanical keyboards - with my intensity from just buying mass-manufactured keyboards to “DIYing” with different form factors, switches, keycaps, flashing with custom QMK configurations, group-buys, etc3. While I don’t have an excess of keyboards some people show on geekhack and /r/mechanicalkeyboards, I am embarrassed by the fact that I have more keyboards than I would realistically use. As a result, I’ve limited myself to building/buying components for a keyboard that was designed & built by me.

Part of this DIY experience is having a convenient abstraction for the keyboard’s microcontroller unit (the ATMEGA32U4). The Pro micro provides all the important components (that I don’t entirely understand) that I would need to add to a barebones circuitboard (such as a timing crystal, pulldown resistors, etc).

I was unhappy with how the Pro Micro (a popular DIY part of custom boards) had a micro-B connector (just because I didn’t like it aesthetically & because I preferred USB-C). Other enthusiasts seemed to agree because there are plenty of Pro-Micro variants called the “Elite-C” specifically designed for keyboards (more exposed IO pins, better connector mounting to PCB, etc). While I wanted one it was incredibly expensive (~$18) compared to a regular pro micro (~$5-6). The higher price is because of more complicated designs and the USB-C connector, but in my mind that didn’t justify a whole $10 bucks per MC (especially when I didn’t need most of these pins). In my mind - I just accepted that I had to live with this tradeoff.

Beware - it seems that a batch of these promicros have some pin-misallignment issues described here in this reddit post.

Here are some pictures for reference:

top-down view

horizontal view

  1. This break has given me more time to jot down topics/ideas for other posts down the line. 

  2. Since playing Starcraft II: Wings of Liberty (2010) 

  3. but I would never buy an artisan keycaps - they can be cute or cool, but I think they’re incredibly dumb functionally and aesthetically. 

tedkim97.github.io/promicro_usb_c