iainschmitt.com — GeistHaus

May 16, 2026

The Great Token Wager

In the first half of 2026 the AI infrastructure buildout has been firmly top of mind.1 The November 2022 launch of ChatGPT brought large-language models into the spotlight, and with each passing year you could feel the diffusion into nearly every conversation about the future of work, education, and culture. To feed the insatiable appetite for generative AI, the industry is making capital investments without recent precedent. From a recent column in The Economist, emphasis mine:

This year the five firms [Amazon, Google, Meta, Microsoft and Oracle] will spend $800bn filling warehouses with computers to run artificial-intelligence models... at around 40% of their revenues this year, the cloud giants’ capital expenditures will surpass those of the oil industry during the shale boom in the 2010s and the telecoms industry during the dotcom bubble in the 1990s.

Many are asking if this a speculative bubble that will collapse like the end of the dotcom boom, but this time with a technology sector that is a larger share of the economy and a much larger share of the S&P 500. Leaving aside the argument that bubbles are actually good by compressing technological progress to build otherwise impossible things, this investment reflects the sincere belief of these firms that generative AI will bring about dramatic economic changes. From Microsoft CEO Satya Nadella's November 2025 interview with Dwarkesh Patel:

In some sense this goes back again to, essentially, what’s the economic growth picture going to really look like? What’s the firm going to look like? What’s productivity going to look like? ...what took 70 years, maybe 150 years for the Industrial Revolution, may happen in 20 years, 25 years. I would love to compress what happened in 200 years of the Industrial Revolution into a 20-year period, if we’re lucky.

If this narrative is correct, this implies some disturbing vulnerabilities. If the current era is a second industrial revolution, the GPUs used to run AI models are as important as basic electrical transmission. And nearly all of those GPUs are produced in semiconductor fabrication plants (fabs) run by the Taiwan Semiconductor Manufacturing Company (TSMC). The only problem with this arrangement is that Taiwan is about 110 miles off the coast of a China that claims the de-facto independent island as its territory. America's CPU resilience is in a mediocre but acceptable place; unlike the rest of their rivals who contract out all manufacturing to TSMC, America's Intel owns many advanced semiconductor fabs in America. But it has struggled to compete in the latest generation semiconductor manufacturing technology and their GPU production is a rounding error as compared to Nvidia. And while TSMC now has a fab in Arizona, as of 2026 the facility is still dependent on packaging and final assembly based in Taiwan.

The Polymarket odds for a Chinese invasion of Taiwan by the end of 2027 stand at 17% as of 16 May 2026. The fragile fab equipment wouldn't make it through a war in one piece, and if Taiwan were to fall, America would likely sabotage TSMC lest it fall into Chinese hands. It's worth noting that the AI investment era did not create this vulnerability; if the island was captured in 2021 this would have meant a complete stop to iPhone production given that Apple contracts all of their CPU production to TSMC. But when taking future economic growth into account, the stakes have been raised: that $800bn in 2026 capex only pencils out if you can actually acquire GPUs for these new data-centres.

But there are many outcomes short of a Chinese invasion that the industry needs to be prepared for. AI model use is priced by the token; it varies between tokenisers but a text token is roughly a syllable. For risks both existential and less-than, having option contracts on AI token prices could provide some means of hedging and a stronger price signal than today's prediction markets. But everyone I've talked to on Wall Street who's in a position to know says that synthetically creating options against a basket of token prices can't be done without baking in some other asset prices. And writing literal options doesn't seem to have the blessing of regulators yet. While the marginal cost of a lot of CPU compute reduces with incremental users, the costs to use AI models scales linearly with tokens. This means companies built around roughly the existing token economics will be more sensitive to price changes than they are for CPU compute. Some firms that rely on LLMs to serve customers will be able to tolerate a 10% increase in frontier model pricing, but not all will. While technological improvements would suggest that token prices will go down over time, demand has not looked very price sensitive in the last two years. And if you are choosing between more model calls or hiring an employee, why would it be?

Putting this all together the technology industry has placed an incredibly large and difficult to hedge bet that A) the political status quo will be maintained in Taiwan and B) generative AI models will be a load-bearing part of the economy in the near future. While ordinary software engineers have little say in semiconductor manufacturing or geopolitics, there are things the profession can do to be less exposed to the great token wager.

References

Nellis, S. and Cherney, M. 2026. TSMC plans to open chip packaging plant in Arizona by 2029, executive says. Reuters. (April 22, 2026). Retrieved May 17, 2026 from https://www.reuters.com/world/asia-pacific/tsmc-plans-open-chip-packaging-plant-arizona-by-2029-executive-says-2026-04-22
Nadella, S., Patel, D., and Patel, D. 2025. Satya Nadella — How Microsoft is preparing for AGI. Dwarkesh Podcast (Nov. 12, 2025). Retrieved May 17, 2026 from https://www.dwarkesh.com/p/satya-nadella-2
Polymarket. 2026. Will China invade Taiwan by December 31, 2027? Polymarket. (March 17, 2026). Retrieved May 17, 2026 from https://polymarket.com/event/will-china-invade-taiwan-by-december-31-2027
The Economist. 2026. Big tech is sacrificing its cashflows to prop up the AI boom. The Economist (May 13, 2026). Retrieved May 17, 2026 from https://www.economist.com/business/2026/05/13/big-tech-is-sacrificing-its-cashflows-to-prop-up-the-ai-boom
TSMC. n.d. TSMC Arizona. TSMC. Retrieved May 17, 2026 from https://www.tsmc.com/static/abouttsmcaz/index.htm

Small editorial note: this post assumes less prior knowledge and explains more details than most others; I wanted this to be understandable to people who do not read Stratechery everday↩

https://iainschmitt.com/post/tokens-and-taiwan

RSS Scraper Development Notes

May 13, 2026

RSS Scraper Development Notes

Artemis.bm is a strange website and I wish that more industries had a counterpart: the website tracks the insurance-linked security (ILS) and catastrophe bond industries. ILSs are an alternative to traditional reinsurance, which is the insurance that insurance companies themselves purchase to protect against tail risks. One bad hurricane could result in a lot of claims to State Farm or Allstate, so reinsurance is what protects retail insurers from this type of risk. There are only so many reinsurers to go around, and not all catastrophic risk is something that reinsurers can confidently assess; insurance-linked securities in general or catastrophe bonds in particular help to fill in the gap. Catastrophe bonds pay above the risk-free rate, but if the insured event occurs, it comes out of the bond principal. For example, the Louisiana Citizens Property Insurance Corporation is issuing $150 million in named storm catastrophe bonds. The terms of those bonds are that if a NOAA named hurricane or tropical storm causes more than $540 million in losses to Louisiana Citizens insurance in the next 3 hurricane seasons, investors will start to take losses.

Artemis.bm is well written with frequent updates, and full articles are available in the website's RSS feed. I don't really understand why this isn't all paywalled, because most people for whom this content is relevant work for a handful of firms who wouldn't mind paying $50/month to subscribe.

However, as hard as this is to imagine, content about ILSs and the reinsurance market can be a little bit dry specially for someone who works in a completely unrelated field. Back when I subscribed to their RSS feed, I didn't read it often enough to stay subscribed. I hadn't yet done a project with LLM calls inside of a service, and RSS feed summaries was something I would use.

The repository for the RSS scraper and summary service is here, and this is what I use to keep on top of Artemis.bm and a several other RSS feeds that I don't have time to read in full. I call each original, scraped feed a 'source' feed, each of which has a 'derived' feed storing in-progress and completed LLM summaries for source items. Breaking out the derived feed makes it possible to handle asynchronous batch requests to the Anthropic or Gemini APIs, because batch job status is checked in the process of checking source feeds for updates. Also, for both synchronous and asynchronous requests I wanted to avoid a situation where I run out of credits from duplicated requests sent off in rapid succession, so the service checks against the derived feed before making a model call.

I thought it would be passé to use a relational database to store feeds, so the derived feeds as defined below are stored in Tigris object storage. While the derived feeds are represented in JSON, these are otherwise very similar to an RSS feed within the application.

type DerivedItem =
    { Guid: String
      Included: Boolean
      Item: RssItem
      Result: String option }

type DerivedBatch =
    { Id: String
      ProcessingStatus: ProcessingStatus
      BatchItems: DerivedItem array }

type DerivedFeed =
    { SourceUrl: String
      Batches: DerivedBatch array }

With JSON files in object storage as the only persistence for the application, If-Match headers are used for concurrency control. This doesn't do much to prevent duplicating requests, but it makes it safer to run the scrape process ad-hoc without interfering with an in-progress scrape on my VPC.

Once the derived feeds associated with a 'sink' feed reach a configurable count of items, they are published to an RSS client accessible sink feed. For example, both my Artemis.bm and Substack blog sink feeds publish in batches of 5, so whenever there are 5 fresh derived feed items that aren't yet in the sink feed then the next sink feed item includes all of these fresh summaries. This requires the DerivedItemReference in a SinkItem to keep track of which derived feed items have already been published.

type SinkFeed =
    { Title: string
      Link: string
      PubDate: String
      Description: string
      Items: SinkItem array }

and SinkItem =
    { Item: RssItem
      DerivedItemReferences: DerivedItemReference array }

and DerivedItemReference =
    { Title: String
      Guid: String option
      Link: String option }

Initially I tried to use XML for the derived and source feeds; I figured if I was doing an RSS project I couldn't really get away from XML. But ambiguities of representing lists as <Items> <Item /> <Item/> </Items> or just <Item /> <Item /> and much better serialisation and deserialization in JSON made me abandon XML. Even sink feeds internally are represented as JSON, but the server that publishes the sink feeds uses the F# Giraffe view engine to convert the JSON to XML. But because the schema of a given source RSS feed shouldn't change much, XML type providers were very helpful in source feed deserialisation. For instance, type ArtemisRss = XmlProvider<"Schema/artemis.rss"> was used to define the Artemis source feed type from a file in the repository, so handling source feeds could be done in a type-safe way that was friendly to autocomplete.

And despite the work that went into making batch LLM summary calls, I've had issues with both Anthropic and Gemini batch requests. Admittedly, the current batch handling logic doesn't handle failures very gracefully; I don't have a timeout or similar handling so that may well be my issue. This isn't a showstopper because this type of work isn't very token intensive. I've used $1.99 on Claude credits and $0.21 on Gemini, and both of those numbers would be lower were it not for some accidental duplicate LLM calls during development. The service uses Haiku 4.5 from Anthropic and Gemini 3.1 Flash Lite because text summaries don't require a crazy powerful model; right now Haiku 4.5 costs $1/MTok input with $5/MTok on output. Gemini is far cheaper at $0.25/MTok and $0.75/MTok.

https://iainschmitt.com/post/rss-scraper-development-notes

A few thoughts from JavaOne 2026

Apr 12, 2026

A few thoughts from JavaOne 2026

Last month SPS sent me and a staff engineer to JavaOne, the main Java language conference held at the Oracle campus in Redwood City, California. It was a great experience and I got a lot out of it. It is worth asking what the value of conferences is when there is no shortage of high-quality content online about advancements in the language, JVM, and important libraries. But as Byrne Hobart points out, by clustering a lot of people in a small location for a couple of days you end up creating a miniature industry-specialised city with all the implied nonlinear benefits. And while most (or at least many) of the JavaOne talks will end up on YouTube, the hallway chatter and side conversations certainly won't be.

There is also something to be said for the curation value of conferences. Regardless of quality, anyone can publish a blog post or video about some OpenJDK 26 performance improvement. But because there are only so many slots to present at conferences like JavaOne, each talk had to have an answer for "why this presentation and not another one?". This sorts the wheat from the chaff.

Not in any order in particular, here are a few takeaways from the conference:

Ron Pressler's "Principles of Memory Management" was excellent. The SPS staff engineer I went with explained that a lot of the talk was a rebuttal to JVM memory management criticism from various camps. One of the arguments made about generational GC was that CPU cost of garbage collection is proportional the product of the live set size and the allocation rate. The young generation in generational GC has a high allocation rate but a small live set and vice versa for the old generation, so GC is less CPU expensive by using these offsetting factors across each generation. Pressler also pointed out that if a program uses 100% of CPU it doesn't really matter how much memory it is using because no other programs can get scheduled by the OS. So if you can pay a price in memory to shorten the amount of time a program is hogging CPU, it is often a good trade.
John Rose's "How the JVM Optimizes Generic Code" is the best technical talk that I have ever seen in person. It isn't surprising that a JVM Senior Architect had something interesting to say about the JVM, but I was surprised by how good Rose's stage presence was - he is engaging and funny. His slides are available here, and his presentation used Quick Sort to measure how polymorphism decreases performance on code that uses generics. In C++ the compiler prepares specialised, static implementations for each required type while Java generics are more dynamic. While Java is roughly as performant as C++ on int[] Quick Sort, using reflection to support both Integer[] and int[] introduces only a slight performance penalty. But when standard generics are used on both Long[] and Integer[] there is a substantial performance penalty and an even higher one if the reflective version is used, as the number and cost of code paths increase substantially. Once three types are handled at runtime then performance falls off a cliff. I'm probably not doing this talk justice, but it was fantastic.
Project Leyden is an OpenJDK project to "improve the startup time, time to peak performance, and footprint of Java programs". While true ahead-of-time (AOT) compilation is still in progress, the finished Leyden JEPs allow for capturing an AOT cache on a running application in order to do some class loading & linking ahead of time and capture profiles that can help the JIT compiler optimise faster. Netflix is using this in production for services with long startup times. Leyden requires the JVM you capture the cache from to be running on the same hardware and be on the same minor version as the JVM where the cache is used. To set the JIT up for success, you want the source JVM to operate under similar conditions as the target JVM. So to avoid a circular dependency, Netflix captures AOT caches from canary production deployments.
I talked with several presenters from Oracle and Netflix between talks; everyone was friendly, open to sharing expertise, and willing to answer questions. I asked a couple of people 'how do I better understand JVM internals' and the answers I got were reading the Garbage Collection Handbook, chapters 4 and 5 of the runtime spec, running the bytecode interpreter under a debugger, and reading everything that Aleksey Shipilëv has ever written about the JVM.
Project Babylon is an OpenJDK project to run Java in more exotic places than a JVM running on a CPU, such as on a GPU or an FPGA. With respect to GPUs you can inline CUDA C in existing Java programs, but it looks and sounds like a painful mess. In order to represent Java programs in a more target-agnostic way you need something easier to manipulate than an AST but something less tied to the JVM than compiled bytecode, and that is where Babylon code models come in; these roughly analogise to MLIR from the LLVM project. "Reflecting on HAT: A Project Babylon Case Study" included a neat demo of using Project Babylon to run a Conway's Game of Life implementation in Java on a GPU. I will be jealous of the first 'I used Babylon to program an FPGA in Java' Hacker News article, à la this Haskeller who solved Advent of Code problems on an FPGA, that post won't come from me because I wouldn't even know how to begin with all that.
Taking conference notes in notebooks rather than on a laptop seems like an anachronism but because you can only write so fast you don't fall into 'transcription mode' when listening to a presenter, and you're also less distractable. Frontier models are good at OCR, so converting them to text really isn't much effort. And high quality materials make a world of difference, I exclusively wrote in Maruman Mnemosyne steno notebooks with a Morning Glory Pro Mach rollerball pen.

https://iainschmitt.com/post/a-few-thoughts-from-javaone-2026

February 2026 Puro Notes

Feb 27, 2026

February 2026 Puro Notes

Last July I wrote the following in Kafka in One File:

It all made me think that surely someone has created a high-quality, open-source "SQLite of Kafka" because that is exactly what I want. Given that SQLite fit my needs well as a database we're not talking about all that much data. But I find event streams interesting to work with given that they have useful qualities of a database, a write-ahead log, and a message queue. Much to my surprise, I haven't really found anything that fits this bill, let alone some actively maintained, well-used open-source project.

I've been sporadically working on Puro, my attempt at an 'SQLite of Kafka'. Puro is a Kotlin program with event-stream like semantics stored on a local filesystem, rather than being distributed like Kafka, Kinesis, or equivalents. A Kafka broker stores each partition of each topic that it serves as a set of segment files, with one active segment file receiving new records from producers. In contrast, Puro has no daemon or broker; consumers and producers use file locking to control access to the active segment. Readers acquire shared locks to read existing records on the active segment; producers acquire an exclusive lock to the region of the file after the end of the existing bytes before writing new records. There aren't partitions in Puro, and all topics are placed onto the same segment. By running on a single filesystem there aren't benefits to consumer groups that can't be reproduced by a consumer thread handing off messages to specific worker threads. Log compaction isn't very practical without a daemon, but iterating over stale segments would allow something roughly equivalent.

There is still a lot of work to do, I haven't completed the work on repairing failed writes or started the work on rollover to a new active segment. Once everything is 'working' there will be no shortage of performance fixes. Despite being a JVM program, I'm trying to allocate as little as I can on the heap; I've internalised the arguments by Casey Muratori and the TigerBeetle team to this end.

Working with binary serialisation is the main challenge of the program, and one that I don't have a ton of experience with. When each record is just a series of bytes, detection and recovery of bad writes is a challenge. And if a producer writes to a segment immediately after an incomplete write, consumers won't be able to tell where the bad write stops and the good write begins. At first, I had some convoluted logic for consumers to detect write failures and zero out their corrupted records. Consumers are using filesystem APIs to listen to reads; by tying the deserialied messages to certain offsets the consumer can piece together when bad writes started. But a consumer that started consuming after the bad write wouldn't have the required history to piece this together. A consumer consuming from the beginning of the segment wouldn't have an issue here, but I wanted consumers to be able to start from the latest record. There is an arguably greater challenge of consumers not being able to tell the difference between a healthy write and an unhealthy write in the absence of any producer locks.

The solution came from a principal engineer at work, who suggested the last action a producer makes during a write is flipping a signal bit. The way I implemented this was with special write-block start and write-block end messages; a consumer that encounters a low signal bit will relinquish the read lock for a delay, and will wait until the bit is high. The write-block end message shows the length of a successful write, allowing producers to check signal bits. If a producer encounters a low signal bit, it will zero out the corrupted message before writing its own messages. Given that producers are responsible for detecting bad writes they will have to iterate down the entire length of the active segment to make sure the existing segment is sound before producing messages themselves, but it feels more appropriate to make the producers responsible for this.

The most annoying part of the project is working with the Java ByteBuffer class, which is hard to avoid when working with the Java NIO APIS. A ByteBuffer instance stores the position of the next byte to read or write to, so just reading an instance and iterating through the contents mutates state. My most common bug is forgetting to rewind a buffer, leading the program to think I have fewer bytes than I actually do in a data structure. I almost want some means to automatically rewind any buffer once it leaves the scope of a function, but I'm not sure if it could be done cleanly in Kotlin. It was a really confusing day when I assumed that public ByteBuffer put(int index, byte[] src) advanced the buffer position just like public abstract ByteBuffer put(byte b) did before seeing 'The position of this buffer is unchanged' in the relevant JavaDocs. I suppose you don't need the buffer to keep track of its own position if you know the length of src at the outset. That one really threw me for a loop.

References

Apache Kafka 4.2.X Documentation: Implementation
Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty. 2021. Kafka Internals. In Kafka: The Definitive Guide (2nd ed.). O'Reilly Media, Sebastopol, CA. ISBN 978-1-492-04307-2. I/O. In Advanced Programming in the UNIX Environment (3rd ed.). Addison-Wesley Professional, Upper Saddle River, NJ, USA.
Software Unscripted Episode #78
TigerStyle: TigerBeetle Style Guide
William R. Stevens and Stephen A. Rago. 2013. Advanced I/O. In Advanced Programming in the UNIX Environment (3rd ed.). Addison-Wesley Professional, Upper Saddle River, NJ, USA.

https://iainschmitt.com/post/february-2026-puro-notes

Review of The Shenzhen Experiment

Jan 31, 2026

Review of The Shenzhen Experiment

For all the attention paid to America's semiconductor reliance on Taiwan and the military consequences of selling advanced Nvidia GPUs to China, I've found it unusual how comparatively little effort has been placed in coming up with an alternative to the Shenzhen manufacturing ecosystem. Before reading The Shenzhen Experiment I didn't know much about this region in south China other than that it is responsible for a staggering amount of the world's electronics manufacturing output. This Christmas I received Juan Du's The Shenzhen Experiment: The Story of China's Instant City; while the book didn't explain as much as I had hoped about electronics manufacturing, it is nonetheless excellent.

A Twisting Road to Market Liberalisation

In the late 1970s, Guangdong province officials wanted greater policy flexibility from the central government to attract exporting industries and foreign direct investment. Their wish was granted on July 19th, 1979 when the party Central Committee authorised what would become Shenzhen's special economic zone (SEZ). This was a mostly rural strip of land a little smaller than Philadelphia, home to about 358,000 people at the time. After Mao Zedong's death, many party leaders were eager to roll back Mao-era economic centralisation to increase the country's economic growth.1 This culminated in the rise of Deng Xiaoping, who is more responsible than any other Chinese leader for the impressive economic growth of the country starting in the early 1980s. The political and economic autonomy granted to the SEZ made it something of a laboratory of liberalising economic reforms: experiments that worked in Shenzhen were applied to the rest of the nation.

Because of this, "China's first" is a phrase that comes up a lot in The Shenzhen Experiment. This includes:

First post-communist at-will employment in 1980
First Sino-foreign joint venture in 1981
First city to open up construction market through competitive bidding in 1982
First city to abolish food ration coupons in 1984
First city to offer legal status to rural migrant workers in 1984
First public land auction in 1987, later resulting in a constitutional amendment legalising land transfers
First international land auction in 1992

These led to the early 1980s Shenzhen construction boom: in 1985 over 10 million m2 of floor area was under construction, and as of 1987 there were 62 towers taller than 100m in the SEZ. Despite providing political cover for the SEZ and championing liberalisation in general, Deng Xiaoping only visited the city twice in his lifetime. In his first visit to the city during the height of the construction boom in 1984, he said the following:

This time I went to Shenzhen to take a look; the impression it gave me was one of widespread prosperity and growth. The construction speed of Shenzhen is very fast, building one floor every few days and a tall building in no time. The construction crews there are even from the inland provinces. One reason for the high efficiency is the contract system and fairness in administering reward and punishment. Shenzhen's Shekou Industrial Zone is even faster. The reason is they were given a bit of power. They can make their own decisions on expenditure under five million US dollars.

However, many in the party were sceptical about both liberalisation in general and the SEZ's success in particular. While an impressive amount of construction was going on, not much growth was coming from exports and foreign investment but rather from other provinces and SOEs making one-time infrastructure investments in the city. In 1985 vice premier Yao Yilin remarked "It is impossible for the SEZ's economic development to rely on the country's long-term 'blood transfusion'; now, the 'needle' must be unplugged decisively", and outside observers from Hong Kong started to notice that the majority of the SEZ's revenue came from infrastructure and speculative real estate development. Worse yet to the central government, importers were a substantial part of the non-construction economy.2 It started to look like a convenient talking point to those who saw SEZ reforms as betraying the basic principles of the Party. The political blowback was swift, and led to the suspension of 804 infrastructure projects, hotel occupancy dropping 20%, and widespread unemployment in the construction sector in the mid-1980s.

This may have been the end of the Shenzhen experiment if it were not for changes in the foreign-exchange market. The Hong Kong dollar weakened against both the Japanese Yen and the Taiwanese Dollar, so Hong Kong looked north of the border for manufactured goods in 1986. Shenzhen's manufacturers were moving up the value chain, aided both by the market reforms and investment in infrastructure, so by 1987 the city was meeting the ambitious production and revenue targets set by the central government. Industrial production kept outpacing construction as an economic driver of the SEZ, leading to the 1988 conversion of state-owned enterprises into shareholding cooperatives and constitutional changes to recognise private enterprise.

Unplanned and Contingent Success

While Shenzhen ended up being an economic powerhouse and a runaway success, the Shenzhen of today is not something that Guangdong provincial officials were trying to accomplish in the late 1970s. This is because emigration was a daunting problem for the region, emphasis mine:

The most severe problem facing the local government was the abandonment of village communes and farming fields in the "Great Escape to Hong Kong," a phrase coined to describe the successive waves of hundreds of thousands of people illegally crossing the China-Hong Kong border. A recently declassified Guangdong government internal report reveals that between 1954 and 1980 there were 565,000 officially recorded crossing attempts.

The state of the Chinese economy at this time drove disparities across the border that are hard to overstate, which drove so many to risk the crossing:

The stark contrast in economic wealth between Chinese Guangdong Province and colonial Hong Kong next door loomed clearly. In 1977, the annual income of a village farmer in Hong Kong was one hundred times that of a farmer undertaking the same work on the other side of the border, while the disparity between factory workers in Guangdong and Hong Kong was even greater.

Hong Kong's immigration policies varied in the 20th century, but between 1974 and 1980, Hong Kong had a 'Touch Base' policy, which allowed for any illegal immigrant to get a residency permit provided they both had relatives in the territory and could get to the city centre without being apprehended.3 Cross-border farming certificates were granted to Hong Kong residents with ancestral farmland in Guangdong, making the border rather porous in some areas and facilitating many escapes to the south.

In 1980, city officials projected a population of 500,000 by the end of the millennium, showing that stopping the bleeding was their primary concern. But in 2000 the city had well more than 6 million people and became a metropolis of 20 million into the 2010s. In most of the rest of the country the state tightly controlled rural migration and employment through the hukou system under stiff penalties, but in the 80s and 90s many firms in the SEZ would break the law to hire informal labor. City officials would look the other way and lie to the central government about how many unregistered rural migrants lived in the city.4

The unregistered migrants were half of Shenzhen's population by the 1990s, but without an urban hukou status they couldn't receive state housing and other welfare benefits. Shenzhen's 'urban villages' that existed before the establishment of the SEZ helped fill the gap and are a major focus of the book; in describing them Du's architecture and urban planning background becomes more obvious.

Residential land use changes in many urban villages went something like this:

Growing sanlaiyibu export manufacturing firms drove lots of illegal construction to house workers
Officials attempted to formalise the existing construction, expand what was legally allowed, and curb future illegal construction with fines and regulations
Villagers treat fines as the cost of doing business, so illegal construction continues

It is a good demonstration of what laissez-faire economists might diplomatically describe as 'evasive entrepreneurship', both in policymaking and the marketplace.

A story that Du really effectively tells is that Shenzhen's success is much more than an 'if you build it, they will come' with respect to market liberalisation. Being across the border from Hong Kong really mattered for the development of the SEZ, and despite this the project very nearly didn't succeed. Many wealthy Hong Kong emigrants remained in touch with kin and invested in Shenzhen before other foreigners were willing to. It is hard to see how private firms could have gotten off the ground absent this outside capital. The Shenzhen Experiment shows many other place-specific factors behind Shenzhen's success, and these help explain why it worked out so well while China's 17 other SEZs haven't been remotely as successful.

Varying Village Trajectories

While every one of Shenzhen's 300 urban villages was impacted by the explosion of growth, the post-1979 fortunes of these villages can range substantially. One village that rode the wave of prosperity was Huanggang.5 In 1979 the local economy was dire enough that Shangwei nearly split off as a separate village, but the commune remained intact through the boom of the 1980s. The commune invested proceeds from land sales to the city into the Shapuwei Industrial Zone, and the villagers themselves converted their homes into residential towers to collect rents. In 1992 the villagers received urban hukou status and formed the Shenzhen Huanggang Real Estate Holdings Company with each villager as a company shareholder, and the redevelopment of the village continued apace with the central business district encroaching into Huanggang. The village had a ten thousand yuan deficit in 1980 but as of 2002 its holdings reached 450 million yuan and paid villager-shareholders a twenty thousand yuan yearly dividend.

But east of Huanggang about 15 km are the Baishizhou villages, which are among the poorest in Shenzhen. The area around today's Baishizhou was the Shahe collective farm, much of which was settled by peasants ordered to move from rural Guangdong province in the late 50s and early 60s. While land claims in Huanggang were unambiguous, it wasn't clear where collective Baishizhou land ended and the state-run Shahe farm began, which prevented the village from starting a shareholding collective once they were granted urban hukou status. Unlike Huanggang, the village wasn't compensated by the city for land transfers during urbanisation owing to this ambiguity. For many Baishizhou residents, their legal status amounted to "rural migrants squatting on urban land" rather than collecting dividend checks from village owned real estate investments.

As an aside, Baishizhou seems more interesting than Huanggang and other wealthier parts of Shenzhen. It remains an affordable place to rent, making it a common first neighborhood to live in for those getting their start in Shenzhen, and the large migrant population means you can find regional food from every corner of China in the village. The introduction to the book describes Du accidentally wandering into a bustling night market, but the book's most absurd Baishizhou detail is American expat Joe Finkenbinder's Bionic Brewery, opened in the village in 2014 because rents were too high everywhere else in Shenzhen.6

Conclusion

I don't think of myself as a particularly fast reader, but I finished The Shenzhen Experiment in three sittings. There is a lot worth saying about the book that is left out here, and overall it does a great job at both the high-level history and telling individual stories about the evolution of Shenzhen. It was the best book I read in 2025, and I'd recommend it to anyone remotely interested in economic development or urban planning.

Mao-era rule of China was an abject tragedy, as outlined in Frank Dikötter's "The Tragedy of Liberation", "Mao’s Great Famine", and "The Cultural Revolution", all published by Bloomsbury Press↩
Yuxi Liu's 'Structure and Interpretation of the Chinese Economy' explains how important getting foreign currency into the country was at this stage of China's development, so a large import sector was doubly bad news. As an aside, if you name an essay 'Structure and Interpretation of $topic' as a nod to Abelson and Sussman's seminal functional programming text, it better be a good explanation of $topic. Luckily the essay lives up to the name.↩
I originally ended this paragraph with "Cross-border farming certificates were granted to Hong Kong residents with ancestral farmland in Guangdong, making the border rather porous in some areas and facilitating many escapes to the south.", but when someone on Reddit raised some questions I checked back in the text and realised that pages 210-211 described a border that Hong Kongers could cross without much issue but that this didn't extend to Guangdong side of the border.↩
I'm playing a little fast and loose here in this paragraph in that the SEZ was only one part of Shenzhen city in the 20th century. In general Chinese cities seem to be larger geographic areas than in the West; today the city is around the size of Orange County, California. A map on page 58 makes the demarcation clearer.↩
Unlike most other Shenzhen villages, Huanggang villagers defied 17th-century imperial orders in the late 17th century to abandon the village lest their resources be seized by rebel forces.↩
Having the confidence to start gentrifying a neighbourhood in China is quite American, and I mean that in a profoundly positive way.↩

https://iainschmitt.com/post/shenzhen-experiment-review

Vibe code at your peril

Dec 19, 2025

Vibe code at your peril

At every software company that I have worked at, if something is going wrong the customer has a named person who is responsible for getting the problem fixed. If a piece of the company's software could be to blame, there is a living, breathing developer on-call 24 hours a day, seven days a week. This on-call developer is responsible for answering questions like 'why is this service not working?' and 'have you made any changes that would explain what we're seeing here?'. Being on-call doesn't require terminal expertise in every facet of the software, but if they are doing their job correctly the engineer can give plausible answers to these questions.

If you are on-call, this inevitably requires you to understand the code that you are responsible for.

No, the on-call engineer has not written every line of the software they are being asked about. Maybe their team hasn't even written all of it. But they need to have a good mental model of the types of things their services are and are not supposed to do, and why.

If no one actually expects you to understand or take responsibility for the software you develop, then by all means don't waste your time reading what your large language model of choice plops onto the screen. But as Patrick McKenzie explains, "Most software is boring one-off applications in corporations, under-girding every imaginable facet of the global economy". This is software worth paying for because nearly no one should have to think about if an invoice was stored correctly in a database, if a Slack message will actually reach its intended target, or if the airline you're flying on has double-booked your seat in their reservation system.

For all of the oxygen that "agentic development" has sucked out of the room, developers cannot take responsibility for that which they do not understand. The act of writing code by hand forces the developer to think through what could go wrong with the changes they are putting in place and what other alternative options exist to solve the problem at hand. A lot of writing code is really an internal monologue to convince yourself that what you have written will work! This monologue isn't something you experience when reviewing someone else's code, and for this reason reviewing code can be harder than writing it yourself. If you're reviewing a person's work you should be sure that they have walked through this process before asking for review; chances are they've thought to themselves 'if this code triggers a 3:00 AM page to a teammate, will they be able to understand what is going on?'. What falls out of this is that if you want a living, breathing person to be responsible for a piece of software then its rate of change is gated by human understanding.

Don't get me wrong, Claude is a great resource and allows me to ask an experienced engineer all manner of questions at a fraction of the opportunity and interpersonal cost. But writing code by hand helps the developer immensely in understanding the problems they are trusted to solve. It takes longer to write that new feature by hand, but this saves precious minutes when tracking down customer-facing bugs, or precious months by preventing wild goose chases down development dead ends. Ultimately if you are responsible for a piece of software that is load-bearing to your customers, you should act like it.

https://iainschmitt.com/post/vibe-code-at-your-peril

On Automation

Nov 30, 2025

On Automation

At the founding of the country, 9 out of 10 Americans lived on a farm, but as of 2022, fewer than 2% of Americans made their living farming. Explaining this transition to an intelligent observer in the late 18th century would be difficult. While they might not use the words 'mass unemployment', that would definitely be on their minds. But more than anything, they wouldn't believe that such a small number of people could feed the whole nation. In the 18th century, humanity was just beginning to escape the Malthusian trap, where any agricultural surplus would be consumed away by a growing population that was at the very edge of subsistence. There's a reason the Church of England's 1662 Book of Common Prayer has three separate prayers for rain, fair weather, and plenty; it took nearly all of society to feed all of society, and just barely at that.

The resulting story of economic development, initially in the West before diffusing, is one of technological advances increasing farm productivity. Industrial farming equipment, artificial fertilisers, and pesticides dramatically increased how much a single farmer could produce. A bad wheat harvest today likely won't be noticed by much of American society. But with far fewer farmers needed to feed the nation, why didn't we face mass unemployment?

An excerpt from Brad DeLong's Slouching Towards Utopia gives some clues:

Four percent of Americans had flush toilets at home in 1870; 20 percent had them in 1920, 71 percent in 1950, and 96 percent in 1970. No American had a landline telephone in 1880; 28 percent had one in 1914, 62 percent in 1950, and 87 percent in 1970. Eighteen percent of Americans had electric power in 1913; 94 percent had it by 1950.

In nations that reached escape velocity from the Malthusian trap, expectations rose for living standards. If a nation could routinely produce enough food to sustain itself, it could start to bring more and more consumer goods to the masses. But manufacturing these goods required the human labor feed by newly-mechanised agriculture. Before kinks had been worked out, it was beyond the capability of the late 19th century American economy to provide automobiles, indoor plumbing, home telephones, or electricity to anyone more than a privileged few. But the labor story of mechanised agriculture repeated itself: at first manufacturing was highly labor-intensive, but moving up the learning curve made labor more productive and brought manufactured consumer goods within the reach of more people.

There's no law of nature that requires economic development to take this path. You could certainly set up a society that maintains a 19th century standard of living where people work single-digit hours per week, but very few societies make this choice. Humans seem to have no intrinsic limitations to our wants; our expectations expand to fill our ability to meet them. But the goods and services at the very frontier of what an economy is able to produce will be more labour intensive, so technological frontiers change what the newly labor-intensive industry is.

Many rich countries struggled with the transition to more service dominated economies. International trade complicates this story, but only somewhat. In any country where manufacturing is done, higher manufacturing productivity means fewer manufacturing jobs, but these gains drive higher demand from rich-world consumers for services at the frontier of what the economy can deliver. Put another way, as hard as it would be to explain to someone from the late 18th century how few people work on farms in 2025, it would be even harder to explain to them how almost 11 million Americans were employed in an industry that the federal Bureau of Labor statistics describes as 'Professional, Scientific, and Technical Services'. Or over 23 million in 'Healthcare and Social Assistance; Private', let alone nearly 3 million in 'Information'.

This is ultimately what makes me skeptical that this decade's dizzying advancements in artificial intelligence will result in mass unemployment. It will certainly be an adjustment, and likely a painful one for many industries. But if our past record is anything to go off of, insatiable human wants will adjust for the productivity savings, and we will move on to other labor-intensive desires.

References

Bradford DeLong. 2022. Slouching Towards Utopia: An Economic History of the Twentieth Century. Basic Books, New York, NY.
The Book of Common Prayer 1662: Statutory Services.
U.S. Department of Agriculture, Economic Research Service. Ag and Food Statistics: Charting the Essentials - Ag and Food Sectors and the Economy.
U.S. Department of Agriculture, National Agricultural Statistics Service. History of Agricultural Statistics.
U.S. Bureau of Labor Statistics. Employment by Major Industry Sector.

https://iainschmitt.com/post/on-automation

My Rough and Incomplete Backend Developer Skill Tree

Oct 19, 2025

My Rough and Incomplete Backend Developer Skill Tree

A few weeks back, an associate software engineer was asking me for advice on the types of side projects that would build relevant skills in his backend role. In the process I talked through some of the books that I found the most helpful in getting me to where I am now, and it forms something of a backend 'skill tree'. By 'backend' I mean server side software that writes to some persistent data store, and as Patrick McKenzie wrote in his seminal essay "Don't Call Yourself A Programmer, And Other Career Advice", this constitutes an awful lot of software engineering jobs.

The Books

Regardless of if you're starting inside or outside of a computer science program, the first few steps are basically the following.

Learn the basics of either Python or Type/JavaScript
Get comfortable with Git, Bash, and SQL1
Get a foundation in data structures and algorithms

With the possible exception of SQL this is true pretty much across all of software engineering; if someone is dead set on working on embedded systems maybe they'd go straight to C, but this is how I'd recommend anyone get their start in software engineering. After taking care of these building blocks, these are the books that I would read as part of the backend 'skill tree'.

Web Development with Node and Express in JavaScript or Flask Web Development for Python. Both of these books teach how to build server side web applications that handle HTTP requests, template HTML, and write to a database. The Express and Flask frameworks are relatively simple, allowing the reader to focus on things that will be transferable to other systems. It can be much more intimidating to pick up something like Java's Spring framework right out of the gate even if you already know the language. Spring or .NET's ASP.NET core are very powerful and have a lot of features that make enterprise development easier, but they aren't the first server side framework one should learn.2 At the end of reading either book the reader should be able to stand up a simple web application, be it a personal website or something that makes REST calls with a client for interactivity.
CompTIA Network+ Certification Exam Guide chapters 1 and 6-12. While it may not come up every day, if you write server-side software you need to have a good mental model for exactly what happens when you type 'www.google.com' into a browser address bar. Otherwise, you will be at the mercy of what you do not understand. This is a book meant for the CompTIA Network+ certification exam, the likes of which are much more important in network engineering as compared to software engineering. While more advanced cert exams will be more specific to given network vendor equipment, this book is a great overview breaking down the OSI model, TCP/IP, routing, DNS, and other important network basics.
SQL Anti-patterns. Not nearly enough books on computing walk through an example of an understandable mistake before explaining the right way to do things - but that is all that this book does. This book has 25 short, pretty self-contained chapters. As it says on the tin, each of these works through a common relational database mistake. After getting one's feet wet with SQL in a project or two this could both correct some bad habits and help the reader recognise bad SQL when they see it down the line.
Domain Modeling Made Functional. Domain modeling is the process of turning the capabilities and requirements of a system into a tractable model, often something like a UML diagram. The model is meant to be understandable to domain experts such that someone working in e-commerce operations could look at a domain model for their company and say 'this looks right, but it's missing the part with volume-based shipping discounts'. As I wrote in a review, the book isn't as detailed as other domain modeling books that I've read, but it makes up for it by being much more readable. The 'functional' in Domain Modeling Made Functional makes this book somewhat unique, as the language used in the book is F#. But no prior knowledge of the language is required and much of the book transfers over well to other languages.
Data Intensive Applications. This book is about the general problems that you face in applications where I/O is a more meaningful bottleneck than CPU performance. Many server-side applications now have more persistence than a relational database, be they search engines, event streams, or dedicated caches. Data Intensive Applications does an incredible job at teaching the details of on-disk storage, distributed persistence challenges, and batch vs. stream processing that are broadly applicable across a variety of data persistence technologies. Chapters 5 through 9 are the best explanation of distributed systems that I've ever read. It can be dense material, but I've never read better prose from a technical book.
Database Internals. The Data Intensive Applications book gave an introduction to database internals that left me more curious. I haven't made it all the way through the book, but it is a good read after working with relational databases for a couple of years especially because I never had the chance to take a database class at school. This was where I finally understood how write-ahead logs were used to make transactions more durable without sacrificing performance.
Little Book of Semaphores. Semaphores are a way to coordinate concurrent threads or processes, and while concurrency isn't something that comes up every day it is important to understand how these problems are solved. The book is also generally fun to work through, which to be honest is the real reason I have it on this list. As I wrote in a post, the book probably works better in classroom settings as it can sometimes be hard to tell if the reader's solutions match those in the solution manual. But I've had good results with asking Claude 'I am trying to learn this in greater detail, please ask me questions to probe my understanding rather than just telling me if my solution is equivalent'. To get any value out of this book the reader really does need to work through the problems.
Operating Systems: Three Easy Pieces. I only made it through about 1/3rd of Tanenbaum's 'Modern Operating Systems' and while I got a lot out of it, Three Easy Pieces is a more appropriate first book on operating systems. Modern cloud infrastructure does a lot to try to abstract away the responsibilities of the OS, but as implied by 'The Cloud Is Just Someone Else’s Computer', some OS somewhere is still doing roughly the same thing to serve your production applications as what takes place when running locally. As for network engineering, you don't want to be at the mercy of what you don't understand about operating systems.

Aside on Languages

It is easy to learn too many languages and frameworks, which wastes valuable time re-learning how to do something you already know rather than learning something truly new. I'm not 100% sure on this, but there is a case to be made that you only need to pick up four languages:

One of the aforementioned big interpreted languages: Python or Type/JavaScript
A statically typed, garbage-collected language: Java, C#, or Go
A language with manual memory management: Probably C. Maybe you can include Rust in this category, and Zig would be a decent choice after its 1.0 release
A functional language: I am biased to F#. Haskell is a great language but comes with a steep learning curve, and Scala can sometimes face mixed OO/FP paradigm issues

C# and Go are great languages, but they are ultimately too similar to Java to justify me learning them. Someone who learned C# first should say the equivalent. The wrinkle in this list is for the interpreted languages. Almost all web applications are in Type/JavaScript and the language works well server side, but JS has its quirks and Python's plotting, analytics, and ML libraries make the language worth learning. Maybe you just can't get away without learning both.

I'm being a little hard on ASP.NET here, as compared to Spring it is easier to learn incrementally.

This isn't an original observation; I got that trio from a Vicki Boykis blog post. As Boykis points out you don't need to reach absolute expertise in all three, but they are crucially important in any backend job.↩
↩

https://iainschmitt.com/post/backend-developer-skill-tree

Uncomfortably Functional Kotlin

Sep 29, 2025

Uncomfortably Functional Kotlin

SPS hosts an informal, internal technology conference every year. This was where I presented work on a stock exchange simulation project a few weeks back. The project was mainly an excuse for learning a couple of technologies that I thought would be fun to use. One of said technologies was Kotlin, which was my server-side language of choice.

Kotlin: better in every way

Kotlin is a JVM language that was first released 16 years after Java. That is a lot of time to make something better, but Kotlin was worth the wait. The following are just a few reasons why it is a joy to work with:

Completely interoperable with Java: you don't need to leave behind three decades of dependable Java libraries
Nullable types: the compiler enforces null-safety
Abbreviated class syntax: you can describe an entire constructor in a class signature, cutting down on boilerplate
Flow control expressions: both if and when statements are expressions that evaluate to their results
Top-level functions: very nice to have these, there is a reason they have been in C# for awhile now

As compared to the Java I write every day at work, Kotlin is better in every way. My functional programming bias certainly comes into play here, but any programming language that is well-used today but didn't exist in the 90s must offer something meaningful to displace alternatives. Kotlin is no exception. What is rather surprising is how far you can take the Java interop: using the 'Convert Java File to Kotlin File' command in IntelliJ I converted one of my team's controller classes to Kotlin in about two minutes during a demo of the language. I assumed you couldn't run Java and Kotlin side-by-side in the same Maven module, but I didn't see any issues in doing so; the Kotlinised endpoints worked without issue. There is a learning curve coming from Java because Kotlin has more syntax. But this is made worthwhile because the additional syntax allows you to be more concise.

While there are many functional programming features in Kotlin, there isn't language-level support of Either and Option. This isn't that much a surprise given the Java interop and nullable types, but I was impressed with the Arrow functional programming library's implementation of Either and Option. In the exchange simulation I used these extensively given how familiar I am working with them in F#, and the library has something equivalent to computation expressions to work with these types. F# computation expression let! assignments are evaluated by calling the bind function of the expressions's monad type, and Arrow has used Kotlin's type-safe builders to accomplish the same thing.

In the snippet below, the either expression will short circuit during the val y assignment because maybeY is a Left type representing a failure rather than the intended Int type. Otherwise, if y was a Right, a would have been a Right type wrapping the sum of x and y.

fun arrowEitherDemonstration() {
    val maybeX: Either<Nothing, Int> = 1.right()
    val maybeY: Either<String, Int> = Either.Left("left failed")

    val a = either {
        val x = maybeX.bind()
        val y = maybeY.bind()
        x + y
    }
    a.fold({println("fold failed")}, { println(it) })
}

It isn't terribly clear to me how the Arrow authors enable this short-circuit behavior; the library seems to be using every Kotlin trick in the book to make this syntax work. F#'s computation expression syntax isn't quite as elegant (especially for defining new computation expressions), but it is more straightforward and whenever you see a let!, do!, or similar you know exactly what that means in F#. But all-in all, Arrow brings a lot of what makes F# fun into Kotlin, and I don't really miss F#'s partial application and function signature type inference when working in Kotlin.

But as far as Arrow can take you, there are still real and frustrating language-level limitations to going down the functional programming rabbit hole in Kotlin.

Hitting the Language Wall

In Haskell, every single side effect producing function must be monadically abstracted. If you try to log to standard output or read a file in an Int returning function, your program will not build: logging and I/O are side effects rather than pure functions. To log or to carry out I/O the function will need to return a Writer or IO monad type instead. This takes getting used to but it allows you to read a Haskell function type signature and immediately tell if the function is pure.

There's a Reddit post from r/fsharp titled "Is it worth using the IO monad in F#?" that I'm reminded of whenever I try to crowbar this behaviour into another language. The top comment says:

I'd strongly advise against trying to write Haskell in F#. It's not idiomatic, it's slow and people do not expect it.

This is, unfortunately, quite defensible in F# and even more so in Kotlin. I also refuse to accept it: bringing the best aspects of Haskell into other languages that I know and like is too appealing. Luckily, Vermeulen, Bjarnason, and Chiusano's 2021 Book Functional Programming in Kotlin was written with exactly this idea in mind. Chapter 13, titled "External effects and I/O" isn't an easy read but is rather thought-provoking. That chapter alone makes it worth buying the book, and it starts off with a naive IO monad implementation, similar to the following:

interface IO<A> {
    companion object {
        fun <A> unit(a: () -> A) = object : IO<A> {
            override fun run(): A = a()
        }
        operator fun <A> invoke(a: () -> A) = unit(a)
    }

    fun run(): A

    fun <B> map(f: (A) -> B): IO<B> =
        object : IO<B> {
            override fun run(): B = f(this@IO.run())
        }

    fun <B> flatMap(f: (A) -> IO<B>): IO<B> =
        object : IO<B> {
            override fun run(): B = f(this@IO.run()).run()
        }
}

This IO implementation would probably work for most use cases, but flatMap ends up nesting IO#run calls in a way that will force a stack overflow if called enough times. This can be fixed by replacing stack frames with objects on the heap, which was done in the book by baking the control flow into a sealed class hierarchy:

sealed class IO<A> {
    companion object {
        fun <A> unit(a: A): IO<A> = LiftF { a }
    }

    fun <B> bind(f: (A) -> IO<B>): IO<B> = Bind(this, f)
    fun <B> map(f: (A) -> B): IO<B> = bind { a -> Pure(f(a)) }
    fun <B, C> map2(ma: IO<A>, mb: IO<B>, f: (A, B) -> C): IO<C> =
        ma.bind { a -> mb.bind { b -> LiftF { f(a, b) } } }
}

data class Pure<A>(val a: A) : IO<A>()
data class LiftF<A>(val thunk: () -> A) : IO<A>()
data class Bind<A, B>(
    val m: IO<A>,
    val continuation: (A) -> IO<B>
) : IO<B>()

The next step is a tail-recursive call that operates over the Pure, LiftF, and Bind. Working around JVM type erasure makes this a little awkward, but it works1:

@Suppress("UNCHECKED_CAST")
tailrec fun <A> run(io: IO<A>): A =
    when (io) {
        is Pure -> io.a
        is LiftF -> io.thunk()
        is Bind<*, *> -> {
            val outerM = io.m as IO<A>
            val outerContinuation = io.continuation as (A) -> IO<A>
            val nextIO = when (outerM) {
                is Pure -> outerContinuation(outerM.a)
                is LiftF -> outerContinuation(outerM.thunk())
                is Bind<*, *> -> {
                    val innerContinuation = outerM.continuation as (A) -> IO<A>
                    val innerM = outerM.m as IO<A>
                    innerM.bind { a: A -> innerContinuation(a).bind(outerContinuation) }
                }
            }
            run(nextIO)
        }
    }

This is a trampoline2, and after it is introduced in chapter 13 the authors point out that the trampoline can be adapted to create an Async monad. They then show that if you define the trampoline for an abstract type constructor, you end up defining the very useful Free monad. But I am relatively sure that this requires higher-kinded type support in Arrow that was removed from the library since publication of the book. Arrow used to have its own IO monad implementation as well as Semigroup and Monoid interfaces, but the libraryr has since walked back from functional maximalism. One reason for this may be that you hit something of a wall if you want to add anything on top of the IO monad.

One way to show this is to walk through an incredibly basic Haskell application that both carries out IO and writes logs. The snippet below serves a single GET endpoint which returns a random number and logs to standard output. The Writer monad uses the tell function to add accumulated logs, and runWriterT will return both the IO Text result of businessLogic alongside the [String] logs created in the process. These are assigned to result and logs in the endpoint respectively.

businessLogic :: WriterT [String] IO Text
businessLogic = do
    tell ["processing"]
    randomNum <- liftIO $ randomRIO (1, 100 :: Int)
    tell ["generated random number: " ++ show randomNum]
    tell ["done"]
    return "Hello World"

main :: IO ()

main = scotty 3000 $ do
    get "/" $ do
        (result, logs) <- liftIO $ runWriterT businessLogic
        liftIO $ print logs
        text (TL.fromStrict result)

This is possible because WriterT is a monad transformer, which allows for layering multiple monads together. In this case the WriterT [String] IO Text is a combination of the IO monad and the Writer monad. Monad transformers are also made possible by higher-kinded types that are supported by Haskell and Scala, but not Kotlin. I mention this to demonstrate that even in this very simple application, IO is not enough. Many applications will also require State and Reader and while it may be possible to define some IOWriter in Kotlin, it increasingly feels like you're hitting a wall. Kotlin simply wasn't meant to do this.

We use much more Kotlin than Scala at SPS and I have only good things to say about the language, so I don't regret picking it up. But it seems like you can only get about 85% the way to 'full monad', which is a disappointment.

References

Patrick McKenzie. 2025. Developing In Stockfighter With No Trading Experience.
Brian Nigito. 2017. How to Build an Exchange
Rachel Wonnacott. 2025. How to Build an Exchange. At Manifest 2025. Berkeley, CA.
The Arrow Authors. 2017-2025. Arrow. GitHub repository.
Microsoft. 2023. F# Language Reference: Computation Expression
Kotlin Foundation. 2025. Type-safe Builders.
Reddit. 2017 r/fsharp: Is it worth using the IO monad in F#?
Marco Vermeulen, Rúnar Bjarnason, and Paul Chiusano. 2021. Functional Programming in Kotlin. Manning Publications, USA. ISBN: 9781617297168.
Marco Vermeulen, Rúnar Bjarnason, and Paul Chiusano. 2011-2025. Functional Programming in Kotlin. GitHub repository
Rúnar Bjarnason. 2012. Stackless Scala With Free Monads
Andy Gill. 2001. MTL Library: Control.Monad.Writer.CPS

It may be possible that the eager function call in innerM.bind could force a stack overflow but I haven't proven this↩
I'd recommend Rúnar Bjarnason Scala paper which looks like something of a precursor to Chapter 13.↩

https://iainschmitt.com/post/uncomfortably-functional-kotlin

T-Mobile's Fiber Ambitions

Sep 18, 2025

T-Mobile's Fiber Ambitions

US Internet (USI) is a regional ISP based out of Minnetonka, a western suburb of Minneapolis, Minnesota. Its primary offering is fiber optic broadband, and its service area is about 2/3rds of Minneapolis and a smattering of her western suburbs. At most about half a million people are in the USI service area, so the company isn't a household name anywhere outside of Minnesota. While few ISPs are well loved, anecdotally USI has a good reputation; the CEO is known to answer customer support questions in Reddit DMs. Regional fiber ISPs in the Midwest don't often attract the attention of the EU's competition authority, but on June 13th the Competition Director-General Oliver Guersent approved the sale of US Internet to T-Mobile US (TMUS) and the investment bank KKR holdings. This is one of the fun things about the EU: because T-Mobile's parent company is based in Germany, they have to let competition authorities know about the purchase of a small American ISP months before any of said ISP's customers. The general public in Minnesota got news of the sale in early August 2025,1 but USI isn't the only regional ISP newly under TMUS ownership - Metronet and Lumos were purchased in 2024 and 2025 respectively. T-Mobile isn't new to the broadband game given that they already have a fixed-wireless access (FWA) home internet service, but the 'T-Mobile Fiber' brand that these ISPs are being rolled into is an new terrestrial broadband offering.

T-Mobile is, first and foremost, a mobile network operator (MNO). So why are they getting into the terrestrial broadband game? This is ultimately pretty predictable given the state of the American mobile network market in 2025. At this point, almost everyone who wants cellular broadband can get it. Thanks to the Cambrian explosion of mobile virtual network operators (MVNOs), this is even true for downmarket segments. It isn't 2010 anymore when vanishingly few American consumers had a smartphone, and many didn't even have a cellphone at all. Given slow population growth and high smartphone adoption, AT&T, Verizon, and T-Mobile cannot grow the absolute size of the market very much. Getting higher revenue requires increasing average revenue per account (ARPA) and playing 'defence' to maintain your existing customers. T-Mobile's ISP strategy plays to both of these.2

The ARPA impact of buying an ISP is pretty clear: when T-Mobile buys a regional ISP, existing T-Mobile wireless customers who were on the incoming ISP will now shift their fiber spend to T-Mobile, driving ARPA higher. 3 The churn-fighting part of this is more interesting. Because cellphones fit into pockets, customers can switch carriers without new equipment being installed to their home or business. With the advent of eSIMs, it can mean switching without ever leaving the home. But terrestrial broadband can be higher friction, requiring burying a new cable to the customer and installing new equipment in the most involved circumstances. For the vast majority of consumers who think as little as they possibly can about their internet service, remembering that switching off of T-Mobile will require changing their home internet may convince them to stay at T-Mobile rather than leave for Verizon or AT&T. So if you're a mobile network, it's in your interest to also be an ISP. This sounds a little conspiratorial, but it is worth pointing out that there is very little overlap in the metropolitan areas where Verizon, AT&T, and T-Mobile offer fiber broadband even though all three carriers serve wireless customers in most of America. This is to the benefit of all companies, because there is more friction in switching services if your new mobile carrier and ISP are different companies.

What I am pretty confident in is that this is not some insidious effort to jack up prices after consolidating players in non-cellular broadband. It would be wrong to say that T-Mobile's entry won't increase market concentration given their existing FWA offering. But FWA doesn't look to be that large of a player as compared to terrestrial broadband: in 2Q4 of 2025 Verizon had over twice as many wireline broadband customers as they did for FWA. And Craig Moffett explained in a 2023 Stratechery interview that capacity constraints for MNOs may render fixed wireless broadband as something of an industry afterthought. And with the comparatively small ISPs that T-Mobile has bought up, they wouldn't have enough market share to dictate pricing terms. This all only makes sense in the context of protecting their core business.

I will admit that the Twin Cities broadband market is in worse shape than I had thought prior to writing this post. I was able to find a residential block in Golden Valley that was 15 minutes from downtown Minneapolis, 10 minutes from General Mill headquarters yet, according to the FCC National Broadband Map, the only non-satellite broadband options were T-Mobile FWA, copper from CenturyLink, and DOCSIS over cable from Xfinity. Having DOCSIS as the fastest option strikes me as bizarre for such a centrally located suburb.

References

Ben Thompson. An Interview with Craig Moffett About Charter vs. Disney and the Path Dependency of the Communications Industry. 14 September 2023.

FCC National Broadband Map. United States Federal Communications Commission.

J.D. Duggan. Minneapolis/St. Paul Buisness Journal. T-Mobile to acquire Twin Cities fiber provider U.S. Internet, expanding home internet footprint. 7 August 2025.

Official Journal of the European Union. 7 May 2025. Non-opposition to a notified concentration, Case M.11985

T-Mobile. T‑Mobile and KKR Announce Joint Venture to Acquire Metronet and Offer Leading Fiber Solution to More U.S. Consumers. 24 July 2024.

T-Mobile. T‑Mobile and EQT Close Joint Venture to Acquire Lumos and Expand Fiber Internet Access. 1 April 2025.

Verizon. Form 10-Q, 2Q 2025. 25 July 2025.

I must have been on the wrong email list, because I got the 'Exciting News! US Internet is now a part of the T-Mobile Fiber family' early this September.↩
As Ben Thompson often notes, it is also usually cheaper to retain existing customers than pay marketing and sales costs to acquire new ones. I can only imagine that the same is true for large mobile network operators. While MVNOs may have a different customer acquisition profile because of their smaller size, the sheer scale of traditional MNO marketing spend makes this plausible.↩
This almost goes without saying, but newly acquired T-Mobile wireless customers who aren't T-Mobile customers are probably much easier for the company to convert given that they necessarily have the contact information and a billing relationship with said customers.↩
I don't understand why Verizon calls it '2Q' and not 'Q2', but I will respect the Verizon style guide just this once↩

https://iainschmitt.com/post/t-mobile-fiber-ambitions

August Event Stream Notes

Aug 24, 2025

August Event Stream Notes Kafka Consumer Group Offset Durability

A few weeks back I read Taylor Troesh's "How/Why to Sweep Async Tasks Under a Postgres Table". Not only does it show off how elegant the Postgres NPM package is, more importantly it shows some good patterns for using PostgreSQL in place of an event stream or message queue. Because I'm sympathetic to arguments that Kafka can often overcomplicate an application, I was receptive to his post. Here Troesh wrote: "In my experience, transaction guarantees supersede everything else", which reminded me of my least favourite aspect of Kafka consumer groups.

For the uninitiated, Kafka consumer groups are best explained by an example. In event streams, topics and partitions serve analogous roles to tables and shards in relational databases, and Kafka consumer groups allow for multiple consumers to coordinate the consumption of events from a given topic.1 Let's say that a payment processor is placing all transaction attempts into a card_tx_attempts Kafka topic that has four partitions. There might be many different services consuming from card_tx_attempts, including a service that records possibly fraudulent transactions for further investigation. If every instance of a fraud analysis service was consuming from card_tx_attempts as part of a fraud_analysis_service consumer group, the Kafka broker will guarantee two things:

Every partition in the card_tx_attempts topic will have one and only one fraud_analysis_service consumer
As many fraud_analysis_service consumers will be active as possible

For example, if fraud_analysis_service starts with one consumer then that single consumer will be assigned to all four card_tx_attempts partitions. If an additional fraud analysis service consumer is added to fraud_analysis_service then a partition rebalance will occur: the broker will take two partitions from the first consumer and assign them to the new consumer, meaning each consumer will end up with two assigned partitions. If an additional two consumers are added then each card_tx_attempts partition will have one dedicated consumer, but any additional consumers will be idle given that each consumer can only be assigned one partition.

Each time a batch of records are fetched and processed by a consumer in a group, the progress of the consumer groups is committed and recorded in the __consumer_offsets topic. This means that when consumer groups are restarted they can pick up at the record offset where they left off.

However, Troesh's post reminded me how disappointing the consumer group offset tracking can be during transitions, and this prompted me to email Troesh with a subject line of 'Validating your Kafka scepticism' earlier this month. If a new consumer joins a running consumer group and triggers a partition rebalance, the default Kafka behaviour does absolutely nothing to save progress inside of an event poll. If the consumer is polling 1000 events at a time and a rebalance occurs while it's processing the 999th event, you have a problem. As far as the broker is concerned, none of those events were actually consumed by that consumer group; the consumer couldn't commit its progress before losing access to the partition. This is, notably, something that PostgreSQL does not remotely struggle with when used as Troesh showed in his async tasks post.

To be fair to Kafka, there is an onPartitionsRevoked in ConsumerRebalanceListener that can define a callback that runs before the consumer is dropped from a partition, but this requires you to manually keep track of the events that you have processed. It also doesn't prevent duplicate event processing if the original consumer exits from a runtime error. Kafka Transactions are even less helpful. While Kafka producers support transactions, ConsumerConfig provides no such configuration because Kafka transactions are not designed for consumers. As stated in the official documentation:

Kafka transactions are a bit different from transactions in other messaging systems. In Kafka, the consumer and producer are separate, and it is only the producer which is transactional. It is however able to make transactional updates to the consumer's position (confusingly called the 'committed offset'), and it is this which gives the overall exactly-once behavior.

This may not even have been that bad an oversight for the original Kafka use case at LinkedIn, but the great irony is that append-only write-ahead logs are the exact structure that relational databases use to make performant transactional guarantees. There doesn't seem to be a good way to get real durability from consumer group offset progress, and these durability issues have been solved problems for decades in the relational databases. I don't think it would be impossible to fix this but I don't see how anyone can look at this behaviour and conclude that Kafka was designed with all of this in mind.

Further Thoughts on 'Kafka in One File'

In the last couple of weeks I've talked through some details about the idea from 'Kafka in One File' with a few people. The first thing I came to realise when talking with a principal engineer at SPS was that consumer groups wouldn't be necessary for this stream. With the producer, event stream, and consumer all on one host, it is more appropriate to push the responsibility of coordinating consuming threads to the consuming process; most of what makes consumer group assignments tricky anyway is maintaining consistency across distributed brokers.

This would mean a single consumer thread would act as the consumer group coordinator, with events passed to various worker threads. The consumer would also need to store their equivalent of __consumer_offsets somewhere, either on the stream itself or read from a dedicated key-value store.

The other thing I came to a conclusion on was how to best carry out the concurrency control side of things. It turns out that full-file locking is relatively portable between operating systems and runtimes: the Unix fcntl system call and the Win32 equivalent 2 are relatively equivalent and are how file locking is accomplished in Java's FileChannel#lock, .NET FileStream constructors, and FileExt::lock_exclusive in the Rust fs2 crate. My original idea was to follow the pattern from SQLite: write the clients in Rust and call them from other languages using foreign function interface, but this portable file locking would make it more viable to start prototyping in Kotlin first. This is especially true if I take pains to use Arrow result types and other Rust-like idioms.3 If I ever saw a need for using the stream in F# I'd likely go down the Rust FFI path, but it sounds like an F# client calling FileStream in the right way would cooperate with a Kotlin process via fcntl and LockFileEx on Unix and Win32 respectively.

I'm not sure if I'll actually give this a shot, but it would be a nice reprieve from yet another side project that is some flavour of a REST API. If this stream is horribly non-performant, my guess would be that forgoing index files would be the issue, but it would be nice to keep this to one file.

References

Apache Kafka Documentation, Section 4.7
Java Apache Kafka Client ConsumerConfig Documentation
Java Apache Kafka Client ConsumerRebalanceListener Documentation
Java Apache Kafka Client ProducerConfig Documentation
Java NIO FileChannel Documentation
Microsoft Learn Win32 LockFileEx Documentation
Microsoft Learn .NET FileStream Constructors Documentation
NPM Postgres Package
Rust fs2 Crate Documentation: FileExt Trait Documentation
Taylor Troesh. 2024. "How/Why to Sweep Async Tasks Under a Postgres Table". https://taylor.town/pg-task
William R. Stevens and Stephen A. Rago. 2013. Advanced I/O. In Advanced Programming in the UNIX Environment (3rd ed.). Addison-Wesley Professional, Upper Saddle River, NJ, USA.

A consumer group can also be assigned multiple topics↩
The Win32 equivalent is LockFileEx in fileapi.h but as best as I can tell this isn't a system call↩
Besides, this would give me an excuse to try writing Kotlin in a way that avoids heap allocation using object pools and other techniques, but I don't know exactly how well that would play with trying to make it as functional as possible.↩

https://iainschmitt.com/post/august-event-stream-notes

Kafka in One File

Jul 12, 2025

Kafka in One File

On a piece of software whose lack of existence confuses me.

A Love Letter to SQLite

When Redis's original BSD open-source licence was changed, Machine Learning Engineer Vicki Boykis mourned the occasion with "I love Redis with a loyalty that I reserve for close friends and family and the first true day of spring, because Redis is software made for me, the developer." While I have certainly worked with Redis far less than Boykis, this is exactly how I feel about SQLite. There may be more impactful software projects, but there is no software that I love so unreservedly as SQLite.

Despite how objectively great it is, most 'how to get started with relational databases' blogs and books don't use SQLite: when I was learning the basics of SQL I had to download a .pkg with MySQL with this clunky editor; it's probably still downloaded on my Mac, untouched since that holiday break I was using it.1 If you're setting up a new database for a project you'll have to provision compute, make sure it is available over the network, and secure it accordingly.

But with SQLite, you create a file and then run sqlite3 myNewDatabase.sqlite to set up your new tables. That's it. There's no daemon, no extra compute, no managed service. The daemonless, single file setup means you can check your database into version control or swap out between test and production with a single-line change. While it isn't as performant as PostgreSQL, the performance ceiling may be higher than you realise, emphasis mine:

The SQLite website (https://sqlite.org/) uses SQLite itself, of course, and as of this writing (2015) it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time.

Event Stream Woes & Motivation

A few weeks back I had the idea of standing up Apache Kafka behind a reverse proxy for use in a side project. I have plenty of compute in my home server rack, but it is all hidden behind a reverse proxy on a Digital Ocean VPS to avoid exposing my private IP address. The idea was to open the broker up to any IP address and secure it with mTLS. At first Nginx streams didn't work on the VPS, so I moved over to HAProxy, but I had this annoying issue where the certificate presented by the Kafka domain name was for an unrelated application on Nginx on the same DMZ server. Given that the reverse proxy went through a port-forwarding rule straight to port 9093 where Kafka was listening, I don't know what this was happening. I am a half decent Linux system administrator and this was a big part of my last job, but in the end I decided to give up and started running the broker directly on the VPS. I have considerably less CPU, RAM, and disk to work with, but at least mTLS works. It all reminded me of what it is like to set up a database from scratch and made me grateful that we are using a Kafka managed service at work.

It all made me think that surely someone has created a high-quality, open-source "SQLite of Kafka" because that is exactly what I want. Given that SQLite fit my needs well as a database we're not talking about all that much data. But I find event streams interesting to work with given that they have useful qualities of a database, a write-ahead log, and a message queue. Much to my surprise, I haven't really found anything that fits this bill, let alone some actively maintained, well-used open-source project.

This breaks my mental model for open source software and the types of projects that get built. I can't be the first person to want this, and there are a lot of software engineers out there who are more experienced and talented than I. There are no barriers to entry to making this, and there are non-zero rewards to reputational capital (and more importantly intrinsic rewards) for those who make a robust, respected solution. My best explanation for why this doesn't already exist is that there aren't many situations where a problem needs more than an append-only log, but an event stream over the network isn't appropriate. It is, frankly, a somewhat contrived problem. But it is certainly a simpler problem than a full relational database; chapter 3 of Travis Jeffery's excellent 'Distributed Services with Go' walks through writing a memory-mapped append-only log and based off of reading the chapter it seems a decent append-only log can be written in an afternoon. All in all, it is somewhat tempting to give this a whirl.

How This Might Work

Any project worthy of the name must satisfy the following

Like SQLite, constitute just a file format and a binary for interacting with the file format
Event byte arrays for message keys and values, with consumers responsible for deserialisation
Consumer message handlers run after events are fetched
Support for multiple topics
Support for consumer groups

A minimal log message would have to include keys, values, and the actual message payloads themselves. From what I have read, Kafka stores these as length prefixed values, so something like

[4-byte total length] [2-byte key length] [key] [2-byte partition length] [partition] [value]

With the value size calculated from the total length left over after the key, partition, and their respective length bytes. However, because all messages are going onto one file, the topic would also need to be included. And just as Kafka stores consumer group offsets in __consumer_offsets, something similar could be done for consumer group offset persistence here. I don't see multi-partition support as a hard requirement; they could be used by different consumers in the same consumer group, but this would only be useful when the event production rate is faster than the single-partition event consumption rate.

Kafka's daemon allows for it to address obsolete event deletion, topic compaction, and many other background tasks even without active consumers or producers active, but 'Kafka in one file' would have to intersperse all of these background tasks during normal operation just as SQLite does. I have three big outstanding questions I have about what this would look like

Kafka uses periodic heartbeats from consumers to determine if they are still alive. While offset commits prove consumer health when there are new events to be consumed, but otherwise how can consumer liveness be proven?
Can obsolete record deletion be carried out without locking all other producers and consumers? It looks like that POSIX OSes may handle interleaved file appends, but that should be simpler than the deletes themselves
Kafka works by separate index and store files. While events are stored in the store file, the index file maps event offsets to their location in the store file. The index file is memory mapped, and is this something that can even be supported?

Just as most uses of SQLite call the C libraries through FFI, it would seem appropriate to do the same here with a similarly capable system programming language, and it seems like this would introduce more than a few interesting concurrency challenges.

References

SQLite. Appropriate Uses For SQLite
Travis Jeffery. 2016. How Kafka's Storage Internals Work.
Travis Jeffery. 2021. Distributed Services with Go. Pragmatic Bookshelf, Raleigh, NC.
Vicki Boykis. 2024. Redis is forked.

The reasons for not using SQLite to teach relationship databases aren't terrible in that you want to teach people the platforms they will be using at larger enterprises, but early on those benefits are, I think, swamped by how lightweight SQLite is↩

https://iainschmitt.com/post/kafka-in-one-file

The English Civil Wars and the Republic That Failed

Jun 19, 2025

The English Civil Wars and the Republic That Failed

In May 1660, Sir Thomas Fairfax crossed the English Channel with eleven other men to formally invite King Charles II to reclaim the throne. Fifteen years earlier, Fairfax had commanded the army that defeated and deposed Charles's father. What made a successful revolutionary general personally restore the monarchy he'd helped destroy?

Fairfax’s story shows how revolutions fail when their leaders sideline institutions on their path to consolidating power. American civic education criminally ignores this story: two decades of chaos that killed hundreds of thousands of people and devastated the lives of many more taught the liberal tradition invaluable lessons in how not to organize society. The story isn't all negative: as Britain was grappling with the aftermath of the Protestant Reformation, the events in this era put it on the path to religious toleration.

Constitutional Breakdown

Even before both sides started trying to kill each other, King Charles I and his Parliaments never got along. While Parliament only represented the wealthiest property-owning men in the kingdom, settled law was that taxes could not be raised without the consent of Parliament: any King without buy-in from wealthy subjects faced an uphill climb to getting anything accomplished.

When the King called his first Parliament in 1625,1 the House of Commons broke with over a century of tradition and didn't grant him the authority to collect import tariffs for life. Undeterred, the King simply collected the duties anyway until 1641. Unable to make much further progress with Parliament, the King decided to demand extra-Parliamentary forced loans to fund the crown. When five knights challenged the loans in court with a writ of habeus corpus, Charles fired the Chief Justice rather than risk an adverse ruling.

The plight of the five knights meant that Parliament conditioned any funding to the crown on the 1628 Petition of Right. This document forbade taxation without the consent of Parliament, imprisonment without due process, or quartering soldiers in civilian homes without consent. The King's assent to the Petition of Right did not prevent further disagreements with Parliament, and he decided to rule without them. From 1629 to 1640 the King did his best to stay out of military engagements that would require funding from Parliament and the strings that would be attached to it, so the he used every trick in the book to raise money without Parliament in the 1630s:

Monopolies: the King sold patents for companies to have a monopoly on a given good, relying on a loophole in the 1624 Statute of Monopolies
Ship money: in wartime the King could raise money to build the navy from coastal counties, but King Charles I raised it from every county during peacetime
Wardship: if a gentry landlord was a tenant-in-chief and died without an heir, the King was entitled to the rents of the estate and had to be bought out before a successor could take ownership
Royal forests: fining people living in the suburbs of London and anywhere else that was technically royal forest at some point in the misty past

It's little wonder that when the King called a Parliament in 1640 it would be less than two years before both sides would meet on the battlefield.

How Republics Fail

The time had come for a political revolution that would restore the rule of law and check the power of an increasingly absolutist King. But unfortunately for the history of liberalism, Parliament and later republican governments were all too willing to run roughshod over the rule of law.

In 1638 when war broke out with Scotland, the Earl of Strafford was a trusted royal advisor who was involved with the military response. He was convinced that hardliners in Parliament were collaborating with the Scottish rebels.2 In response, the Commons prosecuted him for high treason. Extra seating was built in Westminster for the whole public to see how treacherous Strafford was, but representing himself in the impeachment proceedings he embarrassed the Commons by showing all of England how flimsy their case was. Parliament's star witness claimed that Strafford was trying to use Irish soldiers to suppress English subjects, but every other piece of testimony showed that Strafford was arguing for Irish soldiers to be used for the war in Scotland.

Parliament changed tactics and passed a bill of attainder against Strafford. This was an act that could declare someone guilty of a crime without a trial or jury. Missteps by the King later that year meant that he no longer had the political capital to both save his friend and ever expect anything from another Parliament. Royal assent of the bill meant that Strafford was publicly executed on May 12th, 1641 in front of a crowd of 100,000 people.

After the second English Civil War, the hardliners Oliver Cromwell and Henry Ireton used similarly shameful tactics to gain power. When the King lost the second English Civil War in 1648, many moderates in Parliament wanted to put him back on the throne under the Treaty of Newport. While Parliament in this day was hardly a representative body, it was reasonably representative of the public's view of the king: roughly half of England and Wales supported him over Parliament, with strong majorities in many parts of the kingdom. But rather than trying to convince their fellow MPs to draw a tougher line with the king, the army marched onto London with Cromwell and Ireton's backing. Allegedly without their knowledge, Colonel Thomas Pride had a list of royalist and moderate MPs to physically bar from Parliament: these rightfully elected MPs were prevented from taking their seats for eleven years. This is the exact opposite of institutional legitimacy. This was military rule by another name, and a legislature handpicked by the army is not a legislature in any way that matters.

This much more pliant Parliament made it possible for Cromwell and his military allies to advance their agenda, but in doing so they weakened both the rule of law and Parliament's legitimacy. These were the very institutions that Fairfax and Cromwell fought a war to protect, and republican government could not survive without them.

Historians name this Parliament the 'rump' Parliament on account of its diminished size, and in January 1649 it put the King on trial. It wasn't a lawful act because it never got approval from the House of Lords, but the Commons went along with it anyway. Adding to the legal dubiousness is that the lower house could not act as a court of law according to any constitution in English history; when Strafford was put to trial, the Commons led the prosecution with the Lords serving as judges. In 1649 the upper house was essentially empty, and the Lords played no meaningful role in the King's prosecution. The commons too was a husk of its former self; not only from its purged members, but those so disgusted that they stayed away in protest. None other than Fairfax's wife shouted ‘Not half! Not a quarter of the people of England. Oliver Cromwell is a traitor!’ during the proceedings; envoys from the royal family tried to get Fairfax to intervene to no avail.

It was a trial in name only. No due process, no lawful court, and no public confidence. This was no foundation for a lasting republic.

After executing the King on January 30th, the rump Parliament had neither a constitution for republican Britain nor much of a plan for putting one into place. By this point, Fairfax had become disillusioned with politics and the place that the revolution had taken Britain. But the chaos would continue. Cromwell forcibly dissolved the rump Parliament on April 20th of 1653, and the subsequent 'Barebone's Parliament,' convened under a new constitution that July, would last less than a year.3 By December the three realms were onto their next constitution, the Instrument of Government. This was the first Protectorate constitution and appointed Oliver Cromwell as Lord Protector and head of state for life.

The Protectorate was the most stable republican government in the interregnum, but it too faced similar problems to its predecessors. Cromwell dissolved Parliament early in Janurary 1655 by exploiting a loophole in the Instrument of Government: the constitution required Parliament to sit for five months before dissolution, but didn't specify whether these were calendar months or lunar months. Cromwell counted five lunar months, allowing him to end the session weeks earlier than intended. In a modern democracy, dissolving the legislature using a lunar-month loophole would end your career. But in republican Britain, it was politics as usual.

Oliver Cromwell was a formidable statesman, and as of 1657 Britain was on a stable footing: the American republic followed a similar trajectory between Cornwallis's surrender and the ratification of the constitution. But the undoing of the republic was that no one else could balance the interests of Parliament, the army, and the people, and the Protectorate's relative stability ultimately said more about the man leading it than anything else. Before his death on September 3rd, 1658, Cromwell named his eldest son Richard as his successor. Richard's plans to shrink the size of the army led to him being deposed less than a year later, and the only alternative was the equally unpopular old rump Parliament. The rump voted to dismiss army leadership, and after a two-week stand-off in October 1659 the army forcibly dissolved the rump.

At this point, the country went into open revolt. That December General George Monck raised an army in Scotland and marched south not to restore the current Parliament but rather to establish order and call for free, general elections for the first time in 20 years. When he heard of Monck's plan, the aged one-time rebel Fairfax raised troops in York in support. Enough was enough: the time for republican political experimentation was over. The political capital and military power of Monck and his allies was no match for the half-hearted military junta in London, and on May 1st, 1660, the newly elected Parliament declared that "the government is, and ought to be, by King, Lords, and Commons" to the surprise of absolutely no one. Fairfax and the rest of the Parliamentary delegation visited the exile court of King Charles II in the Netherlands, and on the King's return to London the one time hotbed of revolution celebrated for three days straight.

The lessons are clear. A republic lacking a commitment to the rule of law and institutions with broad legitimacy cannot last, and neither can a constitution built around a once-in-a-generation political talent. When a durable republic ends up with sub-par leadership, it has both the legitimacy as well as the checks and balances to muddle through. And durable self-government requires stability: otherwise too many people needed to make a republican government work will take their ball and go home. There probably was a republican constitution that Fairfax and Monck could have lived with, but those leading the revolutionary charge proved unable or unwilling to put it into practice.

It is worth pointing out the seeds of the American Constitution planted in this era. The Petition of Right established third and fifth amendment protections as well as Article I taxation authority of the legistlature. While civilian control of the executive branch is established in Article II, the executive cannot disolve congress and cannot remove judges at will. The Constitution explicitly bans bills of attainder not only in Congress but by state governments as well.

Religious Toleration

There are also more optimistic lessons to be taken from the conflict. It was an important milestone on the path to religious toleration, but this wasn't imposed by leading enlightenment philosophers but rather met as an exhausted consensus after decades of tumult. The erosion of legitimacy was still a great driver, but towards more constructive ends.

In this day a strong majority of England, Wales, and Scotland was Protestant, and Catholics were subject to recusancy fines and other forms of persecution. While King Charles I was a devout Protestant, his marriage to the Catholic Queen Henrietta Maria from France fomented conspiracies that he was on the cusp of reimposing Catholicism to the state Church. In England, what would become Anglican doctrine was hotly debated inside and outside of the clergy: the puritans of the era argued that the existing episcopal hierarchy of bishops, emphasis on sacraments in worship, and ornate churches reeked of residual Catholicism that had to be purged. This was a broad social movement where Calvinism was the dominant theology, and their opponents included anti-Calvinists of different stripes. Some were bishops who wanted to preserve their standing, but others simply wanted to preserve the tradition and sacraments that had been a part of their worship for as long as they could remember.

As the king's unconstitutional taxation was facing opposition throughout England, his heavy-handed approach to the state Church ruffled at least as many feathers. As supreme governor of the Church, the King appointed many ceremonialist and anti-Calvinist bishops. Few members of the clergy were as staunch supporters of the King as William Laud, whose loyalty was rewarded by his appointment to the Archbishop of Canterbury, the highest office in the Church of England. Laud was staunchly anti-Puritan and wasn't shy about using the power of the state to harass his theological enemies. Extrajudicial punishment out of the Star Chamber4 was the order of the day, most notably against three men in 1637 who criticised the episcopal hierarchy in print. As punishment their ears were cut off, fined £5,000, and imprisoned for life. The shocked crowds treated them as martyrs, draping them in garlands as they made their way from the pillory. Puritans were not inclined to turn the other cheek when they had the power to do so: an 1643 act of Parliament commanded the destruction of altars, icons, and statues. Countless religious artefacts, both Catholic and Protestant, were destroyed in this time by several campaigns of iconoclasm.

Once armed conflict was coming to a close in the 1650s, Parliament started stumbling in the direction of religious liberty. The state Church was purged of many royalists and Parliament tried to ban the celebrations of Christmas,5 but outside the abolition of bishops there was less enforced puritan doctrine onto the Church than may have been expected. Debates about the direction of the state Church may have raged in the 1630s and 1640s, but both political elites and the general public had grown weary. While the late King was concerned that religious uniformity was necessary to keep his realms intact, religious toleration worked reasonably well in republican England. Church morality courts were abolished, and mandatory church attendance was repealed. The relaxed sanctions against independent congregations meant a proliferation of new sects.6 From Britain in Revolution:

All who professed faith in God by Jesus Christ were to be free to practise their religion, so long as they did not abuse it to the civil injury of others, or disturb the peace, or invoke it to justify licentiousness. This liberty was not to extend to 'popery or prelacy', but there was a widespread de facto toleration of Anglican [pre-war ceremonial] worship throughout the 1650s

Perhaps most surprisingly, Oliver Cromwell offered assurances that recusancy laws wouldn't be enforced on London's Jewish community, which had been underground to the extent it existed at all since the expulsion of all English Jews in 1290. This started a small Jewish migration to London in the 1650s, paving the way to outright legalisation of their status in 1664.

This is not what many radical puritans wanted out of revolutionary Britain. But most Protestants in all three realms weren't keen on either Archbishop Laud or doctrinaire Calvinists using the levers of state power to oppress dissenting Protestant views. Persecution of Catholics continued, but their treatment in practice was more lenient than the letter of the law during the interregnum. 'Live and let live', at least for some sects, emerged from exhaustion more than enlightenment.

Only scratching the surface

In this era, there were very few irredeemably bad or unimpeachably good characters. There were, and still are, debates about what the wars were fought for, and many important figures switched sides as the events of those decades took on a life of their own. This makes for a fascinating era of history, and there are many aspects of the story that didn't make it into this post: the origin of the Quakers, communities who believed Acts 4:32 commanded proto-socialist collective ownership, or how the Marquis of Montrose entered Scotland with two men but strung together enough victories to briefly conquer the country. 7

While the American and French revolutions are more widely discussed, there are few better starting points for modern constitutionalism than the English Civil Wars. Since 1640, we can see how liberal democracies live and die on how well they learn these lessons from the past.

When people romanticize strong leaders or try to shortcut constitutional process, they’re forgetting why England’s republic failed and how America's could too.

References

Austin Woolrych. 2004. Britain in Revolution: 1625-1660. Oxford University Press, Oxford, UK.
Diane Purkiss. 2006. The English Civil War: A People's History. HarperCollins, London, UK.
Duncan, M. 2013-2014. Revolutions: Season 1 [Audio podcast].
Jonathan Healey. 2023. The Blazing World: A New History of Revolutionary England, 1603-1689. Bloomsbury, London, UK.

In this era some sittings of Parliament were given names, and this one's is the 'Useless Parliament'↩
How true Strafford's claim is subject to debate, but the current consensus is that it is at least plausible. This helps explain the motivation for Parliament's later action against him.↩
The official name was the 'Nominated Assembly', but it was better known as "Barebone's Parliament" for one of its members, Praisegod Barbones↩
While this is a figure of speech in modern times, this was a literal part of the English court in the 17th century↩
Before the Victorian era, English Christmas celebrations involved raucous debauchery and pre-Reformation traditions that were seen by the puritans of the day as a noxious combination of paganism and Catholicism. Parliament tried to ban both Christmas decoration and celebrations, but enforcement was spotty on account of the riots that would often erupt when local authorities tried to clamp down on the practice: the old Christmas celebrations had broad appeal with the masses.↩
If you've ever wondered why there are so many protestant denominations, 1650s England is part of the story↩
If you've got this far, you owe it to yourself to listen to Mike Duncan's excellent podcast series on the wars of three kingdoms.↩

https://iainschmitt.com/post/english-civil-wars

The next big asset class?

May 22, 2025

The next big asset class?

One anecdote that Byrne Hobart is fond of is that when Warren Buffett was getting his start in equity investing most Americans believed that the stock market was a place to bid on livestock. In Hobart's telling, one lesson that Americans took away from the 1929 crash was that equities were scammy nonsense best left entirely alone; Buffett came into the equities investing business late enough to have a fresh perspective but early enough where there were still many proverbial $100 bills left on the sidewalk. This is in part why Hobart argues that the Warren Buffett of the mid-21st century will be someone who gets their start in a new asset class: while there are still plenty of opportunities in traditional asset classes, getting an opportunity to shoot the fish-in-a-barrel before anyone else is a hard first start to beat.

A boring and good answer to "what asset class is to today what equities were in the early 50's" is 'Cryptocurrencies', but today they may be prominent enough of an asset class to be out of this consideration set. The more interesting answer is prediction markets, which are in something of a regulatory grey area: many jurisdictions treat prediction markets more like gambling and less like an above-board financial instrument.1 But regulatory burdens aren't the only roadblock for prediction markets to become a grand new asset class. The existing prediction markets will need to make some changes before they reach maturity, and breaking down the relevant players in stock exchanges helps bring this point home. I'm leaving out a few for brevity, but the relevant ingredients in a stock exchange are the investors, brokers, market makers, clearinghouses, regulators, investment banks, and of course the exchanges themselves. A given equity is listed on an exchange at a given price, and orders for said equity are placed by investors, intermediated by brokers and market makers. All of the same players are in place for equities options, but margin maintenance becomes more important to cap theoretically unlimited risk.

The most important real-money prediction market platforms are Polymarket and Kalshi. For the uninitiated, the Wikipedia table of contents for both markets tells you a lot. Polymarket has a 'Legal Issues' section and cannot legally operate in the United States, while Kalshi has 'Regulatory History' - not exactly confidence inspiring, but better than the former. While Kalshi has obtained regulatory approval by the SEC, it has still faced aforementioned regulatory scrutiny for its markets on political questions. Given that I know a lot more about Kalshi, I'm going to give more focus to it.2

Clearing houses and market makers will be more sophisticated in highly liquid centralised exchanges like the NYSE, NASDAQ, or LSE but in over-the-counter marketplaces these roles are carried out directly by broker-dealers: "A Tegus call with a former OTC Markets Group employee notes that there are only about 300 people who actively make markets in these [OTC] stocks".3 This pales in comparison to highly liquid formal markets with a panoply of market makers and the sophisticated firms and institutions that enable them. But even as compared to OTC markets, Kalshi is a step reduction in sophistication. The 'market' in 'prediction market' naturally means that the platform takes the role of the exchange, but Kalshi wears many other hats to submit orders. The first of these hats is the broker hat. Traders don't have a 'Kalshi broker' because Kalshi is the broker that traders are using to place orders onto the exchange. It isn't all that difficult to imagine third-party brokerages submitting orders to the exchange. While Kalshi makes much of its revenue from brokerage transaction volumes, it isn't all that difficult to imagine something more similar to centralised exchanges where Kalshi is compensated by brokerage firms to enable order submission.

'Kalshi as market maker' is trickier. Kalshi acts as a market maker providing liquidity for most of its markets, allowing for market orders to be processed and not relying solely on investor orders to populate an order book. Prediction markets face the same liquidity challenges of OTC Markets as they cannot count on the swarm of firms making markets on centralised exchanges. Manifold, a fake money prediction market, uses a variant of a constant-product market maker based off of Uniswap,4 and making markets in prediction markets has been an area of research for a long time 5. I'm not entirely sure which approach Kalshi has taken for making markets, but Manifold's market maker would be prone to being gamed if used in real-money markets.6 Recently Kalshi has started a market maker program meaning that it isn't the only entity offering liquidity, but before you spend your weekend trying to get a Kalshi market making PoC off-the-ground it seems like there are some high barriers to entry: "This [market maker] program is highly selective and requires participants to meet stringent criteria."7 This is a relatively new program and it's not clear what, if any, third party market makers are working with Kalshi. I can't think of many reasons why the Jane Streets and Citadels of the world couldn't serve as perdiction market makers. The task is made harder by prices that decay to zero or one at the end of a market resolution timeline but options have similar challenges. Nothing other than low trading volumes seems to be particularly unique for prediction markets as compared to other assets from the market maker's perspective.

The question of 'which entities can list and resolve markets' strikes me as the thorniest, so let's close with 'Kalshi as investment bank'. Kalshi is the only entity that publishes new questions for exchange, it carries out the IPO process equivalent. While prediction markets can be modeled as binary options, I'd argue that this is something rather different from entities that write options against other securities, as these options require someone else to have brought a security to market for there to even be something to trade. Third parties listing and resolving markets raises several questions. Who ensures that the prediction is well-defined and resolvable? Who is responsible for settling disputes? Who do you sue if any of this goes wrong? Insurance marketplaces have had to handle principal-agent problems like these for longer than the modern financial system, and I imagine that Kalshi wouldn't mind offloading some of this hassle to dedicated entities that bring questions to market. But this calls into question who has the incentive to bring questions to market if not Kalshi. The marketplace makes money off of transactions fees so their incentives are obvious, but unless a third party had a forecasting stake in the outcome (like an ice cream retailer's interest in weather forecasting) they might need a piece of the action to incentivize market listing.

The repository Git history shows that probably-not-next-asset-class.md was the old name for this file, because in the process of writing this post I convinced myself that the non-regulatory barriers to prediction market adoption are solvable. Whether or not someone sees enough of an opportunity in fixing them remains to be seen, but a future where prediction markets are the next asset class looks to be one where prediction markets look like more of an NYSE or NASDAQ with third parties increasingly involved as brokers, market makers, and investment banks. The gains from specialisation that equities exchanges enjoy would then be shared by prediction markets, and a virtuous cycle between liquidity providers and traders could be established.

"Statement of Chairman Rostin Behnam Regarding CFTC Order to Prohibit Kalshi Political Control Derivatives Contracts"↩
And, unlike Polymarket, none of its executives have been detained by the United States federal government at least as far as I'm aware↩
The Diff "OTC Markets Group: Monopoly, Jr."↩
Maniswap: Manifold Market Maker ↩
Robin Hanson. 2007. Logarithmic Market Scoring Rules for Modular Combinatorial Information Aggregation. Journal of Prediction Markets 1, 1 (February 2007), 3-15.↩
I can only imagine that Jane Street's market making approach uses more sophisticated trading algorithms and risk management than what the likes of Manifold or Kalshi are capable of↩
Kalshi Support "Market Maker Program"↩

https://iainschmitt.com/post/the-next-asset-class

Software firms and subjective judgment

Apr 26, 2025

Software Firms and Subjective Judgement

If the CEO of LG wants to add a new button to the remote on one of their TV models, this will require prototyping the new remote, designing changes to the PCB and plastic casing, getting on the same page with their manufacturing contractor, and updating support documentation. I've certainly left out a few steps, but executives at a CRM SaaS company could ask for a 'Send Profile' button mid-morning and expect to see it in front of customers before lunch. This dictates a lot of what it means to work in the SaaS industry. If you are writing line-of-business applications with web-browser clients, you aren't constrained by having one shot to get your program right before it is written onto a game cartridge, and you aren't constrained by hardware specifications for 2000's PC programs. You have a much freer hand to pursue 'what drives business value to customers', which is incredibly nebulous.

Do you actually need that "Send Profile" button? Is there work scheduled for the next quarter that will automatically share profiles across customer profiles? What else would the engineer tasked with adding this button be doing, and is that more valuable? Is this customer on the verge of cancelling? Is this needed today? These are not always easy questions to answer, and they often can't be massaged into an Excel model to min/max a decision, especially given the relevant interpersonal dimensions. If you want to run your SaaS company into the ground, say 'yes' to every single customer feature request. At best you will make a confusing Frankenstein’s monster of an application that does nothing particularly well (other than confusing your customers). At worst your best engineers will become demoralised and consider other options as they see a piece of software crawling with bugs losing its battle against entropy.

The power to ship in less time than it takes to brew a pot of coffee can be terrible in the wrong hands, but it underscores that a lot of B2B SaaS companies are paid for their judgment on relatively subjective decisions. This can also be about how you model the business problem in the first place: if you are writing an expense reimbursement application, should you make an assumption that every expense is attributable to one budgetable category? That doesn't sound unreasonable. But sometimes that model won't be accurate. Let's say one of your customers sends an engineer to a conference. This engineer is presenting at one of the major sessions, but you've found out that some important customer prospects will be in attendance so you'll also need your engineer to introduce themselves, and perhaps do some wining-and-dining. If this hypothetical employee expenses his Delta ticket to "Sales Travel" rather than "Engineering: Other" because they've procrastinated on their expenses long enough that the prospect became a customer then that's not the end of the world, but why not have fractional expense categories? That may be required for some customers, but any time a one-to-one restriction is relaxed you've made the problem more complicated. How should rounding be handled? Are there accounting consequences for doing so? If an expense is split between sales and engineering, should there be an approver from both teams?

The contrast between B2B SaaS and open-source HTTP servers is telling: it isn't that there's no room for different approaches on domain model and feature sets, but the choices are more constrained. Apache and Nginx have made meaningfully different technical choices, but the minimal feature set is specified in handful of RFCs. These programs and the projects backing them have a narrow, well-defined mandate for the job they are supposed to do, requiring fewer (but by no means zero) quasi-subjective decisions about where to add incremental value. Spolsky's Strategy Letter V explains how open source software is a compliment to commercial software, but 'what type of judgement calls did the authors have to make' is a factor in determining which category of software a problem will end up in.

https://iainschmitt.com/post/software-firms-and-subjective-judgement

Prism Proxy Bug

Apr 12, 2025

Prism Proxy Bug: STOP-2386

A few disclaimers to start - nothing in this blog post discloses proprietary information from either SPS or our customers, and I did get company permission prior to writing this. This post and everything else on this website is written in my personal capacity and not on behalf of SPS or any previous employers; all opinions expressed here are my own. With that out of the way, my development team at work uses the @stoplight/prism-cli reverse proxy to enforce Open-API specifications on HTTP ingress for some of our REST API applications. When working properly, this requires less defensive programming from API authors and automates interface enforcement between services. Earlier this year, one of our Prism proxies crashed while processing a relatively large request that included a special character, ü.

/Users/iain/code/prism/node_modules/split2/index.js:44
      push(this, this.mapper(list[i]))
                      ^
sourcemap-register.js:1
SyntaxError: Expected ',' or '}' after property value in JSON at position 8193
    at Transform.parse [as mapper] (<anonymous>)
    at Transform.transform [as _transform] (/Users/iain/code/prism/node_modules/split2/index.js:44:23)
    at Transform._write (node:internal/streams/transform:175:8)
    at writeOrBuffer (node:internal/streams/writable:447:12)
    at _write (node:internal/streams/writable:389:10)
    at Transform.Writable.write (node:internal/streams/writable:393:10)
    at Socket.ondata (node:internal/streams/readable:817:22)
    at Socket.emit (node:events:514:28)
    at Socket.emit (node:domain:488:12)
    at addChunk (node:internal/streams/readable:376:12)

The split2 NPM package is used in Prism proxy to concatenate chunks of a readable stream together, and as shown in the stack trace the exception was thrown in that package. The stop-2386-bug-demonstration branch of my @stoplight/prism-cli fork has a cli:stop-2386 NPM script that I used to reproduce the bug without using any proprietary information from SPS or SPS customers. cli:stop-2386 uses the most current version of Prism proxy as of time of writing, version 5.12.0; while I used Node 20.9.0 while preparing this explanation, I have yet to find a Node version where I couldn't reproduce the error. The proxy can be configured to run as either a single process or in multiprocess mode, where the HTTP server and logger are run in separate processes of the same node cluster. The cli:stop-2386 script runs using a multiprocess configuration in a manner equivalent to our production configuration.

Node event handling makes this unclear from the stack trace, but the error itself is raised when incoming HTTP request bodies are logged to standard output. To demonstrate the bug, a legal JSON object badInput is logged on startup for both multi-process and single-process configurations of the reverse proxy, but using this as a request body to the reverse proxy would also reproduce the issue.

const createMultiProcessPrism: CreatePrism = async (options) => {
  if (cluster.isMaster) {
    cluster.setupMaster({ silent: true });

    signale.await({ prefix: chalk.bgWhiteBright.black("[CLI]"), message: "Starting Prism…" });

    const worker = cluster.fork();

    if (worker.process.stdout) {
      pipeOutputToSignale(worker.process.stdout);
    }

    return;
  } else {
    const logInstance = createLogger("CLI", { ...cliSpecificLoggerOptions, level: options.verboseLevel });

    // Forcing the error
    logInstance.info({ badInput }, "Request received");
    return createPrismServerWithLogger(options, logInstance).catch((e: Error) => {
      logInstance.fatal(e.message);
      cluster.worker.kill();
      throw e;
    });
  }
};

const createSingleProcessPrism: CreatePrism = (options) => {
  signale.await({ prefix: chalk.bgWhiteBright.black("[CLI]"), message: "Starting Prism…" });

  const logStream = new PassThrough();
  const logInstance = createLogger("CLI", { ...cliSpecificLoggerOptions, level: options.verboseLevel }, logStream);
  pipeOutputToSignale(logStream);

  // Attempt to force the error
  logInstance.info({ badInput }, "Request received");
  return createPrismServerWithLogger(options, logInstance).catch((e: Error) => {
    logInstance.fatal(e.message);
    throw e;
  });
};

The transform function of split2 is invoked twice because when badInput is logged during cli:stop-2386 it is broken into two integer buffers with 8162 and 5069 elements respectively, it seems that this is the literal representation of a readable stream. When the two chunks are concatenated, a exception is thrown because a missing character prevents the result from being parsed into JSON. While I am no expert in Node readable streams, it appears that a pipe call in pipeOutputToSignale of the cluster's master process is transforming the worker process standard output when the error occurs. Much to my surprise, one of the buffers already had a missing character it reached split2: Pipe.callbackTrampoline is the very first function called while piping to the master process, and the incoming chunk is passed as args[0]: the end of the first badInput chunk is {"id":"2cd9545e58 and ,"values":["8015751025"]} is the start of the second chunk.1 The closing quote for "2cd9545e58 breaks the JSON parsing, but earlier in the first chunk the special character seems to be represented correctly as {"id":"56c8","values":["be0bümmmmmmmmm"]},.2

When running the proxy as a single process the badInput log is passed as single chunk and no exceptions are thrown. When the ü in badInput is repaced with a u, the end of the first chunk is {"id":"2cd9545e58", so parsing succeeds.3 Now is a good moment to discuss how badInput was made: it appears that a quote, comma, or curly brace must be the final character of a chunk to force the error. I have gone as far as to build a Node runtime with debugging symbols to understand the error but I didn't get very far in understanding where the comma is dropped, and why this is only happening for special characters. This is a strange issue that I have spun my wheels on a lot, so I've wanted to write about this for a larger audience to get some input on the root cause from people who have more experience with Node readable streams. While Smartbear has prepared a PR to address the issue, it calls jsonrepair on the concatenated result rather than addressing that the worker process is writing valid JSON to standard output but the resulting readable stream chunks in the master process are broken.

Hex literals as viewed with the Visual Studio Code Hex Editor:

\x7b\x22\x69\x64\x22\x3a\x22\x32\x63\x64\x39\x35\x34\x35\x65\x35\x38
\x2c\x22\x76\x61\x6c\x75\x65\x73\x22\x3a\x5b\x22\x38\x30\x31\x35\x37\x35\x31\x30\x32\x35\x22\x5d\x7d

↩

Hex literal:

\x7b\x22\x69\x64\x22\x3a\x22\x35\x36\x63\x38\x22\x2c\x22\x76\x61\x6c\x75\x65\x73\x22\x3a\x5b\x22\x62\x65\x30\x62\xc3\xbc\x6d\x6d\x6d\x6d\x6d\x6d\x6d\x6d\x6d\x22\x5d\x7d

↩

Hex literals:

\x7b\x22\x69\x64\x22\x3a\x22\x32\x63\x64\x39\x35\x34\x35\x65\x35\x38\x22
\x2c\x22\x76\x61\x6c\x75\x65\x73\x22\x3a\x5b\x22\x38\x30\x31\x35\x37\x35\x31\x30\x32\x35\x22\x5d\x7d

↩

https://iainschmitt.com/post/prism-proxy-bug

First Thoughts on The Little Book of Semaphores and Rust

Mar 17, 2025

First Thoughts on The Little Book of Semaphores and Rust

In my review of Domain Modeling Made Functional, I made the following comment about Rust:

This has made me realise that I would much rather learn a new language by diving into a problem area that it is well equipped to work in: rather than just learn Rust, I'd rather do a deep dive in concurrency and learn Rust in the process.

This was what I briefly thought about as a principal developer at work talked me out of learning C. He argued that if I wanted to learn C for systems programming experience I'd be better served by Rust, but if C interoperability and experience with manual memory management was a bigger priority then Zig would be the right choice. While I was ultimately convinced, this was disappointing. I haven't finished all of either Stevens & Rago's Advanced Programming in the UNIX Environment or Operating Systems: Three Easy Pieces by Arpaci-Dusseau & Arpaci-Dusseau but especially for someone who never took an Operating Systems class both works motivate the reader to pickup C programming.1 It's fun to see a function signature and then look in man pages to answer questions not provided in the text. If I remember correctly, I told the engineer I was talking to 'I want to spend more time in man pages and less time building data-transfer-objects in yet another CRUD application'. The Little Book of Semaphores came from a list of Dan Luu's recommended programming books. 2 This wasn't my first time trying to learn concurrency; I got a decent amount out of the first six chapters of Goetz et al. Java Concurrency in Practice if my O'Reily online reading history is to be trusted. 3 But the Goetz book can be dry and naturally is focused on Java, while Downey's book is both language-agnostic and available for free. I figured it was as good as any to start the ball rolling with learning more about concurrency using Rust.

But Allen Downey's The Little Book of Semaphores is different from most concurrency books in that it spends all of ten pages on the background of semaphores before getting into synchronisation problems. Initially I thought there was a typo in PDF when there was an entire blank page after the first problem statement; both problems and solutions are written in a Python-like pseudocode. Downey doesn't explain much about the format of the desired solution. My solution to the ballroom dance queue problem in 3.8 initially used a literal queue data structure guarded by a mutex, but figuring this out forced me to more explicitly learn a one-thread-per-actor model for concurrent problems. The first three chapters have been a joy to go through and my 'Learning Rust' repository is where I've put my solutions. The problems are rewarding to go through, and you spend far more time writing solutions than reading text. Given that all content is language agnostic and requires very few special language features, pretty much any language that supports shared-memory concurrency at runtime would be appropriate to write solutions for. The Little Book of Semaphores is one of few technical books I've read that comes across as a better self-teaching aide than a textbook. How relatively vague the problems are set up makes me unsure of how well it is to lecture off of, but the relatively vague problems and a 'LLM-as-code-reviewer' made it great for my purposes.

As far as the language is concerned, it's a pretty good one. It has a mix of language features that I really like including Option<T> in lieu of null pointers, ML style pattern matching, and language-level support for Result<T, E>. The semaphore exercises haven't required any crazy lifetime annotations, and compiler errors strictly about variable RAII patterns are pretty clear. It's something of a miracle to me that the RAII memory management works as well as it does: so far I haven't faced a situation where I was confused about why the compiler thought a piece of dynamic allocated memory wasn't available. It's something of a cliché to say 'the Rust compiler is so nice to work with' but it's cliché for a reason. But for the times that I didn't understand the compiler message, I tried to explain my problem or misunderstanding as clearly as possible to Claude with the explicit instructions to 'Provide minimal code examples: I want to understand this concept, don't hesitate to ask me questions or probe to build my understanding'. The README of my 'Learning Rust' repository has a log of these questions and answers like the following:

Question
When using handles.iter().for_each(|handle| handle.join().unwrap()); in place of the for loop, the build error rustc: cannot move out of *handle which is behind a shared reference was provided. Why is the iterator different than the for loop? I would have thought the ownership was clear?
Answer
The problem is that handles.iter() provides shared &JoinHandle<()> references to the handles but does not grant ownership of them.

It is easy to see the value of LLMs as search engines that can interpolate between queries, but this is pretty good evidence in favour of Bryne Hobart's argument that 'AI Ruins Education the way Pulleys Ruin Powerlifting'. 4 Being as specific as you possibly can in writing about a topic is a great way to push your understanding; you're better off learning from an engaging professor, but LLMs can sometimes give you something close.

Because the same semaphore needs to be shared across threads I ended up using Rust's atomic reference counting pointer, Arc<T>, in every solution so far. McNamara's Rust in Action describes this smart pointer as "Rust's ambassador. It can share values across threads, guaranteeing that these will not interfere with each other". While the semaphore is acquired and released by different threads, the semaphore state is handled by concurrency primitives within the semaphore struct. I expected a little more of a fight from the Rust compiler, but the same ceremony is required for Semaphore.acquire() and adding an element to a collection contained in a mutex. Speaking of the semaphores themselves, I was a little surprised to learn that Rust doesn't have them in the standard library so I just used Sean Chen's implementation of them. 5

The Rust learning curve is made more tolerable because the things that are hard have a good reason for being so. But for one problem, I had a tough enough time figuring out how to modify a collection in a way that Rust's compiler would tolerate that I started writing the solution in F#. This was the ballroom dancer queue matching problem between leaders and followers, where I didn't use a one-thread-per-actor model. Both threads in my solution were started by problem_3_8_thread, with leaders and followers having a dedicated queue, the rest of the solution is in the F# part of the repository. Using a mutable collection this way isn't very good F# style, one could argue that an explicit ref cells would be less bad than using Queue<T> this way to at least make the mutability more explicit.

let problem_3_8_thread
    (internal_sem: Semaphore)
    (external_sem: Semaphore)
    (dancer_list: Queue<String>)
    (label: String)
    =
    Thread(fun () ->
        while true do
            if dancer_list.Count <> 0 then
                Console.WriteLine $"{label} thread waiting"
                toggleSem internal_sem Release
                toggleSem external_sem Wait
                Console.WriteLine $"Dancer: {dancer_list.Dequeue()}")

While it is a much, much better idea to write a solution with an implicit queue formed by a single thread for each dancer, I also knew that the same solution had to be possible in Rust. I eventually wrote the following. One of the issues I faced was that method calls on a mutex-guarded item are handled differently than things like incrementing an integer, but that didn't bother me as much as how mutexes are released in Rust. The Arc<T> usage in dancer_list is to allow sharing across threads, and Mutex<T> allows for mutability - using a LinkedList<String> directly as was done in the F# example wouldn't satisfy Rust's safety guarantees, nor should it. I wanted to be able to add entries to the dancer_list from the main thread after a dancer thread was initialised, so I wasn't surprised by needing Arc<Mutex<T>>. I was surprised that std::sync::Mutex didn't provide a function to unlock a mutex. Rather than unlocking a mutex you're supposed to let it be dropped when it falls out of scope as shown below. This is the first time that I've had to use scopes in this way - I'm sure that there is a good reason that there isn't such a function on Mutex<T>, either because this prevents bugs or because it's better to use RAII rather than work around it, but aesthetically I absolutely despise this. It looks like the parking_lot crate provides an unlockable mutex, but I don't remember getting very far with the crate and decided to stick in the standard library.6

fn problem_3_8_thread(
    internal_turnstile: Arc<Semaphore>,
    external_turnstile: Arc<Semaphore>,
    dancer_list: Arc<Mutex<LinkedList<String>>>,
    label: String,
) -> JoinHandle<()> {
    return thread::spawn(move || {
        loop {
            {
                let dancer_list_data = dancer_list.lock().unwrap();
                if dancer_list_data.is_empty() {
                    break;
                }
            }
            println!("{label} thread waiting");
            internal_turnstile.release();
            external_turnstile.acquire();

            {
                let mut dancer_list_data = dancer_list.lock().unwrap();
                let maybe_dancer = dancer_list_data.pop_front();
                maybe_dancer.map(|dancer| println!("{dancer} danced"));
            }
        }
    });
}

These are relatively small issues in the grand scheme of things, and jumping straight into concurrency with Rust means dealing with the language's most distinctive features right-off-the-bat. The tooling situation is very good, which is what you should expect from a post-2000 language that wasn't built for interop with anything else. While the Rust standard library reference 7 is better than it's F# equivalent, I haven't found something like the language reference. 8 Having the Rust book available for free as a GitBook is an acceptable substitute. 9

I really thought about providing Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau's names as 'Professors Arpaci-Dusseu' as is done for the plural 'attorneys general'; this would have been more fun but less clear↩
Luu, D. 2016. Programming book recommendations and anti-recommendations.↩
Goetz, B. et al 2006. Java Concurrency in Practice. Addison-Wesley Professional, Upper Saddle River, NJ.↩
Hobart B. 2024. AI Ruins Education the way Pulleys Ruin Powerlifting.↩
Chen S. 2020. Implementing Synchronization Primitives in Rust: Semaphores.↩
Docs.rs: parking_lot Crate ↩
F# Language Reference ↩
Rust Library Reference ↩
Steve Klabnik and Carol Nichols. 2022. The Rust Programming Language. No Starch Press, San Francisco, CA, USA.↩

https://iainschmitt.com/post/first-thoughts-on-lbs-and-rust

This blog's new syntax highlighting

Mar 2, 2025

This blog's new syntax highlighting

This blog is written from scratch with my custom static site generator, and up until a few days ago I was using Prism for client-side syntax highlighting. 1 This added about 34.3 kB of JavaScript to support F#, TypeScript, and JavaScript syntax highlighting. In the grand scheme of things this is a small bundle, but this always bothered me because syntax highlighting is the only part of the website that needed any JavaScript. Using client-side syntax highlighting required a manual change whenever a new language is used in a code segment: the syntax highlighting rules were limited to only those that I was actively using for the smallest possible JavaScript bundle. That's not something I would want to slow me down as I'm learning and writing more about Kotlin and C this year, and that's not even counting the XML later on in this post. The advantage of server-side syntax highlighting is that you can support a comical number of languages without any impact on bundle size.

I started by trying to do a port of Prism to F# and then run it in the same way that Prism can be used server-side. But Prism is a pretty old piece of JavaScript first made public in 2012 so it relies on dynamic typing exactly as much as you'd expect. 2 The matchGrammar function below takes a syntax highlighting grammar - a collection of regular expressions to match language features - and applies it to the text to be highlighted. Server side Prism is called like const html = Prism.highlight('const code = "var data = 1"', Prism.languages.javascript, "javascript"), meaning that the tokenize function is called such that tokenList originally has a single element that matches text. Type modeling of grammar isn't impossible, but it is a pretty permissive type that includes circular references and properties that can be either arrays or strings.3 The whole idea behind adapting Prism was that the small size of the library would make a re-write relatively short and still allow for using the existing language grammars, but that wasn't happening. For a little while I even tried to run JavaScript from .NET inside of the application; this is supposed to be possible but it's a pretty janky setup and I wasn't able to get it to work. 4

/**
 * @param {string} text
 * @param {LinkedList<string | Token>} tokenList
 * @param {any} grammar
 * @param {LinkedListNode<string | Token>} startNode
 * @param {number} startPos
 * @param {RematchOptions} [rematch]
 * @returns {void}
 * @private
 *
 * @typedef RematchOptions
 * @property {string} cause
 * @property {number} reach
 */
function matchGrammar(text, tokenList, grammar, startNode, startPos, rematch) {
  //...
}

I then had the realisation that I didn't need to run the syntax highlighting at runtime. The way that I am making static pages is by taking a markdown page at application startup and using Markdig to parse it to HTML. 5 On page load, Prism would run in a script tag to apply the highlighting. However, placing the HTML from a syntax highlighter directly into the pages before application startup would allow me to avoid any changes on the .NET side. A Node script running at build time would load the posts, apply syntax highlighting, and then place them in to the application's WebRoot static file directory. I needed to place the syntax highlighted files into the correct WebRoot for both debug and release builds and wait for those directories to exist before running the syntax highlighter. This was accomplished with a couple of MSBuild commands in the portfolio-website.fsproj file as shown below. It took me longer than it should have to get the output directories correct, but I don't really want to invest that much time in understanding MSBuild.

  <PropertyGroup>
    <SyntaxHighlighterOutputDir Condition="'$(Configuration)' == 'Release'">out/WebRoot/markdown</SyntaxHighlighterOutputDir>
    <SyntaxHighlighterOutputDir Condition="'$(Configuration)' != 'Release'">$(MSBuildProjectDirectory)/bin/$(Configuration)/$(TargetFramework)/WebRoot/markdown</SyntaxHighlighterOutputDir>
  </PropertyGroup>

  <Target Name="InstallNodePackages" BeforeTargets="PrepareForBuild">
    <Message Text="[MSBuild] Installing Syntax Highlighter Node Packages" Importance="high" />
    <Exec Command="npm --prefix $(ProjectDir)syntax-highlighting install" />
  </Target>

  <Target Name="EnsureOutputDirectoryExists" BeforeTargets="RunSyntaxHighlighter">
    <MakeDir Directories="$(SyntaxHighlighterOutputDir)" />
    <Message Text="[MSBuild] Ensuring output directory exists: $(SyntaxHighlighterOutputDir)" Importance="high" />
  </Target>

  <Target Name="RunSyntaxHighlighter" DependsOnTargets="InstallNodePackages;EnsureOutputDirectoryExists" BeforeTargets="PrepareForBuild">
    <Message Text="[MSBuild] Running Syntax Highlighter, Migrating Posts" Importance="high" />
    <Exec Command="node $(ProjectDir)syntax-highlighting/index.js posts $(SyntaxHighlighterOutputDir)" />
  </Target>

The new syntax highlighting script isn't what anyone would call elegant. Client-side Prism would find all <code> elements and apply the appropriate highlighting, but here I have to use some regular expressions to find the code blocks and language definitions. It's pretty fragile; I faced some initial issues with the triple backtick counting in the code block below. The string replace() to modify the style should absolutely replaced with proper HTML parsing, but that's a problem for later on. While I was going to all of this effort, I decided to switch out the syntax highlighting library to Shiki, which uses inline HTML styling rather than a stylesheet, cutting out a stylesheet that I had to send. 6

const sliceIncludingLang = currentFileContents.slice(backticks[index], backticks[index + 1]);
const fistNewlineIndex = sliceIncludingLang.indexOf("\n");

if (fistNewlineIndex === -1) throw Error("[Syntax Highlighting]: Newline not found when expected");

const language = sliceIncludingLang.match(/\`\`\`(\w+)\n/)[1];
const sliceWithoutLang = sliceIncludingLang.slice(fistNewlineIndex + 1, backticks[index + 1]);

const defaultSyntaxHighlighting = await codeToHtml(sliceWithoutLang, { lang: language, theme: "catppuccin-mocha" });
const adjustedSyntaxHighlighting = defaultSyntaxHighlighting.replace(
  "background-color:#1e1e2e;color:#cdd6f4",
  "background-color:#181825;color:#cdd6f4;padding:1em;border-radius:0.3em;overflow:auto",
);
const sliceIncludingLangAndClosingBackticks = currentFileContents.slice(backticks[index], backticks[index + 1] + 3);
replacementPair.push([sliceIncludingLangAndClosingBackticks, adjustedSyntaxHighlighting]);

One thing I noticed when working on the syntax highlighting was that my means to bypass connecting to Apache ZooKeeper during development wasn't working. My production containers connect to Apache ZooKeeper as part of service discovery for my reverse proxy, but I don't want to handle this during development. 7


module AppZooKeeper
//...
let configureZookeeper (zkConnectString: string) (hostAddress: string) (hostPort: string) =
    task {
        match zkConnectString with
        | "-1" -> ()
        | _ ->
            let zooKeeper = getZooKeeper zkConnectString
            let! targetListStat = zooKeeper.existsAsync TARGETS_ZNODE_PATH
            let currentTargetZnodePath = getCurrentTargetZnodePath hostAddress hostPort

            if (isNull targetListStat) then zooKeeper.createAsync (TARGETS_ZNODE_PATH, null, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT)
                    |> ignore
            //...

However, whenever I use "-1" as a ZooKeeper connect string during development, the following logs are created in directory where I run the application.

[2025-03-02 03:47:14.695 GMT 	ERROR 	DynamicHostProvider 	Failed resolving Host=-1]
Exc level 0: System.Net.Sockets.SocketException: nodename nor servname provided, or not known
   at System.Net.Dns.GetHostEntryOrAddressesCore(String hostName, Boolean justAddresses, AddressFamily addressFamily, Nullable`1 startingTimestamp)
   at System.Net.Dns.<>c.<GetHostEntryOrAddressesCoreAsync>b__33_0(Object s, Int64 startingTimestamp)
   at System.Net.Dns.<>c__DisplayClass39_0`1.<RunAsync>b__0(Task <p0>, Object <p1>)
   at System.Threading.Tasks.ContinuationResultTaskFromTask`1.InnerInvoke()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)

Prism NPM Package ↩
Lea Verou's "Introducing Prism"↩
I didn't even realise that a @types/prismjs package existed until writing this post, which certainly would have helped↩
"Running JavaScript inside a .NET app with JavaScriptEngineSwitcher "↩
Markdig NuGet Package ↩
Shiki NPM Package ↩
Prior post "My needlessly complicated ZooKeeper-enabled reverse proxy"↩

https://iainschmitt.com/post/new-syntax-highlighting

LLM Chat Interfaces Will Change

Jan 23, 2025

LLM Chat Interfaces Will Change

Grant Sanderson is the mind behind 3Blue1Brown, a math YouTube channel featuring exceptional visualisation and animation created with his own animation library, Manim. 1 In October 2024 he published a walkthrough video of animating a Lorenz attractor in Manim. 2 At 13:25 Sanderson shows his desktop ChatGPT a shared screen, and this image has stuck with me for months:

This is several weeks of Sanderson's ChatGPT queries. Like most people routinely using language models (LLMs), some of Sanderson's queries are on a spectrum between a search-engine-style ('Translate Sentences') and queries that require extensive back-and-forth ('Adjust LSP-pyslp').3 This makes designing LLM chat interfaces somewhat confusing. Are you designing something more like a traditional search engine, or something more like an instant messenger application? A lot of queries are one-and-done, but sometimes a one-off question requires a longer, interactive follow-up. This has UX implications: if all of a user's queries to an LLM chat interface are one-off questions, then there's no reason to store the ~25 most recent questions in the sidebar for the same reason that existing search engines don't do this.

This confusion extends to using the LLM chat interfaces. My first two ChatGPT conversations were about software engineering and economics, which I gave the labels 'Technology' and 'Economics'. Almost all of my ChatGPT conversations are in one of these two buckets, and I would make similar categories like 'Social' for questions outside my first two conversations. But my 'Technology' chat got so long that both loading the page and submitting the queries was taking an increasing amount of time. The page load is a fixable problem and could be fixed by only fetching the most recent messages and dynamically fetching them if I scrolled upwards, but given that most large-language models send the entire conversation history to store context within a conversation, it's not hard to see why the queries themselves were taking longer too. This prompted me to make a 'Technology 2' chat. Claude gives you a shorter leash, and if a single conversations gets too long it will show a 'Tip: Long chats cause you to reach your usage limits faster' above the text input.

While ChatGPT's meteoric rise into what Ben Thompson has described as an 'accidental consumer tech company' can make us forget this, there are a lot of internet users who haven't given LLM chat interfaces a try.4 These users seem to get what they need out of search engines and go on with their lives. They think about their computing devices about as much as the average person thinks about electricity transmission — not at all, except for the rare cases when things aren't working. But what helped get Google from commonly known to ubiquitously used was an obsession on removing as much friction as possible from a process that was already pretty frictionless: type what you want to find in this box, and we'll show you a list of 10 blue links. In contrast, LLM chat interface developers for now have to pick a point on the continuum between search queries and full conversations, and that introduces friction for a user who isn't sure whether to open a new chat, work with an existing chat, or just go back to Google where they know what to expect. If you work with LLM chats every day this might seem somewhat ridiculous to suggest, but if OpenAI and Anthropic are going to become consumer tech giants they will need to craft user interfaces designed for the next 500 million customers, consumers very different in their levels of enthusiasm about tracking frontier model development

OpenAI, Anthropic, Google DeepMind, and their competitors are incredibly capable organisations full of talented engineers. I'd hazard a bet that their UX designers and frontend engineers have been thinking about this for longer than these chat interfaces have been widely available to the public. It's not an impossible task to sort and identify one-shot messages and longer conversations, but the interface that figures this out will have a better chance attracting and retaining new users.

3Blue1Brown. 2025. Manim. [v1.7.2]. Software repository. GitHub.↩
Sanderson, G. 2023. How I animate 3Blue1Brown | A Manim demo with Ben Sparks. 3Blue1Brown YouTube Channel.↩
Regarding the "Learning Meme Ideas" chat: I'm dying to know a) what prompted this b) what is in that conversation and c) what Sanderson did with that information↩
Thompson, B. 2023. The Accidental Consumer Tech Company; ChatGPT, Meta, and Product-Market Fit; Aggregation and APIs. Stratechery (March 28, 2023).↩

https://iainschmitt.com/post/llm-chat-interfaces-will-change

The Strangest Carrier: Southern Linc

Jan 5, 2025

The Strangest Carrier: Southern Linc

Any firm's contingency plans or redundant infrastructure says a lot about the business that they are in and what tradeoffs they are faced with. As an example, if your company's HR software vendor was inaccessible for a day because of an AWS availability zone failure, you probably have bigger problems that day. Beside the point that AWS customers are far likelier to bring an outage on themselves, the impact to a company of HR software being inaccessible for a work day is not an "everything grinds to a halt" situation. Some offer letters will be sent out late and any pending changes in benefits wouldn't be acted on, but it isn't the end of the world. Lattice is the first company that came to mind here, and if Lattice engineering leadership is trying to prioritise between a) lift-and-shift capabilities to another public cloud vendor and b) new features that will differentiate the product in the marketplace, it's pretty clear that they should go for the latter.

My former employer, RELEX Solutions, is in a different situation. Their Forecasting & Replenishment solution is a retail supply chain SaaS offering that takes in retailer data about products, locations, sales, and other similar information. This data is used to calculate order proposals to make sure that retailers have enough items to satisfy demand but not too much as to introduce spoilage or other costs that come with surplus inventory. This means that if RELEX is down, orders aren't going out. In contrast to Lattice, RELEX Forecasting and Replenishment is a load-bearing part of the business of their customers and this really shaped what Site Reliability Engineering work was like there. Even if a primary RELEX data centre is unavailable, RELEX customers are still relying on the service to route trucks. Because of this, we had the ability to fail over to warm reserve data centres. I quite enjoyed this level of criticality: our services were not matters of life and death, but customers really did rely on RELEX to be able to run their business. I had a certain amount of pride in this.

While warm reserve data centres are a pretty significant contingency plan, firms responsible for critical services in the physical world have a much higher bar to clear. Rather than keeping a handful of data centres available, they generally have more ground to defend and higher stakes when things go south. Any major airport employs a small army of air traffic controllers, security personnel, and firefighters to respond to any number of aviation disasters. Hospitals plan and train for any number of mass casualty events from natural disasters to dangerous epidemics. What I hadn't thought of until very recently was 'do utility companies have a backup plan if wireless networks are unavailable', and this takes us to one of the more impressive company contingency plans that I've heard of.

Southern Company is a gas and utility company best known for its subsidiaries Georgia Power, Alabama Power, and Mississippi Power serving over 8 million customers.1 As the regulated electricity monopoly in its service areas, it is responsible for transmission from power plant to the service drop into homes and businesses. If Southern Company was completely reliant on the big three wireless carriers they would be taking a great risk during hurricane season: communicating with crews restoring power to customers would be hampered by interruptions in someone else's network. This is apparently too great of a risk for Southern Company, because they operate an independent LTE network covering over 122,000 square miles, a network which almost completely overlaps with their utility service area.

One of the main tenets of telecommunications unit economics is that building networks has tremendous fixed costs, but the marginal cost of a new customer is miniscule. Southern Company built their LTE network at great expense, necessitating winning spectrum auctions, constructing towers, and setting up a backhaul network connecting cell sites to the network core. Once all is said and done, they were left with a network that designed for the needs of one company's critical infrastructure but almost certainly has excess capacity to spare. Unsurprisingly from this framing, Southern Company has done exactly what would make economic sense and sells this excess wireless capacity. Southern Linc is a wireless carrier that is much less famous than its corporate siblings underneath the Southern Company umbrella. Even if you live in Southern Linc's service you've likely never heard of this carrier give that its other offerings include "mission-critical push-to-talk" and fleet truck tracking: not exactly a concern for consumers and even most businesses. 2 The "Southern Linc at a Glance" page reads: "for more than 25 years, Southern Linc has been the wireless network provider built from the ground up for utilities, government and business".3

While there is nothing I'd rather do more than compare Southern Linc's subscriber growth and financial performance against the big three, heartbreakingly the Southern Company quarterly reports do not break down this information. As a disclaimer I am a CPA in exactly zero jurisdictions, but it looks like this is spelled out in the Financial Accounting Standards Board rules, rules that require any company segment responsible for more than 10% of parent revenue, P&L, or assets to be reported separately.4 Unsurprisingly, Southern Linc does not meet this 10% criteria. What we can see is that Southern Linc falls into the "Other Revenues" bucket, which brought in $283 million in Q3 of 2024 for $820 million year-to-date. As a small consolation prize, in that same 10Q the company reported a $50 million contribution to year-to-date revenue by what it described as "unregulated sales at Georgia Power". I'm sure "unregulated sales" are one of the interesting asterisks in the regulated utility monopoly model in most of the United States. 5

Given that Southern Linc is comparatively more obscure than the big three, the information available about it is more of a PR flavour than ideal, but much of it is still useful. According to a 2022 interview with telecommunications trade press Urgent Communications, the then Southern Linc CEO Tami Barron explained that the carrier had redundant network cores in Atlanta and Birmingham with two generators and 14 days of fuel on-site. As far as cell sites are concerned, “We have cell-site redundancy—from a last-mile [standpoint]—at probably 80% of our cell sites. Interestingly, at every cell site, clearly we have battery backup to about eight hours. But we also have on-site fuel-cell [sic] or [power] generation to the tune of five days.” Barron also explained that much of Southern Linc's backhaul is on fibre owned by Southern Company. I'm less surprised to hear about a utilities company own fibre than its own LTE spectrum, but in what seems to be a Southern Company tradition they resell their excess fibre capacity through their Southern Telecom subsidiary. 6

Southern Company isn't the only electrical utility company operating a private LTE network; Xcel Energy operates such a network across 8 states by leasing spectrum owned by Anterix. 7 But Southern Company is the only utility that I am aware of which operates a wireless carrier selling to external customers on its own network. The utilities industry is relatively unique in how critical it is, the technical capabilities required, and the geographic range that it entails. Few other industries would justify such an investment, but the fact that Southern Linc is so relatively unique likely implies that there isn't an overwhelming case for such expansive investments in wireless communications for most utilities. If I were to speculate, the carrier's existence can be credited to historical path dependence and factors unique to the deep south.

Southern Company 2023 Annual Report ↩
Southern Linc Services ↩
Southern Linc at a Glance ↩
Deloitte Accounting Research Tool 'On the Radar: Segment Reporting'↩
Southern Company 2024 Q3 10Q ↩
Southern Telecom Landing Page Note: 'Our regional focus on the South-eastern U.S. and commitment to connecting "non-NFL" cities enables you to be everywhere your customers want to be.' this distinction between "NFL" and "non-NFL" cities is a pretty useful one, but it is one that I've never seen anyone other than me use, so it's very funny to see this on marketing copy for a fibre optic network company↩
Anterix 2024 Q3 10Q Note: the same accounting rules that don't require Southern Company to break out Southern Linc financials dictate that Xcel Energy is a large enough customer to Anterix that the Xcel revenues have to be explicitly broken out↩

https://iainschmitt.com/post/the-strangest-carrier-southern-linc

Understanding monads?

Dec 26, 2024

Understanding monads?

The Node reverse proxy explained in my last post sprinkled in a little bit of functional TypeScript through Ramda. 1 This naturally gave me an excuse to talk about the project at the functional programming group at work, but some of the feedback that I got was that Ramda takes advantage of the permissive nature of JavaScript types, so fp-ts is a better functional library for TypeScript.2 Professor Frisby's Mostly Adequate Guide to Functional Programming is linked by the fp-ts docs in lieu of a more comprehensive tutorial for the library, and it is a wonderful resource for functional programming in JavaScript. 3 Chapters 8 and 9 explain functors and monads respectively, and after reading them I told a developer friend of mine 'I recently made a breakthrough in understanding monads' before sending him some notes explaining why. Said notes are available here, but the next morning I stared at the relevant type signatures long enough to understand that nearly everything about my explanation was wrong. This post is an effort to get it right.

Some say that once you understand monads you lose the ability to explain them. 4 This can't bode well for this explanation, but let's give it a shot. A better question than 'what are monads and why would you use them' is 'what are the practical differences between monads and functors'. Functors by themselves are pretty powerful: the Mostly Adequate guide explains them well, the gist being 'A Functor is a type that implements map and obeys some laws', with said laws shown below:

// identity
map(id) === id;

// composition
compose(map(f), map(g)) === map(compose(f, g));

const compLaw1 = compose(map(append(" romanus ")), map(append(" sum")));
const compLaw2 = map(compose(append(" romanus "), append(" sum")));
compLaw1(Container.of("civis")); // Container("civis romanus sum")
compLaw2(Container.of("civis")); // Container("civis romanus sum")

The rest of this post will assume familiarity with the Either, Option, and IO types, the latter of which is implemented in fp-ts/IO. Note that all three of these types are both monads and functors, as all monads are functors but not vice versa. The characteristic function of functors is map, which calls a function on the value contained within the functor. The IO type specific map function has been imported as mapIO below:

import { pipe } from "fp-ts/function";
import { IO, chain as chainIO, map as mapIO, of as ofIO } from "fp-ts/IO";

const effect: IO<string> = ofIO("myString");

const singleMap: (fa: IO<string>) => IO<string> = mapIO((input: string) => input.toUpperCase());
const singleMapFunction: IO<string> = singleMap(effect);
const singleMapApplied: string = singleMapFunction(); // = "MYSTRING"

In the segment above, the argument to mapIO is a function that accepts a string and returns another string. The return value is itself a function, and it accepts an IO<string> and returns another IO<string>. This can be useful to sequentially apply multiple successive functions on the functor's value. While this can look messy in languages without a pipeline operator, the fp-ts pipe function can clean this up making doubleMapMessy and doubleMapPipe equivalent in the following.

const doubleMapMessy: IO<string> = mapIO((input: string) => input.toUpperCase())(
  mapIO((input: string) => input.repeat(2))(effect),
);

const doubleMapPipe: IO<string> = pipe(
  effect,
  mapIO((input: string) => input.toUpperCase()),
  mapIO((input: string) => input.repeat(2)),
);
const doubleMapPipeApplied: string = doubleMapPipe(); // = "MYSTRINGMYSTRING"

While all the functions provided to mapIO thus far have had types IO<string> => IO<string>, note the type signature of mapIO:

export declare const map: <A, B>(f: (a: A) => B) => (fa: IO<A>) => IO<B>;

This means that we can use map with string => void functions, which would be appropriate for logging to standard output. To make things clearer, I've added some type information in the segment below.

const mapWithSideEffect: IO<void> = pipe(
  effect,
  // IO<string>
  mapIO((input: string) => input.toUpperCase()),
  //(f: (a: A) => B) => (fa: IO<A>) => IO<B>; A: string, B: string
  mapIO((input: string) => input.repeat(2)),
  //(f: (a: A) => B) => (fa: IO<A>) => IO<B>; A: string, B: string
  mapIO((input: string) => console.log(input)),
  //(f: (a: A) => B) => (fa: IO<A>) => IO<B>; A: string, B: void
);

//Logs "MYSTRINGMYSTRING"
mapWithSideEffect();

Given all of this, why bother with monads? While we've only seen examples from the IO monad, everything shown here would allow you to convert Option<string> to Option<number> while deserialising a property that may not exist, or from Result<UserValidationError, User> to Result<BalanceValidationError, Balance>.

A problem arises when we need to combine different IO operations. Let's say we have a logFilePath in myConfig.json. It is entirely reasonable for an application to read the log file path from the configuration file before writing to the log, but map functions aren't meant to handle this. Strangely, the segment below will compile even though the type hint for pureMapWriteToConfig is incorrect: it should be IO<() => void> instead as shown by the related comment. Because of this, mapWriteToConfig() doesn't actually log anything, which shouldn't have surprised us in the first place.

import { pipe } from "fp-ts/function";
import { IO, chain as chainIO, map as mapIO, of as ofIO } from "fp-ts/IO";

import { readFileSync, writeFileSync } from "node:fs";
interface Config {
  logFilePath: string;
}

const getFileJson =
  (fileName: string): IO<Config> =>
  () =>
    JSON.parse(readFileSync(fileName, "utf-8")) as Config;

const pureMapWriteToConfig = (configFileName: string, log: string): IO<void> =>
  pipe(
    getFileJson(configFileName),
    mapIO((config: Config) => {
      console.log(`Map config: ${JSON.stringify(config)}`);
      return config;
    }),
    mapIO((config: Config) => () => writeFileSync(config.logFilePath, log)),
    // mapIO<Config, () => void>(f: (a: Config) => () => void): (fa: IO<Config>) => IO<() => void>
  );

const mapWriteToConfig = pureMapWriteToConfig("mapConfig.json", "myMapLog");
mapWriteToConfig();

/*
$ wc -l mapLog.txt
  0 mapLog.txt
*/

Once we have the config file represented as an IO<Config> instance, we need to read it and do an IO operation for the log file. That means we need some function that can accept a (config: Config) => IO<void> as well as the incoming IO<Config>. This is known by a few different names, including bind and flatMap, but fp-ts calls this chain. This is the characteristic function that separates monads from non-monad functors:

export declare const chain: <A, B>(f: (a: A) => IO<B>) => (ma: IO<A>) => IO<B>;

The chain function is imported as chainIO, so chainWriteToConfig logs successfully when called.

const pureChainWriteToConfig = (configFileName: string, log: string): IO<void> =>
  pipe(
    getFileJson(configFileName),
    mapIO((config: Config) => {
      console.log(`Chain config: ${JSON.stringify(config)}`);
      return config;
    }),
    chainIO((config: Config) => () => writeFileSync(config.logFilePath, log)),
    //chainIO<Config, void>(f: (a: Config) => IO<void>): (ma: IO<Config>) => IO<void>
  );

const chainWriteToConfig = pureChainWriteToConfig("chainConfig.json", "myChainLog");
chainWriteToConfig();

/*
$ cat chainLog.txt
myChainLog
*/

Chapter 9 of the Mostly Adequate guide explains monads somewhat differently. They introduce join as the flattening of Monad<Mondad<T>> into Monad<T> making chain equivalent to a map followed by a join. The examples they provided use function composition rather than pipes, but this doesn't change much.

A few takeaways

The pureMapWriteToConfig type inference failure is unsettling and has made me lose some confidence in how the TypeScript compiler handles fp-ts.
It is now more obvious to me why Haskell's higher-order types matter. While one has to import type-specific map and chain functions in fp-ts, but it is my understanding that this isn't necessary in Haskell, allowing for $ and >== operators to map and bind respectively.
It is now more obvious to me why you can get so far without understanding monads in impure functional programming languages. I'd argue that functors are more generally applicable than monads, but operations with multiple IO calls are more common than operations with multiple Either or Option instances, so in pure languages you're forced to recon with this sooner.
The only references to functional purity in this post were the function names pureMapWriteToConfig and pureChainWriteToConfig. While many discussions of monads will reference carrying out IO or mutable state without breaking functional purity, this isn't all that helpful in explaining what monads do given a) functors enable similar behaviour and b) chain/bind can be useful in situations where no side effects are carried out.

Ramda NPM Package ↩
fp-ts NPM Package. Note: fp-ts will appear in code segments in this article, consistent with the package documentation↩
Mostly Adequate Guide to Functional Programming. Note: it appears that Brian Lonsdorf started the project, but there are many other contributors to the current repository↩
Unfortunately I have forgotten where I first read this, it isn't my original quip↩

https://iainschmitt.com/post/understanding-monads

My needlessly complicated ZooKeeper-enabled reverse proxy

Dec 18, 2024

My needlessly complicated ZooKeeper-enabled reverse proxy Self-Hosting and Apache ZooKeeper Background

After moving my personal website to my home DMZ network over a public cloud reverse proxy, I have increased my self-hosted footprint. Given how well cheap VPSs had performed for me in the past, I knew that some spare small-form-factor Lenovo desktops with more than one core and literally an order of magnitude more memory would do just fine for the same applications I had in the public cloud. Increasing this hardware footprint meant that I had an excuse to revisit an old friend that was a big part of my past role at RELEX: Apache ZooKeeper.

ZooKeeper is a key-value store for managing the state of distributed applications. The project started at Yahoo!, where many engineering teams working on distributed applications ended up duplicating effort in solving the same problems while introducing the same failure modes to their applications. What became ZooKeeper was a common solution to allow engineers to focus more on business logic and less on reading distributed systems academic research. The service is generally run as an ensemble of 2N - 1 servers where N>2, and two motivations for this configuration are for fault tolerance and master election. Only a single master node in the ZooKeeper ensemble is capable of writing to the data store, and only once a write is recognised by a majority of the cluster members including the master is the write committed. This allows the data store to remain operational provided that a majority of the cluster is available. On ensemble startup, or if the master ZooKeeper server becomes unavailable, the ZooKeeper cluster elects a new master. An odd number of nodes prevents a 50/50 deadlocked master election in these cases. 1

Reverse Proxy Server

While ZooKeeper is normally used for more interesting things, I decided to use it for service discovery and load balancing over the replicas serving the test.iainschmitt.com. static website. On startup, every replica writes to a /targets/$hostName znode; znodes being a node in the ZooKeeper data store. ZooKeeper supports both nodes that will persist until explicitly deleted as well as ephemeral ones that will be deleted once the client that created them in the first place is disconnected from the ZooKeeper ensemble. By using ephemeral lifetimes for replica znodes, unreachable target replicas would be removed from consideration by the reverse proxy.

When an uncached request for a particular URL reaches the reverse proxy, it lists the children of /targets to determine potential reverse proxy targets. The value stored in the /targets/$hostName znode is the count of cumulative requests to that target, so the target with the fewest requests is select and it's count incremented if the connection succeeds. If the first attempted target fails to respond, the next least commonly used is attempted. The request cache is cleared whenever a new replica comes online, which would most commonly happen during an application update.

By setting things up this way there weren't many changes that needed to happen with the portfolio website itself, almost all the new code was for the reverse proxy itself, which I wrote using Node. There would have been better choices as far as ZooKeeper support goes; while the zookeeper NPM package is actively maintained it falls back on more Promise<any[]> type definitions than preferable, but that may have to do with the native C libraries that the client is built with.2 I can't say I've done a comprehensive side-by-side comparison, but the official ZooKeeper client written by the project team looks to be a lot more complete.

Having an 'outer' request made to the reverse proxy as well as an 'inner' request made by the reverse proxy to the target is something that I didn't do correctly at first as shown by this toy example:

import { createServer, IncomingMessage, ServerResponse } from "node:http";
import http from "node:http";

createServer((req: IncomingMessage, res: ServerResponse) => {
  const options = {
    hostname: "127.0.0.1",
    port: 4000,
    method: "GET",
    path: req.url,
  };

  const proxyReq = http.request(options, (proxyRes) => {
    proxyRes.on("data", (chunk) => {
      res.writeHead(200, { "Content-Type": "text/plain" });
      res.end(chunk);
    });

    proxyRes.on("end", () => {
      proxyReq.end();
      res.end();
    });
  });

  proxyReq.on("error", (e) => {
    res.writeHead(500);
    res.end(e);
  });
}).listen(5001);

When the previous server was run, the inner proxyRes handler for 'data' events was never called so it didn't function as an actual reverse proxy. I must have skipped this line in the Node docs the first time: 3

In the example req.end() was called. With http.request() one must always call req.end() to signify the end of the request - even if there is no data being written to the request body.

After calling end, a readable stream of the request to the target server must be handled using event listeners. This readable stream also can emit multiple 'data' events, so; a working reverse proxy looks something like the following:

createServer(async (outerReq: IncomingMessage, outerRes: ServerResponse) => {
  const proxyReq = http.request({
    hostname: "localhost",
    port: 4000,
    method: "GET",
    path: "/",
  });

  proxyReq.end();
  proxyReq.on("response", (proxyRes) => {
    outerRes.writeHead(proxyRes.statusCode || 200, outerReq.headers);
    proxyRes.setEncoding("utf-8");
    const chunks: string[] = [];

    proxyRes.on("data", (chunk) => {
      chunks.push(chunk);
    });

    proxyRes.on("end", async () => {
      const body = chunks.join("");
      outerRes.write(body);
      outerRes.end();
    });
  });

  proxyReq.on("error", async () => {
    console.error("Error");
  });
}).listen(5000);

I hadn't placed much focus into the Node networking APIs before, and until reading the Node chapter of JavaScript: The Definitive Guide I didn't have that great of a handle on it. 4 This was a book that I already had a great deal of respect for given its comprehensive detail, so I wasn't surprised that the Node chapter also was quite well written. The way that readable and writable streams work is relatively intuitive, but the Node documentation would be better if it included TypeScript types and was clearer about which events can be emitted by which readable streams. It seems strange that strings are used to represent arbitrary events. While there are readable stream method type signatures like on(event: "data", listener: (chunk: any) => void): this there is also a permissive on(event: string | symbol, listener: (...args: any[]) => void): this to support custom EventEmitter instances. 5

Event emitting is a pretty unique flow control primitive which I haven't seen an equivalent of in other languages. Because of this, one early mistake that I made with them was trying to catch errors by wrapping the entire createServer in a try/catch, but this of course does nothing to handle 'error' events. A more embarrassing moment was when my reverse proxy was failing a local load test, at which point I captured a flame graph that pointed out something that I should have seen from reading my ZooKeeper enabled reverse proxy more carefully: I was reading from ZooKeeper regardless of if I had a response cached. After fixing this, the load test passed. This reminds me of what it is like to overuse interactive debuggers: often times they'll tell you exactly what you would have figured out if you simply read the code more methodically; a lot of debugging ultimately boils down to reading comprehension.

Shortcomings and Future Development

This ZooKeeper enabled reverse proxy currently serves test.iainschmitt.com across two replicas, this is naturally ridiculous overkill and there are dozens of out-of-the-box solutions to do this better. But given that this isn't for production use, that attitude is no fun. With that said, one less obvious way that this is an absurd solution is that ZooKeeper was designed for distributed system workloads with more reads than writes, but right now the /targets znode is queried before updating the cumulative request count of the chosen target server, making for a 1:1 ratio between reads and writes. 6 Right now I'm operating a single reverse proxy server, a single ZooKeeper in the ensemble, and both target server replicas on the same physical host, but that's just a little bit of system administration away from being fixed.

Apache ZooKeeper provides a few convinces for notifying clients about changes in the data store. ZooKeeper watches are one of these features, and I put them to use for clearing the reverse proxy cache when a new replica becomes available. A target server will write the current date time to /cacheAge during startup, and the reverse proxy calls the function below to clear the request cache accordingly. Because watches only last for a single change notification, in the code below I have reset the watch every time it is triggered but there really has to be a more elegant and error-resilient way to do this.

export const cacheResetWatch = async (client: ZooKeeper, path: string, cache: NodeCache) => {
  if ((await getMaybeZnode(client, path)).isSome()) {
    client.aw_get(
      path,
      (_type, _state, _path) => {
        console.log("Clearing cache");
        cache.close();
        cacheResetWatch(client, path, cache);
      },
      (_rc, _error, _stat, _data) => {},
    );
  }
};

The caching logic in general needs work; the cache TTL is 120 seconds and if the current target server git commit was recorded in ZooKeeper rather than the timestamp of the last replica restart then the cache could be cleared only when the content of the target server has actually changed.

Since leaving RELEX, I've heard many complaints like the following about my favourite fault-tolerant key-value store, such as on episode #116 of the Ship It! Dev Ops podcast: 7

The worst outage I ever had is I was at Elastic, an engineering all hands in Berlin. It was a great place. I loved it. So all the SREs were there. And we did this to ourselves. Let me just preface this by saying… Because we relied on something that you should never rely on, and it’s called Zookeeper.

Half of the gray in this beard is from Zookeeper. So many things that you know, and probably love, and also hate… You probably love it if you don’t have to actually do the operations for Zookeeper, and if you’re on operations with Zookeeper, you absolutely hate Zookeeper. Zookeeper is the bane of your infrastructure, necessary as it may be.

As someone who was on the operations side of ZooKeeper, I have to disagree. But given that I went to the effort to shoehorn into a static site server, of course I disagree.

Benjamin Reed and Flavio Junqueria. ZooKeeper. O'Reily Media, Sebastopol, CA.↩
zookeeper NPM package ↩
Node http.request Documentation ↩
David Flanagan. JavaScript: The Definitive Guide (7th. ed). O'Reily Media, Sebastopol, CA.↩
node:stream TypeScript types ↩
Hunt, P., Konar, M., Junqueria, F. P., and Reed, B. "ZooKeeper: Wait-free Coordination for Internet-Scale Systems", in Usenix ATC, June 2010.↩
Ship It! Episode #116 ↩

https://iainschmitt.com/post/my-needlessly-complicated-reverse-proxy

Reinventing the wheel to go back in time

Nov 29, 2024

Reinventing the wheel to go back in time

Back in the 90s, you had to deal with any number of problems that are foreign to the software engineers of today: the idea that a bug in the Linux TCP/IP libraries could explain some undesired behaviour in your code would have been far more credible in 1994 than it is in 2024: in the early 90s, ICMP ping packets larger than the maximum IPv4 packet size - 'ping of death' packets - would routinely crash IP network connected devices. While it may be a common refrain that the quality of consumer facing software has worsened over time, the engineers of 1994 did not have access to PostgreSQL, the Apache Web Server, or a Linux kernel that could operate on more than one thread. 1 For open source projects as widely used as any of the three previously mentioned, the low-hanging fruit of common bugs gets picked early, with the medium and higher-hanging fruit not far behind. Today if you wanted to find a novel bug in the Linux TCP/IP libraries, you would have your work cut out for you, and it would probably require some pretty interesting torture testing. But if you did, you'd have something to hang your hat on. The barriers to entry for becoming a kernel contributed aren't zero - you have to be a skilled systems programmer and know a lot about operating systems, but that's a lot closer to what an economist would describe a perfectly competitive marketplace for talent than most examples you'll come across. While I don't understand PostgreSQL and Apache development as well as Linux, I'd be surprised if this would look any different for these projects.

Linux, PostgreSQL, and Apache have given us a great alternative to re-inventing the wheel: relying on the decades of development and bug fixing that have made those tools modern day miracles. The CPU running your Linux public cloud workloads cycles through hundreds of concurrently running processes without skipping a beat. Years of work has gone into making the interplay between page stealing and the write-ahead log of PostgreSQL work correctly on the databases of planet-scale applications. Apache (or, for that matter, Nginx) allows for you to practically take it as a given that you are receiving proper HTTP requests. Any novice who starts tinkering with operating systems, databases, or web servers gains an appreciation for how anyone making a living in computing depends on the efforts of an invertible army of engineers who cared about their craft and put decades of work and expertise into making modern miracles. If you think that you should roll your own operating system, database, or HTTP server for use in production then you are almost certainly wrong.

A generation of software engineers got their start in programming, in part, to people asking the wrong question when an exception is thrown, not to single out one StackOverflow contributor in particular. Learning programming in the internet age is a process of gradually accepting that it is nearly always your code that is wrong, at which point you check StackOverflow, see a post from someone who both saw and misinterpreted the same error message as you, then read a hopefully not too condescending answer. Almost all the tools you need are in front of you, and you simply need to use them correctly: it's never a fun exercise to figure out just how little of 'your' code is really your code between operating system libraries, managed runtimes, and third party libraries.

The farther you go in software engineering, the more likely it is that you'll face problems that the software engineers of the 90's and 00's would recognise. I could be misremembering some specifics, but at a previous company that ran a considerable amount of their compute in a private cloud I recall several weeks tracing intermittent bad gateway errors requiring several packet captures to determine that the root cause was a hardware failure on a network switch. As another example, take the Cloudflare engineers who figured out that in high volume egress workloads, TCP connections from odd-numbered ports would have higher latency than even-numbered ports. 2 In these situations "everything else is fine, I just made an error in my program" stops becoming a useful heuristic.

To better prepare for situations like these, it is a good idea to get some experience re-inventing the wheel. By building, breaking, and fixing toy examples of the technologies you use in production, you are re-enacting part of the gradual process that made these technologies as good as they are today. By understanding how the rough edges of the tools were sanded off, you will be more prepared when an HTTP server, database, or operating system behaves in an unintuitive way. When running your Worldle clone on a database you wrote yourself you can't take a shortcut and assume the database is infallible; you'll have to troubleshoot it alongside your application. It's of course a terrible idea to subject your company or customers to erroneous 403 responses because of a bug in your nftables clone, but for your hobby project? You'll want to make sure your servers are otherwise hardened, but fixing your broken software will teach you more than what you'll learn by reading and following tutorials. 3

https://iainschmitt.com/post/reinventing-the-wheel-to-go-back-in-time

Mobile Network Operators and Capex Tradeoffs

Nov 10, 2024

Mobile Network Operators and Capex Tradeoffs

A discussion of why being a wireless carrier is such a tough business, and what lessons about the industry from both sides of the Atlantic can teach us about electrical utilities.

Wireless Carrier Challenges

It can't be fun to be a wireless carrier. Given that it's corporate earning season in the US, we can see Alphabet (Google's parent company) and Meta Platforms had 2024 Q3 net incomes of $26 and $15 billion while Verizon and T-Mobile brought in a measly $3.4 and $3.1 billion this quarter. This would take some work to explain to someone in 1970: the companies that built continent-scale networks for wireless customers to access an incredible wealth of services and publicly available information are less profitable than the services enabled by said networks. Ben Thompson's aggregation theory explains the Google and Meta side of the equation: 1

The value chain for any given consumer market is divided into three parts: suppliers, distributors, and consumers/users. The best way to make outsize profits in any of these markets is to either gain a horizontal monopoly in one of the three parts or to integrate two of the parts such that you have a competitive advantage in delivering a vertical solution. In the pre-Internet era the latter depended on controlling distribution. The fundamental disruption of the Internet has been to turn this dynamic on its head...the Internet has made distribution (of digital goods) free, neutralizing the advantage that pre-Internet distributors leveraged to integrate with suppliers. Secondly, the Internet has made transaction costs zero, making it viable for a distributor to integrate forward with end users/consumers at scale.

Meta Platforms jealously guards the exact number of servers at their disposal to enable a significant fraction of humanity to have an account on Facebook, Instagram, or WhatsApp, but the answer is "at least one million".2 This enables incredible economies of scale when it comes to purchasing hardware, operating data centres, and writing purpose-built software to administer their services. For both Google and Meta, this is a large part of what enables the zero transaction costs at the heart of their aggregator business model.

Carriers/mobile network operators (MNOs) do not have this luxury. For example, the LTE wireless standard is optimised for performance at a 3.1 mile cell radius, which requires cell towers roughly 5.3 miles apart from each other in a hexagonal grid.3 While Google and Meta can spread workloads across planet-scale fleets of servers and network infrastructure, carriers need to have cell sites wherever they intend to serve customers: the modern consumer expects and modern life demands cell coverage in all but the most remote corners of America. As of December 31st of 2023, this meant that T-Mobile operated 128,000 cell sites in order to reach 98% of Americans.4 Building cellular networks that reach more than 90% of American households means providing service in sparsely populated places where the unit economics of building towers is far worse, but it's my understanding that American MNOs have a limited ability to increase prices in these harder to serve areas.

When the aggregators want to increase their network throughput, they can do so without really impacting one another. But not every piece of the radio spectrum can facilitate efficient wireless communication, and wireless communication standards require exclusive access to a defined band of radio spectrum to work correctly. These spectrum bands are generally auctioned by the relevant telecommunications regulator in a jurisdiction, which is the FCC in the United States. The FCC has a rather archaic website for showing this, but a search for T-Mobile's FCC Registration Number shows that they own cellular licence KNKN557 for the A channel block near Myrtle Beach, South Carolina.5 Once a carrier has won spectrum at an FCC auction, there are hard engineering limits for the network traffic they can squeeze out of it. Ultimately all radio communication works by encoding digital signal into a radio wave. Even in ideal conditions there are physical limits to the digital information density that be crammed into a radio wave at a given frequency, but wireless communication protocols require message redundancy and retry mechanisms given the unreliable nature of wireless communications. As a result, there are tradeoffs between throughput and fidelity:

Despite adaptive modulation and coding schemes, it is always possible that some of the transmitted data packets are not received correctly. In fact, it is even desirable that not all packets are received correctly, as this would indicate that the modulation and coding scheme is too conservative and hence capacity on the air interface is wasted. In practice [in LTE], the air interface is best utilized if about 10% of the packets have to be retransmitted because they have not been received correctly.

from Martin Sauter's From GSM to LTE-Advanced Pro and 5G.6

All together MNOs are constrained by cell tower geography, spectrum availability, and the laws of physics. These three constraints mean carriers face very nonzero transaction costs, giving them daunting unit economics as compared to consumer aggregators like Meta and Google. But it gets worse. The American MNO market is an oligopoly with limited differentiation between the big three carriers. As of this quarter, Verizon and AT&T have 116 and 114 million connections to their consumer offerings while T-Mobile has 127 million connections but doesn't break them down between consumer and enterprise accounts. The three networks are more or less fighting to a draw.

At least in the US, carriers were once more differentiated as before the iPhone they had exclusive agreements with handset vendors. The carriers had more leverage over the likes of Nokia or Motorola by owning the customer touchpoint. But the advent of smartphones meant that the primary touchpoint moved from the carrier to the handset maker. Around the world Apple demonstrated that customers would switch carriers to get the iPhone, meaning that they would have to carry the handset to remain competitive, and they would do so on Apple's terms.7

Despite all of this, cellular networks enable an incredible amount of modern life. In all but the most remote corners of the country, you can depend on having signal. Not only does this require the unglamorous work of blanketing a continent with cell towers, communication from a connected device to a cell tower requires rapid digital signal modulation, radio transmission by the device, reception by the tower, demodulation, and error correction including possible retries. In the last few months I've made an effort to learn more about LTE mobile broadband; wireless communication between a single handset and a tower is a hard enough technical problem, but enabling hundreds or thousands of connected devices to communicate with a cell tower requires sophisticated techniques like orthogonal frequency division multiple access (OFDM). Without going into too much detail, it's impossible to learn about OFDM and not come away from it without respect for the engineering that makes it all possible and genuine surprise that this complicated process works literally hundreds of times per second on commodity consumer hardware.

Lessons from Europe and the Electric Utility Parallels

A lot of what is spelled out in the last section was stated more succinctly in the annual SEC filings of the big three carriers. Verizon's 10K explains "The telecommunications industry is highly competitive" while T-Mobile wrote nearly the same with "The wireless communications services industry is highly competitive". AT&T had the decency to mix it up with "We have multiple wireless competitors in each of our service areas and compete for customers". I promise the reader that the SEC filings of major telecommunications companies are more interesting than they sound, but it's not surprising to see such anodyne language in the annual reports given how unassuming the industry is.

For more drama, we turn to the European Telecommunications Network Operators' Association (ETNO) 2024 "State of Digital Communications" report: 8

There is no end in sight for the slide in the financial performance of European telecoms operators. European telecoms operators are among the largest European-owned entities in the digital value chain, and their continued financial weakness makes them less able to develop skills and services in Europe, and makes them prey to takeover and break-up by entities whose values may not be aligned with a European vision for strategic autonomy

By 'financial weakness' they mean that European MNO revenue growth lags behind both European economic growth and the revenue growth for American carriers. The Stoxx Europe 600 Telecommunications index lags behind European equities as well as global telecommunications indices. While there are many factors at play, average MNO monthly revenue per user in Europe was €15 as compared with €42.50 in America. These lower prices have come at a cost: earlier in the report it's explained that only 17.1% of all mobile connections in Europe are over 5G as opposed to 48.7% in the US. Europe's mobile downlink speeds also trail America's, 97 to 64 Mbps. While American consumers pay more, their carriers invest twice as much in capital as compared to their European counterparts.

Rather than seeing low mobile broadband pricing as a sign of the European common market working well for consumers, European policymakers are concerned that prices are too low to support modern, performant cellular networks. Earlier in the year, former European Central Bank president and Italian Prime Minister Mario Draghi commissioned a report for the European Union on improving the competitiveness of the bloc's economy. The chapter on the European broadband market struck a similar chord to the ETNO report: 9

Lower prices in Europe have undoubtedly benefitted citizens and businesses but, over time, they have also reduced the industry profitability and, as a consequence, investment levels in Europe, including EU companies’ innovation in new technologies beyond basic connectivity.

Some of the cited drivers of lower profitability included ex-ante regulation of telecommunications pricing (as opposed to ex-post regulatory action in the US when responding to malfeasance) the market operating on a country-by-country basis rather than bloc wide, as well as:

Spectrum auctions to assign mobile frequencies have not been harmonised across member states and have been purely designed to command high prices (for 3G, 4G and 5G) over the past 25 years, with limited consideration for investment commitments, service quality or innovation.

This wouldn't be as much of a concern if it were not for the cost required to build out 5G; while they have considerably higher throughout, 5G networks require new network hardware in cell sites. This is a substantial capital investment, and the money to make it happen has to come from somewhere. The Draghi report has come to the conclusion that increasing European consumer access to 5G networks today and 6G networks tomorrow will require charging European consumers more, but it remains to be seen if this will be politically feasible.

When reading the ETNO and Draghi reports, I saw a lot of parallels between the state of the European telecommunications market and the role of regulated electrical utilities in America. In many parts of America, state governments grant one company to be the electrical utility monopoly. The idea is that duplicate electrical transmission infrastructure would be inefficient, and in lieu of competition between firms keeping prices in check, state governments set electricity prices that maintain relatively small profits for the utility company.

This is made more interesting by policy initiatives to encourage residential and industrial electrification in an effort to reduce carbon emissions. There is political pressure to keep electricity prices low, as electricity rates set by the government have many tax-like qualities. However, electrifying residential and industrial uses of power that currently rely on fossil fuels requires investments in transmission infrastructure to support higher loads. It's worth noting that electrical utilities have a business incentive to do this, as more electrification means moving dollars from natural gas companies to power companies. But policymakers with ambitious carbon reduction targets may have more aggressive timelines than what would otherwise make economic sense for the utility, which requires bargaining.

Rate setters have a healthy skepticism of the utilities they regulate - no power company would tell a rate commission that their company would survive just fine with a slightly lower the price of electricity. But just as for Europe's 5G buildout, the capital investment required for both regular operations and any further electrification has to come from somewhere, and utilities need to balance their books one way or another. For both European MNOs and American power companies there is a tradeoff between prices paid by consumers and investments in infrastructure. American policymakers committed to abundant, low carbon electricity would do well to heed Europe's warning on the consequences of ignoring this tradeoff.

https://iainschmitt.com/post/mnos-and-capex-tradeoffs

November Grab Bag

Nov 8, 2024

November Grab Bag

There are a couple of topics that were cut from drafts of the previous two posts. I liked the sections too much to shelve, so I'm posting them as a grab bag here.

Software Unscripted and Enterprise Haskell

There aren't all that many good software engineering podcasts because the medium isn't a great match for the message: source code in plain text is ultimately, well, text, and being able to re-read sentences that you didn't understand the first time helps a lot when learning a concept in computing. But podcasts move at the pace of their presenters, and along with the general expectation that podcast episodes are self-contained, it's hard to get a lot of depth on a topic. This means most quality software engineering podcasts are valuable primarily for their topic curation: a JS Party podcast taught me about Preact 1 and Talk Python to Me episode led me to MIT's 6.824 and Klepmman's Designing Data-Intensive Applications.2 This is still valuable, as it can help prioritise what topics are worth learning and which written resources to reference as you're learning them.

One software engineering podcast that stands head-and-shoulders above the rest is Software Unscripted by Richard Feldman, a high-profile Elm programmer and the creator of the Roc functional programming language. Richard and his guests get into a level of detail that I haven't seen in podcast form. One of the things that enables this is that Richard expects a lot from his listeners: he would rather have you stop the podcast to refresh your memory on automatic reference counting or effect systems than break up the conversation with an explanation. Not every episode is tied to progress on the Roc language, but a lot of the episodes seem to be Richard recording a conversation with someone he needed to talk to anyway to make progress on the Roc language.

Something else that sets the tone for many of the conversations on Software Unscripted is that Feldman used to work for a company that used Haskell as its primary backend language. It's a powerful tool in the right hands, but Haskell is a gutsy choice to make in enterprise settings. If you have the right developer talent then the steep language learning curve can be overcome, but the dearth of third-party libraries and tooling that come with a smaller language are considerable disadvantages. The Java and Microsoft.NET platforms have libraries for seemingly everything under the sun, and while there's a lot to be said for stick-and-rudder programming with nothing but Vim and your terminal emulator - the way that UNIX was written, after all - having an out-of-the-box debugger sure helps solve bugs on a late Friday afternoon. Languages can and should exist for non-enterprise reasons, but someone who is a good enough engineer to use Haskell is also a good enough engineer to write your CRUD app in Java. This isn't to say that there aren't any advantages to writing production programs in Haskell, but seldom will using the language be the best use of anyone's 'innovation tokens'.3

F# and .NET Naming Conventions

Until listening to "F# in Production with Scott Wlaschin",4 I was skeptical of production use of Haskell in particular and functional programming in general. The episode didn't go in as much depth as some,5 but Wlaschin positioned F# as a better fit for enterprise software engineering over Haskell:

“If you look at the Haskell books, like printing ‘Hello World’ is like in Chapter 7 of these books … you can’t do it in Haskell until you understand I/O, and you can’t understand I/O until you understand monads”

“And so [F#] is nice because it’s not as pure as something like Haskell but it’s very programmatic, and so you can piggyback on the massive .NET libraries”

“And I have seen … one person re-writes a whole chunk of code in Haskell or Clojure or whatever and then they go away and nobody has really bought into it”

Wlaschin's pitch for F# was so good that I read his 2018 book Domain Modeling Made Functional and wrote a short review on this website.6

F# is an ML-family functional language that compiles to Microsoft's Common Intermediate Language (CIL), the bytecode for the Common Language Runtime (CLR) virtual machine. C# is far and away the post commonly used CIL language, but the .NET runtime APIs are shared between both languages: for example, the System.IO namespace contains I/O related .NET libraries included in the runtime. Interop between the languages is supported, allowing the F# programmer to put to use a whole wealth of packages already written for C#. In my experience working with the language, I've had to use several packages written entirely in C# and while you have to call the functions slightly differently, things have almost always just worked.

My biggest knock on the .NET Platform is the naming scheme, and complaining about this is too much fun to pass up. My favourite example of this is that “What is .NET? What's C# and F#? What's the .NET Ecosystem?” is an eighteen-minute video by Microsoft VP Scott Hanselman. At one point he says "it's not the best name, but it's the name that we have".7 After all, ".NET" is indistinguishable from a common top-level domain when spoken, and starting a sentence with ".NET" makes it look like you've broken punctuation rules. The Microsoft Writing Style Guide allows for starting a sentence with the platform name, but notes that "Microsoft.NET" is an acceptable alternative. Before seeing it in the style guide I had literally never seen this longer form, but the full "Microsoft.NET" apparently should always be used on the first mention of the platform, as is done in this post.8 This goes without saying but "Common Intermediate Language" and "Common Language Runtime" could not sound more generic without serious effort. Luckily the second characters in "C#" and "F#" are not actually musical sharp symbols, which would give the languages the unfortunate honour of having a non-ASCII character in their name. But not to let them off the hook, searching for "C#" or "F#" on Wikipedia redirects the reader to the pages for their respective Latin letters: 'for technical reasons, "C#" and ":C" redirect here'.9 More than a few times I've had to spell out "sharp" in place of the number sign when referencing either language.

Moving this Website to F#

Most advice about developer blogs is to use a static site generator by the likes of Hugo, Gatsby, or 11ty rather than trying to create something yourself. While it isn't bad advice, I've never been very good at following it. Last year I picked Ethan Brown's book on Node and Express - one of the more engaging language/framework learning books I've come across - and I realised that I had some gaps in my knowledge of how bog-standard MVC applications worked.10 At the time this website was some not-very-well-written React application serving static text, so I re-wrote it to be a Node MVC application. Then, as is still the case, the website was little more than a group of routes to serve Markdown files rendered as HTML, with a little bit of templating and CSS to cut down on boilerplate and to make things look nice. When doing that first migration to Express, I thought that I would be writing more, so I added a /post/:markdownFileName route to serve arbitrary Markdown files. While I didn't end up writing much of anything other than the landing and resume pages on the old website, I kept using it as a testbed for various things like Nest or different logging libraries.

A little before picking up F# I finally took the very common advice for software engineers to do public-facing writing, so I wanted to preserve the ability to serve arbitrary Markdown now that I was finally putting it to use. The main server-side framework I used was Giraffe, which is mostly set of functional bindings on top of C#'s ASP.NET .11 The Node website used Handlebars templates, but I moved these over to Razor pages without much issue when I couldn't find a .NET Handlebars engine that worked well. As is called out by the Giraffe docs, F#'s eager evaluation of functions means that routes are only evaluated the first time that they are requested, where the same initial result will be used for future requests. Giraffe's warbler functions need to be used for accessing dynamic resources, but eager evaluation for static content probably has some performance benefits by forgoing unnecessary re-renders from Markdown to HTML. However, as shown below I was following the lessons from Wlaschin's Domain Modelling Made Functional by using the type system only build /post/:markdownPath paths for existing Markdown files. This meant that I'd need to restart the application whenever I'd write a new post.

module AppHandlers

//...
module MarkdownPath =
    let create path =
        match path with
        | path when (File.Exists path) && (Path.GetExtension path = ".md") -> Some(MarkdownPath path)
        | _ -> None

    let toString (MarkdownPath path) = path
//...

let createRouteHandler markdownPath =
    match MarkdownPath.toString markdownPath |> Path.GetFileName with
    | "landing.md" -> route "/" >=> markdownFileHandler LeftHeaderMarkdown markdownPath "Iain Schmitt"
    | "resume.md" ->
        route "/resume"
        >=> markdownFileHandler LeftHeaderMarkdown markdownPath "Iain Schmitt's Resume"
    | _ ->
        route $"/post/{Path.GetFileNameWithoutExtension(MarkdownPath.toString markdownPath)}"
        >=> markdownFileHandler PostMarkdown markdownPath "Iain Schmitt"

let appRoutes: list<HttpHandler> =
    Directory.GetFiles markdownRoot
    |> Array.choose MarkdownPath.create
    |> Array.map createRouteHandler
    |> Array.toList

While I'm sure I could get some Headless CMS working to my liking, Markdown is just plaintext, so I decided to essentially use GitHub CI/CD Actions as a CMS; the deploy pipeline would build and deploy the application with all Markdown files in WebRoot/markdown/ after merge commits. GitHub actions worked well in another F# project, 12 but I decided to make things more interesting by self-hosting the website. While the domain records for this website are served on a Digital Ocean VM, Nginx on said VM runs a reverse proxy to my home router. A port forwarding rule that only applies to the Digital Ocean IP address forwards traffic to a server on a DMZ VLAN, which hosts the application. One of these days I'll implement some caching on the Digital Ocean Nginx and set up another reverse proxy VM in a different geography, but that's almost entirely for show given what little bandwidth this website needs. At this point I don't have any website analytics outside of Nginx logs, but some minor F# middleware would fix that to give me some visibility on page views.

As far as the actual content goes, I was surprised that Markdown syntax includes footnotes, but they work pretty well in both the browser and in the Obsidian Markdown editor that I use to write posts. One fun initial footnote issue was '↩' rendering as an emoji on iOS until I used the correct escape characters, and if I use a sufficiently long URL sans any link text then the page will grow larger than the mobile viewport. I'd be surprised if there wasn't some CSS fix for breaking those URLs, but it's a good idea to have link text anyway. And speaking of CSS, my biggest current annoyance is that I'm still doing code syntax highlighting on the client using Prism.13 As Josh Comeau wrote in his excellent post about building his new blog, there's a tradeoff between bundle size and supporting more languages in a client-side syntax highlighter.14 Given that the only thing on this website that needs JavaScript is the syntax highlighting, I would much rather do this entirely on the server. However, I haven't found a good replacement syntax highlighter for use with my Markdown processor, Markdig; while I've seen the Markdig.Prism package, it requires serving the Prism script on the pages themselves thus defeating the purpose of migrating entirely.15 At the very least Prism is less than 2000 lines long, so maybe it's time for a full rewrite in C# or F#.

https://iainschmitt.com/post/november-2024-grab-bag

October 2024 GDPLE Development

Oct 27, 2024

October 2024 GDPLE Development

GDPLE is my Wordle-inspired US state economy guessing game that I wrote last year, but this October I did some work on both the frontend and backend that I thought warranted a blog post.

Re-writing the GDPLE Backend in F#

After reading Domain Modeling Made Functional I decided to re-write the GDPLE backend in F#. The player is shown a breakdown of a US state's GDP by sector and has five attempts to guess the state correctly. Initially, the backend was written in TypeScript and Node using NestJS, which is based off of Express but includes some useful features for dependency injection and decorators to mark endpoints. While it strikes a good balance between something like Spring and vanilla Express, NestJS was overkill for the GDPLE backend, and you can end up writing lots of boilerplate when using the library.

The prior NestJS controller class is shown below, and the new F# backend has the same endpoints. The frontend hits POST /puzzle_session to get a UUID for the player's attempt on that day's puzzle that is written to local storage. That UUID is included in POST /guess to submit a guess and GET /answer/:id if the player has exhausted their guesses before reaching the correct answer. The GET /economy provides the GDP breakdown of the mystery US state for the frontend to create a treemap visualisation of the mystery state economy.

@Controller()
export class AppController {
  constructor(private readonly appService: AppService) {}

  @Get("/economy")
  async getTargetStateEconomy(): Promise<StateEconomy> { ... }

  @Post("/puzzle_session")
  async postPuzzleSession(): Promise<IPuzzleSession> { ... }

  @Post("/guess")
  async ostGuess(@Body() body: GuessSubmissionRequest):
      Promise<GuessSubmissionResponse> { ... }

  @Get("/answer/:id")
  async getPuzzleAnswer(@Param() params: PuzzleAnswerRequest):
      Promise<PuzzleAnswerResponse> { ... }
}

F# is a joy to write in, but because I was pretty directly re-writing TypeScript into F#, there are a couple of TypeScript language features that I missed. The US state economy data was represented in a JSON file, and being able to directly import the file with import stateRecordList from "./UsStates" was convenient. TypeScript union types made working with hierarchical economic data very succinct as shown by the getTotalGdp function below.

export interface NonLeafEconomyNode {
  gdpCategory: string;
  children: Array<NonLeafEconomyNode | LeafEconomyNode>;
}

export interface LeafEconomyNode {
  gdpCategory: string;
  gdp: number;
}

getTotalGdp(economy: NonLeafEconomyNode | LeafEconomyNode): number {
  if ("children" in economy) {
    return economy.children
        .map((node) => this.getTotalGdp(node))
        .reduce((prev, cur) => prev + cur, 0);
  } else {
    return economy.gdp;
  }
}

The function takes advantage of TypeScript's permissiveness and if ("children" in economy) isn't the best way to distinguish between the two node types. It is possible to do something equivalent with more type safety in F# by replacing the direct JSON import with a JSON type provider class and using discriminated unions as shown below, but it ended up being unwieldy by introducing four separate types for the economy data. Because only the leaf nodes of the economy tree structure have GDP data, calculating total GDP means distinguishing between a leaf node that must return a GDP value and a non-leaf node where the GDP sum needs to be recursively called over all child nodes. I am sure there is a way to use option types more creatively and coerce the type provider into using them, but at first pass I wasn't able to make it work, and I wanted to get something up-and-running relatively quickly. Type providers are such a great F# language feature, so I hope I'll soon find the time to get these working properly.

let getTotalGdp (economyNode: StateEconomies.Root) =
    let rec loop (node: Node) =
        match mode with
        | Leaf leaf -> leaf.Gdp
        | StateEconomy se -> (se.Children) |> Array.map loop |> array.sum
        | OuterChild oc -> oc.Children |> Array.map loop |> array.sum
        | MiddleChild mc -> mc.Children |> Array.map loop |> array.sum

    loop economyNode.StateEconomy |> Math.Round |> Convert.ToInt64

Another aspect of the backend that I want to return to pertains to the database. While I ended up using F# Dapper, I originally wanted to use either an SQL type provider or SqlHydra for better database related type checking while editing, but these required referencing an OS-specific .dll and I didn't want the hassle while moving between MacOS and Linux while writing the new backend. When I tried to decrement the id column of every row in the target_states table (represented as puzzleAnswerTable in Dapper), I faced a compiler error as shown below because I wasn't able to use the puzzleAnswer field in such a self-referential way inside the update statement. This meant I had to use the raw query instead, with even less type safety than what F# Dapper provides:

let deleteObsoletePuzzleAnswers (dbConnection: DbConnection) =

    // More elegant, but wouldn't compile
    let updatePuzzleAnswer (puzzleAnswer: PuzzleAnswer) =
        {id=puzzleAnswer.id-obsoletePuzzleAnswerCount; name=puzzleAnswer.name; gdp=puzzleAnswer.gdp}
    // error FS0039: The value or constructor 'puzzleAnswer' is not defined.
    update {
        for puzzleAnswer in puzzleAnswerTable do
            set (updatePuzzleAnswer puzzleAnswer)
        } |> ignore

    // Cruder, but compiled
    dbConnection.Execute(
    $"UPDATE target_states SET id = id - {obsoletePuzzleAnswerCount}, updatedAt = CURRENT_TIMESTAMP"
    )
    |> ignore

CI and Frontend Woes

While working on the F# re-write I found out that GitHub allows for self-hosted CI/CD runners, and I wish I had found out about these sooner. All the CI/CD I had done in the past had been on GitLab CI/CD runners or Azure DevOps pipelines and I mistakenly thought that GitHub's equivalent - GitHub Actions - was a paid offering. This means that in the past, deploying one of my personal projects onto my VPS (virtual private server) meant either using scp or git pull before manually restarting some systemd services. Every once in a while I'd forget to restart NGINX, or I'd mess up file permissions by accidentally deploying as root and while this was a great way to stay on top of my system administration skills, it could get annoying and introduced more friction while trying to push out updates. My GitHub actions for the backend were straightforward to implement but not very sophisticated: I'm running the actions directly on the VPS hosting the backend. But the CI/CD Actions made my life much easier while addressing the bugs on the F# backend. Instead of having to SSH onto the VPS for every fix, I could push out small fixes over lunch.

The CI story was more interesting for the frontend: I kept running out of memory while running vite build, which wasn't terribly surprising given the 1 GB of memory on the VPS:

<--- Last few GCs --->

[1044134:0x67be140] 48423 ms: Scavenge (reduce) 380.7 (391.3) -> 380.0 (391.6) MB, 1.79 / 0.00 ms (average mu = 0.210, current mu = 0.079) allocation failure;

[1044134:0x67be140] 49129 ms: Mark-Compact (reduce) 381.1 (391.6) -> 379.2 (391.6) MB, 697.59 / 0.05 ms (average mu = 0.262, current mu = 0.313) allocation failure; scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

The JavaScript bundle was relatively small at 445.79 kB/145.92 kB when gzip compressed. Luckily I was using Preact as a drop-in-replacement for React, which gzips down to 9.96 kB but my problems seemed to be with bundling Material UI. There were over 10,000 'module level directive' warnings associated with the @mui NPM package:

$ npx vite build &> /tmp/build
$  cat /tmp/build | grep -c 'node_modules/@mui.*Module level directives cause errors when bundled'
10924

There has to be some way to reduce Node's memory footprint or work around these Material UI issues, but I wasn't able to do so and never liked the component library that much anyway. When an actual professional frontend engineer recommended I look into React Aria I had an excuse to replace the library entirely. It's a very simple frontend and I only ended up using the Button, Modal, and Dialog React Aria components. The most complicated part of the frontend is the autocomplete text input for submitting a US State name as a guess, and if I was using React then the ComboBox element would have worked great for this input. But I ended up using an Autocomplete element from the Mantine component library because of a possible bug in Preact.

Possible Preact Bug

I set up another repository with a side-by-side comparison between Preact and React to troubleshoot the bug. Both applications render only a ComboBox with a single ListBoxItem underneath it as a selectable option. When the Preact application is run, the following error appears in the developer console:

Uncaught TypeError: Cannot set property previousSibling of #<Node> which has only a getter
    at $681cc3c98f569e39$export$b34a105447964f9f.appendChild (Document.ts:119:13)
    at Object.insertBefore (portals.js:49:22)

The relevant section of Document.ts is shown below, this is part of React Aria used to build a variety of different components.

export class BaseNode<T> {
  //...
  appendChild(child: ElementNode<T>) {
    //...
    if (this.lastChild) {
      this.lastChild.nextSibling = child;
      child.index = this.lastChild.index + 1;
      child.previousSibling = this.lastChild;
    } else {
      child.previousSibling = null; // Line 119 of Document.ts: error thrown here
      child.index = 0;
    }
    //...
  }
  //...
}

The React application in that repository has no issue with rendering ComboBox. No Preact errors are thrown if ListItemBox elements are left out, which of course defeats the purpose of having the ComboBox in the first place. This means that the problem is with Preact and ListBoxItem in particular. With respect to the error itself, in the code segment above the child.constructor.name is equal to "HTMLUnkownElement" for Preact. The rest of this section is more speculative - I know very little about Preact internals. However, in comparing two stack frames up from BaseNode#appendChild, Preact is calling the method on a lower level of the virtual DOM hierarchy than React, and I don't think that this lower level exists.

The code segment below is taken two stack frames above BaseNode#appendChild. For Preact, parentVNode.type and parentVNode._dom have values of "item" and undefined respectively, and the _dom property of a Preact VNode is 'The [first (for Fragments)] DOM child of a VNode'.1 While the function call looks a little different for React, child.constructor and child.node.type are [[FunctionLocation]] Document.ts:227 and "item". While React is inserting an "item" virtual DOM element into the DOM, Preact looks to be inserting the first child of an "item" virtual DOM element, which is undefined.

Because the text inside of the ListBoxItem is the lowest level in the component hierarchy, I assume is that the "item" DOM element is the content between the ListBoxItem tags, which is 'Aardvark' in my bug demonstration repository. Line 119 of Document.ts is only reached once in both the Preact and React applications, so this isn't a matter of React not yet reaching the lowest level of the component tree, and the ListBoxItem definition is const ListBoxItem = createLeafComponent('item', function (props, forwardedRef, item) {...}) which likely explains the "item" in both function calls.

// preact: children.js:343
function insert(parentVNode, oldDom, parentDom) {
  //...
  parentDom.insertBefore(parentVNode._dom, oldDom || null);
  //...
}

// react: react-dom-development.js:11069
function appendChildToContainer(container, child) {
  //...
  parentNode.insertBefore(child, container);
  //...
}

There's a good chance that this isn't a true bug and is rather a tradeoff in Portal components that the Preact maintainers had to make, but in the coming days I'll post an issue in the GitHub repository for the project. If it's not a bug then I'll be curious to see what I got wrong here.

Frontend Re-write Impact

I would have preferred to use React Aria for the autocomplete over importing a second component library, but my VPS can build and deploy the frontend with GitHub actions without running out of memory. To my surprise, this frontend rewrite barely impacted the JavaScript bundle size. In comparing the repository before the F# rewrite at ee865c8 with the current most recent commit fa98105 as of writing, the bundle size shrank by less than 5% down to 423 kB. When building on an Apple Silicon M1 processor, the peak memory during the build also went down a modest amount from 47.56 MB to 45.04 MB. However, the build time went from 9.62 to 3.42 seconds, and the number of modules transformed from 12,473 to 2,716.2

Instead of rewriting the frontend to build on one of the cheapest servers offered by Digital Ocean, I could have instead moved the action runners to a homelab server and done any number of things to deploy it onto the VPS - something I plan to do anyway. But what's the fun in that?

Link to relevant comment in Preact ↩
Values taken using the MacOS time command, averaged over three measurements↩

https://iainschmitt.com/post/october-2024-gdple-development

Review of Domain Modeling Made Functional

Sep 8, 2024

Review of Domain Modeling Made Functional

Domain Modeling Made Functional: Tackle Software Complexity with Domain-Driven Design and F# is a 2018 book by Scott Wlaschin.1 As it says on the tin, the book's goal is to show the reader how to implement domain modeling using functional programming. The phrase "domain-driven design" itself was coined by Eric Evans' 2003 book of the same name where all code examples are written in Java,2 and most other books and talks about DDD use object-oriented languages. While code examples for this book are provided in F#, no prior knowledge of the language is assumed. The first section of the book is a brief overview of domain-driven design terms, and section two walks through how the type system and chained function calls can be used for 'modeling in the small'. Part 3 wraps up the book by fully implementing the e-commerce bounded context introduced in Part 1.

The book's discussion of domain modeling topics agnostic to programming paradigm were good but could have gone into more detail. It covers the basics of entities, value objects, aggregates, and bounded contexts well if not as in-depth as Domain Driven Design. While the Evans book suffers from a low ratio between prose and code segments, Domain Modeling Made Functional never goes more than a few paragraphs without showing some code, making Wlaschin's arguments more concrete and the book more readable. One limitation of his approach is his reliance on a single, relatively straightforward business domain throughout the book while Evans offers more varied modeling challenges across multiple domains in a single chapter. Wlaschin's e-commerce business domain includes order validation, pricing, and an acknowledgement email,3 and this business logic isn't as complicated compared to the syndicated loan system in chapters 8 and 10 of Domain Driven Design where the solution must keep track of a lender's share of an incoming loan payment. The book would be improved by a chapter walking the reader through multiple thorny domain modeling cases.

What makes the book shine is its progressive disclosure of language features to an F# beginner in a way that really sells the language. Domain modeling doesn't require too many language features so Wlaschin spends as little time as possible explaining F# syntax. By introducing the type system and pattern matching early on the book can quickly explain the advantages that F# brings over imperative languages. An example of baking the domain rules into the type system is shown below, where the String50 type and module are set up to make null, empty, or string larger than 50 characters unrepresentable.

type String50 = private String50 of string

module String50 =
    let value (String50 str) = str

    let create fieldName str : Result<String50, string> =
        if String.IsNullOrEmpty str then
            Error(fieldName + " must be non-empty")
        elif str.Length > 50 then
            Error(fieldName + " must be less than 50 chars")
        else
            Ok(String50 str)

Concerning pattern matching, the following example was used in the book with the ShoppingCart discriminated union.

type ShoppingCart =
    | EmptyCart
    | ActiveCart
    | PaidCart

let addItem cart item =
    match cart with
    | EmptyCart -> ActiveCart { UnpaidItems = [ item ] }
    | ActiveCart { UnpaidItems = existingItems }
        -> ActiveCart { UnpaidItems = item :: existingItems }
    | PaidCart _ -> cart

Wlaschin also gives an effective explanation of the Either monad to bring error handling into the primary control flow rather than the 'hidden' control flow of exception handling. Rather than explain that the Result type is an implementation of the Either monad - or even explain what monads are - he explains the type before using it in the rest of the domain for error handling. Aiding his explanation are diagrams similar to those from his 'Railway Oriented Programming' talk shown below.4

For readers coming to functional programming for the first time, this is an appropriately gentle introduction to monads, and he doesn't even use the 'm-word' until he's fully explained Result: "The m-word has a reputation for being scary, but in fact we’ve already created and used one in this very chapter!". I'm almost embarrassed to admit that this was how I learned that the Java Optional class was itself a monad. Little did I know that nearly every day I was using an FP concept that had confused me for years! Something similar is done for Michał Płachta's Groking Functional Programming, where the Optional, Either, and I/O monads are demonstrated without using the word 'monad' anywhere in the book.5

This leaves a lot about F# that a novice to the language will have to pick up - there isn't a discussion of the built-in .NET types, a dedicated chapter to working with collections, and other such topics. For questions like 'how do I do a map over an array' the excellent F# Language Reference 6 and F# Library Reference provide quick answers.7 Domain Modeling Made Functional isn't sold as a way to learn F#, but while I didn't intend to at the outset I ended up reading the entire book cover-to-cover and it greatly motivated me to write F# in the process. Reading the book, using the language documentation, and doing a few old Advent of Code problems was a faster and more painless way to learn the basics of a new language than what I had done in the past by reading a language-learning book first. This has made me realise that I would much rather learn a new language by diving into a problem area that it is well equipped to work in: rather than just learn Rust, I'd rather do a deep dive in concurrency and learn Rust in the process. Rather than just learn C, I'd rather write a ray tracer, something I've tried unsuccessfully a few times since first seeing the spectacular ray tracer in 99 lines of C++.8 In the future, I'll be on the lookout for 'Learn concept X through language Y' books.

One criticism of the book is that it stresses practicality and making functional programming look normal to a fault. While some material in the persistence chapter helped tie database options to Result types, much of the content on incorporating relational and document databases was unnecessary; this type of detail was left out of the Evans book for a reason. Wlaschin's chosen dependency injection method was by passing dependencies as function parameters, explaining that the Reader and Free monads would be omitted given the introductory nature of the book. The Reader monad was something that I immediately read up on after finishing the book, as I'm sceptical of passing every dependency as function parameters for large applications. Given how good his Either monad explanation was, I'm sure he would have hit the mark for Reader and Free as well.

The second to last sentence in Domain Modeling Made Functional is "In this book I aimed to convince you that functional programming and domain modeling are a great match", and the book does accomplish this. Wlaschin is one of the best technical authors I've read, making the 310 pages in the book fly by. While It is missing some advanced DDD and FP concepts it ended up being an excellent introduction to F# that I would recommend to anyone interested in learning the language.

https://iainschmitt.com/post/ddmf-review