GeistHaus
log in · sign up

https://yehudacohen.substack.com/feed

rss
77 posts
Polling state
Status active
Last polled May 19, 2026 05:38 UTC
Next poll May 20, 2026 06:43 UTC
Poll interval 86400s
ETag W/"75bf2-zCXDAq5UE7Yo9xxujVlmp9tGk8A"

Posts

Introducing TypeKro
A control plane aware framework for orchestrating Kubernetes resources in TypeScript
Show full content

I've not yet met a developer who likes YAML much. Nonetheless, the Kubernetes ecosystem is made up of engineers who will spend hours debugging YAML indentation errors and typos. For the longest time I found this to be a curious mystery: tools like Pulumi and cdk8s have been around for a while now, providing a statically typed experience for building Kubernetes workloads. Yet engineers eschew these tools that might shorten the engineering feedback loop enabling IntelliSense and compile-time feedback and expressive programming language control flow statements like loops and conditionals in exchange for oh-so-painful YAML.

And many of these Kubernetes engineers complain about this YAML incessantly too.

So if nobody likes YAML, and there are alternative tools like Pulumi and cdk8s, why do Kubernetes engineers still use helm charts and kustomizations with ArgoCD or Flux to manage and distribute their Kubernetes source code. This stands in stark contrast to AWS engineers, who have long since abandoned the pain of writing CloudFormation templates in favor of Pulumi or Terraform or the AWS CDK.

Without further ado then, let’s discuss my hypothesis as to why everyone is still reluctantly using YAML for Kubernetes development. More excitingly still, I want to introduce you to TypeKro a new framework that I’ve built with the attempt to provide a really good user experience for Kubernetes developers and address the shortcomings of Pulumi and cdk8s.

If you aren’t interested in a long diatribe about the state of Kubernetes and just want to see TypeKro, scroll way down or just go to the linked website or github repository [where all stars are deeply appreciated].

The Shortcomings of Pulumi and cdk8s and the existing Kubernetes eco-system

Pulumi and cdk8s are two primary options that have been available to Kubernetes engineers who want to define their infrastructure using programming languages rather than YAML.

Both of these approaches have impedance mismatches with the natural declarative paradigm adopted by the Kubernetes CI/CD ecosystem. Each of these tools has a different philosophy and so we’ll address the impedence mismatches of each independently.

Why not Pulumi?

Unlike the RESTful APIs provided by many SaaS, PaaS, or IaaS vendors, the Kubernetes control plane doesn't really succeed or fail at a single resource. The Kubernetes control plane has no interest in dependencies between resources. This is in opposition to the philosophy of Pulumi (and that of Terraform), which cares deeply about applying resources in topological order when deploying. Pulumi builds a dependency graph from your definitions, and deploys them in order.

This is in opposition to a Kubernetes cluster, where you blast all your YAML configuration at the cluster, and let it try and try and try again to reconcile the state until Kubernetes' reconciliation loop successfully aligns your cluster's actual runtime configuration with the desired state as declared in your YAML configuration.

This impedance mismatch between the operational model of Pulumi and the operational model of the Kubernetes control plane renders Pulumi non-ideal for Kubernetes development.

After all, why should Kubernetes infrastructure be held back by the need to define a dependency graph up front, and why should we wait for a resource to become ready prior to deploying its dependencies.

But directed acyclic graphs and dependency ordering are fundamental to Pulumi’s deployment model, and if Pulumi were to deploy dependencies without waiting for the parent resources to become ready, Pulumi users would forever remain ignorant of whether their deployment’s success.

Why not cdk8s?

The approach cdk8s takes is closer to the operational model of Kubernetes. Rather than waiting to deploy each Kubernetes resource and watching the control plane for stability before deploying dependent resources, cdk8s generates YAML that tools like ArgoCD can then deploy and monitor for stability.

This renders cdk8s useful in the generation of YAML, but insufficient to perform safe deployments to your Kubernetes cluster. Instead, you must supplement cdk8s with a second tool like Argo CD to deploy and monitor the YAML it generates.

If a cdk8s deployment fails, an engineer must then correlate the errors reported by Argo with the source CDK code before synthesizing the YAML again.

Because of its operational model, the Kubernetes ecosystem demands deep integration with its native CI/CD tools like Argo and Flux. Kubernetes engineers elect to use YAML with clunky tools like Kustomize and Helm because that is their best approach to reaping the benefits of the Kubernetes ecosystem without layers of abstraction between their configuration and the resources deployed to the cluster.

Yoke

While developing this project, I discovered Yoke, a new tool by David Desmarais-Michaud, which lets you define flights: compositions in a typesafe language and compile them to WASM. Yoke then provides a cli to deploy these to your Kubernetes cluster. I haven’t played with Yoke yet, but it looks cool to me, and like a version of cdk8s with proper support for integrated gitops and release versioning. Nonetheless, as far as I can see, Yoke, doesn’t yet provide the ability to handle complex orchestration scenarios like the kinds we are going to discuss in the next section.

EDIT: Apparently it does, but this wasn’t clear to me when I wrote this or built this tool. https://github.com/yokecd/examples/tree/main/demos/dynamic-mode Either way, I prefer the aesthetic developer experience of TypeKro to Yoke, although it seems like David and I were bothered by very similar things.

The Growing Need for Better Kubernetes Orchestration Abstractions

These tooling challenges have become even more pressing as Kubernetes has evolved far beyond its original scope. Since around 2017, the Kubernetes community has extended the platform far beyond its original intent as a container orchestration platform.

With the introduction of Kubernetes operators, the Kubernetes control plane evolved from a container platform focused on managing Deployments and ReplicaSets to managing everything needed for your Kubernetes workload. From DNS records with ExternalDNS, to provisioning your AWS dependencies using the AWS Service Operator (and more recently using ACK), to provisioning your Azure resources using the Azure Service Operator, to the Cluster API project that emerged to operate other Kubernetes clusters, to Crossplane that extends the Kubernetes control plane to let you orchestrate all of your platform engineering components, even those outside of your Kubernetes cluster.

It should come as no surprise then that as the Kubernetes control plane has evolved from a container platform to a universal control plane, orchestration primitives have become vital to building complex workloads. After all, the Kubernetes control plane must to be able to consume data about resources that it creates, even when it is creating resources outside of the boundary of the Kubernetes cluster.

Take an Amazon RDS database created with a Kubernetes operator, for instance. Deployments that depend on this database must be able to discover its connection string even though it is not known when the deployment manifest is applied.

The Kubernetes community has innovated to address these orchestration challenges.

Kubernetes Resource Orchestration with Crossplane

Perhaps the most well-known approach to orchestrating complex kubernetes dependencies is the one taken by Crossplane.

For the uninitiated, Crossplane extends the Kubernetes API itself with custom resource definitions (CRDs) that represent cloud resources. You define composite resource definitions (XRDs) that describe the schema of your infrastructure abstractions, then create compositions that template the underlying managed resources.

Here's a simplified example of what a Crossplane composition looks like for creating a database with its required networking:

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: database-with-vpc
spec:
  compositeTypeRef:
    apiVersion: example.com/v1alpha1
    kind: XDatabase
  resources:
  - name: vpc
    base:
      apiVersion: ec2.aws.crossplane.io/v1beta1
      kind: VPC
      spec:
        forProvider:
          cidrBlock: "10.0.0.0/16"
          region: us-west-2
  - name: subnet
    base:
      apiVersion: ec2.aws.crossplane.io/v1beta1
      kind: Subnet
      spec:
        forProvider:
          availabilityZone: us-west-2a
          cidrBlock: "10.0.1.0/24"
          region: us-west-2
          vpcIdSelector:
            matchControllerRef: true
  - name: rds-instance
    base:
      apiVersion: rds.aws.crossplane.io/v1alpha1
      kind: RDSInstance
      spec:
        forProvider:
          dbInstanceClass: db.t3.micro
          engine: postgres
          dbSubnetGroupNameSelector:
            matchControllerRef: true

To understand how this works, you need to grasp Crossplane's approach to dependencies and cloud provider integration. The forProvider field contains the actual configuration that gets passed to the AWS API - essentially a direct mapping of the cloud provider's resource schema. The vpcIdSelector with matchControllerRef: true tells Crossplane to automatically populate the VPC ID field by finding another resource in the same composition that can provide it.

This selector-based dependency model is clever in theory - resources automatically wire themselves together through Kubernetes' controller reference system. But in practice, can result in debugging nightmares when selectors don't match or when the dependency chain breaks.

Firstly, the YAML complexity becomes even more pronounced as you extend the kubernetes YAML DSL with Crossplane’s own DSL. That simple example above already feels unintuitive to me, and real-world compositions often span hundreds of lines with cryptic field paths like spec.forProvider.vpcSecurityGroupIds[0]. Debugging failures requires correlating errors across multiple managed resources, often with minimal context about which part of your composition is actually failing.

Secondly, while Crossplane introduced composition functions to address some of these limitations, these functions are often not sufficient for complex orchestration needs. If, for example, you want to conditionally create resources based on input parameters, you'll need to write a composition function - essentially a containerized program that transforms your composite resource into managed resources. This means your "declarative" infrastructure now includes imperative code running in your cluster.

The fundamental issue is that Crossplane has taken the Kubernetes paradigm of "everything is YAML" and applied it to problems that don't naturally fit that model. Complex infrastructure orchestration often requires conditional logic, loops, and data transformations that are painful to express in YAML templating, even with composition functions.

Kubernetes Resource Orchestration with KRO

In December of 2024, however, Amazon released KRO, a decoupled resource orchestrator for Kubernetes.

While KRO does not yet have a stable release and it has only been available for a short while, the KRO project marks a rare collaboration between the cloud giants, with its recent backing from Azure and GCP, all of whom seem to have bought into its philosophy. Its simplicity seems to have really hit home, and not just with me, but with Kubernetes engineers everywhere.

This orchestrator allows users to register ResourceGraphDefinitions with the Kubernetes control plane. These RGDs are essentially factories that describe the schema of the composition, and the relationships between the resources within the composition.

A simple example (taken from the KRO docs) is the following DeploymentService definition:

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: deploymentservice
spec:
  schema:
    apiVersion: v1alpha1
    kind: DeploymentService
    spec:
      name: string
  resources:
    - id: deployment
      template:
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: ${schema.spec.name}
        spec:
          replicas: 1
          selector:
            matchLabels:
              app: deployment
          template:
            metadata:
              labels:
                app: deployment
            spec:
              containers:
                - name: ${schema.spec.name}-deployment
                  image: nginx
                  ports:
                    - containerPort: 80
    - id: service
      template:
        apiVersion: v1
        kind: Service
        metadata:
          name: ${schema.spec.name}
        spec:
          selector:
            app: deployment
          ports:
            - protocol: TCP
              port: 80
              targetPort: 80

With the above RGD registered with KRO, it becomes possible to create DeploymentService instances as follows:

apiVersion: v1alpha1
kind: DeploymentService
metadata:
  name: my-app
spec:
  name: web-app

Upon the successful registration of the my-app DeploymentService instance, the KRO operator will watch for changes to the placeholder fields and dynamically ensure that the service and deployment resources remain synchronized.

Yet even KRO, promising as it is, still requires engineers to write and maintain YAML configurations. Additionally, like Crossplane, it does not natively support complex control flow constructs like loops. The fundamental question remains: how can we bring the benefits of modern programming languages to Kubernetes development while preserving the declarative, GitOps-friendly workflows that make the ecosystem so powerful?

Enter TypeKro

I’ve pondered the Kubernetes developer experience and the problem of Kubernetes runtime dependency graph resolution for a while now. As such when I first discovered KRO I became very excited. You see, KRO represented a stateful API I could submit my resource dependency graph to, and it would take care of the continuous reconciliation of that resource graph’s dependencies. To solve Kubernetes YAML hell, I could build an expressive SDK to allow engineers to define their resource graphs as traditional software factories, using static types and typescript generics. Engineers could then use these factories to provision instances, passing statically typed input variables to these factories. The factories compile to KRO Resource Graphs, and the instances compile to KRO instance CRDs.

This is not a small undertaking, and while I’ve been mulling over this idea for a while, I never had the time to put fingers to keys and hammer out a good solution.

Until the Kiro hackathon, that is. With Kiro’s release (Amazon’s new agentic IDE), I found a reason to execute this vision, and a workflow that would let me build something substantial and high quality provided I did the right work up front. Here is what I have been doing before and after work and during weekends for the last 6 weeks, leading to many a late night and very little sleep.

A Quick Intro

TypeKro is a framework aimed at compressing the development feedback loop for Kubernetes resource orchestration. At a high level, TypeKro allows you to build resource graphs, create factories from these resource graphs, and use these factories to deploy instances by passing in statically typed input. Using magical proxies, TypeKro lets you access fields on other kubernetes resources that haven’t been created yet as though they already exist, providing a seamless developer experience. Behind the scenes, the library replaces these simple types with reference types that are resolved at runtime.

Resource graphs provide two factories that users can choose from if they want to deploy instances:

Users can use a kro factory, which deploys a KRO ResourceGraphDefinition during it’s instantiation. This approach is the recommended approach for deploying production workloads. In this deployment model the Kubernetes control plane is responsible for orchestrating the deployment of the Kubernetes resources that are created with this factory.

The same resource graphs are also deployable using a direct factory. This factory mode lets you deploy to kubernetes clusters where the KRO operator is not installed. It works similarly to Pulumi’s model and resolves dependencies within the javascript runtime. This mode is usable if you need to quickly test resource graphs and don’t have access to a cluster with KRO installed. It is also necessary if you want to use this framework to deploy the KRO controller itself.

The TypeKro Magic Proxy Experience

At the heart of TypeKro’s developer experience is what I call the magic proxy. It is this ES6 proxy that allows you to access fields which will only exist after deployment as though they are already available, enabling the declarative Kuberetes user experience like yaml with the added static type checking performed by the TypeScript compiler.

You can access properties on typescript objects and have it just work, even though the deployment object hasn't been created in a Kubernetes cluster yet. So, how, you might ask, does this magic work without breaking TypeScript's static type safety?

When you define a resource using a TypeKro factory, you aren't getting a plain JavaScript object back. Instead, you receive a special proxy object that wraps your resource definition. This proxy looks and feels exactly like the real resource to TypeScript and your editor, so you get all the benefits of autocomplete and type-checking.

However, when you access a property on this proxy, something special happens. For example, when your code accesses deployment.metadata.labels, the proxy intercepts this request. Instead of returning a value of type T (which at deployment time has not been resolved by the Kubernetes control plane yet), it generates a special reference object, a KubernetesRef<T>. This object is essentially a structured piece of data that says, "I am a reference to the field that Kubernetes will reesolve in the future".

This unified way of defining resource relationships makes TypeKro so expressive and versatile, because this abstract graph of references that just looks like a plain javascript object during development time can be interpreted in two different ways depending on your chosen deployment strategy.

A deeper dive into the KRO Strategy:

When you choose to leverage the 'kro' factory type, your graph of resources and its KubernetesRef objects are serialized into a KRO ResourceGraphDefinition manifest. The TypeKro engine processes every reference, converting an object like deployment.metadata.labels into a Common Expression Language (CEL) string: ${deployment.metadata.labels}. This YAML is then applied to the cluster, and the in-cluster KRO operator becomes responsible for resolving these expressions at runtime and reconciling the resources.

You can optionally elect to wait for Kubernetes reconciliation and obtain the Kubernetes values in response to a simple javascript Promise when it becomes available.

Option 2: The 'direct' Factory:

When you use the 'direct' factory for local development, that same graph is interpreted differently. Instead of generating CEL expressions to be processed in the cluster, the DirectDeploymentEngine uses a DependencyResolver to inspect the very same KubernetesRef objects. It generates a directed acyclic graph using these reference objects, and deployes them to the cluster, waiting for the unresolved values to be hydrated upon Kubernetes processing the resources.

This graph is then topologically sorted to produce a step-by-step deployment plan, ensuring the independent resources are processed before dependent resources.

An Example to Showcase the Developer Experience

“Show me the code!” I hear you demand. Well, okay then:

import { type } from 'arktype';
import { kubernetesComposition, Cel } from 'typekro';
import { Deployment, Service } from 'typekro/simple';

const webapp = kubernetesComposition(
  {
    name: 'webapp',
    apiVersion: 'example.com/v1',
    kind: 'WebApp',
    spec: type({ replicas: 'number' }),
    status: type({ ready: 'boolean' })
  },
  (spec) => {
    const deployment = Deployment({
      name: 'webapp',
      image: 'nginx',
      replicas: spec.replicas
    });
    
    const service = Service({
      name: 'webapp-service',
      selector: { app: 'webapp' },
      ports: [{ port: 80 }]
    });

    return {
      ready: Cel.expr<boolean>(deployment.status.readyReplicas, ' > 0')
    };  
  }
);

await webapp.factory('direct').deploy({ replicas: 3 });

First, we define our component's interface using ArkType schemas. This provides both compile-time TypeScript validation and runtime schema validation. Next, we create the resource graph using the two-parameter kubernetesComposition API provided by typescript. The first describes the resource graph and its input and output types. The second is a function that let’s you assemble the resources in your composition.

Because TypeKro wants to provide you versatility in your workflow, the same resource graph can be deployed using completely different strategies.

Direct Deployment provides immediate, client-side deployment similar to tools like Pulumi:

const directFactory = webapp.factory('direct');
await directFactory.deploy({
  replicas: 2,
});

KRO Deployment leverages the Kubernetes Resource Orchestrator for kubernetes-control plane managed dependency resolution and runtime intelligence:

const kroFactory = await webservice.factory('kro');
await kroFactory.deploy({
  replicas: 3,
});

YAML Generation produces deterministic output yaml that can be used in GitOps workflows:

const yaml = kroFactory.toYaml();
console.log('Generated ResourceGraphDefinition:', yaml);
How TypeKro works: RefOrValue Type Architecture

The entire TypeKro architecture rests on this type union:

type RefOrValue<T> = T | KubernetesRef<T> | CelExpression<T>

The RefOrValue<T> type union is the foundational contract that enables every composition function in TypeKro to work seamlessly with static values, schema references, and complex expressions without the developer needing to think about the distinction.

Every parameter in every TypeKro factory function accepts RefOrValue<T>. Whether you're passing a static string like "my-app", a schema reference like schema.spec.name, or a CEL expression like Cel.template("prefix-%s", schema.spec.name), the composition function handles it transparently.

We tell the compiler to view any RefOrValue<T> as its base type T, so that the static type system sees the natural types developers expect (string, number, etc.), while the runtime system can handle the complexity of reference resolution and expression evaluation. This enables the seamless developer experience while preserving the power of declarative resource orchestration.

The implications of this design become apparent when you consider how it enables TypeKro's versatility. The same function call that accepts schema.spec.name generates a CEL expression ${schema.spec.name} for KRO deployment but resolves as an actual string value for direct deployment without changing user code.

The $ Prefix: Known vs Unknown Value Resolution

TypeKro’s DSL design philosophy is straightforward: known values should resolve statically, unknown values should resolve to references.

Known values resolve statically because TypeKro can determine them at execution time:

const deployment = simpleDeployment({
  name: 'my-app',           // Known: literal string
  replicas: 3,              // Known: literal number
  image: 'nginx:latest'     // Known: literal string
});

Unknown values become references because they won't exist until runtime:

const deployment = simpleDeployment({
  name: schema.spec.name,   // Unknown: becomes KubernetesRef<string>
  replicas: schema.spec.replicas,  // Unknown: becomes KubernetesRef<number>
});

// Status fields are always unknown - they don't exist until after deployment
const statusUrl = deployment.status.loadBalancer.ingress[0].ip; // Unknown: becomes KubernetesRef<string>

Notice how schema and status references don't require optional chaining (?.) - TypeKro uses a type modifier to enhance bare-bones kubernetes client resource types to treat status fields that are accessed as non-optional within the graph definition since they're guaranteed to be references that will resolve at runtime.

This works perfectly for schema references and status fields - they're clearly unknown values that must become references.

The challenge arises when you want to reference a field on a resource you just defined:

const configMap = simpleConfigMap({
  data: { apiUrl: 'https://api.example.com' }
});

const deployment = simpleDeployment({
  env: {
    API_URL: configMap.data.apiUrl,     // Known: 'https://api.example.com'
    API_URL: configMap.data.$apiUrl,    // Unknown: whatever's in the cluster
  }
});

The $ prefix is how you explicitly opt out of static resolution for values that TypeKro could otherwise resolve immediately. It forces "unknown" semantics on values that would otherwise be treated as "known."

Deployment Strategy Architecture

When I first started building TypeKro, I just built the KRO deployment mode. KRO was the whole point.

But there was an obvious chicken-and-egg problem: how do you deploy KRO itself with a tool that requires KRO to be installed? So I built direct deployment mode to bootstrap KRO clusters.

Once I had direct deployment working, I kept finding other legitimate use cases. Teams that wanted GitOps workflows. Local development where spinning up KRO was overkill. Testing scenarios where I needed immediate feedback. Teams that wanted to mix TypeKro resources with existing YAML files using yamlFile() and yamlDirectory() factories. Teams that needed Helm chart integration with helmRelease() factories to consume third-party applications.

The bootstrap composition became particularly useful: you can deploy a complete runtime environment with Flux CD and KRO using direct mode, then switch to KRO mode for your application workloads, and use helm factories to consume entire third-party ecosystems like monitoring stacks or databases alongside your custom resources.

So now the same resource graph works across deployment modes - direct for bootstrapping and development, KRO for production orchestration, YAML generation for GitOps workflows. Each mode optimized for its use case instead of forcing everyone into the same pattern.

CRD Bootstrap Timing Intelligence

One pain point I kept hitting was CRD timing errors. You deploy a custom resource and get "CRD not found" because the CustomResourceDefinition hasn't been established yet. Most tools make you manually sequence CRD deployment or add retry logic.

I built automatic CRD establishment detection into the direct deployment engine. When TypeKro encounters a custom resource, it checks if it's a built-in Kubernetes resource. If not, it finds the corresponding CRD and waits for Established: True before deploying the instance.

This happens transparently. Your deployments work without "CRD not found" errors, even when deploying CRDs and their instances in the same resource graph.

Raw Kubernetes Client Integration

TypeKro uses raw @kubernetes/client-node types to ensure full compatibility with the Kubernetes ecosystem. No custom abstractions or simplified wrappers that break integration with existing tooling.

But raw Kubernetes types are verbose and complex. So TypeKro wraps them in simple factory functions like simpleDeployment() and simpleService() that expose the most common configuration patterns while preserving access to the full API surface underneath.

This approach gives you both accessibility for common use cases and full power when you need it, without sacrificing compatibility with kubectl, client-go, or other Kubernetes tools.

Deployment Event Streaming

One of the painful issues with current deployment approaches is that it’s difficult to see what’s happening in the Kubernetes control plane while you’re deploying. In order to further shorten the feedback loop so you don’t have to keep a kubectl watching your kubernetes events for state changes, I built an integration into TypeKro that let’s you stream control plane events that are relevant to your deployment from the cluster into your TypeKro logs. All you need to do is set the TYPEKRO_LOG_LEVEL to debug in your environment and you can see what’s going on during your Kubernetes Cluster as the control plane is attempting to reconcile the state of your Kubernetes resources.

Resource Readiness Evaluation

TypeKro ships with a bunch of resource types that are enhanced with a readiness evaluator. The TypeKro runtime polls the Kubernetes control plane for the status of deployed resources. It then evaluates those statuses against the readiness evaluator to determine whether it has stabilized in the Kubernetes control plane. If you want to ensure your deployments are successful, you can pass a wait flag to your control plane that tells it to ensure resources are ready before letting you know your deployment is complete.

State Management in TypeKro

TypeKro is not responsible for state management, it relies on the Kubernetes Control Plane to describe a program’s current state. But not being responsible for state management does not equate to not supporting state management. I wanted TypeKro to seamlessly work with any infrastructure-as-code tool that you already use and so it is extensible.

I have been really enjoying Sam Goodwin's alchemy lightweight infrastructure-as-code library for TypeScript, a new lightweight Infrastructure-as-Code library for TypeScript that focuses on simplicity and a direct-to-API approach.

I built an integration so I could manage kubernetes resources as part of my alchemy stacks. When you pass an alchemy Scope into your kro factory constructor options, it will register your KRO resource graph definition and instances as alchemy resources within the provided scope.

If you pass an alchemy Scope to your direct factory options, each kubernetes resource created with your factory will be individually registered with that alchemy scope.

Passing an alchemyScope as an input will also allow you to consume fields on other alchemy resources in your scope and enable other alchemy resources to depend upon the properties of the Kubernetes resources you are deploying.

So long and thanks for all the fish

This has been a long ride through the technical architecture of TypeKro, so congratulations if you made it this far.

If you're someone who likes Kubernetes but dislikes YAML, I do think TypeKro provides a differentiated experience that you won't get anywhere else. But this is just the beginning.

The real challenge lies ahead: extending TypeKro to handle the complex dependency workflows that current tooling struggles with. Crossplane and 3rd party resources with their intricate composition dependencies. Cloud controllers that need to coordinate AWS, Azure, and GCP resources with Kubernetes workloads. Multi-cluster deployments where resources span infrastructure boundaries.

These scenarios break most existing tools because they require dependency graphs that cross platform boundaries, runtime state coordination between different control planes, and orchestration patterns that go beyond simple resource creation. The type system and deployment architecture we've built positions TypeKro to tackle these problems.

The Kubernetes ecosystem is vast, and it has a large surface area. While I've spent time covering support for many commonly used Kubernetes tools, I cannot cover the whole surface area of the ecosystem myself. I'm releasing TypeKro as an Apache 2.0 licensed open-source project, so I hope you'll come build with me.

The future of infrastructure orchestration isn't just about replacing YAML with TypeScript. It's about building systems that can handle the complexity of modern multi-cloud, multi-cluster deployments while preserving the developer experience that makes you productive. That's the challenge I'm excited to tackle next.

Please give it a try and share your thoughts!

https://yehudacohen.substack.com/p/introducing-typekro
Extensions
Developing with Kiro: Amazon's New Agentic IDE
My experiences testing Kiro with TanStack Start + React, Spring Boot + Angular, Open Source Dev Tools, and Internal Dev Tools
Show full content

One of the coolest things about being part of AWS's Community Builders program is that we occasionally get early access to new products. As such with today's public preview that is available over at kiro.dev, I am excited that you guys will be able to have an opportunity to try out this new development experience. Like other agentic IDEs that I've worked with, Kiro still feels early, but in my extensive testing there is no question to my mind that it has already multiplied my productivity by leaps and bounds. I'd like to introduce you to this new development tool, helping you understand how it is different from other VS Code forks and agentic IDEs, describing my experiences developing on Kiro, and share some tips regarding how to get the most out of the IDE.

Subscribe now

I'll be upfront about my experience: Kiro has genuinely changed how I approach development work, but we haven't yet arrived at the AGI magic bullet rendering software engineers obsolete. What I found interesting is how it forced me to think differently about the development process itself. Instead of jumping straight into code, I found myself spending more time articulating what I actually wanted to build and high level software architectural choices.

Kiro shows code diff view with chat for building features.

During my testing, I built a complete TanStack Start portfolio website from scratch in a few hours without writing a single line of code. I contributed substantial pull requests to open source projects like Alchemy.run, with Kiro generating roughly 80% of the implementation. I used it to help onboard to complex Spring Boot + Angular projects and develop internal command-line tools. But each of these experiences also taught me about Kiro's limitations—when it gets stuck in loops, when it needs more guidance, and when traditional development approaches are still faster.

The reality is that Kiro requires a different kind of project management. You're not just writing code anymore; you're steering an AI that can get overwhelmed by complexity, sometimes prefers workarounds over root cause analysis, and occasionally needs to be told explicitly not to move on until issues are actually fixed. It's powerful, but it's not hands-off.

In this review, I'll walk you through my real-world experiences using Kiro across multiple technology stacks and project types. I'll share what worked, what didn't, and the strategies I developed for getting the most out of it. Whether you're considering adopting Kiro or just curious about where development tooling is heading, I want to give you an honest look at both the promise and the reality of working with Amazon's entry into the agentic IDE space.

Is Kiro Really Different?

I've been using various AI-powered development tools since GitHub Copilot was released a couple of years ago. Each tool has carved out its own niche in the development workflow: Copilot excels at enhancing your typing speed with intelligent code completion, Cursor at debugging and helping you implement discrete tasks well, and recently pushing more into agentic territory.

In fact, many developers are already trying to use Cursor for complex, multi-step workflows by having it update and refer to requirements files, essentially trying to create their own spec-driven development process. The problem is that these tools weren't really designed for that kind of autonomous execution. They're great at the task you're currently working on, but they tend to drift when you need them to follow a complex plan over multiple sessions.

Kiro feels different because it was built from the ground up to handle this kind of work. Instead of trying to hack together a spec-driven workflow, Kiro already knows how to help you build and adhere to a specification and develop without deviating too far from the plan. You can tell it how to implement something, create complex plans, and then let it run largely by itself, filling in the implementation details.

The shift is subtle but significant. With Cursor, I'm constantly course-correcting and re-explaining context. With Kiro, I spend more time upfront articulating what I want to build, but then I can step back and let it execute. It's the difference between being a hands-on manager who needs to check every detail versus setting clear expectations and trusting the process.

But I want to be clear about something: this isn't magic, and for now, there are still times when your coding skills will be required. There are still times that Kiro is unable to solve your bugs for you. At this point you need to solve them yourself or guide Kiro very explicitly regarding the approach to take. You need to develop a sense for when to let it run and when to start interfering or take over control.

The comparison that keeps coming to mind is the evolution from text editors to IDEs. We didn't abandon text editors because they were bad—we moved to IDEs because they let us think at a higher level of abstraction. Kiro feels like it might be the next step in that evolution, though it's still early days and the product does have some rough edges.

Real-World ExperiencesBuilding with TanStack Start + React

With that introduction to Kiro out of the way, let me walk you through my actual experiences using it across different technology stacks. My first real test of Kiro for web development was building a portfolio website from scratch using TanStack Start and React. I wanted to see if it could handle a complete project end-to-end with new tooling that is probably not in its training data. I've wanted a portfolio website for a while but haven't built it because it means frontend development, and I don't love frontend work. At least I didn't until now.

The process started with me describing what I wanted: a clean, professional portfolio site with project showcases, a blog section, and contact information. I mentioned I wanted to use TanStack Start for the framework and shadcn for the component library, but beyond that, I was pretty vague about the implementation details.

I was quite impressed by what happened next. Kiro didn't just start writing code—it created a requirements document, a design document, and a detailed task list. This spec-driven approach meant I could review and refine what we were building before any code was written. I made a few tweaks to the requirements (added some specific sections I wanted, adjusted the visual direction), and then Kiro got to work.

The execution was largely autonomous. I had to set up a couple of MCP servers first—one for DuckDuckGo to help it search for documentation when it got stuck, and another for GitHub access. But once that was configured, Kiro handled the heavy lifting. It scaffolded the project structure, implemented the routing, created the components, and even handled the CSS styling.

What was particularly clever was how it handled content generation. Rather than me having to write all the portfolio content myself, I simply pointed Kiro to use the MCP server integrations to scrape the data it needed about me from a couple of links. It pulled information about my projects, background, and experience automatically, then structured it appropriately for the site.

Here's where it got interesting: I know very little about modern CSS, especially the kind of complex layouts and animations you see in professional websites. But Kiro managed to create something that looked genuinely polished. When I pointed out rendering issues or asked for visual tweaks, it understood what I meant and made the corrections without me having to explain the technical implementation.

The whole process took about four hours, and I wrote exactly zero lines of code. More importantly, when I wanted to make changes later—adding new sections, tweaking the design, or fixing bugs—I could describe what I wanted in plain English, and Kiro would implement it correctly.

That said, it wasn't completely hands-off. I had to interrupt the process a few times when Kiro started going down the wrong path or when it needed clarification about specific requirements. The key was learning to be specific about what I wanted upfront and not being afraid to course-correct when needed.

Spring Boot + Angular

While the portfolio website showed Kiro's strength in building from scratch, my experience with Spring Boot + Angular projects has been more limited so far, but it's shown me another side of the tool's capabilities—particularly around project onboarding and generating insights about existing codebases.

When I pointed Kiro at a complex Spring Boot + Angular project I needed to contribute to, it didn't just read through the code. It generated comprehensive steering documentation that broke down the project's architecture, identified key patterns and conventions, and even highlighted potential areas for improvement. This kind of project analysis would normally take me hours of manual exploration, but Kiro delivered it in minutes.

The onboarding assistance was genuinely helpful. Instead of spending time trying to understand how different modules connected or what the data flow looked like, I could ask Kiro specific questions about the codebase and get detailed, contextual answers. It understood not just what the code did, but why certain architectural decisions were made and how they fit into the broader system design.

However, this is also where I encountered some of Kiro's limitations. When development environment issues arose—things like dependency conflicts, configuration problems, or credential setup, Kiro would sometimes get stuck in loops. It would try the same approaches repeatedly rather than stepping back to diagnose the root cause. In these situations, I found I could use Kiro's insights to understand the problem myself and then either fix it manually or guide Kiro very explicitly through the solution.

What's interesting is that even when Kiro couldn't solve environment issues directly, its analysis of the codebase was still valuable enough that I could resolve problems independently using the context it provided. It's like having a very knowledgeable colleague who can explain the system architecture but might need help with the practical setup details.

I'm planning to do more extensive testing with Spring Boot projects specifically, but so far the pattern seems to be that Kiro excels at understanding and explaining complex systems, even when it struggles with the operational aspects of getting them running.

Open Source Development: Contributing to alchemy

One of my most challenging tests of Kiro was contributing to alchemy, an open source Infrastructure-as-Code library that lets you manage cloud resources using pure TypeScript. Unlike tools like Terraform or Pulumi, Alchemy runs entirely in JavaScript runtimes with zero dependencies, making infrastructure management feel more like regular application development. I am a huge fan of its approach, and while it's still early days for the project, it has picked up really good traction.

The pull request I ended up submitting (https://github.com/sam-goodwin/alchemy/pull/657) was a substantial 8000 lines of code, and Kiro generated roughly 80% of that. But getting there required a completely different approach than the portfolio website project. This is where I really learned about the importance of spec-driven development with Kiro.

My initial spec for the changes I wanted to make was not detailed or clear enough for Kiro be effective. It wasn't a great experience. Kiro wouldn't follow the project's contribution guidelines, struggled with the async nature of resource management, and kept implementing solutions that didn't fit Alchemy's patterns. I spent hours going in circles, with Kiro repeatedly making the same mistakes. It would run into issues and create new files to debug the issues because it couldn't root cause the actual issues independently. It would then fail to debug those new files, and we were going around in circles.

The breakthrough came when I stepped back and created a much more detailed specification document. I had to create a new spec with much more specificity to get Kiro to perform well. Instead of trying to explain everything in chat, I had Kiro write out clear requirements, acceptance criteria, and implementation tasks. I iterated through those tasks with Kiro spending time fleshing them out to leave no room for ambiguity. I even included details about how to implemet the tasks. The result was a spec of discrete, manageable pieces that Kiro could tackle one at a time.

This spec-driven approach was a game-changer. Suddenly, Kiro could work across multiple sessions without losing context. When our conversation would expire (which happens frequently with complex projects), I could start a new session, point Kiro to the spec, and it would pick up exactly where we left off. The specification became our shared source of truth. When I changed my mind about anything, it went straight into the spec, which it turns out is the secret to getting Kiro to listen to you.

The whole process took about a week of aggressive prompting and iteration, but the end result was a high quality pull request adding substantial functionality to a really awesome project. More importantly, I learned that Kiro's real strength isn't in quick, one-off tasks—it's in sustained development work where you can invest time upfront in creating clear specifications and then let it execute consistently over multiple sessions.

Internal Dev Tool

My experience using Kiro for internal development tools has been surprisingly rewarding, particularly when building command-line interfaces and developer experience utilities. This is where Kiro really shines in understanding niche use cases and collaborating meaningfully on user experience refinement.

I've used Kiro to build several internal CLI tools that would have been labor-intensive to create manually. What impressed me most was how it understood not just the technical requirements, but the developer experience considerations that make command-line tools actually pleasant to use. It would suggest helpful error messages, implement sensible defaults, and even add progress indicators without me having to explicitly request them.

The collaboration feels genuinely like pair programming with someone who understands exactly what you're trying to achieve. When I described a workflow problem our team was facing, Kiro didn't just implement a solution—it asked clarifying questions about edge cases, but helped me evaluate user experience decisions, and helped refine the tool until it felt really intuitive to use.

The development speed for these internal tools has been remarkable. Kiro's understanding of the business domain has led to me being able to build experiences for my developers that I simply would not have had the time to build from scratch.

Getting the Most Out of Kiro: Practical Tips and Best Practices

After testing Kiro across different project types, I've developed some strategies that have led to a really good development workflow. These are hard-learned lessons taken from doing real work, with plenty of mistakes along the way.

Embrace Spec-Driven Development

Kiro shines when you invest time upfront in creating detailed specifications that are detailed enough for autonomous execution.

Start by having Kiro help write the spec itself. Describe what you want to build, then ask Kiro to create requirements documents, design documents, and implementation task lists. The key: iterate until they're specific enough that there's no room for ambiguity. What feels like over-specification to me is often exactly the right level of detail for Kiro.

When you change your mind about anything, update the spec immediately. This becomes your shared source of truth across multiple sessions—it's the secret to getting Kiro to listen to you consistently.

Master Project Management for AI Development

Working with Kiro requires different project management than traditional development. You're steering an AI that can get overwhelmed by complexity and sometimes prefers workarounds over root cause analysis.

Key rules:

  1. Explicitly tell Kiro not to move on until issues are truly fixed

  2. Insist on root cause analysis rather than accepting quick fixes

  3. Keep specs and task lists updated as living documents

  4. Pause periodically to ask Kiro to align the spec with project state

  5. Let Kiro do project management and suggest plan improvements

Generate Steering Documentation Early

One of Kiro's features is its ability to generate steering documentation for existing projects. This is not something you have to do, but it’s something that you always should do. As soon as you start working on any project—especially one that's poorly documented—ask Kiro to analyze the codebase and create steering documentation.

This documentation captures the product vision, technical architecture, and development patterns in human-readable form. It's not just useful for you; it becomes a resource that Kiro uses to understand project conventions and maintain consistency throughout development. I've found this saves hours of onboarding time for me and helps Kiro make better decisions from the start.

Leverage MCP Server Integration

One of Kiro's most powerful features is its ability to integrate with external tools and data sources through MCP servers. Add MCP servers when you need to interact with data or tools that the language model doesn't know how to access directly.

For development work, I've found the DuckDuckGo MCP server essential for documentation searches when working with newer frameworks. The GitHub MCP server is invaluable for understanding project context and contribution guidelines. If you're working with project management tools like Jira or need to access specific APIs, setting up the appropriate MCP servers can dramatically expand Kiro's capabilities.

The setup is usually straightforward, and the productivity gains are immediate. Don't hesitate to configure these integrations early in your project—they're not optional extras, they're essential tools for complex development work.

Use Agent Hooks for Automation

Kiro's agent hooks feature lets you automate repetitive development tasks by triggering agent executions based on specific events. For example, you can set up hooks to automatically update and run tests when you save code files, ensure translation strings are updated across languages when you modify them, or run code quality checks when you commit changes.

I've found agent hooks particularly valuable for maintaining code quality and consistency across team projects. Instead of remembering to run various checks and updates manually, you can configure Kiro to handle these tasks automatically, ensuring nothing falls through the cracks.

Recognize When to Step In

Learning when to let Kiro run versus when to take control is crucial. Kiro excels at sustained development work where you can set clear specifications and let it execute over multiple sessions. It's particularly strong at understanding business domain requirements and translating them into working code.

However, there are times when traditional development approaches are still faster. If you're doing quick debugging, making small tweaks to existing code, or working on problems where you already know the exact solution, it might be quicker to just write the code yourself.

The key is developing a sense for when Kiro is struggling. If you find yourself going in circles, step back and either provide more specific guidance or handle the problem manually. Don't let perfect be the enemy of good—sometimes the most productive approach is a hybrid where Kiro handles the bulk of the implementation and you step in for specific challenges.

Common Pitfalls and How to Avoid Them

The biggest mistake I made initially was treating Kiro like Cursor. It's not—it's a fundamentally different approach that requires different strategies.

Don't expect Kiro to work well with vague requirements. The more specific you can be about what you want, the better the results. This includes being specific about coding patterns, architectural decisions, and even implementation approaches.

You will need to change the way that you have been developing and adjust to Kiro's learning curve. You need to develop a feel for how to work with specs effectively.

Finally, remember that Kiro is still early. There is still occasional unexpected behavior and bugginess, and times when it simply can't solve problems that seem straightforward. The key is maintaining realistic expectations while taking advantage of its genuine strengths in sustained, specification-driven development work.

The Head Fake: This blog post was written with Kiro

Before we talk about the future, I want to share something that perfectly illustrates the spec-driven development approach we've been discussing throughout this post: I used Kiro to write this very blog post.

In preparing my reflections about my experiences with Kiro, I didn't just start typing. Instead, I worked with Kiro to create a comprehensive specification that included requirements documents, design documents, and detailed implementation tasks.

Kiro created the requirements that captured what the blog post needed to accomplish—introducing Kiro to developers, sharing authentic experiences across different technology stacks, and providing practical tips. Then I iterated with Kiro to develop a design document that outlined the narrative structure, writing style, and content approach based on analysis of my existing blog posts.

Most importantly, Kiro broke down the blog writing into discrete, manageable tasks. Each section had clear objectives, specific requirements to address, and detailed guidance about what to include. This plan served as a detailed enough spec that Kiro could use to execute each section autonomously while maintaining consistency with the overall vision.

You're reading the result. This long (hopefully substantial) blog post was written collaboratively, with Kiro handling the bulk of the content generation while I focused on steering the direction, providing authentic details from my experiences, and refining the voice and tone.

This meta-example demonstrates my shift from writing to steering. Instead of spending hours crafting each paragraph, I spent time articulating what I wanted to communicate, and then let Kiro handle the translation from intent to polished prose. The specification became our shared source of truth, enabling us to work together effectively across multiple sessions.

All I needed to do was sprinkle in my authentic voice by adding sentences like this one!

The Future of Development, Productivity, and Creativity

Looking ahead, I think we're witnessing the early stages of a fundamental shift in how software gets built. The evolution from writing code to steering development isn't just a productivity improvement—it represents a qualitative change in the role of software engineers.

As context windows grow and models become more capable, I expect software engineering will gradually become more about conversations regarding requirements and user experience. The technical implementation details that consume so much of our mental energy today will increasingly be handled by AI systems that understand not just syntax, but software architecture, user experience principles, and business domain requirements.

This doesn't mean engineers become obsolete—quite the opposite. As the AI handles more of the mechanical aspects of coding, engineers can focus on the parts of software development that require human judgment: understanding user needs, making architectural trade-offs, ensuring quality and maintainability, and solving complex business problems.

We're still a long way from language models that can one-shot optimal design and solutions for complex systems. There's still significant value in understanding good software architecture, appropriate technology choices, and effective development practices. But the day-to-day work is already shifting from writing code to steering AI systems and ensuring they produce quality results.

Kiro feels like an early glimpse of this future. It's not perfect: it still stumbles and has limitations and requires a learning curve. But it offers a preview of what development might look like when AI systems can truly understand and execute complex, multi-step projects.

The key insight I've gained from my testing is that success with tools like Kiro isn't about becoming a better prompter, it's about becoming better at articulating what you actually want to build. The clearer you can be about requirements, user experience, and architectural decisions, the more effectively these tools can translate your vision into working software.

Whether Kiro specifically becomes the dominant platform in this space remains to be seen. The agentic IDE landscape is evolving rapidly, and there's plenty of room for innovation and improvement. But the fundamental shift it represents toward vibe-coding for complex and significant systems is real.

I don't think the spec-driven development paradigm will remain restricted to building software. This blog post already proves how it can help with writing. Over the next few years, I expect that engineers will build tools for many different industries enabling spec-driven creativity, and spec-driven workflow execution.

For developers considering whether to invest time learning Kiro, my recommendation is straightforward: if you work on complex, multi-step projects where you can benefit from sustained AI assistance over multiple sessions, it's worth the investment. You can visit Kiro's website to get started!

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/developing-with-kiro-amazons-new
Extensions
A Tale of ECS Service Stability
How ECS's new(ish) version consistency feature affected the stability of an old service
Show full content
An Unexpected Evening Investigation

I was just about to log off for the day when I noticed a customer's SRE team chat suddenly erupting with activity. It was around 6pm my time, approaching midnight for their engineers in Europe. I hadn't received any direct notifications, but seeing the rapid-fire messages in the channel, I decided to hop in and see what was happening.

First things first, I pulled up the Datadog dashboard.

As it turned out, their production environment was experiencing something peculiar - their API Gateway traffic had completely flatlined. Not degraded, not sluggish, but completely silent. For a system that typically hums with constant activity, this digital silence was both unusual and concerning.

Subscribe now

Diving into the Mystery

My first step was to check the enterprise firewall service that sits in front of their API Gateway - the crucial component responsible for traffic filtering and security. This service runs as a task in Amazon ECS, and surprisingly, the ECS console showed zero running instances.

What made this particularly intriguing was that there hadn't been any deployments to this service in several weeks. The service was designed with auto-scaling and self-healing capabilities specifically to prevent this type of situation. Yet somehow, it was completely down with no obvious explanation.

The Puzzling Behavior

The logs revealed what appeared to be a straightforward issue:

Task stopped at: YYYY-MM-DDThh:mm:ss.555Z CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref [ECR_REPO].dkr.ecr.us-west-2.amazonaws.com/[SERVICE]/[IMAGE]:[TAG]@sha256:[DIGEST]: [ECR_REPO].dkr.ecr.us-west-2.amazonaws.com/[SERVICE]/[IMAGE]:[TAG]@sha256:[DIGEST]: not found

ECS couldn't find the container image it needed to launch the task. But here's where things got really interesting.

I decided to try pulling the images myself, and what I discovered was perplexing: I could successfully pull the associated tag, and I could see the new SHA digest associated with that tag. However, I couldn't pull the specific SHA digest that ECS was trying to use. It was as if ECS had cached an old version of the image digest and was stubbornly refusing to look at the current tag.

This behavior contradicted everything I'd previously understood about how ECS handles container images. In my experience, ECS services had always pulled whatever image was currently associated with a tag, without getting hung up on SHA digests from previous versions. This fundamental change in behavior was what led me to reach out to AWS Support.

A Helping Hand from AWS Support

After explaining the puzzling behavior I was observing, AWS Support provided invaluable assistance. While I had correctly identified the root cause of the issue, AWS Support helped me attribute the cause to a specific documented change in ECS behavior - pointing me to documentation about the Software Version Consistency feature introduced in July 2024.

This was one of those moments where past experience created a blind spot. I had actually seen this feature announcement when it was released, but I hadn't fully grasped how dramatically it would change the behavior of ECS service stability in certain scenarios. What I had assumed was a bug or misconfiguration was actually a deliberate design change to improve security.

The Technical Puzzle Pieces

The Software Version Consistency feature fundamentally altered how ECS handles container images. Instead of resolving image tags at runtime for each task, ECS now:

  1. Resolves a container image tag to its digest when the first task of a deployment starts

  2. Stores this digest in the ECS control plane

  3. Uses this exact digest for all subsequent tasks in that deployment

Here's where our customer's situation created the perfect storm:

  1. Their ECR repository had a lifecycle policy that removed older, unused images

  2. The specific image digest referenced by the firewall service had been removed by this lifecycle policy

  3. When tasks needed to restart, ECS tried to pull the exact image digest it had on record

  4. That specific digest no longer existed in ECR, resulting in launch failures

Before this feature, ECS would have simply pulled whatever image was currently associated with the tag - a behavior I had internalized over years of working with these services. The change to using immutable digests created a new failure mode that hadn't been accounted for in our operational practices.

Given that so many of the services we work with have a unique tag associated with each image and task definition, this was my first exposure to this failure mode.

Philosophical Takeaways

While a short outage in the grand scheme of things, this incident was a good reminder that cloud architecture requires constant adaptation. Features designed to improve security can sometimes create unexpected ripple effects in our operational practices.

The Software Version Consistency feature itself is a valuable security improvement - it ensures workloads use precisely the container images they were designed to use. The challenge comes in adapting our operational practices to work effectively with these evolving platform capabilities.

Practical Takeaways and Recommendations

As far as my practical recommendations for situations like this:

  1. Try hard to use different tags with each image: When the ECS task definition references an image tag, it is always safest when that tag uniquely associates it with an immutable docker image.

  2. When images require the same tags, leverage version consistency, but with safety nets: Preserving the ECS version consistency feature rather than disabling it enables us to audit any change in the image ECS runs. Nonetheless teams should take efforts to implement automated recovery mechanisms that can detect this specific failure pattern and force a new deployment automatically. This ensures each deployment represents a distinct immutable docker image, while still allowing recovery from failure.

  3. Improve alerting granularity: Proactively escalating alerts on launch failures of mission-critical microservices is essential. If these launch failures have known root causes, adding these root causes into alert messages helps SRE teams to quickly identify root causes.

  4. Align lifecycle policies with deployment strategies: Teams should align their ECR lifecycle policies to ensure they don't conflict with their ECS deployment patterns.

  5. Develop deeper awareness of platform changes: One area I think is worth exploring, is how language models might enable us to consume updates from cloud providers or software providers, and determine whether they might be relevant for us. For teams who are constantly focused on innovating, these tools can help us cut through the noise to find information that is pertinent to our systems.

Looking Forward

For this customer, the solution isn't to disable the safety feature, but rather to enhance their recovery processes to account for new potential failure modes. It's about finding that balance between embracing security improvements while maintaining resilience in the face of unexpected changes.

For the rest of us, this incident offers a valuable reminder about the continuous evolution of cloud platforms. What works perfectly today might behave differently tomorrow, not because of a flaw in our implementation, but because the underlying platform continues to improve and evolve.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/a-tale-of-ecs-service-stability
Extensions
Gambling with language models
One clueless investor's attempt at beating the stock market with ModernBert
Show full content

I began this blog post between two other re:Invent attendees en route to Las Vegas for AWS re:Invent. It's been a while since my last blog post, but I am excited for this one, because I am approaching Vegas with a bundle of optimism. I am prepared to gamble. Blackjack? Slot machines? Poker? Not this year. This year, I join the ranks of the great r/wallstreetbets legends preparing to pick stocks. These stocks, I might add, belong to companies I have never heard of in industries I know little about, and I have no stock trading experience to speak of.

Nonetheless, over the last year or so, in my scant free time, I have slowly begun outsourcing my gambling strategy to machines in Amazon Web Services' data centers.

It is re:Invent, and so I will be utilizing AWS's new and existing machine learning services in the experimentation process. I will also be pulling $5,000 out of my investments and placing them into stocks selected by my machine-learning driven stock-picker. I’ll then revisit my blog post portfolio next re:Invent to see how I did.

By the time I publish this, re:Invent has been over for almost a month. To keep things current, I've adjusted some minor portions of my approach to leverage some new innovations by AWS, Google, and in the open source language model world. The broad strokes, however, remain the same.

Without further ado, let's put Amazon Bedrock, AWS Sagemaker Studio, PyTorch, Hugging Face Transformers, and some open-source models to the test.

Background

There are many approaches to picking stocks, but I enter the realm of finance absent any pre-conceived notions (or idea) of how things should be done. As such, rather than relying on my own reasoning skills, I rely heavily on the reasoning capabilities of language models, and their ability to extract knowledge, understand text, and predict quantitative outcomes.

I should also point out, that one last major change I made to this blog post over the last two weeks. Following an exciting new release by the answer.ai and LightOn teams, I switched to fine-tuning ModernBert, a long-context, SoTA encoder-only model to predict stock prices. This type of model was exactly what I had searched for when first starting the project to no-avail.

By the end of this blog post, a fine-tuned version of ModernBert-large becomes the brains of my fledgling stock-picking operation.

The remainder of this blog post is dedicated to providing a detailed description of my process, along with accompanying code snippets to explain the various steps I undertook in this experiment. It also explains the anticipated results, and my hypothesis that even a model that predicts stock price with a substantial error rate might enable me to do better than randomly picking stocks or purchasing an index fund.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

Methodology

At 1000 feet, my methodology was to synthesize a broad range of information about publicly traded companies, process it, and then train a model to predict stock growth for each of these companies. I then leveraged the most recent company information to predict future growth for each of these stocks using the trained model, and ranked the stocks in order of best predicted performance.

It's important to note that I am not relying on precise predictions of future stock prices for any given company. After all, even if I had access to the a structured data corpus of the companies internals, stock prices are impacted by events that have not yet come to pass. As such, rather than attempting to look into a crystal ball, I'm aiming to create a system that, by looking at a rich context of financial reports, macroeconomic indicators, and more, can produce a relative ranking of which stocks should perform better than others.

While still within the realm of divination, my predictions are more murky, akin to reading tea-leaves rather than being crystal-ball clear. The idea is that, even if the model’s absolute predictions are inaccurate, the relative rankings might provide an edge in selecting a portfolio that outperforms the market.

As an input to my model, I processed data to generate what I call "contextual snapshots." These snapshots act as a rich, time-bound report cards for each company at a given fiscal quarter. Each snapshot is a combination of financial statements, analyst reports, macroeconomic factors and more, distilled into a report that the model can learn from. Think of it like a time capsule for each company, encapsulating a wealth of information for each historical quarter and year.

My hypothesis is that, after being exposed to these capsules a model might identify patterns that an untrained eye might miss within these contextual snapshots.

With high level methodology out of the way, let’s dive into some details of each phase: the data preparation, the model training, and finally, the stock ranking.

Data PreparationA cloud-based experimentation environment

Prior to training a model, I needed to gather, clean, and format the necessary data. Like many a data scientist, I depend heavily on Jupyter notebooks for interactive feedback. Unfortunately, my laptop and network doesn't have the horse-power necessary to handle the scraping, text processing, and model training requirements without impacting my day to day work.

Enter AWS SageMaker Studio notebooks. Sagemaker Studio notebooks are cloud-based and allow you to stop and start jupyter notebook instances as needed. I started with a t3.medium instance for data fetching and light processing. I found this instance to be more than adequate for those initial tasks. This instance type, though not GPU-enabled, provided enough power for data download, and text manipulation.

One of the things I like about Sagemaker Studio notebooks is that it's super easy to swap out the compute for something with a little more horse-power if you need to later on. (I knew I'd need something a little more beefy for the model training portion.)

I will add that since starting this project, AWS has released a preview for Sagemaker Unified Studio. A new product that aims to more seamlessly integrate AWS's individual data-engineering components to provide an end-to-end data engineering platform. Check it out if you're looking to do something production ready.

Gathering Financial Data

My first step was to download the current holdings of the Russell 3000 index, to create a list of the stocks I would be focusing on.

Then, I used this sec-cik-mapper library to map company ticker symbols to their unique SEC CIK identifiers (a critical step to accessing filings).

After consulting with my personal finance expert, Claude 3.5 Chat, I discovered the SEC's Edgar database, along with the Federal Reserve Economic Data (FRED) and Bureau of Economic Analysis (BEA) APIs were some accessible and free sources I could use to gather the micro-economic data for these companies and macro-economic data surrounding the time-periods in question. I also discovered that I could use the finagg tool to easily access the data from these APIs. So I started forming historical profiles for each of these companies from all of these data sources.

Additionally, I downloaded historical financial filings, specifically 10-K (annual reports) and quarterly 10-Q filings which provide detailed insights into each company's financial health. From these reports, I extracted two critical sections: Risk Factors, and Management's Discussion and Analysis (MD&A). These are often long, dense, and might contain information necessary to contextualize a company's current state. It might also contain some intent signals that machine learning models might pick up embedded in the report writing style and choice of words that might not be captured by human readers.

To parse these documents, I leveraged an approach I discovered in this gist. (I might add that if you like any of the open source code or tools I depend on, you can show your support by starring them on github!)

import re
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO

def parse_10_k(text, hash_digest) -> Form10KExtracts | None:
    # Regex to find <TYPE> tags followed by section names like '10-K'
    sections_regex = re.compile(r'(>(Item|ITEM)(\\s|&#160;|&nbsp;)(1A|1B|7A|7|8)\\.{0,1})')
    matches = sections_regex.finditer(text)
    matches_list = [(x.group(), x.start(), x.end()) for x in matches]

    # Check if we have any matches before creating DataFrame
    if matches_list:
        sections_df = pd.DataFrame(matches_list, columns=['item', 'start', 'end'])
        sections_df.columns = ['item', 'start', 'end']
        sections_df.replace('&#160;',' ',regex=True,inplace=True)
        sections_df.replace('&nbsp;',' ',regex=True,inplace=True)
        sections_df.replace(' ','',regex=True,inplace=True)
        sections_df.replace('\\\\.','',regex=True,inplace=True)
        sections_df.replace('>','',regex=True,inplace=True)
        sections_df['item'] = sections_df.item.str.lower()
        sections_df.sort_values('start', ascending=True, inplace=True)
        deduped = sections_df.drop_duplicates(subset=['item'], keep='last')
        deduped.set_index('item', inplace=True)
        risk_factors = clean_string(get_text_only(text[deduped['start'].loc['item1a']:deduped['start'].loc['item1b']]))
        md_and_a = clean_string(get_text_only(text[deduped['start'].loc['item7']:deduped['start'].loc['item7a']]))
        return Form10KExtracts(risk_factors, md_and_a, hash_digest)
    else:
        return None
Prompt Engineering for Financial Analysis

Because the extracted sections from the reports tend to be long and verbose, it would have been infeasible to directly use them as input to my model. Instead, I employed the power of large language models to create summaries of these sections. For my initial inferences of the model I used Claude 3.5 Haiku over Amazon Bedrock. This was an awesome experience when I first started out, as the api is much more reliable than many others.

Then a couple of months ago, (I think around the same time that Bedrock introduced batch inference into GA) Bedrock started imposing some pretty onerous usage quotas. I didn't want to rearchitect my process to be event driven since a more synchronous experience is desired for my experimentation. As such, when I processed the most recent batch of documents in preparation for pulling the trigger on this experiment and publishing this blog post, I pointed to Google's Gemini 2.0 experimental API which did well enough.

Prior to switching off of bedrock, I tried using the new Nova Micro model and I found that while it was able to understand a lot, it wasn't as good at claude or gemini at following instructions and while it's quota was higher than the Haiku quota, it was still very limited. You might still see some remnants of this experimentation in the source code. (I will paste a link at the end of the blog post in case you're interested)

Here’s one example of one instance where Nova micro wasn't able to reliably give me meaningful output in my requested format:

In each company's us-gaap filings, they include sometimes hundreds of keys describing their financial situation. Processing them all is a fools errand, and each company uses different keys. So I asked Gemini to select the ones that would be most useful for a financial analyst:

keys = ['AccountsPayableAndAccruedLiabilitiesCurrent', 'AccountsReceivableNetCurrent', 'AccruedIncomeTaxesCurrent', 'AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment', ...]
key_string = '\\n' + '\\n* '.join(keys)
context_information = [f"In this companies us-gaap report, the following keys are available: {key_string}"]
prompt = GeminiPrompt(
    task="Select 20 keys from the us-gaap keys provided in the context to enable a financial analyst to quickly make a snap-judgment of the company's performance.",
    context_information=context_information,
    instructions=["Be thorough in your analysis and ensure the keys you select are categorical, distinct, non-overlapping and represent key financial indicators like profits, revenues, assets, liabilities, and investments in company growth. Ensure the keys summarize the company's financial health effectively and are present in the context information."],
    response_format_instructions=["Respond with only a markdown code block containing a json list of strings representing the selected keys: eg. ```json\\n[\\nKey1,\\nKey2\\n,...,Key10\\n]\\n```", "Verify the presence of each of the keys in the provided context before responding", "Preserve the original PascalCase of the keys eg. 'AssetsCurrent' rather than 'Assets Current' and 'CostOfGoodsSold' instead of 'Cost Of Goods Sold'", "Order the keys in the order of importance"]
)
response = get_gemini_response(prompt, output_format="json")

While Gemini and Claude 3.5 Haiku do a decent job of this, I found that getting Nova-micro to respond in a deterministic format difficult.

Similarly, I used prompts to summarize the risk factors and the management discussion and analysis (MD&A) section, keeping the same careful eye on structured prompts that minimize hallucinations and ensure correct formatting.

Note that I also didn't want the model to favor companies based on identifying information withhin my snapshots so I instructed the model to redact company specific information:

redaction_instructions = "Any time the company's name, or year of the report, or any revealing product name would appear in the returned markdown document, redact it using the tag [REDACTED] so that a reader would not know which company is described. Also ensure summaries and quotes do not include names of individuals associated with the company."

def summarize_risk_factors(risk_factors: str):
    context_information = [f"The following text was scraped from the risk factors section of a company's 10-K report: ```\\n{risk_factors}\\n```\\n "]
    task = "Return a three paragraph summary of the most important information for a financial analyst to understand the company's risk factors. Also include a set of up to ten quotes from the text that support the summary."
    response_format_instructions = [f"The summary should be formatted as a markdown file with a 'Risk Factors' heading with two sections: Summary and Substantiating Quotes. {redaction_instructions}", "Respond only with a markdown code block containing markdown content within starting '```markdown\\n' and ending: '\\n```'"]
    prompt = GeminiPrompt(
        task=task,
        context_information=context_information,
        response_format_instructions=response_format_instructions
    )
    response = get_gemini_response(prompt, output_format="markdown")
    return response

Aside from Risk Factors and MD & A sections, I spent a bit of time trying to summarize the huge Disclosures sections as well, but it was taking too long and outside of my token budget (probably because companies like to bury their important disclosures in mountains of irrelevant text in the hope that nobody will notice).

Creating the Contextual Snapshot

The culmination of the data preparation process was the creation of these "contextual snapshots". Each snapshot included:

  1. Company Summary: A concise description of the company, its sector, location, and approximate market cap.

  2. Historical Trends: A CSV-formatted table of historical financial metrics, including company-specific data and macroeconomic indicators, like CPI, interest rates, and unemployment, over the past two years. This trend data was designed to highlight patterns over time.

  3. 10-K Summaries: Summaries of the risk factors, management discussion and analysis, and disclosures from the most recent 10-K report, distilled using LLMs via Amazon Bedrock.

@dataclass
class ContextualSnapshot:
    year: int
    q: int
    company: Company
    historical_trends: Dict[str, Trend]
    future_projection: Projection
    most_recent_10k_file: Optional[str] = None

    def to_anonymous_report(self):
        company_summary = self._get_company_summary()
        historical_trends = self._get_historical_trends()
        file_10k_extracts_summary = self._get_most_recent_10k_summary()

        return f"""
# Company Summary
{company_summary}

# Historical Trends
{historical_trends}

#  Most Recent 10-K Summary
{file_10k_extracts_summary}
"""

I saved each snapshot report to a markdown file, to be used as the primary input for the machine learning model in the subsequent phase. I also associated a set of labels describing the percentage change in the stock price over the subsequent quarter, six months, and year alongside each point-in-time snapshot.

Because, I was worried that a point-in-time sample for each company would not take into account financial announcements or would represent premature excitement from earnings reports, I took stock price samples using both the pre- and post-earnings stock prices for each of these time periods.

This data preparation phase was a little onerous, but it seems like worthwhile use-case to test the efficacy of applying modern language models to long context documents that mix both structured and unstructured data.

Model Training

With the data prepared, it was time to turn to training a model. For this phase I switched over my Sagemaker Studio instance to use a more powerful g6.xlarge GPU instance with an NVIDIA L4 GPU.

Choosing the Right Model Architecture

When it came to selecting a model, I knew I needed something that could handle the long context provided by the contextual snapshots. Older encoder only models with limited context windows and less pre-training would be insufficient, and newer encoder-decoder and decoder-only models were non-ideal for regression problems with the aim of predicting numeric values from contextual snapshot data.

I was therefore very happy when only a couple of weeks ago, Answer.ai and LightOn teams released ModernBert, a modern encoder-only mode which supported long context lengths.

Encoder-only models, like ModernBert, excel at understanding the context of long text sequences and converting them into meaningful embeddings. They use attention mechanisms to relate words within a large window. With its ability to process extensive texts, ModernBert was perfectly suited to handle the detailed contextual snapshots I'd created.

So, I swapped out my original llama-based fine-tuning with a ModernBert-large version.

Fine-tuning with QLoRA

Fine-tuning a transformer model with data with long context lengths requires significant compute. To efficiently fine-tune this model, I employed the use of QLoRA (Quantized Low-Rank Adaptation). QLoRA is a technique that enables us to fine-tune large language models efficiently using quantization and low-rank adapters. By quantizing model weights to 4 bits, and injecting small, trainable, parameters (adapters), we can drastically reduce memory consumption and train models faster.

Here's a brief snippet of my QLoRA code to show you how that works:

    from transformers import (
        AutoTokenizer,
        AutoModelForSequenceClassification,
        TrainingArguments,
        Trainer,
        TrainerCallback,
        BitsAndBytesConfig,
        EarlyStoppingCallback,
        ModernBertForSequenceClassification
    )
    from peft import LoraConfig, get_peft_model
    import torch

    def create_model(config: ModelConfig):
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16
        )

        model = ModernBertForSequenceClassification.from_pretrained(
            config.base_model,
            num_labels=config.num_labels,
            problem_type="regression",
            quantization_config=bnb_config
        )

        lora_config = LoraConfig(
            r=16,
            lora_alpha=32,
            target_modules=["Wqkv", "Wo"],
            lora_dropout=0.1,
            bias="none",
            task_type="SEQ_CLS"
        )

        model = get_peft_model(model, lora_config)
        return model

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["Wqkv", "Wo"],
        lora_dropout=0.1,
        bias="none",
        task_type="SEQ_CLS"
    )

    model = get_peft_model(model, lora_config)
Rolling Window Training Strategy

Given the temporal nature of financial data, I employed a rolling window approach to simulate real-world conditions. I'd train the model using five years of historical data, and validate it on the subsequent year, moving the window forward by one year with each split. This was in an effort to try to maintain the model's relevance to current market dynamics.

def rolling_window_split(
    df: pd.DataFrame,
    train_window_years: int = 5,
    validation_years: int = 1,
    min_train_years: int = 3,
    stride: int = 1
) -> List[Tuple[pd.DataFrame, pd.DataFrame]]:

    years = sorted(df['year'].unique())
    splits = []

    # Calculate the total window size
    total_window = train_window_years + validation_years

    # Generate splits
    for start_idx in range(0, len(years) - total_window + 1, stride):
        train_start = years[start_idx]
        train_end = years[start_idx + train_window_years - 1]
        val_start = years[start_idx + train_window_years]
        val_end = years[start_idx + total_window - 1]

        train_df = df[
            (df['year'] >= train_start) &
            (df['year'] <= train_end)
        ]

        val_df = df[
            (df['year'] >= val_start) &
            (df['year'] <= val_end)
        ]

        if len(train_df['year'].unique()) >= min_train_years:
            splits.append((train_df, val_df))

            logger.info(f"""
            Created split:
            Training: {train_start}-{train_end} ({len(train_df)} samples)
            Validation: {val_start}-{val_end} ({len(val_df)} samples)
            """)

    if not splits:
        logger.warning("No valid splits were created with the given parameters")
    else:
        logger.info(f"Created {len(splits)} total splits")

    return splits
Prediction Phase

By the end of this training process, I had a model capable of predicting company performance with some level of accuracy (a root mean squared error of around 20-30%), so I used the trained model to generate a set of predictions for each stock in my dataset.

def predict(self, df: pd.DataFrame, batch_size: int = 3) -> pd.DataFrame:
    df = get_latest_files(df)
    dataset = DynamicTextDataset(df, self.tokenizer, self.config.max_length)
    dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate_fn)

    predictions = []
    metadata_list = []
    self.model.eval()
    with torch.no_grad():
        for batch in dataloader:
            outputs = self.model(
                input_ids=batch['input_ids'].to(self.device),
                attention_mask=batch['attention_mask'].to(self.device)
            )
            predictions.extend(outputs.logits.cpu().numpy())
            metadata_list.extend(batch['metadata'])

    results_df = pd.DataFrame(metadata_list)
    results_df['predictions'] = predictions

    return results_df

But predicting a company's stock price to within 20%-30% margin of error is not useful in itself. After all, a company's stock price depends on far more than the small sum of information available within a contextual snapshot.

Ranking Stocks

I wanted to understand which companies the model would give an edge to compared with the other companies. As such I built a simple ranker.

Ranking Algorithm

I used the simplest ranking approach I could. I took the stock price pre-earnings and post-earnings for each of the next-quarter, next six-months, and next year. I then ranked growth of each company over each time period. Lastly, I combined the rankings, weighting the year change at 0.5, the six month change at 0.3, and the quarter change at 0.2.

Here's the code for my ranker if you're interested:

class FinancialRanking:
    def __init__(self, weights: Optional[Dict[str, float]] = None):
        """
        Initializes the FinancialRanking class.
        
        Args:
            weights (Dict[str, float]): Weights for different prediction periods
                Default: {"next_quarter": 0.2, "next_six_months": 0.3, "next_year": 0.5}
        """
        self.weights = weights or {"next_quarter": 0.2, "next_six_months": 0.3, "next_year": 0.5}
        if not np.isclose(sum(self.weights.values()), 1.0):
            raise ValueError("Weights must sum to 1")

    def _combine_ranks(self, row: pd.Series, method: RankCombinationMethod) -> float:
        """
        Combines per-period ranks into a single score.
        
        Args:
            row (pd.Series): Row containing period ranks
            method (RankCombinationMethod): Method to combine ranks
            
        Returns:
            float: Combined rank score
        """
        ranks = [
            row['next_quarter_rank'],
            row['next_six_months_rank'],
            row['next_year_rank']
        ]
        
        if method == RankCombinationMethod.AVERAGE:
            return np.mean(ranks)
        elif method == RankCombinationMethod.MINIMUM:
            return np.min(ranks)
        elif method == RankCombinationMethod.WEIGHTED_AVERAGE:
            return (
                ranks[0] * self.weights['next_quarter'] +
                ranks[1] * self.weights['next_six_months'] +
                ranks[2] * self.weights['next_year']
            )
        else:
            raise ValueError(f"Unsupported rank combination method: {method}")

    def rank_stocks(self, df: pd.DataFrame, combination_method: RankCombinationMethod = RankCombinationMethod.WEIGHTED_AVERAGE) -> pd.DataFrame:
        """
        Ranks stocks based on predictions using the specified combination method.
        
        Args:
            df (pd.DataFrame): DataFrame with predictions column containing arrays of predictions
            combination_method (RankCombinationMethod): Method to combine period ranks
            
        Returns:
            pd.DataFrame: DataFrame with added ranking columns and sorted by final rank
        """
        # Convert predictions to numpy arrays if they aren't already
        df = df.copy()
        df['predictions'] = df['predictions'].apply(lambda x: np.array(x) if not isinstance(x, np.ndarray) else x)
        
        # Create ranking dataframe for each prediction period
        prediction_df = pd.DataFrame({
            'ticker': df.index,
            'next_quarter': df['predictions'].apply(lambda x: np.mean(x[:2])),
            'next_six_months': df['predictions'].apply(lambda x: np.mean(x[2:4])),
            'next_year': df['predictions'].apply(lambda x: np.mean(x[4:]))
        })
        
        # Calculate ranks for each period (higher predictions get lower ranks)
        prediction_df['next_quarter_rank'] = prediction_df['next_quarter'].rank(ascending=False)
        prediction_df['next_six_months_rank'] = prediction_df['next_six_months'].rank(ascending=False)
        prediction_df['next_year_rank'] = prediction_df['next_year'].rank(ascending=False)
        
        # Calculate combined rank
        prediction_df['combined_score'] = prediction_df.apply(
            lambda row: self._combine_ranks(row, combination_method), 
            axis=1
        )
        
        # Add ranks back to original dataframe
        df_result = pd.concat([
            df,
            prediction_df[['next_quarter_rank', 'next_six_months_rank', 'next_year_rank', 'combined_score']]
        ], axis=1)
        
        # Sort by combined score and add final rank
        df_result = df_result.sort_values('combined_score')
        df_result['rank'] = range(1, len(df_result) + 1)
        
        return df_result
The Initial ResultsSome important Caveats

In this experiment, I didn't deal with many edge-cases where my data extraction failed, or where I wasn't able to find some mappings for company ciks and tickers, or where there were transient network failures. I also stopped early. A full list of 3000 stocks from the Russell 3000 was too many for me, so when I had enough training examples, I just stopped there. As such, I only ran training and inference on a subset of the Russell 3000. Nonetheless, I processed larger companies first so many of the more stable results will be well represented with in these findings.

A second caveat, while I am a reasonably proficient dabbler in machine learning, I am hardly an expert.

A third caveat, I am not recommending that you invest in this approach. Hedge funds are working full-time trying to outdo my strategies, and I expect there are probably far better ways of achieving high yield than this. If you choose to copy my trades and lose your money because of it, don't come crying to me.

Lastly, this is less of a trading project and more about having fun. So I hope you found this to be as fun a read as I found it an experiment.

With caveats out of the way, here are the rankings:

The Rankings

Here are the top 20 and bottom 20 companies out of my sample of 608 snapshots that the fine-tuned model tends to like:

The Top 10

Why did the model like these companies' recent snapshots? I don't know. After all, I still know very little about finance. I will say that sampling them and feeding them to a non-fine tuned Claude didn’t yield a ton of excitement from Claude. Additionally, most of these stocks have underperformed for at least the last few months. I will nonetheless be entrusting my money’s fate to the hands of this model and we’ll see why it likes them so much.

Whether or not there is enough signal in the model to help beat the market is something we will only find out a year from now.

The Bottom 20 stocks

I’m not so confident in this model that I’m shorting these stocks, but here is what my model disliked the most.

Conclusion and Next Steps

This whole experience highlights the power of combining the ability to buy GPU time on demand from AWS, a crazy idea, and all the new AI/ML tools both open source and paid that are on the market today. While predicting stock prices precisely is unrealistic, leveraging the knowledge extraction and reasoning capabilities of large language models, combined with a focus on relative stock performance has the potential to give us a meaningful edge over simply picking stocks at random.

Just how much of an edge? Well that's is where the real fun begins. I’m investing $5,000 in the top stocks chosen by my model.

I created a Public account to track the performance of my stocks which you can follow along with here. This account takes the top 20 stocks from the model and purchases them in inverse proportion to their combined score which is a score that shows the strength of their rank for each of those three time periods weighted toward the end of the year. I placed orders for the corresponding amounts rounded to the nearest dollar at Public that should be fulfilled at the start of trading tomorrow.

Follow along over at: https://public.com/@yehudac

A year from now, (if I remember), I'll share the results here.

There are many other domains other than finance where similar approaches might prove even more value. If you’d like to read the code for this project or try run it for yourself, you can visit my github repository here.

Acknowledgements

Before I leave you, I'd like to make a few acknowledgements:

I'd like to thank the AWS Community Builders program for providing the compute credits that have enabled me to undertake this project. I made extensive use of Sagemaker Studio Notebooks, Bedrock, and the GPUs over this course of time.

I'd also like to thank the authors of the open source tools used in this project. Head over to their Github profiles and give them some stars because stars like them make this kind of tinkering possible.

I'd like to thank Jeremy Howard for his fast.ai course of which I completed several lessons when first exploring machine learning. I'd also like to thank him and his team for continuing to release open source powerful tools like ModernBert which I and many other tinkerers deeply appreciate.

Lastly, I’d like to thank my artificial buddies Claude 3.5, ChatGPT, and Google Gemini 2.0 Flash Experimental, who have helped me with some ad-hoc questions along the way and have also helped me review this blog post.

Subscribe now

https://yehudacohen.substack.com/p/gambling-with-language-models
Extensions
Initial thoughts about HashiCorp license changes
The consequences of HashiCorp changing it's licenses from MPL to BSL
Show full content

A bit of a different blog post today. As of a few hours ago, Terraform's license has been changed from an open source license (MPL) to BSL. This change is not exclusive to Terraform. It applies to all HashiCorp’s open source products.

Subscribe now

The announcement from HashiCorp today definitely threw me for a loop. I've contributed to the terraform AWS provider and spent significant time with Vagrant back in college, and Terraform, Packer, and Vault more recently. I have never done significant work with Nomad, Consul, Waypoint or Boundary.

Why did HashiCorp change the license?

It is tremendously difficult to monetize open source dev tools, but I figured HashiCorp had nailed down the recipe. Even though my own team typically uses Terraform without Terraform Cloud and Vault without paying for the managed offerings.

We don’t use managed offerings because we like to control where things run and don't feel like the managed offerings add enough value to us. For teams less intimate with DevOps practices and infrastructure, managed offerings definitely reduce the barrier to using these products.

Upon hearing about this change in licensing, I recalled that HashiCorp is a public company and I did a little bit of digging. Since it’s IPO in December 2021, has lost 67% of its value. The company has not turned a profit, and it’s $137.98M revenue this quarter was eclipsed by its $179.01M of operating expenses. The company’s growth story doesn’t look promising either. Over the last year, HashiCorp’s cash reserves have fallen over 5%. It’s assets during this time have only increased 0.29%.

The promise, and therefore value of HashiCorp’s Terraform Cloud has decreased significantly since it’s IPO. Ironically, I suspect this decrease in value is an indirect consequence of its popularity.

As Terraform has become more popular, more resources and guides are available to enable people to operate independently. Some of these resources include direct competitors to Terraform Cloud itself. Spacelift.io, for instance, offers a direct competitor to Terraform Cloud. Consequently, the promise, and therefore value of HashiCorp’s managed solutions has decreased.

These changes in licensing feel like a desperate response to business pressures and are a frantic attempt to make more money off of existing products.

The move will probably yield some dividends too. By disallowing third parties like Spacelift.io to leverage new versions of terraform, new business will probably naturally shift over to Terraform Cloud. I’m betting Spacelift.io is reeling right now.

What consequences will the license change have for in the marketplace?

Terraform's dominance in the marketplace makes it the de-facto cloud infrastructure provisioning choice for enterprises. The product has a ton of inertia, and enterprise organizational guidance and devops engineers are therefore unlikely to switch to competitors like Pulumi. Terraform’s new license restrictions still allow these enterprises to use Terraform to manage their own infrastructure.

CloudOps platforms that compete directly with Terraform Cloud like Spacelift.io, on the other hand have been more severely kneecapped. These will need to take drastic action to recover from this blow.

A managed solution that is more moderately impacted by this license change is AWS Proton. Proton allows users to run terraform templates on demand within AWS. In this case, where the product isn't a direct competitor to Terraform Cloud, I expect Amazon and HashiCorp to arrive at some kind of symbiotic arrangement.

Other HashiCorp products like Vault and Vagrant, for instance, are more standalone in nature and I don't think people are trying to use them compete with HashiCorp managed offerings to the same extent. As such, the changes in license are even less meaningful.

If I were a betting man, I’d make the following five predictions

  1. Until Terraform is forked, system integrators and cloud solution providers will continue to use terraform as usual. The widespread industry adoption of terraform makes it a safe bet. Nobody ever got fired for buying IBM. Because the licenses remain permissive enough for these companies and because the product has enough inertia, little will change.

  2. Platforms like Spacelift.io will push alternatives like Pulumi hard, and will consider forking terraform. This change directly attacks their product and responding effectively is existential.

  3. Other infrastructure as code tools like PulumiCorp might invest in using this licensing move as an opportunity for differentiation. Knowing Pulumi’s style, I would be unsurprised if they added support for Hashicorp Configuration Language.

  4. Terraform providers will probably retain fully open source licenses. If they don't, they will promptly be forked by the Open Source community.

  5. Smaller and more innovative companies will start adopting fully open source infrastructure as code platforms. Tools like Pulumi will probably see a decent boost from this. Developers exploring new approaches to infrastructure as code will begin to explore alternatives.

What is the long term prognosis for HashiCorp

I expect that this change in licensing will enable a small boost to HashiCorp in the short term, but I am not buying any stock soon. It seems to me like the current business model is unsustainable. The HashiCorp team needs to cut costs significantly or find a new source of revenue.

Because the products bringing in 85%+ of revenue to HashiCorp are Terraform Cloud and HCP Vault, these products probably consume most of the company’s budget. Given their complexity, I expect that supporting products like Consul and Nomad is probably reasonably expensive too. Resources will probably be diverted away from these products.

I’m not sure what HashiCorp is doing to innovate, but I don’t think these license changes will change the company’s fortune significantly in the long term. The company will need to pull another hit product out of its hat if it is to succeed.

Will I continue to use Terraform and Vault?

For personal use and our app modernization projects, I gradually moved away from Terraform almost 2 years ago. These days, I prefer to use Cloud Provider-native secrets management solutions to Vault, and Pulumi to Terraform as my IaC solution of choice.

With that said, my team is a team of system integrators. It is our job to meet customers where they are, and if a customer’s engineers are trained on terraform, I don’t think these license changes are substantial enough to warrant a change in recommendations. Terraform remains a mature product, and I expect that I will continue to use it for years to come.

At least until infrastructure-from-code frameworks like Eventual Cloud and getAmpt go mainstream or Pulumi get the love they deserve.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/initial-thoughts-about-hashicorp
Extensions
Exploring the Emerging Cloud Development Tooling Landscape
In-depth thoughts about the past, present, and future of cloud development from my (opinionated) perspective
Show full content

I’ve been building infrastructure with code and cloud applications since 2016. Over the years, some of what I have built has been deployed to servers, some of it has been deployed to container platforms, and some has been deployed to function-as-a-service platforms. The way I and my team build has evolved a lot and I anticipate that it will continue to evolve over the next few years. This is an (opinionated) exploration of how we got here and where I believe we’re going.

While I expect the salient points of this article to apply across cloud platforms, I offer the following disclaimer: While I do have experience building on GCP, Azure, and Kubernetes, I have not fully explored the many options available for these platforms. As such, I will focus on AWS in my discussion.

Subscribe now

This is a very long blog post, so if the subject matter interests you, sit tight and get comfortable.

The Beginning

When I started building on AWS as a Software Engineer at Audible, I, and everyone else I came into contact with, developed via click-ops. We opened the AWS console and manually configured DynamoDB tables, Aurora Clusters, SNS topics, SQS queues, Kinesis Streams, S3 Buckets, Firehoses, EMR clusters, and Lambda Functions.

When we wanted to replicate the configuration to production or set up a similar set of infrastructure, we would attempt to recreate our clicks a second time in a new AWS account.

I quickly became sick of repetitive clicking around in the AWS console and the resulting misconfigurations that led to production issues. I looked for alternatives.

Infrastructure-as-Code

I discovered CloudFormation and introduced it to our team in 2016.

The only opinion I had about infrastructure-as-code back then was that defining YAML CloudFormation templates to automate my AWS configurations beat the crap out of ClickOps every time.

In November of that year, SAM enabled me to start building serverless applications with more expressive YAML. It felt painless and I started becoming excited by how expressive my infrastructure as code templates became. At least until my configuration files got longer and the resulting YAML became unwieldy. I got fed up with CloudFormation YAML’s verbosity and difficulty to debug.

The occasional cryptic failures during my CloudFormation deployments with unclear remediation paths eg. UPDATE_ROLLBACK_FAILED also played a part in my dissatisfaction with the experience.

Developing Serverless Java Applications

In December or January, I found the aws-serverless-java-container package on GitHub while exploring ways to bundle a Spring-based java application into a Lambda function. The idea was simple: we had applications with very irregular traffic and could reduce the operational footprint by moving our entire Spring-based application inside a lambda function. We could then transparently reverse-proxy all traffic to that lambda function from API Gateway.

I could use a lightweight SAM template to deploy a full-blown Java application that scaled elastically to meet demand. Beat the developer experience of ElasticBeanstalk or self-managed EC2 runtimes1 by miles.

Running full Spring applications inside a lambda function was not without pain, however. Cold starts were slow (especially due to spring context builds), and AWS Lambda’s code size and timeout limitations reared their respective heads too frequently. With provisioned concurrency not yet available, I used CloudWatch event rules, to keep lambda functions warm. P-95 and P-99 latencies remained high.

I was not enamored with these drawbacks, but the approach had me thinking about how cloud development experiences should feel.

Terraform and a new Infrastructure-as-code obsession

By mid-2017, I was obsessed with the power of infrastructure-as-code and started exploring alternatives to my CloudFormation YAML spaghetti. I quickly discovered terraform and became obsessed. So much so, that I made the decision to quit my day job as an SDE in a failed attempt to sell reusable infrastructure-as-code modules. I found my way into AWS professional services where terraform nonetheless became my new best friend.

Terraform was far more expressive than CloudFormation and had IDE highlighting. Some of the things I loved about it that were not true for CloudFormation back then were:

  1. Support for modules.

  2. A state manipulation API to import existing resources or remove a resource from a terraform stack.

  3. Support for non-AWS providers and resources.

  4. The ability to deploy to multiple AWS accounts and regions in a single stack, enabling an easier AWS multi-account development experience.

  5. Superior documentation, community, and a visible roadmap.

Using terraform to build serverless workloads, however, was painful.

After working with SAM, creating serverless APIs in terraform felt clunky. For one, zipping and deploying lambda functions on the fly wasn’t something terraform was suited for2. For another, configurations for API Gateway integrations seemed verbose and there was no standard library of supported modules I could use to hide this verbosity.

Bridging the serverless Infrastructure-as-Code gap

Having started my career building monolithic RESTful APIs, I wanted the same ease of development with the benefits of the lower operational costs of AWS lambda-based workloads.

I explored the Serverless Framework and found it to be easier to configure and maintain than SAM or Terraform. It also had an interesting plugin ecosystem. The developer experience still felt fragmented, however.

Chalice

After some exploration, I discovered Chalice, an AWS open-source framework for building event-driven serverless applications with Python.3

Developing applications with chalice is similar in user experience to building applications in Flask.

A Hello World app in chalice looks as follows:

from chalice import Chalice
app = Chalice(app_name='helloworld')

@app.route('/')
def index():
    return {'hello': 'world'}

This chalice deployment generates an AWS API Gateway endpoint and lambda function which handles an incoming request to /, invokes the index() function, and returns a 200 response with {'hello': 'world'}.

In addition to API routes, Chalice can also configure function subscriptions to SQS queues, S3 bucket events, Kinesis streams, EventBridge events and schedules, SNS, and DynamoDB streams.

Having a framework infer infrastructure configuration from application code immediately felt intuitive to me. Over five years later, it still feels good.

The pattern of inferring required infrastructure configurations from application code has recently become known as infrastructure-from-code4. We will explore infrastructure-from-code in greater detail after fully exploring the CDK and Pulumi’s approach to infrastructure-as-code.

Expressive cloud infrastructure declaration

It was probably early in 2018 when I started questioning Terraform’s HCL language as a sufficiently expressive way of configuring cloud infrastructure. I found I repeated myself a lot and the declarative language felt suboptimal. IntelliSense for Terraform in my editor (VS Code) was also not up to par.

Control flow constructs like loops and conditionals (which were frequently necessary for my stacks) felt like second-class citizens in HCL5, and configuration that could be expressed easily in regular programming languages like Python or TypeScript felt cumbersome and sometimes infeasible to build with terraform.

Back then, I believed6 the best way to resolve these terraform pain points was to amend the HCL language and improve IDE tooling. I watched the release notes for HCL very closely back then, and the language spec has definitely improved somewhat. As has IDE support.

Pulumi

Toward the end of 2018, another infrastructure-as-code project arrived on my radar. Another engineer I spoke to in the infrastructure-as-code space had become obsessed with the early beta-releases of Pulumi.

Pulumi offers an alternative to Terraform to enable users to configure code in an imperative paradigm using their programming language of choice. The Pulumi deployment engine is then responsible for ensuring that the deployed infrastructure matches the imperatively defined infrastructure. In this way, some parts of Pulumi operate imperatively and others declaratively.

I saw value in this model but was unwilling to experiment. After all, Pulumi was new and had low adoption and little support. I did take notice though. More on this later.

The AWS CDK

In July 2019, AWS announced the AWS CDK. At the time, anyone serious about infrastructure-as-code was using Terraform. Managing infrastructure in YAML or JSON CloudFormation templates didn’t scale and AWS looked to the CDK as its response.

The initial release announcement blog for the CDK showed 30 lines of infrastructure code defining an API that posted messages to an SQS queue that was consumed by an ECS Fargate service in a VPC. I tried it out, and the user experience and abstractions blew my mind.

I was almost convinced to abandon terraform in favor of the CDK wherever possible, but three factors kept me using Terraform:

  1. Terraform has a huge array of non-AWS providers that you can use to combine AWS and non-AWS resources into the same stacks.

  2. Deploying infrastructure across multiple regions and AWS accounts was easier with Terraform than with the CDK.

  3. Terraform has a state API as an escape hatch to easily remediate drift. I was attached to the ability to be able to manipulate state files programmatically.

Experimenting with the CDK

In early-2021, I hired a new Principal Engineer who was (and still is) a very strong proponent of the CDK. He introduced the CDK to our team at Foresight and his interest reignited my own. My discussions with him regarding the tradeoffs of terraform and the CDK coincided with the announcement of the preview for the CDK for Terraform. My aversion to CloudFormation was no longer a good excuse to avoid the CDK.

I needed to build a small Fargate / Sagemaker-based prototype for a customer. The perfect opportunity to try out the Terraform CDK for something non-trivial. My experience building with the Terraform CDK felt intuitive, with a couple of painful exceptions. Back then, the token system was a mess. I was forced to litter my code with nasty workarounds that I found buried within Github issues and, failing that, the source code itself.

Issues with the token system have since been fixed, but cdktf support for the terraform state cli is still poor. Refactoring terraform stacks still require users to know about the generated terraform code rather than using the cdktf cli.

Problems aside, I immensely enjoyed the expressiveness of a full-fledged and mature programming language when defining infrastructure.

Reflections on infrastructure-as-code tools

To make a long story short, I continued to experiment in this space and have developed some opinions:

I currently favor Pulumi

When building greenfield systems, I favor Pulumi. With a state management API, an automation API that enables pulumi triggers from code, and its own deployment engine, Pulumi feels more flexible and powerful than either the AWS or Terraform CDK.

In electing to not reuse the Terraform or CloudFormation deployment engines, Pulumi is able to support asynchronous processing in a construct using its Output.apply(callback) pattern. This is impossible with the AWS and Terraform CDKs as synthesis creates CloudFormation or Terraform prior to creating infrastructure.

Pulumi is not without its drawbacks, but it feels like the most flexible infrastructure-as-code option currently. Some drawbacks are:

  1. The community is small (albeit friendly and helpful) and the documentation feels fragmented. It can be very difficult to find the information you need.

  2. The small community leads to a meager construct ecosystem, especially when compared with that of the AWS CDK. Pulumi has attempted to remedy this by building CDK construct interoperability. This interoperability is unfortunately buggy and relies on the AWS Cloud Control API which does not yet fully support all necessary resources.

  3. Pulumi Output<T> types expose a callback interface that initially feels unintuitive for the uninitiated. Language-native promise support allowing the use of async / await keywords would improve developer experience tremendously.

Domain-Specific Languages vs Programming Languages

Defining infrastructure using existing programming languages like Python and TypeScript provides a superior experience to domain-specific languages like HCL or CloudFormation YAML. IntelliSense, autocomplete, testing frameworks, package eco-systems, and years of refinement have ensured that these languages are flexible, expressive, composable, and are a relative pleasure to work with.

Dynamic vs Static Typing

Statically typed languages with mature type systems like TypeScript make infrastructure-as-code, and cloud development in general, a lot easier than dynamically typed languages like Python. This is especially true because the work is so dependent on modeling relationships and passing data between different cloud services.

Ensuring developers can detect invalid relationships statically using an IDE linter rather than dynamically at run-time is hugely impactful. While Python’s type annotations are helpful in this regard, I find that Python developers frequently pass complex data types as dictionaries or lists or tuples rather than modeling out these types ahead of time. Additionally, Python’s typing ecosystem does not feel as mature as TypeScript’s to me. See this Hacker News thread for a conversation comparing the two.

The future of infrastructure-as-code

It is interesting to me that the AWS CDK, the Terraform CDK, and Pulumi all elected to target a large number of programming languages rather than specializing in a single popular statically typed language like TypeScript.

I think that a TypeScript native implementation of an infrastructure-as-code tool would let designers focus enough to build a much better user experience.

Output<T> types are one example of a sub-optimal developer experience in Pulumi. An intuitive async-await API built on language-native promises would significantly simplify code.

Consider the following Pulumi code allowing traffic from an application security group to a database security group in a separate AWS account, for example:

const appSecurityGroup = new aws.ec2.SecurityGroup(
  "InfraFunctionSecurityGroup",
  {
    description: "InfraFunctionSecurityGroup",
    vpcId: vpcId.value,
  }
);
const dbSecurityGroupId = aws.ssm.getParameterOutput({
  name: "/db/security-group-id",
});
const dbSecurityGroupRef = securityGroupId.value.apply(
  (value) => `${dbAccountId}/${value}`
);
const appSecurityGroupId = appSecurityGroup.id.apply((id) => `${id}`);
const allowAppToDb = new aws.ec2.SecurityGroupRule("allowAppToDb", {
    type: "egress",
    fromPort: 5432,
    toPort: 5432,
    protocol: "tcp",
    sourceSecurityGroupId: dbSecurityGroupRef,
    description: "Allow outgoing connections from app to db",
    securityGroupId: appSecurityGroupId,
});

A more succinct Promise-based TypeScript-specific declaration that leverages native promises and async-await semantics might resemble the following:

const appSecurityGroup = aws.ec2.securityGroup(
  "appSecurityGroup",
  { vpcId: vpcId.value }
);
const dbSecurityGroupId = aws.ssm.getStringParameter({
  name: "/db/security-group-id",
});
const allowAppToDb = await aws.ec2.securityGroupRule("allowAppToDb", {
    type: "egress",
    fromPort: 5432,
    toPort: 5432,
    protocol: "tcp",
    sourceSecurityGroupId: `${dbAccountId}/${await dbSecurityGroupId}`,
    description: "Allow outgoing connections from app to db",
    securityGroupId: await appSecurityGroup.id,
});

Of the two, the second example is more concise and intuitive. When describing more complex relationships and dependencies, the benefits of conciseness compound. Language-native promise-based infrastructure-as-code configuration flattens code and renders more complex cases requiring the nesting of entire resource declarations within callbacks trivial.

Cloud dev tool designers Sam Goodwin and Dax Raad discuss a compelling solution like this on Twitter7:

The next infrastructure-as-code tool I plan on adopting will provide a language-native, statically typed async API to define infrastructure, and will execute lazily.

The gap left by Infrastructure-as-code tools

The composability and flexibility of programming language-based IaC tools like Pulumi and the CDK are integral to cloud development. These tools enable us to create higher-level and reusable infrastructure-as-code constructs and share them using language-native package ecosystems. They are not sufficient, however, and achieving an optimal cloud development experience requires improved abstractions.

Like monolith developers, cloud developers must understand how to build software. Unlike monolith developers they must also understand:

  • The purpose of a huge array of cloud services and how to configure them.

  • How to keep their codebase comprised of both infrastructure and application code from growing complex and disorganized.

  • How to test their solutions without waiting 15 minutes between iteration cycles.

  • Perhaps most importantly, how to structure infrastructure code, application code, and release pipelines to keep blast-radiuses contained and to ensure that systems remain stable during releases.8

Even with the above expertise, developing cloud workloads with infrastructure-as-code tools and utility scripts leveraging AWS SDKs feels slow and error-prone. A far cry away from the magical developer experiences provided by application frameworks like Ruby-on-Rails or Spring Boot or Django.

Cloud-development frameworks and Infrastructure-from-code

With a high barrier to entry, the tedium of testing, and the lack of conventions in cloud development, it was inevitable that old frameworks and tools would evolve and new frameworks and tools would emerge to target cloud development pain points and bridge the gap unfilled by infrastructure-as-code.

Cloud-focused enhancements to popular application frameworks

Many AWS services are used to support communication and persistence for business applications. This is true even of applications deployed outside AWS. These applications are frequently built with traditional application frameworks like Spring Boot in the Java world or Rails in the ruby world.

Libraries and cloud-focused extensions to these frameworks enable first-class support for some key AWS services. For example, ruby-on-rails has been extended to support AWS’s DynamoDB and S3 with dynamoid and active storage s3 support respectively. Similarly, Spring Cloud AWS brings DynamoDB, S3, SES, SNS, SQS, Parameter Store, Secrets Manager, and CloudWatch metrics support to Spring.

Usually, but not always, these libraries and extensions focus on supporting AWS’s data-plane operations rather than control-plane operations9.

Occasionally, as is the case with the popular JavaScript and TypeScript library dynamoose, these libraries perform control-plane type operations like creating database tables. In these cases, I recommend disabling this functionality.

Libraries and extensions intended to live in existing applications are ill-suited to managing the operational lifecycle of the AWS resources they create. Instead, these tools should be used in conjunction with a dedicated infrastructure-as-code tool to manage the life cycle of an application’s dependent cloud resources.10

Frameworks for serverless development

Early on, it became obvious that regular infrastructure-as-code tools did not provide an ideal developer experience for serverless workloads. One reason for this is that serverless workloads frequently consist of lots of cloud resources including functions, event triggers, IAM roles and policies, and more. With each integration requiring a significant amount of configuration, serverless developers were spending more time configuring infrastructure, than writing code.

Tools like the Serverless Framework and SAM emerged to enable more concise configurations for these types of applications. Tools like Chalice and Zappa aimed to unify the development experience further, by moving this configuration inside your code using decorators.

All of these frameworks provide developer experiences that enable users to focus on defining and binding event triggers and business logic rather than the underlying cloud resources.

Almost every serverless stack, however, depends on extra cloud resources that belong external to the application code. Persistent resources like DynamoDB tables, S3 buckets, and SQS queues are common examples11.

To address extending serverless applications beyond event triggers, the serverless framework and SAM both support extension via CloudFormation YAML. The Chalice CLI supports a —-merge-template argument that allows you to merge separate CloudFormation files containing these resources into your chalice deployment bundle12.

My painful associations with CloudFormation led me to write a set of python scripts to extract infrastructure dependencies from terraform stacks, and inject them into our chalice configuration. I performed a similar trick to extract resources (like API Gateway endpoints) created by Chalice and inject them into dependent terraform stacks. This approach has served us well at Foresight despite all the upfront python needle-and-thread work required.

Frameworks as Infrastructure-as-code Constructs

The cleanest model for managing infrastructure dependencies for a Chalice application, however, emerged in January 2021. Amazon released a CDK construct for a Chalice App. Suddenly, passing infrastructure configuration into a Chalice application, and extracting infrastructure configuration created by the Chalice application became trivial. Here’s a code snippet from the above blog post that illustrates this simplicity:

class ChaliceApp(cdk.Stack):
    def __init__(self, scope, id, **kwargs):
        super().__init__(scope, id, **kwargs)
        self.dynamodb_table = self._create_ddb_table()
        self.chalice = Chalice(
            self, 'ChaliceApp', source_dir=RUNTIME_SOURCE_DIR,
            stage_config={
                'environment_variables': {
                    'APP_TABLE_NAME': self.dynamodb_table.table_name
                }
            }
        )
        self.dynamodb_table.grant_read_write_data(
            self.chalice.get_role('DefaultRole')
        )

A DynamoDB table is created with a CDK construct. A chalice app is deployed with its own construct which in turn creates a new IAM role. Permissions on the table are then modified to enable the chalice role to read and write data.

This CDK-chalice integration blended the best of an expressive infrastructure-as-code tool and a serverless application framework. To the best of my knowledge, the Chalice-CDK integration was the first framework to offer this deployment model. It was not the last, however. I am aware of two other frameworks that use this deployment model.

The first of these frameworks is SST, and the second is Eventual. Like Chalice, Eventual is deployed as a construct. SST, on the other hand, exposes constructs to deploy applications built with popular frameworks to the cloud.

Each of these two frameworks has more to offer than their integrations with general-purpose programming language-based infrastructure-as-code tools, however. We will therefore devote some time to each.

SST

SST is a super-set of the CDK focused on providing primitives necessary to build serverless applications on AWS seamlessly. Of the frameworks I’ve worked with, SST represents the best current developer experience for building greenfield serverless applications.

Deploying web applications with SST

SST exposes constructs for building static, NextJS, Remix, Astro, and Solid State websites. An example of a NextJS site with a custom DNS taken from the SST docs follows:

import * as acm from "aws-cdk-lib/aws-certificatemanager";
import * as route53 from "aws-cdk-lib/aws-route53";
import * as route53Targets from "aws-cdk-lib/aws-route53-targets";

// Look up hosted zone
const hostedZone = route53.HostedZone.fromLookup(stack, "HostedZone", {
  domainName: "my-app.com",
});

// Create a certificate with alternate domain names
const certificate = new acm.DnsValidatedCertificate(stack, "Certificate", {
  domainName: "foo.my-app.com",
  hostedZone,
  region: "us-east-1",
  subjectAlternativeNames: ["bar.my-app.com"],
});

// Create site
const site = new NextjsSite(stack, "Site", {
  path: "my-next-app/",
  customDomain: {
    domainName: "foo.my-app.com",
    alternateNames: ["bar.my-app.com"],
    cdk: {
      hostedZone,
      certificate,
    },
  },
});

// Create A and AAAA records for the alternate domain names
const recordProps = {
  recordName: "bar.my-app.com",
  zone: hostedZone,
  target: route53.RecordTarget.fromAlias(
    new route53Targets.CloudFrontTarget(site.cdk.distribution)
  ),
};
new route53.ARecord(stack, "AlternateARecord", recordProps);
new route53.AaaaRecord(stack, "AlternateAAAARecord", recordProps);

Note how certificates are passed into the NextjsSite construct, a path to the application code is specified, and a route53 alias record points to the CloudFront distribution returned from the NextjsSite construct.

While these frontend frameworks are not cloud-development frameworks, per-se, they are frequently deployed to cloud resources including CloudFront distributions, S3 buckets, and lambda functions.

Expressing application code and infrastructure dependencies in a single code base without complex workflows or spaghetti configuration to deploy changes feels incredibly powerful.

What the SST adds to the CDK’s developer experience

It has only been over the last couple of months that we’ve started building workloads using the SST at Foresight. I have heard only good feedback about the developer experience from our team.

Aside from the set of expressive higher-level constructs the SST provides, the SST also differentiates itself by laser-focusing on the entire serverless developer experience.

It provides a project bootstrapping mechanism that lets you create a best-practice mono-repo. It offers support for intuitive testing. It offers mechanisms for managing database migrations and secrets, along with a set of excellent local development tools including IDE support, a local lambda debugging experience, and a visual console to aid your development experience.

The SST team’s focus on developer experience has quickly made it one of my favorite cloud development experiences. If you haven’t tried it yet, I’d recommend giving it a try.

Eventual (and the future of Infrastructure-from-code)

Eventual is the brain-child of Sam Goodwin and Sam Sussman, two former Amazon Alexa engineers. This cloud development framework is still in a closed beta that you can request access to here. Even in beta, this framework offers a working glimpse of what I hope cloud development becomes.

Eventual provides a few simple but very powerful abstractions. The Eventual compiler enables this level of abstraction by running Eventual Service code and having primitives self-register to produce an AppSpec. Eventual then uses esbuild to tree-shake and bundle distributed components. During the bundling process, the eventual compiler performs some limited transformations on the source code using SWC libraries.

The produced AppSpec is consumed by infrastructure-as-code constructs to understand what infrastructure they must create.

The extent to which Eventual goes to both introspect and bundle service code is no parlor trick. It enables a development experience akin to programming a single computer, but a deployment experience that enables the reliability and scale of battle-hardened cloud services.

An example Eventual Service

To illustrate the power of Eventual, I will excerpt (with permission) a couple of very simple examples from the examples provided in the as-of-yet non-public GitHub repository. Expect some of these interfaces to change before a stable version is released.

An Eventual Service is defined as a TypeScript file without much fanfare:

import { event, activity, workflow, api, HttpResponse } from "@eventual/core";
api.post("/work", async (request) => {
  const items: string[] = await request.json();
  const { executionId } = await myWorkflow.startExecution({
    input: items,
  });
  return new HttpResponse(JSON.stringify({ executionId }), {
    status: 200,
  });
});
export const myWorkflow = workflow("myWorkflow", async (items: string[]) => {
  const results = await Promise.all(items.map(doWork));
  await workDone.publishEvents({
    outputs: results,
  });
  return results;
});
export const doWork = activity("work", async (work: string) => {
  console.log("Doing Work", work);
  return work.length;
});
export interface WorkDoneEvent {
  outputs: number[];
}
export const workDone = event<WorkDoneEvent>("WorkDone");

This service lets you post an array of strings to an API Gateway endpoint at route /work. The API starts a workflow that processes this array of strings and immediately returns an HTTP response including the workflow’s ID to the caller. The workflow then iterates over each string and computes its length. Once the entire array has been processed, the workflow publishes a WorkDoneEvent with a computed array of string lengths.

This service does not seem all that impressive at first. After realizing that the workflow execution occurs in a different serverless runtime13 than the API, you might be a little more impressed. Upon realizing that each string’s length is calculated in parallel in a separate serverless runtime, you might be floored. Especially after you realize that each workflow execution and activity result is durable with exactly-once execution guarantees. I know I was floored.

This simple and monolithic-looking development experience describes a distributed serverless application that can be deployed to AWS using an API gateway, AWS Lambda event handler, a durable workflow built on top of lambda, SQS, and DynamoDB, along with an EventBridge EventBus.

Defining a similar workflow using AWS Step Functions and other equivalent AWS infrastructure is a lot more difficult. You need to configure a lambda function construct to doWork, an event bus to publish the resulting event, a step function with correctly configured ASL, and a second lambda function with an API gateway integration to trigger the step function. You then need to wire up the environment variables in the API handler function with the step function id and create a step functions client with the AWS SDK before writing the code to trigger a workflow execution. This is in addition to the complex set of permissions that must be defined with IAM roles and policies.

Eventual wires everything up for you with a few elegant primitives including workflow, activity, event, and api, along with its own cloud application compiler.

Not used in this example, but often very important in workflow design is Eventual’s signal primitive that enables workflows to pause and wait for additional information from API handlers and event handlers.

A strong focus on statically-typed integrations

Another feature provided by Eventual is support for Typesafe tRPC-like commands that are validated with zod schemas. An eventual command might be defined as follows:

export const privateAccess = api.use(cors).use(authorized);
export const listPipelines = privateAccess.command(
  "listPipelines",
  {
    input: z.object({
      beforeTime: z.date({ coerce: true }).optional(),
    }),
  },
  async ({ beforeTime }, { user }) => {
    const query = Pipeline.query.byOwner({
      ownerId: user.username,
    });

    const pipelines = await (beforeTime
      ? query.gte({
          createTime: beforeTime?.toISOString(),
        })
      : query
    ).go();

    return {
      items: pipelines.data,
      nextToken: pipelines.cursor,
    };
  }
);

This example describes an RPC-like interface with included middleware to enable cors and authorization for a command named listPipelines. This particular command, which was sent to me by Sam Goodwin (one of the creators of Eventual), accepts three arguments. The first is the name of the command, the second is a zod schema with which the input to the command is validated, and the third is the function that is invoked when the command is invoked.

This command can be invoked from a TypeScript frontend like a react project for instance via the Eventual ServiceClient documented here. An invocation might resemble the following:

import { useService } from '@/useService'
import { useUser } from '@/useUser'
import { useCallback, useState } from 'react'

export default function PipelineCountComponent () {
    const [pipelineCount, setPipelineCount] = useState(0);
    const { session } = useUser({ redirectTo: "/login"})
    const myEventualService = useService(session)

    useEffect(async () => {
        const pipelines = await myEventualService.listPipelines(
            {
                beforeTime: new Date("2022-12-01")
            }
        )
        setPipelineCount(pipelines.length)
    }, [setPipelineCount, session, myEventualService])
    return <div>You have {pipelineCount} pipelines</div>
}

Note how the Eventual ServiceClient enables you to use a statically typed abstraction to listPipelines. It takes care of serializing and deserializing the data as it traverses the network, mirroring the developer experience of a local library-method invocation.

These type definitions enable your compiler to alert you of any mismatches between your input and the expected format for the listPipelines function, and your IntelliSense to let you know what methods and input formats are available on myEventualService.

The Zod schema integration also enables Eventual to generate OpenAPI specs for API endpoints defined as commands. This way you can be sure that your OpenAPI spec accurately describes your API contracts.

Eventual does not stop at API validation, however. It extends its type-safe development experience to event-driven systems and enables validation of event structures with zod. In the case of event-driven integrations, Eventual can generate and publish JSON Schema for your event types.

This focus on type safety is a huge deal in a cloud development framework because it abstracts over the network layer entirely. Starting at the user’s browser, and ending in backend upstream workflows, your compiler can alert you of broken integrations as you are developing locally. Crucially this speeds up the cloud development feedback loop and helps detect many integration errors. Detecting these errors early lowers the probability of them going unnoticed until a production release.

Deploying an Eventual Service

Like deploying a Remix app with SST or a Chalice app with the CDK, an Eventual Service is defined as a construct. Support for a Pulumi construct is in progress14 too. For instance:

const service = new Service(stack, "Service", {
  name: "my-service",
  entry: path.resolve("services", "my-service.ts"),
});

What if you need your service to know about a DynamoDB table you created in your CDK stack? Just pass in an environment variable like so:

const service = new Service(stack, "Service", {
  name: "my-service"
  entry: path.resolve("services", "my-service.ts"),
  environment: {
    TABLE_ARN: table.tableArn,
  },
});

But Eventual deploys many different isolated runtime15s. What if you only want to set the environment on one of them? Well, you can use a type-only import to import the types from your eventual service definition into your CDK stack. You can then override individual command, subscription, and activity handler properties in your eventual definition, like this:

import type * as MyService from "@my/service";
const service = new Service<typeof MyService>(stack, "Service", {
  entry: path.resolve("services", "functions", "my-service.ts"),
  commands: {
    myCommand: {
      // set environment variables only on the myCommand function
      environment: {
        TABLE_ARN: table.tableArn,
      },
    },
  },
});

In the same way that the Eventual Service construct accepts configuration from dependencies in your infrastructure-as-code stack, it exports the resources it creates as attributes. You can then pass these attributes as inputs to dependent infrastructure components.

What’s missing from Eventual

Eventual is a really good start but it is not yet feature complete. Here’s a list of new features that many of the workloads I build rely on but don’t seem to be easily achievable with Eventual yet:

  1. Support for exposing WebSockets endpoints

  2. Consuming externally defined events like s3 invocations, Amazon-native EventBridge events, and events published to SNS topics

  3. Support for batched processing of streams and queues like kinesis and dynamodb streams, and SQS queues

  4. Auth middleware that supports OIDC

  5. Support for VPC configuration and private APIs that are inaccessible over the internet

  6. Long-running activities that last longer than current lambda run-time limits of 15 minutes

  7. The ability to provide base-runtime configurations like docker images or pre-built AMIs with preinstalled software for API handlers, commands, activities, and Event handlers

  8. Support for transparently mounting POSIX-compliant file systems like FSx for Lustre and EFS to enable dealing with data in large files so you can persist files between workflow activities

  9. Web-socket notifications for state changes during workflow executions and published events.

  10. An activity of type notebook, that lets you specify running a specified Jupyter notebook as a step in a workflow.

The good news is that the Eventual team is listening closely

Concluding thoughts on Eventual

I am a huge fan of Eventual and its approach to building distributed cloud applications for three reasons:

  1. Eventual focuses intently on providing a statically typed experience that integrates with your TypeScript toolchain.

  2. Eventual integrates natively with and delegates the lifecycle of infrastructure to your infrastructure-as-code tool.

  3. Eventual exposes an incredibly powerful set of abstractions that are non-trivial to build yourself.

Eventual’s most impressive abstraction to me represents a product in-and-of itself. Its workflow abstraction allows you to describe durable and distributed workflows using native TypeScript semantics. The experience is not unlike that of Temporal and similar workflow frameworks.

Eventual provides a far superior deployment experience to Temporal, however. It tree-shakes activities before deploying them and the other distributed components necessary to run these workflows on your behalf. What’s more, the lifecycle of these components is delegated to your preferred infrastructure-as-code engine with a single construct.

Powerful primitives and laser focus on developer experience make Eventual a formidable framework capable of expressing both micro-service architectures and event-driven architectures.

Other Emerging Infrastructure-from-code Frameworks

There has been a lot of talk about infrastructure-from-code frameworks, and concluding this article without addressing other contenders in this landscape would not do this topic justice. Several great resources are available about this topic and I recommend consuming them if you are interested in delving further into this space:

  1. Cloud Application Infrastructure from Code by Asher Sterkin16

  2. The Unfulfilled Potential of Serverless by Jeremy Daly, the CEO of Ampt.

  3. The Self-Provisioning Runtime by Swyx (Shawn Wang). *Note that Swyx backs the Ampt project.

  4. The Current State of Infrastructure-from-code by Allen Helton

  5. State of Infrastructure-from-Code 2023 by Ala Shiban

Here are some brief initial thoughts about other infrastructure-from-code tools I have surveyed.

Winglang

Winglang’s thesis is that existing programming languages are not sufficient to describe cloud applications. It introduces a distinction between code that is executed after deployment and code that is executed during deployment. Code that is executed during deployment is called Pre-flight, while code that is executed after deployment as a part of a cloud application’s run-time is called In-flight. In Winglang’s model, control-plane operations are generally preflight, and data-plane operations are generally in-flight.

I would not discount Winglang, as it is backed by CDK veterans who are experts at infrastructure-as-code, but to me, having in-flight code deployed by a construct using an infrastructure-as-code tool, as Eventual does, provides a better separation of concerns and user experience than Winglang’s proposed new-language approach.

A new language and development tooling ecosystem also seems unnecessary to me but I’ll be watching Winglang to see how the ecosystem changes. I might be proven wrong yet.

Ampt

Ampt is the successor to Serverless Cloud, and its developer experience looks very smooth at first glance. I’ve added my name to the waitlist, but have not yet managed to get my hands on the private beta. Jeremy Daly’s vision does appear well thought out and cohesive, and some highly competent distributed systems and dev-tools engineers have thrown their support behind this project.

The recent Sessions with SAM & Friends episode by Eric Johnson featuring Jeremy Daly gave a great demo of the Ampt control plane’s simple console UI. Some highlights were live updating of application configuration that propagates to Ampt environments in near-realtime, and an easy-to-use file-management UI that lets you read and modify assets used by your app.

The icing on top and Ampt’s key differentiator to me is the high-performance local development experience that transparently syncs changes from the local development environment to your connected Ampt backend. Also especially powerful is the ability to drop its libraries into your favorite full-stack application frameworks like Astro, and Remix.

See the conversation here for a demo:

With a recently updated documentation repository, Ampt looks primed for a beta release and I’m hoping to get my hands on it soon. There are, however, four attributes that make me more tentative about Ampt than Eventual at the moment:

  1. I have not seen any evidence of existing support for durable and distributed workflows in Ampt. This is a non-trivial integration that sets Eventual apart for me.

  2. Early documentation indicates that Ampt is meant to be deployed separately from your infrastructure-as-code stack. This makes me think that integrating with existing infrastructure or external resources will not be seamless.17

  3. The underlying infrastructure created by an Ampt project remains a mystery to me despite having read its current documentation repo and seeing a demo. This is in opposition to the Eventual team’s easily navigable and well-organized generated AWS infrastructure that is easily grok-able as you plan and execute your infrastructure-as-code scripts. For AWS control freaks like myself, I can see Eventual’s transparency giving it an edge over Ampt.

  4. Eventual’s focus on an end-to-end type-safe tRPC-inspired development experience feels absent in Ampt.

Caveats aside, my initial impressions of Ampt are very positive and I am looking forward to eventually getting my hands on the beta.

Nitric

Nitric provides similar abstractions to Ampt with more transparency into generated infrastructure, with less integration into application frameworks, and with slower deployments. Because Nitric is deployed with Pulumi, you can use pulumi’s tools to visualize your infrastructure.

Because Nitric cannot be deployed from an existing Pulumi infrastructure stack as a construct, integration into the rest of your infrastructure-as-code ecosystem requires more effort than Eventual. This is done via an external Nitric configuration file which is an imperfect developer experience for me.

Unlike Eventual or Ampt, Nitric has already been available to build publicly since early December. You can get your hands on it immediately without joining any programs, and get a feel for what this new emerging paradigm feels like. Like Eventual and Ampt, I am excited to see how Nitric evolves to find its place in the infrastructure-from-code ecosystem.

Klotho and Encore

Klotho and Encore let you define annotations as comments in your code that follow a defined spec. Engines convert your annotated code to infrastructure and code dependencies that are necessary to generate a distributed cloud application.

I do not like comment-based annotations from an aesthetics standpoint, and abstractions feel slightly weaker than Ampt or Eventual annotations. As such, I have not investigated further.

Shuttle

Shuttle is a Rust-based framework for building cloud applications. It reminds me of Chalice and is very web-app focused. Right now it seems to offer no support for long-running jobs or event-driven systems. I have not tried it out, but it seems to have a far smaller scope than most frameworks that bill themselves as infrastructure-from-code frameworks.

Modal

I really like the ideas underpinning Modal by Erik Bernhardsson that he introduced in December. You develop locally and deploy to Modal’s managed infrastructure. The development experience lets you simply designate sections of code and specify the type of infrastructure you want to run them on using python decorators. Modal takes care of orchestrating these runtime18s for you.

I also want to call out Modal’s native support for asynchronous Python APIs, which is welcome in the python eco-system where performance and asynchronous programming are often neglected.

Modal represents an impressive feat of engineering, along with a fantastic developer experience. Especially for data-science-related tasks. Modal’s main drawback currently is its inability to deploy to infrastructure building blocks that you know and understand, like AWS compute and storage solutions.

I will continue to watch Modal closely and wait to understand its platform reliability better. I am also curious to see how workloads built on Modal are securely integrated into existing cloud environments.

Concluding Thoughts

I am very optimistic about the trends we are seeing in cloud-development tooling. Emerging frameworks seem to be language-specific, and focused on identifying the right abstractions necessary to build distributed cloud applications.

My expectation is that we will not be able to dispense with infrastructure-as-code in favor of infrastructure-from-code. The two paradigms are going to evolve independently of one another, and infrastructure-as-code tools will deploy infrastructure-from-code frameworks going forward.

Infrastructure-as-code languages will likely move away from DSLs like HCL to statically typed, tried-and-tested programming languages like TypeScript. I also expect these to eventually become lazy and provide good async programming models.

Meanwhile, applications and business logic will live in purpose-built frameworks. They will integrate natively with your infrastructure-as-code engine mirroring the integration models of the Chalice construct for the CDK, SST constructs for deploying modern web frameworks, and Eventual’s native infrastructure-as-code construct deployment model.

For those with a background in building Spring applications in Java, having code that generates a set of static dependencies into a context, and business logic that dynamically consumes these dependencies during application run-time is a familiar concept. We are moving to a world where these static contexts are distributed and built by infrastructure-as-code tools. Our business logic will still be defined simply and concisely, built within infrastructure-from-code frameworks, and integrate seamlessly with this distributed infrastructure context.

With better abstractions on the horizon and a continued focus on refining development-iteration time, the future of cloud development tooling looks bright.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

1

I talk about runtimes a lot in this blog post. In this context when I use the term runtime, I am referring to a set of logically isolated compute resources dedicated to a specific task. A runtime might be an EC2 instance, Fargate task, lambda function, modal container, CodeBuild job, or any other location you might want to run your code.

2

See this old GitHub issue for more info.

3

As an aside, 6 months prior to AWS launching Chalice, Rich Jones created Zappa. Zappa is a framework with an incredibly similar API to Chalice and is, in some ways, more configurable. At the time, knowing little about either framework, I elected to use Chalice because of its friendlier documentation.

4

As far as I am aware, Jeremy Daly’s team coined this term to describe the Serverless Cloud framework.

5

Hashicorp Configuration Language or HCL is the domain-specific language used to define terraform configuration.

6

At the time, I strongly believed that infrastructure configuration needed to be declarative. After all, rerunning an infrastructure deployment stack must ensure existing infrastructure conforms with the desired configuration. It should not create an entirely new set of infrastructure.

7

It is probably no coincidence that Dax and Sam work on some of my favorite dev tools SST and Eventual. They are deeply focused on developer experience and seem hyper-aware of many pain points that I experience.

8

Anecdotally, a lot of the time I and my team spend with existing cloud development SRE teams setups is targeted at reducing deployment blast radiuses and simplifying deployment processes.

9

If you are unfamiliar with how AWS architectural guidance divides services between data-plane and control-plane operations, I highly recommend the linked section in the AWS whitepaper, AWS Fault Isolation Boundaries.

10

Infrastructure is often mutable and stateful. One must understand potential blast radiuses during deployment times, and infrastructure-as-code tools let you create plans and map out deployment blast radiuses ahead of time. This lets cloud developers enact safe deployment processes and practices.

11

Some mostly serverless applications might also require server-dependent components like an ElasticSearch cluster or NAT gateways and network firewalls to keep traffic from traversing the internet. These resources certainly don’t belong in a serverless framework.

12

Zappa has no built-in answer to its infrastructure-as-code gap that I am aware of. Any infrastructure that the Zappa app depends on must be created outside of the Zappa app.

13

See footnote 1 for what I mean when I say runtime.

14

Sam Goodwin has an open pull request for Pulumi support. This is visible on the Github repository that is currently only open to members of the beta program.

15

See footnote 1 for what I mean when I say runtime.

16

Just a short note to say that I find Asher Sterkin’s writing about infrastructure-from-code to be especially comprehensive.

17

Another potentially large issue with this decision is that Ampt documentation indicates that it manages the control planes for persistent resources with its storage and data APIs. I think that these components should not be created or managed by the application source code as they are stateful and the lifecycle should be delegated to an infrastructure-as-code tool.

18

See footnote 1 for what I mean when I say runtime.

https://yehudacohen.substack.com/p/exploring-the-emerging-cloud-development
Extensions
Organizing and Integrating Distributed Processes with AWS
Part 2 of 2: A comprehensive guide to communication in distributed systems with AWS
Show full content

Previously, we identified and explored a set of components that enable us to communicate from a source to a destination using push-based and pull-based communication models. We explored application servers, load balancers, message queues, and data streams.

This second installment is designed to be consumable independently of the first and focuses on identifying a second set of components including message topics, event buses, and workflow engines. It explores how these components can help us operate and scale increasingly complex integrations. We examine how these building blocks can be used to achieve reliable, observable, flexible, and performant distributed processes.

Hopefully all without boring you senseless.

Subscribe now

We will examine three types of components that are more complex than communicating from a source to a destination:

  1. Message fan-out

  2. Rule-based conditional routing of messages across a set of consumers

  3. Orchestrating and choreographing complex workflows

    1. Event-driven choreography

    2. Engine-driven orchestration

Note that these topics are exclusively related to asynchronous, pull-based processes. We will therefore not cover service meshes or other components that are designed solely with push-based distributed service integrations in mind.

two people drawing on whiteboard
Photo by Kaleidico on Unsplash

Tackling messaging complexity in human systems

Sending a message from a source to a destination can be thought of as passing a note in middle school. (I haven’t been in middle school for a while, so things may have changed. Either way, dinosaur that I am, I will proceed with this analogy.)

You have a note that you want to send to your buddy. If your buddy’s desk is next to yours, you just pass it along. If you’ve got good aim and you know where your buddy is sitting, you lob it over when the teacher is not looking. This method is both less reliable and less secure. If your note isn’t very private or you trust your neighbor not to look, you might write your buddy’s name on the cover and pass it off to your neighbor for delivery. This method also has reliability and security implications.

But let’s say you wanted to invite everyone in the class to your birthday party on Sunday. You might write a note and have everyone toss it around the class. Of course, your buddy would probably get mad at you if the note makes its way to everyone in the class except them.

A better approach might be to have a class bulletin board. You put your invitation up on the bulletin board, and your buddies check the bulletin board to see if any new events are posted for the class. This way, if your buddy doesn’t see the birthday invitation, they have only themselves to blame.

This approach fails in several scenarios. What if you want your buddies to attend your party, but your buddies are not very diligent about checking the bulletin board? What if you want to selectively invite a subset of the class?

And thus emerged the practice of personalized party invitations.

You make a list of everyone you want to invite to your party. You create a separate invitation for each person on the list. You deliver these invitations to everyone you want to invite. You check off when you’ve delivered the invitation and you check off RSVPs you receive.

The notepad approach is manageable for a small party with a short guest list. With increasing party size, however, comes the need to use more complex tools. But your notepad quickly evolves along with the scale of your problems:

Your invitation list becomes longer. You start needing to keep track of gifts you receive. You need to collect RSVP counts for catering reasons. When delivering all the invitations in person is no longer feasible, you need to record and keep track of mail addresses. You need to send out reminders when you haven’t heard back.

For my bar-mitzvah and my sisters’ bat-mitzvahs, my father kept spreadsheets. My wife and I used the same spreadsheets for our wedding.

Message complexity in human systems and message complexity in distributed systems are very similar.

Tackling messaging complexity in distributed processesMessage Fan-out

Akin to the scenario of wanting to broadcast an invitation to everyone in your class, systems frequently need to broadcast information to a set of consumers. As such, engineers have built messaging primitives to enable pub/sub or publish/subscribe capabilities. In AWS, the simplest publish primitive is provided by Simple Notification Service or SNS. SNS enables you to create message topics to which you can publish messages.

Each message topic represents a bulletin board-like construct to which publishers can deliver messages. Messages can be posted to this bulletin board, and different consumers can subscribe to these messages. Subscribers can be lambda functions or SQS queues or email or text messages. For an exhaustive list of subscribers see AWS documentation.

A single SNS topic allows up to 12.5 million subscriptions. As such, if you need to take a single message and fan it out to a lot of consumers, SNS is a pretty good bet. RabbitMQ and ActiveMQ allow you to define limitless subscriptions per topic, but since you’re in control of the infrastructure, you bear the responsibility of scaling it to meet the demand.

AWS does provide a managed Amazon MQ solution that lets you operate RabbitMQ and ActiveMQ clusters, but you need to interpret metrics and determine scaling policies yourself. This will usually add more operational complexity to your solutions than architecting with SNS limitations in mind.

Rule-based conditional routing of messages across a set of consumers

Frequently, when you fan out messages in distributed systems, you might not want to fan all messages out equally.

Assume, for instance, that you have a topic called ProductPurchaseNotifications in a web store. Now, assume you are selling both physical and digital products from the same storefront. You might have an upstream system responsible for ordering new inventory which is very interested in physical product purchases but is not interested in digital products at all. You might also have a royalty platform responsible that is only interested in digital products. You’ll also probably have an email system that is indiscriminate and wishes to receive all product purchase notifications.

It is, of course, possible to deliver all product purchase notifications to all consumers and let them discard what they are not interested in, but this isn’t a very efficient solution. It results in a more congested network and resource waste.

Instead, many Pub/Sub systems have therefore implemented filtering capabilities in the subscription layer. In fact, as one of the pre:Invent announcements in November (2022), Amazon added support for payload-based message filtering in SNS. Before this, more rudimentary metadata-based message filtering was already available, but as AWS Serverless Hero, Yan Cui noted at the time on Twitter, this is a big deal.

When Yan mentions EventBridge, he is referring to Amazon’s other service capable of enabling implementation of pub / sub patterns.

EventBridge vs SNS

Unfortunately, in the world of AWS, there is no neat separation of responsibilities between services. That is to say that EventBridge and SNS definitely have some level of overlap.

Like SNS, EventBridge enables you to create an endpoint to which producers can publish messages. EventBridge calls this message topic-like construct an Event Bus. These constructs are similar but not identical.

SNS Merits

We mentioned earlier that you can create 12.5 million subscriptions for each SNS topic. In EventBridge, instead of creating subscriptions you create rules. Each rule can deliver to up to 5 destinations, and you can have up to 300 rules associated with an EventBus. Even in the case that you maximize your rule configurations, an EventBridge bus will never be able to achieve 1/1000th of the fan-out of an SNS topic.

Luc van Donkersgoed also measures latency of an SNS event as four times as fast as EventBridge when triggering a Lambda function.

Lastly, as Yan mentions further in the same Twitter thread, unlike EventBridge, SNS also has the ability to guarantee ordered delivery of events.

EventBridge Merits

But SNS is not superior to EventBridge in all cases. In fact, the increased latency of EventBridge is likely due to some of the capabilities that EventBridge offers that are not offered by SNS.

Yan provides a couple of examples in another tweet in the above thread:

You see, in addition to enabling pub/sub architectures, EventBridge aims to provide all the necessary functionality for choreographing event driven processes. Later on in this blog post we will dive deeply into what choreographing event driven processes entails, but for now let’s just understand how EventBridge supports a few features that help with this.

As Yan notes, in distributed workflows it might be necessary to archive, and replay events in the case of system failure. EventBridge therefore offers these as options even though SNS does not. It also lets you target a far more diverse set of upstream systems than SNS does.

I do want to mention one more important benefit of EventBridge not mentioned by Yan. If your upstream inventory system has an API requiring a productId and quantity as an input and your email system also requires the purchaser userId, EventBridge lets you use the same source message to invoke both upstream services.

When defining an EventBridge rule, engineers can map and transform the input message into the exact format expected by each specific consumer in its rule configuration. This property enables consumers to expose simple, and minimal endpoints that need not be aware of producer message formats.

While SNS message producers must publish messages in the same format that consumers expect to receive those messages, EventBridge’s input transformation enables integration to be handled without code changes in either the consumer or producer. (This assumes, of course, that the producer provides all of the data required by the consumer.)

EventBridge + SNS

In the rare cases where you require the input transformation of EventBridge and the fan-out of SNS, the two can be used in conjunction. You can send a message to an EventBridge bus, and use a rule to send it to an SNS topic.

Unfortunately, this does not help with latency or delivery order limitations of EventBridge, but it does help alleviate the fan-out limitations.

Other Pub/Sub Services

Outside the scope of this blog-post, are two other managed services capable of enabling pub / sub workloads on AWS that I want to quickly touch on.

AWS IoT

AWS IoT Core provides MQTT topics. While these topics are designed for IoT devices to send data to AWS and then process this data with a rules engine, it is possible to (ab)use these topics for general service communication and pub/sub.

AWS IoT Core Topic Rules, like EventBridge rules, can target a wide variety of destinations. It also supports input transformations using a SQL like syntax.

I have not investigated the drawbacks to (ab)using AWS IoT as an EventBridge alternative outside of the IoT space, but I intend to do this at some point.

AWS MSK

In MSK, AWS provides a managed solution for Apache Kafka. Kafka is capable of doing pub / sub, but is mostly intended as a data streaming platform. I will remark, however, that Kafka is probably overkill if all you are looking for is a pub-sub solution. We will therefore leave it out of the scope of this discussion.

Orchestrating and choreographing complex workflows

At this point, we have identified some powerful components AWS provides us to help manage service integrations in distributed processes.

Event Driven Choreography

Using EventBridge and SNS we can choreograph incredibly complex processes. Process choreography is like dance choreography. Like a dancer in a dance, every service in a process waits for its cue before performing its role and then exiting the stage. Each service only needs to understand the events that it reacts to, and no service need know about any other services in the workflow. Services are therefore said to be loosely coupled to each other.

3 women in black tank top and black leggings doing yoga

We achieve this loose coupling by having each service publish important state changes it performs to a central event bus. Each service need, therefore, only know about the state changes it performs. It need not know about any consuming service. When consuming events, integrators can look to the central event bus to define inputs for event consumers and need not be aware of event producers.

Dangers of Event Driven Choreography

Without very careful planning, however, choreographed processes can pose some challenging operational challenges. Especially as the dependencies in the processes expand in number and complexity.

Here are some examples of the type of operational challenges I am referring to:

  1. You notice some incorrect data in the output of your system. Your system consists of lots of producers and lots of consumers. How do you work out the source of the data issue?

  2. A process which was working before is no longer working. How do you trace which services have failed?

  3. How do you discover that a process that was working is no longer working? Without the right monitors in place and without a central actor making sure that each step in the process is invoked and succeeds, you might only discover that your system is broken when it is too late to recover gracefully.

  4. A producer wishes to deprecate a message. How do you ensure that the event is no longer in use?

  5. How do you keep track of all the different kinds of messages that are available to consume? How do you keep track of the structure of these new messages and enable integrators to effectively integrate?

To help make sense of these choreographed and dynamic processes, distributed tracing solutions like Zipkin, Jaeger, and AWS X-Ray have emerged. EventBridge integrates with AWS X-Ray and X-Ray dynamically provides visualizations that stitch together steps in the process so that it is easier to visualize these processes.

Distributed tracing solutions alongside rigorous monitoring and alerting implemented throughout event consumers and producers can operate together to form a rigorous observability program and help address the first three challenges mentioned.

With regard to the remaining two challenges, EventBridge offers a schema registry with a schema discovery solution to enable the dynamic cataloging of events that are present in your system.

To me, aside from the challenges involved in implementing a comprehensive observability program, there is another great hazard present when implementing choreographed processes. Perhaps the greatest danger of choreographed processes is that they tend to become very complex very quickly.

Because of the prospect that consumers can start consuming events with very little friction, the overall distributed process grows organically. This property is powerful in that it enables choreographed processes to be extended easily. It is also dangerous in that organic growth of processes is chaotic. Without engineers constantly planning and centrally managing processes and ensuring processes remain as simple as possible, entropy will prevail.

I cannot profess to be a follower of Tony Robbins, nor can I vouch for the efficacy of his advice, but if he has ever said something true it is these six words:

The more complex a system is, the harder it is to secure, scale, extend safely, or fix if it breaks.

Consider, the event planner who publishes invitations to a set of invitees and needs to collect RSVPs.

If 50% of the invitation list doesn’t RSVP in time, when the time comes to submit catering numbers the planner risks over-catering or under-catering. This is a consequence of not planning adequately up-front. If this concern was top of mind at the outset, the event planner could have sent RSVP reminders ahead of time, and reduced the catering uncertainty.

If the caterer asks for special dietary requirements prior to the party, the event planner might need to scramble to collect this information. Had this been top of mind at the outset, the event planner could have collected this information on the RSVP card.

Compared with many distributed processes, party planning is simple and yet even in this process it becomes quickly apparent that without planning adequately, surprises often make event planners lives more difficult.

I am not claiming that it is impossible to build a reliable, minimalist, and simple choreographed large distributed system. I am saying that I have never seen one.

Engine Driven Orchestration

Workflow engine orchestration is an alternative to event driven choreography that helps alleviate some of the challenges posed by choreographed systems.

Workflow engines operate by requiring engineers to build their processes explicitly in a central system. This system then becomes responsible for invoking each step of the distributed process with the appropriate input at the appropriate time and collecting any output for use in subsequent steps.

The workflow engine resembles a conductor of an orchestra who stands in front of the musicians providing input as to when to and how to perform their roles. In the end, the goal is to produce wonderful music. The workflow engine tells a set of services when to and how to execute the different steps that make up the process, with the goal of ensuring the process completes successfully and completely.

Forcing engineers to explicitly map out the process up front has two main benefits:

  1. It forces engineers to pay attention to the entire process from beginning to end and gives them an opportunity consider edge cases at the process level.

  2. It gives engineers the opportunity to simplify complexity as it is introduced into the system.

Having a central engine conducting the process and saving the inputs and outputs of every step has other great benefits:

  1. It becomes easy to understand the status of every process execution in real time.

  2. Bugs become easy to find by inspecting the central defined process and input and output of steps.

  3. In case of a service outage or an unexpected step failure, the engine can raise an alarm making it far easier to implement an observability program.

For some additional reading material about monitoring and managing distributed workflows, I recommend reading the work of Berndt Ruecker, a cofounder and the CTO at Camunda (A very popular workflow platform) starting with this excellent essay that first started me down this rabbit hole several years ago.

When you wish to build distributed processes on AWS, I recommend using AWS Step Functions. Step Functions lets you define workflows in a visual editor or by using Amazon States Language, a DSL that enables you to define expressive workflows that build state machines.

An example step functions state machine

Step Functions provides the capability to invoke AWS services as steps, perform decisions, perform steps in parallel, iterate over lists and perform a step for every item in the list and perform data transformations between steps using intrinsic functions. The step functions console also provides a great debugging experience wherein you can view execution graphs and step inputs and outputs, and click through to CloudWatch logs for lambda functions.

If you prefer to define your workflows using a programming language, there is an awesome CDK construct you can use to build Step Functions. If you want to write workflows using variables instead of JSONPath to reference data that is passed between steps, explore Temporal or the older SWF w/ Flow framework.

Or wait a few months. There’s a very exciting project I’ve been watching which provides an exceptional developer experience for building workflows. More on that soon.

Conclusion

This post should provide you a starting point and enable you to start build effective distributed processes on AWS. Using building blocks like SNS topics, EventBridge buses, and workflow engines, it becomes far easier to manage and operate reliable distributed processes.

I’ll leave you with one last word regarding the orchestration vs choreography debate. My perspective is that both orchestration and choreography have their place. When defining processes that must be reliable and are mission critical I tend to prefer orchestration. When integrating with other systems, I tend to prefer event driven choreography. Our team at Foresight loves both Step Functions and EventBridge and our team at Foresight use both the two to provide flexibility when we need flexibility and form and structure when we want to mandate form and structure.

Finally, AWS announced EventBridge Pipes at Re:Invent this year. You will notice we didn’t cover them in this blog post. This is because they are used to integrating a single source with a single destination and belong in part one of this guide which I hope to update soon.

Thank you for reading, and please let me know your thoughts as always!

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/organizing-and-integrating-distributed
Extensions
Re: re:Invent 2022
Reflections on my first AWS re:Invent, both experiential and technical
Show full content

Every year when re:Invent rolls around, I morph into an avid and eager armchair AWS fan. This year, I got up off my butt and flew with my team to the heart of sin-city, leaving my superwoman-wife behind to take care of our three very little children. With my wife’s ardent support, a pit of guilt in my stomach, and the frustration of two cancelled flights, I finally boarded an airplane to Las Vegas at Newark airport early Monday morning.

Absent any flamboyant gamblers, rowdy bachelors and bachelorettes, swooning honeymooners (or elopers), the cabin was absolutely packed with introverted looking, soft-spoken engineers. Oh, and lots of extraverted-looking, slightly over-friendly sales-people. These sales-people were immediately identifiable by their snappy suits or loud cheery phone conversations. I sat sandwiched between one engineer and one salesperson.

Landing in Vegas several hours later revealed that it was not just my flight that was overrun by nerds. The airport, the MGM Grand, as I later came to learn the entire Las Vegas strip was overflowing with re:Invent conference organizers, staff, and attendees.

I’m not going to bore you with the details of my frustrations with hotel reservations or flight cancellations, but suffice it to say that I eventually got into my 17th floor hotel room, and showered a much needed shower, even if it did fall as a pitiful trickle without any substantial pressure.

I donned my AWS Community Builder T-shirt for the first time ever, and headed toward the MGM Grand conference center to pick up my re:Invent badge. And to see how many people would see the AWS logo and come to me lost in search of directions. The walk from my hotel room to the conference center was almost ten minutes. All in the MGM Grand. I was warned in advance there would be a lot of walking. But ten minutes to the conference center wasn’t very bad.

Most people had already collected their badges, and the line was short and moved quickly. I collected my badge and headed to the dining room for lunch.

What can I say about the Re:Invent dining room at the MGM Grand? It is immense in its size. White table cloths and wait-staff circle maybe a hundred or more circular tables. It’s not terribly fancy, but it is terribly big. I made a beeline for the special meals section and they scanned my badge and brought out a kosher meal a few minutes later. The food was standard microwaved kosher fare, but better than most airplane food I’ve eaten. I was impressed at the extent the AWS team went to accommodate us.

I met up with a couple of awesome engineers who are building something really awesome that I’m hoping to get my hands on soon.

We headed to the Venetian together on one of the re:Invent shuttles. I didn’t know this beforehand, but the Venetian is where the party is at. And by the party I mean, the expo, the swag, the big chess board, and the Data Dog branded slide that slides down to the entrance between the escalators. It’s also where most of the people are.

There was a lot of walking over the four days of re:Invent. Walking to sessions, walking to lunch, walking to meetings, walking to meet people, and walking to events and parties. By the end my feet were sore.

At the end of the conference, the highlights were:

  1. Getting to meet and network with a lot of absolutely awesome people, some for the first time, and others in person for the first time. Including the people behind Functionless, Pulumi, Winglang, and some of my favorite AWS services.

  2. Getting to see some of our team in person.

  3. Three of the sessions I attended were awesome.

    1. Deliver Great Experiences with QUIC included a history of the design decisions that lead to HTTP 3 by Jim Roskind:

    2. How Stable Diffusion was built as a joint presentation between AWS and the Stability team:

    3. Chalk talk NET 304 with some really interesting insights into the design of AWS networking. For a recorded session by the same presenters see:

  4. Some super cool stuff at the expo, and getting to see how AWS Discovering several interesting tools and services that I think will excite our customers.

It wouldn’t be a complete re:Invent if we didn’t talk about announcements from AWS this year. Amazon did okay. Highlights were:

  1. A Serverless OpenSearch offering which was #1 on my Re:Invent Wishlist. It comes with the caveat that there is no scale to zero for now. Without obscene pricing, it seems like this is still going to be better than a cluster I have to scale myself.

  2. A huge amount of improvements to EFS that I didn’t ask for including: elastic throughput, improved latency, 1-day IA lifecycle policies.

  3. Lambda SnapStart that will become more and more useful as it supports languages other than Java.

  4. Amazon Verified Permissions

  5. ECS Service Connect

  6. Amazon VPC Lattice, that will become super awesome when it starts supporting TCP / UDP services

  7. Possibly Amazon DataZone. We’ll see after I have a chance to play with it.

  8. Step Functions distributed map state.

  9. EventBridge Pipes looks pretty awesome. I haven’t gotten to play with it yet sadly.

  10. Cross-account, cross-region CloudWatch, which might not yet be a good enough reason to not buy DataDog for many companies. AWS definitely seems to be taking steps to reclaim this revenue.

It was an incredibly different experience being at re:Invent in person. Watching from an armchair, re:Invent is all about the announcements.

Being there in person, and the keynotes and announcements felt like a very small part of the experience. Being at re:Invent is all about meeting people who are interested in the cloud, getting a chance to explore what other people are doing on AWS, and a lot of walking. I came away from the experience with some new and renewed relationships, a whole host of swag socks, and a pair of very sore legs.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/re-reinvent-2022
Extensions
The top 6 items on my Re:Invent 2022 Wishlist
What would make AWS even better: The Top 6
Show full content

This is the final post in my count-down of items on my 2022 re:invent wishlist. You can the previous post here.

It is November 10th tonight and Re:Invent is arriving pretty darn soon. Because of the time-crunch, and frankly because I have more interesting things to write about than missing AWS features, I am finishing my Re:Invent wishlist here and now. In one blockbuster BuzzFeed style list.

woman in black long sleeve shirt covering her face with her hands

So without further ado, here are the top 6 items I want most this Re:Invent:

#6 - Cognito as a SAML Identity Provider

In a typical authentication workflow with AWS Cognito, an application manages its users in Cognito user pools. If an organization wishes to allow all of its users to access the application, they establish a SAML or OIDC connection with Cognito whereby users can log in with their organization’s identity provider.

Cognito does not currently support a reverse workflow. Let’s say I have built a website with Cognito and have a bunch of users loaded up into my user pools. Now, I want to be able to enable a set of these users to access a my tableau analytics dashboard using their regular application login credentials.

Would it not be awesome if you could establish a SAML trust between Tableau and Cognito?

Imagine if a user who attempts to access Tableau could automatically be redirected to their regular application login screen. Imagine if your application’s user pool users could use their application credentials to gain access to your Jira boards and see your internal roadmaps.

This re:Invent, AWS, please would you transform Cognito user pools into powerful directories that are capable of serving as an identity provider to any number of third party apps that support SAML authentication and application assignments.

#5 - Scale to Zero for “Serverless” Resources

Until AWS released Serverless Neptune a couple of weeks ago, I was pretty bullish about this wishlist item.

Only a couple of years ago, I was giddy with excitement at the announcement of Aurora Serverless. Today, I am a customer of Aurora Serverless, but much of my initial excitement has waned.

Perhaps I had unrealistic expectations. After all, I have been using Lambda since 2016, DynamoDB since 2015, and S3 since college in 2014. DynamoDB, S3, and Lambda, effectively offered me the ability to experiment as much as I wanted for free without the fear of a suddenly large AWS bill I couldn’t afford. I’d never built a VPC back then.

In my head Aurora Serverless would be just like DynamoDB, except I could model data relationally and do joins too, and connect it to ActiveRecord or Hibernate or SQL Alchemy. Plus they’d announced this thing called the Data API, where I could perform SQL queries over HTTP and didn’t need to worry about connection pooling.

Since then, I’ve used Aurora Serverless in my workloads, but don’t use it to experiment much because it hasn’t lived up to its promise. Even while idle, it charges you money, and the VPC dependencies make it feel a lot more cumbersome than DynamoDB.

Skip a couple of years, and while Aurora Serverless has improved tremendously across a set of dimensions, it feels even less serverless. Aurora Serverless V2 isn’t highly available out of the gate. You have to deploy a cluster and then serverless instances in multiple availability zones in your VPC. It kind of just feels like an autoscaling RDS cluster without the serverless promises of scaling to zero. The Data API has also gone by the wayside.

AWS Neptune’s serverless solution was recently released with a minimum of 2.5 NCUs. Once more, autoscaling, but over $289 / month to run. Its operational model feels very similar to Aurora Serverless V2.

Although AWS calls these two services serverless, they only seem to meet two of the four serverless criteria AWS describes in its serverless FAQs:

Q: What makes a service or application serverless?

We founded the concept of serverless on the following tenets: no server management, pay-for-value services, continuous scaling, and built-in fault tolerance. When adopting a serverless service or building a serverless architecture, these ideals are fundamental to serverless strategy.

With the new Aurora Serverless and Neptune Serverless offerings, the tenets of no server management and continuous scaling hold true. The tenets of pay-for-value and built-in-fault tolerance do not apply the same way.

I would love to see these two “Serverless” offerings scale down to zero and handle configuration of high availability out of the box.

#4 - Availability Zone Equality

One of the most frustrating things to experience when building on AWS is what I call the availability zone equality problem. It goes like this:

  1. You plan your network architecture and create a VPC and subnets, usually in randomly selected availability zones.

  2. Some time later, you elect to deploy AWS Workspaces in your VPC and discover that your chosen subnets invariably do not support AWS Workspaces because they’re in the wrong AZ. You see, it turns out that AWS Workspaces only supports a subset of availability zones in each region.

Why is Workspaces only available in a subset of AZs? Who can say.

AWS Workspaces, however, isn’t the best culprit of the availability zone equality problem. Amazon Nimble Studio also documents this behavior.

Worse still, some limitations are not documented. Byron Wolfman of Hashicorp writes about the absence of Nitro EC2 instances in availability zone use1-az3. EKS sometimes errors out with:

Cannot create cluster 'example-cluster' because region-1d, the targeted Availability Zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these Availability Zones: region-1a, region-1b, region-1c

I’ve seen OpenSearch and Amazon RDS clusters sometimes fail with similar errors too.

This re:Invent, if AWS shows progress toward making all services and instance types availability across all availability zones in supported regions, I will be a happy man.

#3 - AWS Multi-account Console

While administering an AWS environment securely is made a lot easier with multiple accounts, multi-account strategies are not without pain-points. One of the pain-points of building multi-account workloads is that its impossible to be logged into multiple AWS accounts at the same time.

AWS’s current solution to this dilemma is AWS SSO which lets you switch between AWS accounts or roles in your organization easily. While an improvement over the old role switching experience, I still find this a frustrating experience.

A common use case is trying to debug an ECS task that’s using firelens to ship all logs to cloud-watch in a log archive AWS account. I have to either open separate chrome profiles or keep switching between accounts as I want to read the logs and debug the ECS task configuration.

Another use case I often encounter is trying to debug network configuration issues that span across multiple accounts.

Item #3 on my re:Invent wishlist for this year is a single console experience. This experience should allow me to view resource configurations across accounts and regions in a centralized place, and search and filter these resources to help identify misconfigurations more quickly.

#2 - ECS Stateful Sets

There is currently no native integration that enables you to mount EBS volumes to ECS tasks to enable persistent volumes to support an application. For more information re: why this is would be awesome, see this github issue which has been open since January of 2019.

Stateful services with low latency requirements are one of the few use-cases where I recommend that customers use EKS instead of ECS.

#1 - Serverless Open Search

I am a huge fan of OpenSearch. It is a real pain-point, however, to operate an OpenSearch cluster. You run out of shards on your nodes, or your performance starts suffering, or you run out of hard-drive space. Or your data isn’t refreshing quickly enough. Or your AWS bill skyrockets. Or your single node falls over because you didn’t configure things according to best practices.

It requires time and domain knowledge to understand how to size clusters and allocate resources for OpenSearch. A serverless solution you can use to search and index documents, to help remove some of the domain knowledge necessary to operate the infrastructure reliably and cost-effectively would be absolutely awesome.

And please, AWS, no minimum of 2.5 Open Search Capacity Units for billing purposes.

One last thing…

I’m going to Re:Invent for the first time this year. If you’ll be there too and would like to connect, my Twitter DMs are open. Alternatively feel free to reach out on LinkedIn.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/the-top-6-items-on-my-reinvent-2022
Extensions
VPC @ Edge is an important and missing AWS Feature
What would make AWS even better: #7 in my countdown from 10
Show full content

This is item #7 in my count-down of items on my 2022 re:invent wishlist. You can find item #8 here.

A customer wants to deploy globally accessible website with minimal latency, caching of resources at the edge, and server side rendering at the edge.

AWS: “Not a problem. Just use CloudFront with CloudFront Functions and Lambda@Edge.”

Subscribe now

A customer wants to deploy the same solution, but privately without allowing traffic to enter or leave their VPC. Maybe for a customer with compliance requirements, or maybe for a development environment.

AWS: “Do you absolutely require server side rendering at the edge? You’re probably going to have to make some tradeoffs here.”

“What about my origin?” asks the customer, “does that have to be exposed over the internet?”

AWS: “Why yes, yes it does.”

AWS doesn’t actually say these things, but people have been asking about VPC support from CloudFront in various forms for many many years.

I’m not optimistic we’ll be getting support for this feature because the question seems to get pushed aside with questions like:

  1. If you’re already inside a VPC why do you need CloudFront? Just hit the origin directly.

  2. Why don’t you just add a WAF in front of your CloudFront distribution and whitelist it by IP address?

But CloudFront is a lot more than a cache. It is a reverse proxy solution that is highly configurable with Lambda@Edge and CloudFront and can group together disparate backends to provide a unified experience.

Here is what I want from CloudFront that I don’t yet have:

  1. Let me place VPC endpoints in each target VPC and associating these with my CloudFront distribution origin configurations. Let me configure security group rules for each of these VPC endpoints.

    1. If these origin domain names are resolved using the DHCP options of the VPC to allow me to use private domain names.

    2. If you want to make me really happy, let me target internal network load balancers with Global Accelerator too.

  2. Let me restrict access to CloudFront by source VPC endpoint or source VPC using a resource policy similar to the way API gateway does.

I am a huge fan of using CloudFront distributions as reverse proxies to minimize the surface area for attack. I run into VPC issues every time I want to deploy internal only development accounts. Plus, it would be really nice to not having to worry about rotating shared secrets regularly to protect public load balancer origins.

Thanks for reading Fun With The Cloud! That’s it for this week, but #6 in my Re:Invent 2022 wishlist is coming soon!

https://yehudacohen.substack.com/p/vpc-edge-is-an-important-and-missing
Extensions
A comprehensive guide to communication in distributed systems with AWS
Part 1 of 2: How push and pull-based communication architectures are used with synchronous and asynchronous services
Show full content

Since starting my career in software engineering around eight years ago, I have been a little obsessed with distributed, and event driven systems. This blog post is the first of two in an attempt to formalize some of the fundamentals I have learned along the way. Although the concepts I discuss herein apply generally, I use AWS communication services to help illustrate these concepts.

There is a lot of content I want to discuss here, and so I have broken it up into two parts.

This first section is entitled: How push and pull-based communication architectures are used with synchronous and asynchronous services.

This first part addresses the following:

  1. Understanding and visualizing lambda invocation approaches

  2. Push-based, asynchronous APIs

  3. Comparing pull-based and push-based communication models

  4. Comparing pull-based communication vehicles

    1. Queues

    2. Streams

The second section is still in progress and is entitled Organizing and integrating services and I will try to publish it over the next couple of weeks. The second section addresses:

  1. Delivering a single message to multiple destinations

  2. Rule-based delivery of messages to multiple consumers

  3. Orchestrating and choreographing complex workflows

    1. Event-driven choreography

    2. Engine-driven orchestration

Subscribe now

Without further ado then:

How push and pull-based communication architectures are used with synchronous and asynchronous services

A couple of months ago, I came across a very interesting and all-around excellent analysis published by Luc van Donkersgoed: Serverless Messaging: Latency Compared.

In this blog post, Luc experiments with lots of different mechanisms of triggering a lambda function. He measures the average time taken from the time a message is sent until it a lambda function is triggered across a sample of these mechanisms.

Message latency is, however, but one of the factors to consider when selecting a method of communication for your integration.

In Luc’s analysis, a column of the results table is dedicated to indicating whether an AWS service integration with Lambda is pull-based or push-based. Later on in the analysis, Luc explains the difference between pull-based and push-based lambda integrations:

In our Lambda-to-Lambda setup a push-based messaging system will invoke our Consumer Lambda Function for us.

and:

In a pull-based system like SQS or Kinesis, a message is placed onto a queue or stream, from where it can be retrieved by a consuming system.

Queues and streams, are the two primary pull-based transport vehicles for most distributed systems. SQS allows you to create managed queues, and Kinesis allows you to create managed streams.

There is nonetheless a subtle but important distinction between a push-based service integration and a push-based communication paradigm.

You see, while a pull-based service integration necessitates a pull-based communication paradigm, the converse is not true. A push-based service integration can provide an interface that encompasses two steps:

  1. Placing a message into a pull-based communication vehicle.

  2. Performing the work with an idle worker process that pulls the message off the queue.

Luc therefore includes another column in his results table indicating synchronous or asynchronous invocation. An asynchronously invoked lambda function abstracts over its message queue integration, but the workers still pull the task from a queue internal to the lambda service.

There are therefore three ways to invoke a lambda function:

  1. A synchronously invoked lambda where the lambda is invoked in a push-based model, and where the response is returned synchronously.

  2. An event-source mapping, where the lambda is configured to listen to a pull-based event source.

  3. An asynchronously invoked lambda function that abstracts over its message queue integration. In this model, workers still pull the task from a queue, but the queue is internal to the lambda service.

Different AWS services integrate with Lambda differently per the following grid:

Understanding and visualizing lambda invocation approaches

To further illustrate the distinction, I have drawn three diagrams:

Diagram 1: Pull-based Service Integration w/ Pull-based messaging

Note that in this Diagram 1, a message store is external to the service. The service is configured to pull messages from the external message store. This model is an approximation of how the Event Source Mapping Invocation model within AWS Lambda operates. External event sources that Lambda will poll for work include:

  1. SQS queues (FIFO and regular)

  2. Kinesis streams

  3. DynamoDB streams

  4. Amazon MQ queues

  5. Kafka streams

This is incredibly similar in structure to the way that asynchronous lambda invocations work from a messaging standpoint. In Diagram 2, shown below, messages are pushed to a message store, and workers pull from the message store. This process is almost identical to the flow in Diagram 1. The difference is that the message queue is local to the Lambda Service rather than external. The Lambda service abstracts the message passing process away from the event producer.

Diagram 2: Push based service integration w/ Pull-based messaging

The final integration model for Lambda is the one that is probably most familiar to most people. This is true because it is arguably the communication model that maps most intuitively to interactive experiences. We send requests directly to an application worker that is waiting for us to send it work. It completes the work, and responds with the result. This application worker is therefore appropriately named an application server. A typical push-based client-server messaging model with an API layer and a load balancer is illustrated below in Diagram 3:

Diagram 3: Push-based service integration w/ Push-based messaging
Push-based asynchronous APIs

It is true that when AWS’ Lambda functions are invoked asynchronously, a message queue intermediary is always used. This is a design decision made within the Lambda service, but not all asynchronous services require a message queue or pull-based messaging.

Webhook endpoints, for example, represent non-blocking asynchronous communication patterns that are nonetheless push-based. As soon as a webhook is invoked, a new thread can be spawned to handle the event. This is illustrated in Diagram 4:

In these scenarios, clients do not need to remain blocked while waiting for a notification from the server. This is in spite of the fact that no pull-based transport vehicles were used in the communication process.

This API represents an asynchronous, but push-based API implementation absent the use of a message queue.

For reasons that we will discuss, although services of this nature are possible, it is incredibly common to write messages received at a webhook endpoints to a message queue. This change in architecture is illustrated in Diagram 5:

Notice how the load balancer behind the webhook API has been replaced with a message queue. Also notice, that this second webhook API messaging architecture resembles the same architecture used in asynchronous lambda function invocation.

Comparing pull-based and push-based communication models

At this point, you should clearly understand the concepts of push-based and pull-based communication. Let’s understand some of the tradeoffs you make when you select a communication model across a set of dimensions.

Simplicity

Push-based communication models are simpler than pull-based communication models. Effectively, only two components need to be present: a client and a server. The server listens for requests, the client sends a request to a server, the server handles the request and sends the response back to the client.

In the case case of a pull-based communication model, a third component is introduced: a transport vehicle usually in the form of a message queue or a stream.

A producer places a message in a queue or a stream. A worker polls the queue or the stream for tasks it can perform. This creates indirection and the request path changes.

The transport vehicle is not the only complexity introduced, however. After the worker performs the work, the result may need to be relayed back to the caller which might require the caller to listen to event streams.

On top of the actual communication complexity added, adding a message queue makes tracing requests as they flow through our system more difficult. We often need to be able to understand where a request is along its journey. With requests being queued, consumed, and chained together this becomes more and more difficult. This usually results more urgent need to implement monitoring solutions, along with increased operational and implementation complexity.

Complexity is something I like to avoid wherever possible, and if simplicity were the only dimension in question the decision would be simple: no pull-based communication, just push-based communication.

Latency

Latency is a difficult dimension to unpack, because different system conditions, implementations, and resource congestion can result in pull-based or push-based communication models winning the race.

With that said, assuming underutilized compute, memory, and network resources, I would expect additional network and disk hops along with slower polling intervals to make pull-based slightly slower than client-server push-based communication. With that said, the impact of these hops should be small in a well-designed distributed system. It is still very much possible to achieve realtime and interactive experiences using pull-based request models.

Looking at Luc’s analysis table linked above that compares lambda triggers from a latency perspective unfortunately yields little insight into this problem. One cannot generalize general performance in pull-based architectures from AWS Lambda triggers, because each integration is service specific. Each integration abstracts over enough functionality that it is impossible to know how much of its latency is introduced by the presence of a queue or stream.

Reliability

When I started my first job as a Software Engineer after college, I remember hearing the constant refrain that message queues are more reliable than synchronous APIs. In my youthful naivety, I doubted the validity to this philosophy. After all, to my mind, an extra hop in the network represented an extra failure scenario.

The services I had used at the time seemed pretty reliable, and I didn’t understand why an invocation writing a message into a queue had any more chance of succeeding than calling the service directly.

I have since become older and less naive. Not all services are equally reliable.

The bottleneck in push-based communication is usually the performance of the application processing the input from the TCP backlog queue. The more dependencies an application has and the more complex the work performed, the slower the service will be able to process the incoming requests. This might be due to memory or CPU or disk or network bottlenecks. As each server is overwhelmed, user experiences will experience higher and higher latency, or the application might crash entirely.

When you are dealing with very simple services with few dependencies that execute quickly with little resource consumption, this will usually not be an issue. Requests get handled quickly and cycle through the system quickly. If you are seeing huge request volumes, you simply scale your backend horizontally and use a load balancer to distribute load across the available servers. If a servers’ resource utilization starts to take strain, simply add more capacity.

A synchronous service that places one or more messages or events in a queue or stream is a simple service with very few dependencies. No data transformations occur. Data is persisted. Very few services are as simple a queue or stream interface, and so the chance of a message not making it to a queue or stream is very very small.

The more resource intensive a service or the more dependencies it has, or the more time a request takes to process, the higher we can expect the failure rate to be. Most service processes will therefore be far less reliable than writing to a message queue or stream.

One function of pull-based communication vehicles is to decouple systems. This is an often repeated mantra in marketing materials used to sell message queues and streams.

The advantages of decoupling systems are sometimes described as primarily impacting service flexibility. The theory goes that a producer can be wholly unaware of a consumer. A consumer can be wholly unaware of its producer. Each service need only be aware of message queue in order to operate. Indeed, this interpretation is the interpretation given in Apache ActiveMQ documentation linked above:

The senders and consumers of messages are completely independent and know nothing of each other. This allows you to create flexible, loosely coupled systems…

…Using a message bus to de-couple disparate systems can allow the system to grow and adapt more easily. It also allows more flexibility to add new systems or retire old ones since they don't have brittle dependencies on each other.

I disagree with this characterization entirely. Flexibility of systems is achieved through clearly defined API / Endpoint contracts, not through the use of message queues. Even in queue based systems, it is important to define contracts. If a producer suddenly starts submitting messages with a changed structure, or a consumer suddenly starts requiring a field that used to be optional, inserting a message queue is not going to fix broken integrations.

If you have well-defined API interfaces and service contracts, the API implementation can easily be replaced or modified. This is true for pull-based and push-based communication models alike.

My understanding of the benefits of using pull-based communication models to achieve decoupling of services is different. It pertains far more to reliability and fault-tolerance than system flexibility. AWS SQS marketing copy provides this great description:

SQS lets you decouple application components so that they run and fail independently, increasing the overall fault tolerance of the system.

Pull-based communications let you design more reliable systems because if a consumer believes it is too busy to pick up a new request, it can simply leave it in the queue to be picked up by another consumer or to wait until it has capacity to service the request.

The more resource intensive a service, the more dependencies a service has, or the more time a request takes to process, the more using a pull-based communication vehicle reduces the risk of system failure.

I like to use the following analogy to describe the reliability benefits gained by pull-based over push-based messaging:

Tom the juggler is performing in front of an audience. Tom is requesting that audience members give him objects to incorporate into his juggling act. Should Tom request that the audience toss their items to him, or should he request that they place it on a table for him to pick up?

Think of placing objects on the table as placing objects in a queue. Think of tossing objects to the juggler as using a regular synchronous, push-based application server. I, for one, hope that nobody in the audience tosses a chainsaw to the juggler.

Durability

While durability sounds like a similar metric to reliability, there is a subtle but important difference. Reliability describes how likely it is that the system fails. Durability describes how resilient the service is in failure scenarios. This is a key component of fault-tolerance.

If a producer places a message into a message queue or a record into a stream and a consumer is experiencing an outage at the time, there is no harm caused. When the consumer is repaired, it can pick up where it left off and process a backlog of messages or records.

Message queues and streams both typically retain messages and records in failure scenarios. We will discuss how different failure scenarios are typically handled with each of streams and queues in a comparison later on.

In push-based communication models when requests fail, the client needs to become aware of these failures. Once a client is aware of these failures, simply retrying might result in additional failures, so protocols need to be established to determine when and how and to retry requests. These protocols might or might not be reliable or need to change over time. They will certainly result in significant complexity in client code.

Or, the consumer can simply introduce a pull-based communication vehicle to decouple it from the producer. In this way, even if the consumer fails, no further input from the producer is required to recover. The recovered consumer can simply go back in time and process the requests that are backlogged in its pull-based communication vehicle awaiting processing.

Cost

Relative to most services message queuing and streaming services are usually very cheap. As such, to understand how pull-based communication vehicles affect system cost, we must investigate how resource utilization changes as an integration model changes from push-based to pull-based.

We mentioned above that, a consumer can decide when to pull a request off a queue, but a server cannot decide when to process a request from a user. This means that if we want to ensure that a push-based service is stable, we must over-provision capacity to handle the requests. After all, who knows when or how big the next request is going to be. Even if a system is idle, push-based message models require enough capacity to handle a sudden influx of large requests.

In the case of pull-based request models, because a consumer can elect when to process a message it is possible to achieve far greater utilization of resources. After all, if a sudden influx of large requests comes in after some idle time, they will be queued, and more capacity can be added to the system to process them, or the backlog will slowly be drained during other idle periods.

For services where influxes of large requests are common, pull-based strategies enable the maximization of allocated resources.

A Summary

Processes which are long running, or are resource intensive, or have many dependencies become far easier to implement reliably, durably, and cost effectively leveraging streams or queues.

Although, pull-based communication vehicles introduce another hop in a request’s journey, they do so with specific goals in mind. I, for one, do not know if it is possible to achieving reliability, durability, and cost-effectiveness for the classes of services that we have identified without using a stream, queue, or analogous pull-based replacement.

Comparing pull-based communication vehicles

Streaming and queueing systems both typically provide a pull-based communication vehicles for passing data. The similarities end there. Each queues and streams are designed to solve a set of distinct challenges.

Queues

Queues can be thought of as ordered backlogs comprised of work that must be performed. Each message in a queue represents a segment of work intended for a single consumer or worker. Which consumer, you ask? Why whichever one is able to get to the work first.

When a worker pulls a message off of a queue, the message becomes invisible, or locked so that other workers do not work on the same message concurrently.

If a worker fails to process a message, or in some implementations if a worker times out, the message becomes available to other workers for processing. If multiple workers fail to process the same message, queuing systems usually support a mechanism of quarantining that message so that it does not clog up the workers by moving them to a dead-letter queue. This dead-letter queue can deliver messages to error handling processes.

AWS has two queue services that provide queues, Amazon SQS and Amazon MQ which provides managed solutions for RabbitMQ and ActiveMQ.

Streams

Streams maintain a persistent, ordered view of data records. Unlike queues, when consumers read records from a stream, they remain in place and visible to other consumers.

In stream-based architectures, consumers are responsible for keeping track of what messages in the stream they have processed and maintain a cursor to the most recently processed message. This property enables multiple types of consumers to process the same stream in parallel.

Because streams are not simply a backlog of work, they offer no locks or dead-letter messages and when consumers have processed records they do not delete them from the queue. This leaves consumers with the responsibility of implementing fault-tolerance themselves. In stark contrast to queues, streams do not support visibility timeouts or message locking.

One of the other key differences between queues and streams is that messages in a queue are intended to be processed independently. One message never needs to be processed alongside a second message. Streams on the other hand index every item and data is often processed in time-windows. Multiple records within a window are often aggregated and processed to provide real insights over sequenced data.

This property of streams makes them quintessential components of many realtime analytics architectures, and streams therefore provide APIs to query messages in bulk. Stream processing frameworks like Apache Spark have been built to enable complex aggregations over data within streamed records using familiar data query languages like SQL.

Because streams let you pull huge volumes of data at a time, and because retrieving that data does not perform any distributed locking, you can often improve processing performance tremendously by switching from a queue worker to a stream consumer when processing huge volumes of data.

Several Amazon services expose provide streams. Amazon provides its own Kinesis Data Streams and Amazon MSK provides a managed Kafka solution. DynamoDB record changes can also be configured to reactively emit to dynamodb streams.

A Summary of Part 1

Streams and queues are both very useful components when attempting to define pull-based architectures, but they target different use-cases. Streams were created to enable consumers process sequenced data in bulk without overwhelming consumer, while queues exist to distribute work across a series of parallel workers without overwhelming any worker.

In AWS, you will notice that Lambda Polling is available for all service integrations for SQS, AmazonMQ, DynamoDB Streams, Kinesis Streams, and Managed Service for Kafka streams.

Queues and streams are very powerful components that empower your architecture when designing pull-based solutions.

If your service calls are short and simple CRUD (Create, Read, Update, and Delete) operations, you are better served with a push-based communication services like AWS API Gateway, Cloudfront, and AWS Elastic Load Balancers.

This isn’t the end of the story, however. We also have other very powerful abstractions that let us compose, fan out, and use rule-engines to orchestrate and choreograph complex distributed processes. Part two of this blog series is coming soon and is focused on Organizing and Integrating Services.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/a-comprehensive-guide-to-communication
Extensions
Wanted: a serverless solution to exchange IAM access for network access to VPC resources
What would make AWS even better: #8 in my countdown from 10
Show full content

This is item #8 in my count-down of items on my 2022 re:invent wishlist. You can find item #9 here.

person holding brown clamshell near cliff
The Current State of CloudShell

I was very excited when AWS released CloudShell at the end of 2020. It seemed like a very obvious missing piece of the AWS puzzle. Why should I need to spin up an entire EC2 instance or Workspace when all I want is to run a command against my AWS environment? Not to mention that GCP and Azure both had cloud shell options at the time.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

Since then, I’m disappointed to say, I’ve only used CloudShell a handful of times. Maybe, I need to get the scheduled scaling configuration for an ECS service, and don’t feel like logging in again. Maybe I’m fighting a fire from someone else’s computer. Having a little shell which I can start in a new tab right in my browser is useful. Having the AWS CLI preconfigured with my current logged in identity is even more useful.

With this said, CloudShell is somehow not useful enough for me to actually use frequently.

The lesser reason for this is that it (understandably) spins up more slowly than my local computer. The greater reason is that CloudShell doesn’t have access to my network resources. The vast majority of the time that I pull up a terminal outside of developing in my IDE, I need access to my network resources.

Perhaps, I want to telnet to a port and see if it’s listening. Maybe, I want to connect to a database and view my data. I might want to test DNS resolution within a VPC. Or maybe an API Gateway endpoint is configured to reject traffic that does not originate in my VPC. For these situations, and any other situation where I need access to my network resources, CloudShell does not help me.

How I access Network Resources

I usually interact with network resources by connect to my private network resources over a Client VPN connection from my macbook.

This workflow replaces the previous workflow I used before AWS Client VPN was released: an ec2 bastion instance and SSH tunnels to network resources. Back then, I remember entire operations processes built around rotating SSH keys, and authorizing keys to access the bastion instance. I knew even less what I was doing then than I do now, but I do not miss those days.

A desired user experience

In its most distilled form, I am looking for a simple mechanism to exchange my IAM credentials for an ephemeral, serverless, shell session with access to VPC resources. Ephemeral, because I don’t want to pay $$$ when I terminate the session. Serverless, because I don’t want to need to worry about OS configurations or managing servers.

A design based on existing AWS services

If you are familiar with SSM Session Manager, you might have noticed that a mechanism to attain a shell session with access to VPC resources already exists, but not in a serverless variety.

In case you’re not familiar with SSM Session Manager, here is a brief overview of how it works. Highlights below:

  1. Ensure that the SSM agent is running on an EC2 instances. Most of the prebuilt Amazon Linux AMIs will take care of this for you.

  2. Grant your EC2 instance role access to a set of api actions necessary to make Session Manager work.

  3. Install the session manager plugin for the AWS CLI

  4. You make a request to establish a shell session with a target instance.

  5. If your principal (role, user) has the appropriate permissions, a bidirectional communication channel is established between you and the server via Session Manager.

Better than simply being able to gain shell access, Session Manager allows you to tunnel traffic between a port on a client computer and a port SSM managed instance. With this functionality, it becomes possible to run SSH over session manager and access remote resources over an SSH tunnel like you would with a bastion host.

With these primitives in place, extending CloudShell to be able a session to connect to your VPC resources should be achievable. An instinctual proposed design for this feature looks something like the following:

  1. The AWS console should let you create a set of launch profiles, each specifying a set of subnets and security groups.

  2. For each profile, a network interface for the launch profile should be deployed in the corresponding subnet, and placed in the specified security groups.

  3. When a CloudShell session is created, you should be able to associate it with a launch profile.

  4. The created shell session should route traffic via the network interface with the associated launch profile. This should be achieved in the same way that AWS Lambda / ECS Fargate gain VPC access.

  5. You should be able to connect to the CloudShell session via a Session Manager-like interface, but instead of targeting an instance, you target a CloudShell session id. In this way, you can tunnel traffic from your local machine through CloudShell to gain access to your network resources without deploying an EC2 instance.

A More Powerful GCP Identity Aware Proxy

One last aside, if you will. CloudShell + VPC launch profiles, effectively results in a more powerful version of GCPs Identity Aware Proxy.

GCP’s IAP lets you access HTTPS resources in your VPCs via a proxy instead of via VPN. The IAP ensures that a user is authenticated and authorized to access the HTTPS resource before proxying traffic to endpoints in your private network.

Cloudshell + VPC launch profiles with the above implementation lets you do the same to all network resources. Not just HTTPS resources.

Takeaways

With VPC access and the ability to tunnel through CloudShell sessions in place, I imagine I would use CloudShell a whole lot, both via the CLI and the console. Being able to troubleshoot and connect to AWS VPC network resources without deploying a client VPN would save money and time in many instances.

More importantly, the ability to bind network sessions to IAM authentication and control access centrally through IAM policies would be secure, and convenient to enforce and administer.

And so, if you are an AWS genie, item #8 on my wishlist is the ability to attain ephemeral access to my AWS network resources over IAM. I don’t really care about the implementation details, but based on the AWS technology I have worked with, enabling CloudShell to interact with my network resources would be a great start. Providing an IAM authenticated mechanism to tunnel traffic originating on my macbook through these CloudShell sessions to my VPC resources would be game changing.

Subscribe now

https://yehudacohen.substack.com/p/wanted-a-serverless-solution-to-exchange
Extensions
Can we get security groups rule references that extend across Cloud WAN, transit gateway, and cross-region VPC peers?
What would make AWS even better? - #9 in the countdown from 10
Show full content

This is item #9 in my count-down of items on my 2022 re:invent wishlist. You can find item #10 here.

red padlock on black computer keyboard

Back when I started building on AWS, I used to think of network security in terms of private and public subnets and nothing more. I was a software engineer after all, and what did I care for configuring firewalls or routes? Networking was something that needed to get out of the way and just let my code work. And now it was my problem.

Back in college, we learned some theory about the TCP handshake protocol and subnet masks, but I didn’t fully internalize the concepts. Years later, I had forgotten most of the meager theory I once knew.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

So, I did what I usually do when I don’t know what I’m doing: I consulted the great sage, Google, and found an article called Practical VPC Design which I read carefully and followed to the tee.

At the time I would use two security groups on any network interfaces I deployed:

  • A public security group I would attach with to internet facing load balancers. This security group allowed ingress TCP traffic from 0.0.0.0/0 on application ports (usually 443 and 80) This security group then allowed egress TCP traffic to all ports and the CIDR range of the VPC.

  • A private security group that allowed ingress from itself and the public security group, and egress traffic to the internet.

I believe that my present state of embarrassment at admitting this fact is appropriate. After seeing security hub findings highlighting open security ports, I started taking this more seriously, but I never had a clear idea about what a good strategy to determine when I might want to create a new security group, and what kinds of network interfaces I should place in each.

This changed around two and a half years ago, when I met someone over the internet named Evan Spaeder. He is now, and was then, the CEO of Foresight Technologies. Back then, he was the company’s only employee.

Evan was and still is a great network engineer. While my bread-and-butter was software, his was enterprise architecture and networking. This is relevant because he said something off-handed to me that changed everything I thought I knew about using security groups. I’ll paraphrase, because I don’t remember the exact words he used:

“AWS security groups are powerful and easy: just give every service or component its own security group. That way you can create a zero trust network by whitelisting any network paths you want to use.”

This approach resembled something I was already familiar with: authorization rules in a service mesh. Except while authorization policies in a service mesh were enforced by the sidecar, the zero-trust network model that Evan described was enforced at the network layer. Since this conversation, security groups have brought a huge paradigm shift to how I think about network security. These days I use security rules to build zero-trust networks wherever possible.

I’ll briefly digress to say that Evan and I developed a mutual respect and trust and our companies merged in January of last year (2021) after we had already been working together for several months. Since merging, our joint team has grown from a headcount of 4 to 24 with no sign of slowing down. Along the way, I have learned a great deal from Evan about enterprise architecture and networking.

Returning to the subject at hand, you will notice that I said I use security rules to build zero-trust networks wherever possible. Building zero-trust networks with security rules is not always possible.

The first time I ran into a limitation regarding referencing a security group from another security group was also the first time I built a transit gateway. It was at some point in 2020, I don’t quite remember when. It was also the first time that I tried to put into practice learnings from the then newly released AWS whitepaper Building a Scalable and Secure Multi-VPC AWS Network Infrastructure.

By this point, I was accustomed to referencing security groups across multiple accounts across a VPC peering connection and was surprised when my attempt to reference a security group in another security group’s rule did not work over a transit gateway.

I shouldn’t have been surprised. The very white-paper I was basing my network architecture on contains this excerpt:

Security groups referencing works with intra-Region VPC peering. It does not currently work with Transit Gateway. Within your Landing Zone setup, VPC Peering can be used in combination with the hub and spoke model enabled by Transit Gateway.

I hadn’t read it closely enough. So I started configuring hybrid transit-gateway, vpc peer network topologies. In these models, traffic between VPCs is routed through peering connections, while traffic destined for other transit gateway attachments is routed through the transit gateway. In this way, I could define zero-trust security group rules that spanned accounts and VPCs by referencing other security groups across an account and VPC peer.

This network topology has created weird asymmetric routing issues when traffic that is sent from a network load balancer which is configured to preserve the client IP address through a vpc peer. In these cases, traffic is dropped as it egresses the vpc via the transit gateway instead of the vpc peering connection it arrived on. For the most part, however, the setup works well.

The zero-trust network model breaks down when you start trying to reference security groups across an inter-region VPC peer. You see, a security group is a regional construct. When you attempt to reference a security group from a security group rule that crosses an inter-region VPC peer, your request will fail. The security group rule in us-east-1 cannot see a security group in us-east-2.

The effect of this is that if a service in us-east-2 needs to receive traffic from a source in us-east-1, an ingress rule must be added to its security group that allows traffic from all CIDR blocks of subnets that the source in us-east-1 is deployed to. After all, with the exception of NLB enis, no network interface in an AWS vpc is guaranteed a static private IP address.

After adding the requisite rules to establish connectivity, all other services deployed to the same subnet as the source have network access to the destination and zero-trust is broken.

Recently, Evan and I were at the AWS Summit in New York where we attended a session on AWS Cloud WAN. Cloud WAN has the same limitation, and security group rules cannot reference other security groups across Cloud WAN attachments.

While there, we met and had a great conversation with an exceptional solutions architect at AWS. We extensively discussed limitations around security groups in particular with her.

The architect in question seemed unfamiliar with the methodology of using security groups to form a service mesh. She clearly understood the methodology’s merits, however, and spent time discussing some of the underlying factors that make the prospect tricky.

My takeaways from the conversation were that the use of a security group rule that references a source security group was initially designed to operate within a single VPC. AWS has extended this concept to apply to an intra-region VPC peering connection, but has not yet managed to apply it to transit gateways, cross-region VPC peering connections, or Cloud WAN because of the technical challenges involved.

It is my hope that this Re:Invent changes that, and we can begin to build zero trust networks that span the globe on AWS.

Subscribe now

https://yehudacohen.substack.com/p/countdown-from-10-what-would-make
Extensions
What would make AWS even better...
#10 on my Re:Invent 2022 Wishlist countdown
Show full content

I spend a good amount of time using AWS. For the most part I love the power that the platform puts into your hands.

With that said, it could be better.

“Over 200 services not enough for you? You’re never going to be happy,” I hear you cry.

To that I reply, that’s probably true. Also, my ticket to re:Invent was expensive this year, so I feel like AWS should come prepared.

“Prepared with what?”

Well funny you should ask. You see, I have prepared a wishlist that I have whittled down to ten items. So if you are an engineer or product manager at AWS and you dream of being a genie, here is how you make my AWS dreams come true this November.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

Without further ado, I represent the first entry in my blog-series countdown:

#10: Long running lambda functions

I really really love lambda functions. They let me do most of what I want so long as I can make it happen in 15 minutes. Unfortunately, many of the things I want to do take longer than 15 minutes.

clear hour glass with brown frame
Photo by Kenny Eliason on Unsplash

“What could take longer than 15 minutes?” you might ask. And that question might be because you’re used to writing web-apps with request / response times well under a second. Or because you really like batching and parallelizing computational work.

If you are in the latter group, I can only say that I don’t share your tastes. Maybe it’s just because I have too much to do, but I like to optimize as late in the game as I possibly can. If you are in the former group, here is an incomplete list of where some of us more backend-focused engineers spend our time.

  • Database cloning tasks

  • database seeding tasks

  • web crawling tasks

  • media transcoding tasks

  • ML model building tasks

  • infrastructure provisioning tasks

  • data analytics tasks.

  • data synchronization tasks

But if you’re a know-it-all like me, you might be thinking to yourself, “What about Fargate? You can do all of this with Fargate!” Or perhaps, “I don’t need lambda to transcode my video! I have Elemental MediaConvert for that!” Or, “I don’t need my lambda function to build my ML model! I have SageMaker for that!” Or one of any number of other valid points.

Here’s the thing though: AWS Lambda integrations with AWS services are really really good! Especially Step Functions. If I invoke the ECS RunTask API from StepFunctions (Something I do quite a lot, I might add), its rarely as simple as passing JSON directly.

Either using environment variable overrides or command overrides, or a command override plus some API gymnastics in my task to hydrate references. I also have to handle returning response or error data back to my state machine.

This experience results in monotonous work that needs to be accurate. Not my cup of tea. (I imagine good tooling could help with this pain-point. I’m watching the functionless framework closely here.)

But frameworks will not be able to solve performance issues. You see, lambdas start up within a couple of seconds, while it can take a minute or two for Step Functions to kick off a Fargate task.

Observability is also easier with lambda functions. You can click straight through from step functions to lambda execution logs, while Fargate tasks which stop will only linger for an hour. If you want to click through you’d better do so quickly. I concede that you can hook up event bridge to log stopped task details. But then you need to build your own user experience on top of what AWS gives you. I don’t have time for that.

An even larger benefit to this proposal there is a whole bunch of great tooling which is built to deploy lambda functions. Serverless Framework, SAM, Chalice, etc. I can’t use that with Fargate or Step Functions Activities. Why shouldn’t you be able to deploy long running tasks to AWS with these great frameworks?

All in all, there are lots of reasons to make Lambda Functions last longer. But maybe the biggest and best reason is that the competition is already doing this. That’s right, long running functions is one of the few things Azure does better than AWS. That’s right, if you provision a function app on Azure with the right plan, you can run it all the time without needing to timeout functions.

Is there a good reason that we need to timeout lambda functions 15 minutes? Probably. Maybe tying this feature to provisioned concurrency lambda functions is necessary to ensure that the lambda runtime is stable from a work schedule standpoint. Either way, AWS engineers are smart, and implementing this feature would make this lambda user’s AWS wishes a reality.

Thanks for reading Fun With The Cloud! That’s it for this week, but #9 in my Re:Invent 2022 wishlist is coming soon!

https://yehudacohen.substack.com/p/what-would-make-aws-even-better
Extensions
How an Azure migration kicked my butt
A less than epic tale of moving a hybrid-tenant application from AWS to Azure
Show full content

Only a year and a half ago I led the effort to built a backend to handle a hybrid-tenant solution for one of our customers on AWS.

The idea was simple: bin-pack application micro-containers on Amazon EC2 instances, and schedule them with ECS. One application container per tenant. Clone a prepared seed database to bootstrap the application and place it in a shared database cluster to save on infrastructure costs. Maintain an inventory of tenant state in a DynamoDB table. Inject an email approval workflow in the middle. Orchestrate all steps and gain visibility into tenant deployments with Step Functions.

It seemed simple, and it was relatively simple. There were a couple of hoops to jump through in terms of how we would fit tons of containers onto an EC2 instance with only a few network interfaces, but that’s a story for another time.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

So we defined multi-tenant infrastructure code templates and deployed them. We defined single-tenant infrastructure code templates and created a command line deployment CLI for long running tasks and dockerized it. We defined a set of Step Function state machines to handle the data integrations, decisions, and inventory management for these deployments.

When we built this pipeline we came up with a plan, executed it, and it worked without any major wrenches.

A couple of months ago, we got an interesting request. Certain tenants were dead-set on dedicated hybrid-tenant environments in Azure and GCP, and were willing to pay to make it happen. I and the team are not entirely unfamiliar with Azure and GCP environments, but we hardly have the depth and breadth of experience as we do with AWS.

The remainder of this blog post is dedicated to five primary goals:

  1. Detailing the plan to deploy this application to azure.

  2. Describing the roadblocks that we ran into along the way arising from Azure limitations.

  3. Describing how we adapted to said roadblocks.

  4. Describing what changes I would make if I were to start from scratch.

  5. Reflecting on the merits of my general experience building non-trivial Azure infrastructure.

The Plan

The plan is simple but there are still a few kinks to be worked out. Azure environments will consist of a vnet with one subnet for the databases, and another for each tenant application.

We still need to decide where to host the container without ECS available to us.

AKS is always an option but I like solutions that just work and Kubernetes always feels like extra work to manage. Especially if you want out of the box monitoring and dashboards, and logs.

Azure Container Apps at the time still in beta and did not yet have terraform support. At a glance Azure Service Fabric looks kind of similar to Kubernetes and I’m looking for something that doesn’t require me to manage underlying vms. Azure Container Instances doesn’t autoscale or manage SSL certificates.

One of our engineers discovers Azure App Service which looks very close to what we are looking for. It gives us a very similar user experience to AWS App Runner and should be super simple to get up and running with.

The Execution

An engineer on my team at Foresight Technologies writes some terraform scripts and spins up a vnet and deploys Azure Database for MySQL servers. He then deploys an App Service along with a Regional VNet Integration to allow it to reach the database, pushes an image to ACR, and seeds the database. The app service runs fine, seems stable, and was super responsive.

Well that was easy. All that remains is consolidating some of the code, adding some monitoring configuration, and extending the existing deployment pipeline to allow deployments to azure environments. (Or so we think as you will discover as you read on).

It is with a single successful tenant deployment that our engineer takes paternity leave. It is also at this point that I pick up the project and the fun truly begins.

Extending the Existing Pipeline

Unfortunately some organizational constraints make extending the existing pipeline a little tricky. You see, the customer requires that no data from any of the tenant databases leave Azure, and the deployment pipelines performs operations on the data inside the database. It is therefore insufficient to simply set up a site to site VPN to Azure. We need to run database operations from within Azure.

In AWS we ran these operations on the ECS cluster using the RunTask api action from Step Functions. In Azure we have to decide where to run these operations.

A Bad Idea

My plan is to run database operations inside Azure Container Instances. After all, you can set up container instances that don’t self heal. So the idea is: run a container instance let it perform the work, and then die.

As I begin working on executing this plan, I realize that it’s doomed to failure. You see, while ECS allows you to specify dynamic container overrides like docker command strings when you run a task. Azure Container Instances has no way of passing these container overrides. This means I need a separate container instance for every deployment pipeline execution.

Investigating further, I discover that Azure has a quota limiting the number of Container Instances I can deploy. If I have many deployment pipeline executions, I do not want to need to garbage collect every ephemeral container instance that is spawned.

Azure Container Instances was clearly not designed for this, and my current attempt was a poor abuse of the service.

After additional investigation, short of spawning an AKS cluster, I can find no way to perform an event driven ephemeral task with transient docker containers in Azure. In fact most event driven patterns in Azure run with Azure Functions.

A Viable Solution

There are several system dependencies, but the best option I can think of is creating an Azure Function capable of handling events from the deployment pipeline. I will, of course, write an Azure function interface for the command line we built.

Because these functions are long living and rely on linux system dependencies, I deploy a docker backed Azure Functions Application on a more expensive App Service Plan to ensure the runtime doesn’t die after ten minutes. I build an adapter to leveraging Azure Service Bus messages to pass requests to Azure Functions and the Step Functions API to send output and errors back to the deployment pipeline.

This was a little more work than I wanted, but apparently running ephemeral docker tasks is not as common a task as I expected.

I work out a few more kinks in the step functions workflow and am elated when the first tenant deployed through the deployment pipeline. Everything just works. I am just about to email the customer saying that we are done, when I decide to deploy one more tenant just in case.

Tenant Networking

I watched the state machine’s execution graph light up green step after step, until the final service deployment step. And then it went red. The deployment failed. Only one resource failed to create: the network connection between the new App Service and the existing subnet.

You see, as is abundantly clear from azure documentation:

The integration subnet can be used by only one App Service plan.

I am trying to use the same integration subnet with another App Service plan.

My instinct is to share an App Service plan between multiple plans. This is supported by continued documentation seemingly endorsing this idea.

You can have only one regional virtual network integration per App Service plan. Multiple apps in the same App Service plan can use the same virtual network.

Should be a simple fix but I search the App Service plan documentation to understand the implications of running all my App Service tenants on the same plan.

It turns out that an App Service plan represents a set of physical infrastructure that runs the App Service. And when it scales, a warning is evident:

In this way, the App Service plan is the scale unit of the App Service apps. If the plan is configured to run five VM instances, then all apps in the plan run on all five instances. If the plan is configured for autoscaling, then all apps in the plan are scaled out together based on the autoscale settings.

If I’m planning on deploying hundreds of Azure tenants, I’m going to need some very expensive computers I’m not sure I or my customer can afford.

Changing the Network Model

There is only one option left to me. Change the networking model and deploy a separate subnet for every App Service. I do a little math and if I want to pack /28 CIDR blocks into a /16 block, I can fit up to 4096 in. After verifying, Azure has a hard limit of 3000 subnets per vnet. I’m very lucky that the number of tenants to be deployed to Azure is in the order of hundreds rather than thousands.

Deployment pipeline users have no idea about networking, and manually keeping track this network space is going to be painful.

As such, the deployment pipeline needs to take care of managing cidr block allocation for the Azure subnets. I write an AWS lambda function to achieve this. The logic is simple enough.

def get_next_available_subnet_cidr(
    vnet_name=f"dev", cidr_prefix=28,
    resource_group_name="dev"
):
    vnet = network_client.virtual_networks.get(
        resource_group_name=resource_group_name,
        virtual_network_name=vnet_name
    )
    
    vnet_ranges = map(ip_network, vnet.address_space.address_prefixes)
    candidate_subnets = (subnet
        for vnet_range in vnet_ranges 
        for subnet in vnet_range.subnets(new_prefix=cidr_prefix)
    )
    cidrs_in_use = map(ip_network, (s.address_prefix for s in vnet.subnets))
    for candidate_cidr in candidate_subnets:
        overlaps = False
        for in_use_cidr in cidrs_in_use:
            logging.debug(f"Candidate: {candidate_cidr}, In use: {candidate_cidr}, overlaps: {candidate_cidr.overlaps(in_use_cidr)}") 
            if candidate_cidr.overlaps(in_use_cidr):
                overlaps = True
                break
        if overlaps:
            continue
        return str(candidate_cidr)
    raise ValueError("No available CIDR ranges in vnet")

I run the lambda function locally and test it in several cases. It works better and faster than expected. I am able to find the next available subnet in several test cases.

I make changes to the Step Functions State Machine DSL, and try to deploy the state machine along with the new lambda function.

The Final Obstacle

The lambda function deployment process has been working smoothly for the last year, but it explodes now. The AWS API informs me that I may not have an unzipped deployment package of over 256MB.

I’ve only added one dependency though. The pip package azure-mgmt-network lets me find the subnets in my azure tenant to be able to find the next available subnet. It turns out that this along with its dependencies this pushes me over the 256MB limit.

I look at my Pipfile and I suppose that there are a few dependencies that might be large, but I am surprised that the azure-mgmt-network package pushed me over the edge.

No problem, I’ve dealt with large dependencies before. I try to create a lambda layer for this dependency, but this too fails. Upon investigation, I discover that the unzipped size of the azure-mgmt-network and its dependencies is 265MB. This package alone is over the limit of lambda layer size.

This stuns me and apparently I’m not alone. This Github Issue open for over 2 years now has over 50 reactions. It turns out the reason is that the pip package includes all historical versions of the package and the package-info references functionality from some assortment of these.

I surrender in resignation, and knowing that I already have an Azure Functions application, I move the subnet CIDR determination to Azure functions. I send a message to Azure Service Bus from the state machine, and we’re off to the races.

Our second, third, and fourth tenants create successfully. I am very glad to put this implementation behind us.

Reflections on my Azure experience

Looking back on this deployment experience, I found myself preferring the development experience on AWS.

For one, I do think that the breadth of features AWS offers makes it easy to find something that fits most use cases I encounter. More importantly, I rarely find myself faced with constraints that I don’t understand the rationale behind. Why should only one App Service plan be able to be associated with a Subnet?

None of these are egregious oversights to my mind. They are clearly documented and had I and my team been more vigilant in our planning, we would have been able to plan around them.

There is, however, no excuse to my mind for a network management client package that adds 265MB to a python deployment package. This oversight veers more into the negligence category. For this ticket to be open for over 2 years without any sign of imminent resolution is not something I understand.

There are some magical things about Azure development, I admit. No need to worry about availability zones and NAT Gateway routes. If the magic comes with some of these constraints, however, I will gladly remain an AWS muggle.

I will probably need to do more Azure work like this in the future, but when I do, we’ll be using AKS. For, with all the complexity Kubernetes brings, I rarely run into tooling or platform limitations that are overly difficult to design around.

Thanks for reading Fun With The Cloud! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/how-azure-kicked-my-butt
Extensions
A quick overview of AWS principals, identity-based policies, and resource-based policies
Permissions in multi-account AWS environments
Show full content

One of the more frequent hurdles I watch my team run into when they first learn AWS is that AWS has two primary ways to assign the same privileges to some of its resources. In its documentation, AWS describes the difference between identity-based policies which affect IAM Principals, and resource-based policies that affect AWS resources.

The model of permissions associated with identity-based policies is often referred to as RBAC or (Role-based Access Control).

Thanks for reading Rain Clouds! Subscribe for free to receive new posts and support my work.

Consider an S3 bucket as a quick example. I can either define an identity-based policy and attach it to an IAM principal, such as a role or user directly or via a group:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowReadExampleBucketAndObjects",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::example-bucket/*",
                "arn:aws:s3:::example-bucket"
            ]
        }
    ]
}

When you attach the above policy is a principal in your account, that principal is able to access objects in the s3 bucket named example-bucket if that bucket exists in your AWS account.

Note the above caveat: this policy grants access if the bucket exists in your AWS account. If the bucket is in another AWS account, this policy alone is not enough to grant access. I’ll get back to this later.

There is, however, another way to grant the same permissions. Let’s assume you have the following list of AWS principal arns, one a user, and one a role, that you wish to grant read access to the s3 bucket:

[
  "arn:aws:iam::123456789012:user/malfoy",
  "arn:aws:iam::123456789012:role/deatheater"
]

An alternative strategy to granting access to the bucket and objects in question is to create an s3 bucket policy, or a resource-based policy attached to the s3 bucket itself. This resource-based policy has an extra Principal key in each statement of its json document that distinguishes it from an identity-based policy.

A resource-based S3 bucket policy that provides equivalent permissions to the above identity-based policy might look like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "MalfoyAndDeatheaterReadBucketAndObjects",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:user/malfoy",
                    "arn:aws:iam::123456789012:role/deatheater"
                ],
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::example-bucket/*",
                "arn:aws:s3:::example-bucket"
            ]
        }
    ]
}

You might notice that the principal arns both share an aws account number 123456789012. This s3 bucket will be accessible by the aforementioned principals whether or not example-bucket lives in the same aws account as the principals: 123456789012.

This highlights a key difference between resource-based policies and identity-based policies.

Identity-based policies cannot expose your resources in your AWS account to principals that exist outside of your AWS account.

The contrapositive is true too:

If you want to grant a principal outside of your AWS account access to your AWS account, you must use a resource-based policy.

This is true even for service principals like lambda.amazonaws.com. You will notice that whenever you define a Lambda execution role, or an EC2 instance profile, or an ECS task role, or any other role that is assumed by an AWS service, the role is created in your AWS account. The “Assume Role Policy” is effectively a resource-based policy, that is attached to your role. This resource-based policy shares your role with third parties outside of your organization.

But, I hear you ask, what happens if you want to give my AWS account’s IAM administrator the ability to determine who in my organization should have access to your s3 bucket objects? Do you have to keep on modifying the principals in your resource policy so that I can allow different users and roles in my aws account to access your s3 bucket? That sounds rather inconvenient.

That’s where the special IAM AWS account principal comes into play. You can delegate s3 object access to my AWS account by specifying an iam account arn in the Principal block of your resource-based policy as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ShareAccountReadBucketAndObjects",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:root"
                ],
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::example-bucket/*",
                "arn:aws:s3:::example-bucket"
            ]
        }
    ]
}

This policy does not let all Principals in my AWS account access the objects in your s3 bucket. Instead it delegates the capability to allocate permissions to your S3 bucket to my IAM administrator. After this resource-based policy is defined on your bucket, I would be able to create an identity-based policy identical to the one above:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowReadExampleBucketAndObjects",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::example-bucket/*",
                "arn:aws:s3:::example-bucket"
            ]
        }
    ]
}

Now, even though example-bucket lives in your AWS account, when my IAM administrator attaches this identity-based policy to user malfoy and role deatheater, those principals will be able to access objects inside your bucket.

Note that no s3 bucket policy was necessary when your IAM administrator granted permission to your s3 bucket via an iam-based policy. There is, however, an instance in which you must define a resource-based policy to allow your own IAM administrator to grant access to resources inside your AWS account.

You can only define KMS key permissions with identity-based policies if there is an explicit resource-based key policy that grants your IAM account access to delegate these permissions. By default, a policy resembling the below is added to all KMS keys created in the console to allow identity-based policies:

{
  "Sid": "Enable IAM policies",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::123456789012:root"
   },
  "Action": "kms:*",
  "Resource": "*"
}

What to use?

I will leave you with some prescriptive guidance that I generally use to recommend when to use resource-based policies vs identity-based policies to my team at Foresight.

Always use identity-based policies unless you need to grant permissions cross account or make a resource public. This makes it easier to understand what a particular principal can do without worrying that additional privileges have been granted to them by resource-policies inadvertently.

I might be persuaded to expand this blanket rule. Do you have a use-case or counter argument as to why resource-based policies might be better? I’d be interested in hearing your thoughts.

Thanks for reading Rain Clouds! Subscribe for free to receive new posts and support my work.

https://yehudacohen.substack.com/p/a-quick-overview-of-aws-principals
Extensions
Monitoring Shard Allocation In OpenSearch and ElasticSearch
Filling in a critical gap in the AWS OpenSearch managed Cloudwatch Metrics
Show full content

AWS’s managed OpenSearch Service (formerly ElasticSearch Service) is one of the AWS managed services which requires quite a lot of custom tuning to get to work well for you. It’s also a service where the cost of not knowing what you are doing is very expensive.

On a related note, knowing what you are doing while operating an OpenSearch domain is not always intuitive. This is because each OpenSearch domain’s health is very dependent on its data and access patterns. A cluster’s health cannot simply be determined by the health of the software and hardware that is running. The very same domain that might perform some tasks quickly and without any hiccups can fail to index a single new document.

I’ll give you an example.

You’re using OpenSearch Service as a log search engine and ingest your logs using a typical ELK or EKK solution. Every day, a new index is created in your domain (possibly a new index per environment, or per service, per environment depending on your implementation). As each index is created, by default 5 primary shards are allocated to that index (this number is configurable, but each index must be allocated at least 1 shard). Along with each primary shard you’ll want to allocate a replica shard to each primary shard.

Assuming you have not tuned your index configuration at all, every day, dev, uat, and production environments each create a new index in your OpenSearch domain. Ten shards for each environment (5 primary and 5 replica) are allocated per day. If you have two data nodes which each allow the default 1000 shards per node, you can store 66 days worth of logs on your cluster ((2 node *1000 shards/node) / (3 * 10 shards / day).

The result of such a configuration is that even if you only log 1 MB worth of data per day, writes to your cluster will fail on day 67 if you haven’t cleaned up your older logs. What about reads? Well you’ll probably get sub-100ms response times on those. What about your OpenSearch domain metrics? Well most everything will look healthy. Everything but your write request error response codes, that is.

You might be able to monitor error responses for your OpenSearch domain and be alerted that something is wrong post-facto, but by then its too late. All your available shards are already allocated and some writes have already failed. Failures like this will take the form of a 400 error with the following error message:

Error{
  "type": "illegal_argument_exception",
  "reason": "Validation Failed: 1: this action would add [10] total shards, but this cluster currently has [5992]/[6000] maximum shards open;"
}

Now this is not a blog post about sizing OpenSearch clusters. I’ll write that one another time (AWS actually has some reasonably thorough documentation on that). It’s about augmenting your OpenSearch domain monitoring suite to capture shard allocation. That way, when your shards are 90% used, you can scale out (if you have lots of $$$ to throw at the problem), or do some tuning to ensure your cluster is more correctly sized before things go wrong.

If none of the built-in domain monitors help with this, let’s build our own.

We can grab the total shards in the cluster by making a request to:

GET _cluster/stats?filter_path=indices.shards.total.

We can grab the total shards of our domain by making a request to:

GET _cluster/settings?include_defaults=true&flat_settings=true&pretty

From the response we can grab the `persistent` property and within that the `search.max_buckets` property.

Dividing the used shards by the total shards gives us a shard allocation percentage value between zero and one.

Performing this calculation every hour, and publish a CloudWatch metric allows us to proactively warn administrators before OpenSearch shards become fully allocated.

Below is a small Chalice project to publish a Serverless scheduled task to help proactively identify over-allocation of shards in an OpenSearch domain. I’ve deployed this on several occasions for our customers at Foresight Technologies and it has, saved me a couple of times.

from chalice import Chalice

app = Chalice(app_name='es-monitor')
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3
from os import environ
import requests

def es_get(endpoint, path, payload={}, headers={
    'Content-Type': "application/json"
}):
    if not endpoint:
        endpoint = environ.get('ES_ENDPOINT')

    credentials = boto3.Session().get_credentials()
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, os.environ['AWS_DEFAULT_REGION'], 'es', session_token=credentials.token)

    return requests.get(
        url=f'{endpoint}/{path}',
        headers=headers,
        data=payload,
        auth=awsauth
    ).json()

def get_used_es_shards():
    used_shards = es_get('_cluster/stats?filter_path=indices.shards.total').get('indices').get('shards').get('total')
    available_shards = int(es_get('_cluster/settings?include_defaults=true&flat_settings=true&pretty').get('persistent').get("search.max_buckets"))
    return used_shards / available_shards

def emit_metric(metric_namespace, metric_name, metric_value):
    cloudwatch = boto3.client('cloudwatch')
    cloudwatch.put_metric_data(MetricData={
        'MetricName': metric_name,
        'Dimensions': [],
        'Unit': None,
        'Value': metric_value
    },
    Namespace=metric_namespace)


@app.schedule('rate(1 hour)')
def track_elasticsearch_shard_allocation():
    emit_metric(
        "Elasticsearch/Reliability", 
        'ElasticsearchShardAllocation', 
        get_used_es_shards()

This is a very cheap solution to run. It’s essentially free, when you compare it against the AWS bill that comes alongside running a production ready OpenSearch domain.

I don’t have anything to add here, other than to stress that I’ve seen this type of failure on multiple occasions. Restoring logs from secondary data sinks is far more painful than preventing the problem in the first place.

Thank you for making it this far, Reader. I do hope that you learned something along the way.

https://yehudacohen.substack.com/p/monitoring-shard-allocation-in-opensearch
Extensions
An OpenSearch Service Ghost Story
When Managed Services Fail Silently...
Show full content

ElasticSearch has long been one of my favorite products, but I wear no rose colored glasses. Operating an ElasticSearch cluster is expensive and requires work, and even operating a cluster with AWS’s managed OpenSearch Service is no picnic. Especially if you don’t want to break the bank.

I usually run production clusters with 3 dedicated master and an even number of data nodes scheduled across two availability zones. The cluster in question is no different. What was different was the nature of this failure. Some application nodes were experiencing connection timeouts while trying to communicate with the cluster endpoint. Others were working just fine.

This after just yesterday all processes were healthy and no configuration changes. This is confirmed by CloudTrail. A quick look, and CloudWatch metrics report healthy. The nature of the requests is identical between nodes.

Maybe a networking issue that has always existed and lay dormant. OpenSearch nodes and application nodes are deployed to the same subnets. Routes exist. OpenSearch nodes are in one security group, application nodes in a second. Security group rules look good. No relevant NACLs exist.

Network interfaces for all nodes, healthy and unhealthy have expected security groups attached to their network interfaces.

I connect to the client VPN endpoint and try to telnet to the OpenSearch domain. Nothing. Okay, so this is on the OpenSearch side not the application side.

I nslookup the endpoint and all four IP addresses yielded are in my private subnets. I try to telnet to each of them. Two connect, two fail. One in each subnet connects. One in each subnet fails. I validate that failing application nodes are trying to connect to the same two network interfaces which are timing out for me.

Production customers are getting grumpy, as is our client.

With no other resource to debug at hand, I try to apply a software update to kick off a blue-green deployment and replace all nodes. In parallel, I reach out to AWS support to radio silence. This update sits in the pending state for a few hours. I replace data instances with a different machine type to force a blue-green deployment.

This second blue-green deployment immediately begins to apply. When the dust settles, all new nodes work just fine. I can connect to all of them, as can application nodes. It is as though the problem never existed.

What happened? With no access to the underlying infrastructure, I probably will never know. Do I need to worry about something like this in the future? How would I automate a response? No idea.

If this was your OpenSearch cluster, Reader, how would you respond?

https://yehudacohen.substack.com/p/an-opensearch-service-ghost-story
Extensions
Exposing dockerfile content to an EFS volume in Fargate 1.4.0
Semantic docker volume and Fargate EFS mount differences...
Show full content

There is a page of Amazon ECS documentation that I tried to consume without success a couple of weeks ago. Possibly because of an AWS introduced regression in the Fargate platform. Possibly because my use-case is subtly different from the use case AWS tested. I reread the documentation several times, but ultimately failed to find a flaw in my configuration that explained why I was unable to elicit the desired behavior from Amazon ECS.

The use case in question was a deployment of OpenEMR to ECS with a stateful EFS volume mounted to the /var/www/localhost/htdocs/openemr/sites directory within the deployed container. Without this configuration, the state of my OpenEMR cluster would not persist beyond the lifecycle of each executing task, rendering my OpenEMR storage volatile and highly unsuitable for a production workload.

Running Fargate tasks is pretty routine at this point and so I was somewhat surprised1 to receive a fatal error preventing the standard OpenEMR built container from starting in task logs upon container start-up:

2021-06-01T22:34:24.675-04:00	PHP Warning: require_once(/var/www/localhost/htdocs/openemr/sites/default/sqlconf.php): failed to open stream: No such file or directory in Command line code on line 1

2021-06-01T22:34:24.675-04:00	PHP Fatal error: require_once(): Failed opening required '/var/www/localhost/htdocs/openemr/sites/default/sqlconf.php' (include_path='.:/usr/share/php7') in Command line code on line 1

That’s not how this docker thing is supposed to work! If someone builds an image it’s meant to include all the dependencies necessary to get it to run. Why then is the file /var/www/localhost/htdocs/openemr/sites/default/sqlconf.php not included?

There’s no way a widely distributed docker image is this broken though. Especially after a quick google search yields nobody with the same problem as me. For kicks, though, I roll back to an older version of the docker image. And, still no discernible change.

You might have noticed the reason that the file was not available to read.2 I notice that the file in question lives on the mounted volume and has been overridden by EFS.

Sure enough, upon removing the EFS mount point in the task definition the container starts as expected. Now, I have a working OpenEMR installation with no persistent volume attached. Not very useful, in and of itself, but at least I can be sure that my diagnosis of the problem is correct.

I am initially taken aback by this file masking behavior. In OpenEMR’s example docker-compose.yml configuration file, there is a volumes configuration specifying the same mount point as my EFS mount point. Why does EFS not work if the docker-compose.yml configuration does?

The first unlikely possibility that comes to me is that the docker-compose.yml is new and might just be broken. I quickly spin it up locally and rule that out as a possibility. Everything works as expected

Reluctantly, because I can already sense the rabbit hole here, I spawn several google searches and start reading through documentation. If this blog were a movie, a montage would probably start playing now showing me frantically reading AWS docs at like 50x speed3. It wouldn’t be very a very interesting montage4.

Here is some of what I uncovered down the rabbit hole:

  1. If you start a container with docker and that docker container references a volume that does not yet exist, docker will create this container for you.5

  2. Prior to mounting this new volume to the mount point, docker will first populate the new volume with the content found within the mounted directory.6

  3. By default when you specify a docker volume in a docker-compose file, if the volume does not yet exist, docker-compose honors the docker specification and will pre-populate the volume with the content of the mounted directory prior to mounting the volume.7

In the case of OpenEMR’s docker-compose file, when docker-compose up is invoked, a volume is created, and the /var/www/localhost/htdocs/openemr/sites/default/sqlconf.php file is copied into the created volume as per the above specifications.

Apparently, when I specify a task definition in ECS and specify a volume and mount points, upon the initial task execution, Fargate does not copy the container content to the volume.

I speculate that this behavior is because Fargate creates the volume prior to launching the task definition. Pre-creating this volume prior to launching the task means that no data will be copied into the volume before it is mounted to the Fargate task. Because OpenEMR relies on content within a mounted directory to launch, and because Fargate does not ever copy this content to a mounted EFS directory, using the standard OpenEMR docker image with a volume at /var/www/localhost/htdocs/openemr/sites fails.

I do not know whether my hypothesis is correct, but this github issue seems to mirror my experience at least. I am not the only person confused as to how to get stuff to run on Fargate 1.4.0.

Note: After writing and publishing this, u/brunokktro points out that this behavior is most likely due to the docker engine runtime of Fargate being swapped out with containerd for Fargate 1.4.0. Thanks for the insight u/brunokktro.

Good news is AWS has seemingly issued a fix. So I follow the instructions in the linked documentation:

  1. Create a Dockerfile… The VOLUME directive should specify an absolute path.

FROM openemr/openemr:6.0.0
VOLUME "/var/www/localhost/htdocs/openemr/sites"
  1. In the task definition volumes section, define a volume.

volumes: [
        {
          name: 'site',
          efsVolumeConfiguration: [{
            fileSystemId: siteVolume.volume.id,
            rootDirectory: '/'
          }]
        }
      ]
  1. In the containerDefinitions section, create the application container definitions so they mount the storage. The containerPath value must match the absolute path specified in the VOLUME directive from the Dockerfile.

        mountPoints: [
          {
            containerPath: '/var/www/localhost/htdocs/openemr/sites',
            sourceVolume: 'site'
          }
        ]

I deploy the changes, but no dice. Still seems that the container files aren’t being copied to the EFS directory. I’m not sure if this is AWS prematurely closing the above ticket or if my implemenation is flawed but it seems like I’m not the only one experiencing this.

Either way, I need this urgently and don’t have time to go through the AWS support turn-around time. So I add the following lines to my Dockerfile:

RUN mv /var/www/localhost/htdocs/openemr/sites /sites
CMD [ "./entrypoint.sh" ]

This moves all the assets from the mounted directory to a separate directory. Then I write the following ./entrypoint.sh:

#!/bin/sh
DIR="/var/www/localhost/htdocs/openemr/sites/"
if [ "$(ls -A $DIR)" ]; # Execute once 
then
    echo "Open EMR sites already initialized"
else
    mv /sites/* $DIR # Copy temp files to EFS volume
    chown -R apache /var/www/localhost/htdocs/openemr/sites
    chmod -R 755 /var/www/localhost/htdocs/openemr/sites 
fi

./run_openemr.sh # OpenEMR original entrypoint script

Similar steps can be used whenever you need hydrate an EFS volume once when you start a container for the first time.

First, you move the files you want to hydrate out of the mounted directory into a temporary directory. Then you create a new entrypoint to do the following steps:

  1. If the EFS volume is not already hydrated, move the hydrated files from the temporary directory to the original directory that is now backed by EFS.

  2. Ensure that ownership and permissions are adequately set for all files that you copied into your EFS volume

  3. Run the original entrypoint.

I still don’t know whether Fargate was broken or my configuration was broken. Whatever the case, if you’re struggling to get Fargate to hydrate your EFS volume from a docker container, I’ve demonstrated a technique, somewhat inelegant though it may be, that I hope helps you achieve your goal.

1

This is a lie. I’m never actually surprised when things go wrong.

2

The title of the blog and aforementioned details give it away.

3

The video would play at 50x speed. I was reading pretty slowly.

4

Or movie for that matter.

5

https://docs.docker.com/storage/volumes/#start-a-container-with-a-volume

6

https://docs.docker.com/storage/volumes/#populate-a-volume-using-a-container

7

The docker-compose reference allows you to specify nocopy: true to circumvent this default behavior.

https://yehudacohen.substack.com/p/exposing-dockerfile-contents-to-an
Extensions
Running ECS on CIS Hardened Amazon Linux
A journey deep into awsvpc container networking...
Show full content

I really like Amazon ECS and we have probably deployed it for at least 20 customers by now at Foresight Technologies. Both with Fargate and EC2 flavors depending on the use-case in question. A fully managed control plane and deep integration into various AWS services like Load Balancing, IAM, CloudMap, CloudWatch, and EventBridge make it incredibly appealing as an orchestration engine for containers.

When using ECS on EC2, my team and I mostly use the same Autoscaling ECS Cluster Terraform module that I built a couple of years ago1 on top of Amazon Linux ECS Optimized AMIs.

This one client, however, asked for CIS Hardened EC2 images required to meet contractual obligations. Ok, so swap over the base AMIs for CIS hardened AMIs from the AWS Marketplace, install the ECS agent and easy peasy. And sure enough, the ECS agent started scheduling tasks configured with the recommended2 awsvpc networking mode on the EC2 instances and everything worked correctly.

At least I thought they did at the time. The services started without a problem, and everything appeared to be working as expected. It was only later came to discover that despite apparent seamless operation, some very important ECS features were broken.

For one, every time a call to AWS was made with a privilege granted in the Task Role, the call would timeout while the ECS task tried to retrieve its temporary credentials. For another, if I killed a task forcibly from ECS, the entire docker agent froze, and I had to restart the machine before the ECS agent could launch any additional tasks on the screen.

The natural first port for debugging was the ECS agent logs. No dice. Everything looked normal, and no errors were being reported. Strange.

I connected to my EC2 instances with AWS Session Manager and started exploring. I take the following steps:

  1. Start a task on the ECS instance in awsvpc mode and wait for it to start.

  2. Execute docker ps to determine the id of the container running scheduled task.

  3. Execute docker exec -it containerid /bin/bash to enter the container and explore.

  4. Execute env to print the container’s environment variables. I note the ECS_CONTAINER_METADATA_URI and the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI values.

  5. I execute wget $ECS_CONTAINER_METADATA_URI and experience a timeout. The container is not able to access its metadata endpoint.

    At this point, it is pertinent to note that the $ECS_CONTAINER_METADATA_URI points to http://169.254.170.2/v3/some-long-guid-possibly-a-task-id, and when configuring my ECS host’s iptables, I have configured the following rules per AWS’s documentation:

    sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
    
    sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679

    Unremarkably, 127.0.0.1:51679 points to my local ECS agent. My timeout, therefore, can be narrowed down to reaching the ECS agent from within the ECS task container.

  6. I need to determine whether the ECS agent is accessible from outside the container and operating correctly. So I exit the container’s shell and execute wget http://169.254.170.2/v3/some-long-guid-possibly-a-task-id and am immediately able to retrieve metadata. I can therefore conclude that the ECS agent seems to be reachable at least from the ECS instance if not from within the task itself.

At this point I’m stymied and need to learn more. How does the ECS agent usually expose its metadata endpoint to the docker containers if they’re deployed with different network interfaces? I discover this AWS blog entry that touches on the subject.

Take a while reading the entire article if you’re interested, as I think the background is necessary to understand the issue in full. If you are already familiar with how awsvpc container networking works, I have extracted the highly relevant portion and italicized the critical line.

The ECS agent invokes a chain of CNI plugins to ensure that the elastic network interface is configured appropriately in the container’s network namespace. You can review these plugins in the amazon-ecs-cni-plugins GitHub repo.

The first plugin invoked in this chain is the ecs-eni plugin, which ensures that the elastic network interface is attached to container’s network namespace and configured with the VPC-allocated IP addresses and the default route to use the subnet gateway. The container also needs to make HTTP requests to the credentials endpoint (hosted by the ECS agent) for getting IAM role credentials. This is handled by the ecs-bridge and ecs-ipam plugins, which are invoked next.

So maybe the ecs-bridge and ecs-ipam plugins aren’t being invoked because the CNI plugins aren’t being invoked by the ECS agent for some reason. I run the validations described at the end of the blog post and verify that the first plugin in the chain (the ecs-eni plugin) is in fact operating perfectly well, and each task has its own network interface as expected.

The ecs-bridge and ecs-ipam plugins, however, are not having their intended effect and I am still not able to reach the bridge from the container.

I conclude that one of two things is true:

  1. The ECS agent requires additional undocumented configuration to run tasks in awsvpc configuration mode on a CIS hardened Amazon Linux 2 machine.

  2. The ECS agent is not currently able to run tasks in awsvpc configuration mode on a CIS hardened Amazon Linux 2 machine.

It is at this point that my terrible understanding of networking3 prevents me from making further progress toward resolving this issue.

I open a support ticket to AWS, and send it along to our AWS partner representative who puts us in touch with a partner Solutions Architect who initially misunderstands my problem. He tells me he is able to run tasks on the CIS hardened Amazon Linux AMI. I jump on a call to explain it to him that it’s not running tasks that I’m having issues with. Upon talking over the phone I’m able to better illustrate the issues I am facing, and he tells me he will reach out internally and try to find a resolution for me.

In a rather painless experience with the AWS team, they are able to very quickly help pinpoint the root cause and publish some solutions. I’m not sure whether I’m meant to be identifying the solutions engineer in question, but I am incredibly grateful to this Solutions Architect and the ECS agent team for the fantastic support on this issue.

The tldr; is that the CIS hardened image includes the following iptables INPUT chain rule:

0  0 DROP    all  --  *      *       127.0.0.0/8      0.0.0.0/0           

The ECS agent relies on a standard ACCEPT rule configured by default on regular un-hardened Amazon Linux. With the hardened image dropping these packets by default, two new (undocumented at the time) iptables rules are required to configure the ECS agent:

iptables -A INPUT -i ecs-bridge -d 127.0.0.0/8 -p tcp -m tcp --dport 51679 -j ACCEPT
iptables -A INPUT -i docker0 -d 127.0.0.0/8 -p tcp -m tcp --dport 51679 -j ACCEPT

This additional configuration made everything work immediately. See this github issue for more info.

I don’t like to be beaten, but this was a fun rabbit hole and it taught me a ton about the way ECS is designed in the process. If you’re still with me at the end of all of this, I hope you learned a bit too.

1

Albeit upgraded to Terraform 0.14 compatible syntax with modified userdata to enable operation on Amazon Linux 2 ECS Optimized AMIs.

2

Aside from being recommended by AWS, awsvpc networking mode attaches a ENI to each ECS task and allows the use of native AWS security groups to restrict ingress and egress traffic from each task at the network level making this important for our particular use-case.

3

of the computer variety although the same description could apply to my business networking skillset.

https://yehudacohen.substack.com/p/running-ecs-on-cis-hardened-amazon
Extensions