x86.lol — GeistHaus

Polyglot NixOS: The Same Disk Image for All Architectures

Dec 19, 2025

Recently a colleague mentioned building NixOS images that run unchanged on multiple architectures. Given the past adventures on this blog with systemd-repart and cross-compiling NixOS, I decide to give this a go.

tl;dr You can find a quick’n’dirty implementation here. Check the repo for details on how to build and run it.

So do we want to do: We want to build one disk image that boots on x86_64, ARM AArch64, and RISC-V 64-bit. We limit ourselves here to UEFI platforms, which makes this pretty straight forward.

From a high-level we need to:

Have a NixOS configuration.
Build the system closure for each target.
Throw everything into one /nix/store partition.
Populate the ESP to boot the right closure depending on the architecture.

All of this is surprisingly straight-forward. The ESP has architecture-dependent default filenames for what the firmware should boot, given no other configuration. This means we can build an UKI per architecture and drop it at the right place in the ESP (/EFI/BOOT/BOOTX64.EFI for 64-bit x86) and we are done!

By linking the system’s UKI in these locations on the ESP, we skip over having an actual bootloader and thus can’t have multiple generations, but it makes for a much leaner example!

The example repo puts the closure for each architecture in a single Nix store partition. I thought this would bring some space savings, because files that are not binary code should be largely the same. This doesn’t really pan out in this small example and we only save a couple of percent. Maybe it makes a bigger difference for larger closures.

If you want to dig into the details, the example repo has the instructions how to build and boot the image. I’m also eager to see someone building a more comprehensive version of this that includes a fully functioning bootloader and multiple generations!

https://x86.lol/generic/2025/12/19/polyglot.html

Quick and Dirty Website Change Monitoring

Aug 10, 2025

Let’s say, you need to monitor a website for changes and you really don’t have a lot of time to set things up. Also solving the problem with money using services, such as changedetection.io or visualping.io, have failed you, because their accesses are probably filtered out.

I’ve come up with the following scrappy solution. First, I want to get push notifications to my phone. So I installed simplepush on my phone. There are a couple of these services, this was just the first I found and it works well.

I have a couple of Linux servers. So I just logged in to one, installed ntfy and the Links text-based web browser (probably links2 in your package manager).

Configure ntfy with your simplepush key:

# ~/.config/ntfy/ntfy.yml 
backends:
  - simplepush
simplepush:
  key: 12345

Afterwards, you can just dump the website to a text file with Links and send a push notification to your phone when something changes:

#!/usr/bin/env bash

# By starting without old.txt, we get a notification when we start the script
rm -f old.txt

# Let's be polite here and not hammer the site.
POLL_FREQ_MIN=15

URL="https://example.com/"

while true; do
    touch old.txt
    links -dump "$URL" > new.txt

    if ! diff -u old.txt new.txt > diff.txt; then
		# It's hard to condense the changes (diff.txt) into something readable,
		# so we just send the URL to easily click on on the phone.
        ntfy send "Check $URL"
    fi

    mv new.txt old.txt

    sleep $(($POLL_FREQ_MIN * 60))
done

This only works for simple websites and there is a lot left to be desired. But it is doable in the 10 minutes of productivity a newborn baby gives you and it works to get appointments at government offices in Spain. 😉

PS. This blog post was written in another 10-minute productivity window.

https://x86.lol/generic/2025/08/10/change-monitoring.html

FOSDEM Edition: Thoughts on the Microkernels

Jan 30, 2025

It’s FOSDEM time! I have fond memories of the Microkernel and Component-based OS devroom in particular. It’s a fun meetup of extremely skilled low-level software engineers. This year I cannot attend, so it’s a good time to ramble reflect on it.

Some Background

The community around this devroom has one epicenter in Dresden, where many of us met at the Operating Systems Group at the university. Dresden has a lively systems community, and microkernel enthusiasts are a big part of it. For a large part of my professional life, I was working on microkernel-based systems, too.

Microkernel-based systems are appealing. They result naturally, if you take the idea of Least Privilege to its logical end. They also arise naturally, if you architect an operating system in a rigorously modular fashion.

On paper systems based on microkernels promise clean architecture and extremely secure systems due to a tiny Trusted Computing Base. In reality, despite the conceptual advantages, the microkernel-based systems struggle to achieve traction outside of niches.

The Problem

Some years ago, I sat in a bar with a friend from the microkernel community. We talked a long time about the issues in the community. Despite the common goal of a component-based and secure systems in the community, the point he made was that the inability to work together is self-limiting for each project. Instead of everyone collaborating towards the common goal, each company is reinventing the wheel. Developers are excited to implement a new IPC path that shaves five cycles off compared to the other microkernel’s IPC path or boast about their small TCB, even though this rarely works towards what users would actually need.

This conversation stuck in my head, and I had some years to reflect on it. There are a couple of causes at the core of this problem. The main cause, in my opinion, is that the personality that is required to bootstrap a project is not the best personality to make that project grow.

Starting vs. Growing a Project

Starting a new operating system project requires strong opinions. You want your system to have certain properties. You are not willing to compromise because if you are in the compromising mindset, you could have just used Linux. Linux is everything to everyone, so you could have squinted your eyes to build your ideas on top of it. Instead, you chose to start over because your ideas were so important to you that you were not willing to compromise.

In this bootstrapping phase, you are typically alone or work with few disciples that intimately share your vision. You freely implement things your way. If you need a custom build system to fine-tune your build flow, why not write one as well!

At some point, you reach a state where it’s challenging to make meaningful progress without external contributions to work on larger use cases and iron out the kinks of a still-niche project. You need to grow a community.

To grow a community, you need an entirely different skill set. The lone hacker with a clear vision in their mind is not equipped to do this. Instead of writing beautifully crafted code yourself, you need to be the inspiring leader that rallies people to your cause and establishes structures that will outlast yourself. New developers will come with new ideas and different ways of working. Some of your initial idiosyncratic choices have to give way, while the overall vision remains and evolves.

The person who spent hours implementing their own build system now has to contend with people who say there is a better tool to do the job. And better here usually means much better for them, but worse for you because the old system was carefully polished for your own use case.

So now it’s time to compromise. Do you insist on having this special build system, or do you switch to something other people are familiar with? If you keep insisting on your idiosyncrasies that are not core to the mission of the project, you risk alienating contributors.

To summarize: The skills that you need to start a radical new project are the skills that will not help you in growing a community.

My Wishes

My wishes for the community are to find a way to collaborate instead of starting all over again and again. We need to attrach users. We need to find the “killer app”, where these kinds of systems are obviously better than Linux. And this niche cannot be just pleasing the BSI in certifications.

Systems must be trivial to use and contribute to. We need to embrace open source and open decision processes. No CLAs! Make it trivial to use and to contribute.

We must work together!

https://x86.lol/generic/2025/01/30/microkernels.html

Hardening C Against ROP: Getting CET Shadow Stacks Working

Sep 23, 2024

This post shows you how to use CET user shadow stacks on Linux. CET is a hardening technology that mitigates typical memory unsafety issues on x86. This post will not explain this security feature. If you don’t know what CET is, this post is probably not for you. For general advice on hardening C/C++, check out these guidelines.

Back to CET shadow stacks. Recent distros, such as NixOS 24.05 and Fedora 40, satisfy all the software requirements. If you’re not on one of these distros, you need to check whether you have the following prerequisites:

Linux 6.6 or later with CONFIG_X86_USER_SHADOW_STACK=y
glibc 2.39 or later
A CPU supporting CET shadow stacks:
- Intel Tiger Lake or later (?)
- AMD Zen 3 or later
GCC 8 or clang 7 or later

With this out of the way, let’s get it working. We use a tiny C program test.c that simulates ROP:

#include <stdio.h>

int hello()
{
  printf("Return address corruption worked!\n");
  return 0;
}

// "Smash" the stack to execute hello instead of returning directly. This
// should not work with shadow stacks.
int foo();
asm ("foo: mov $hello, %rax; push %rax; ret");

int main()
{
  foo();
  return 0;
}

Compile this program with -cf-protection=return (or full) to enable shadow stack support:

$ gcc -fcf-protection=return -o test test.c

If your toolchain is recent enough, you see that the binary is marked as supporting shadow stacks:

$ readelf -n test | grep SHSTK
	  Properties: x86 feature: SHSTK

Shadow stacks are not enabled by default as of glibc 2.39. So without opting in, the test program will not use shadow stacks:

$ ./test
Return address corruption worked!

You opt in to shadow stacks using a glibc tunable. When everything works, you’ll see that the stack smashing is prevented:

$ GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK ./test
[1]    14520 segmentation fault (core dumped)  GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK ./test

Now you can go out and try it out on more interesting software!

https://x86.lol/generic/2024/09/23/user-shadow-stacks.html

Immutable Systems: Cross-Compiling for RISC-V using Nix Flakes

Sep 21, 2024

In my last post, we built whole disk images for embedded systems using Nix. This approach is well suited for RISC-V or ARM systems, but you probably don’t have a powerful build box for this architecture. You wouldn’t want to build a Linux kernel for hours on a RISC-V single-board computer praying that you don’t run out of RAM…

In this blog post, we will use the same NixOS configuration to cross-compile system images for x86, RISC-V and ARM from our powerful x86 build server.

Let’s go over some theory first and then look at how this applies to our flake from the previous post. A complete example lives in here. For the version that was current when this blog post was written, check out the blog-post-2 tag.

Cross-Compiling NixOS

nixpkgs has excellent cross-compilation support. There are also excellent resources for cross-compiling individual packages. Cross-compiling whole systems is even easier, but not as well documented. There are two main ways to configure it. For a deeper discussion, check out this post.

Approach 1: nixpkgs.buildPlatform/hostPlatform

The first approach is to configure the build and host system in the NixOS configuration. The terminology that NixOS uses is:

buildPlatform for configuring what kind of system does the actual build,
hostPlatform for configuring what kind of system the resulting binaries should run on.

For me, the name hostPlatform is somewhat ambiguous, but these are the names we are stuck with.

To configure a NixOS configuration for cross-compiling, you can use a module like this:

{ ... }: {
  nixpkgs.buildPlatform = "x86_64-linux";
  nixpkgs.hostPlatform = "riscv64-linux";
}

Approach 2: Build pkgs Yourself

The second approach is to build a cross-compiling pkgs set yourself and then just use this for your NixOS configuration. Assuming nixpkgs is the nixpkgs flake input, you can create it like this:

let
  # Let's stick to the terminology from earlier.
  buildPlatform = "x86_64-linux";
  hostPlatform = "riscv64-linux";

  crossPkgs = import nixpkgs { localSystem = buildPlatform; crossSystem = hostPlatform; }

  # ...

As you can see, we re-evaluate nixpkgs with parameters that enable cross-compilation. The challenge is mostly the changed terminology 🫠. localSystem is the system to build on and crossSystem is the system where the final system needs to run.

The resulting crossPkgs can then be used to configure cross-compilation in the NixOS configuration:

{ ... }: {
  nixpkgs.pkgs = crossPkgs;
}

You cannot mix these approaches. If you set nixpkgs.pkgs, buildPlatform and hostPlatform will be ignored.

Flakes and Cross-Compilation

To always cross-compile from your local system, you can set buildPlatform to builtins.currentSystem. This doesn’t work with flakes, because they don’t allow you to call builtins.currentSystem. It would leak details of the build platform into the flake outputs. The flake would not be fully encapsulated and thus impure. This is one reason why flakes have a bad reputation when it comes to cross-compilation.

Despite the misgivings, cross-compiling with flakes works great. It’s just that the flake has to be prepared for cross-compilation. Let’s go through that for the immutable appliance example.

When I wrote the example, I aimed for the following outputs for the flake:

packages
├───riscv64-linux             # Cross-compiled
│   ├───appliance_17_image
│   ├───appliance_17_update
│   ├───appliance_18_image
│   └───appliance_18_update
└───aarch64-linux             # Cross-compiled
│   └ ...
└───x86_64-linux
    ├───appliance_17_image
    ├───appliance_17_update
    ├───appliance_18_image
    └───appliance_18_update

As you see, each version of our example appliance produces one install disk image and one update package for systemd-sysupdate (see the last post for how this is used).

To build all these images from x86, we only need to apply our theoretical knowledge from above to define crossNixos as a convenience wrapper to add the cross-compilation module to an existing NixOS configuration:

  outputs = { self, nixpkgs, flake-utils, ... }:
    let
      # The platform we want to build on. This should ideally be configurable.
      buildPlatform = "x86_64-linux";
    in
    (flake-utils.lib.eachSystem [ "x86_64-linux" "aarch64-linux"
                                  "riscv64-linux" ]
      (system:
        let
          # We treat everything as cross-compilation without a special
          # case for the build platform. Nixpkgs will do the right thing.
          crossPkgs = import "${nixpkgs}" { localSystem = buildPlatform;
                                            crossSystem = system; };

        # A convenience wrapper around lib.nixosSystem that configures
        # cross-compilation.
        crossNixos = module: nixpkgs.lib.nixosSystem {
          modules = [
            module

            {
              nixpkgs.pkgs = crossPkgs;
            }
          ];
        };

      in {
        # ...

With this out of the way, we can then define a NixOS configuration that is cross-compiled for all our target architectures like this:

        appliance_18 = crossNixos {
          imports = [
            ./base.nix
            ./version-18.nix
          ];
        }

Note that we can use the same configuration to generate system images for x86, RISC-V, and ARM and we build all of them on our beefy x86 build boxes! 🤯

It’s a nice exercise to make the build platform configurable. Check out nix-systems as a starting point.

Running the Images

If you are in the development shell, you can run the cross-compiled images in Qemu:

# uname -m
x86_64

# Enter the development shell that provides the qemu-efi convenience tool.
$ nix develop

# Build the disk image for version 17 of the appliance.
$ nix -L build .\#packages.riscv64-linux.appliance_17_image

# Run the disk image as a VM.
$ qemu-efi riscv64 result/disk.qcow2
...
<<< Welcome to ApplianceOS 24.11.20240906.574d1ea (riscv64) - ttyS0 >>>


applianceos login: root (automatic login)

root@applianceos (version 17) $ uname -m
riscv64

By the way, if you want to know how to run a RISC-V UEFI VM with Qemu, check the qemu-efi script.

Parting Words

If you have comments or suggestions about this style of cross-compilation with Nix, please reach out. I’m eager to hear them!

https://x86.lol/generic/2024/09/21/cross-compile-riscv.html

Immutable Systems: NixOS + systemd-repart + systemd-sysupdate

Aug 28, 2024

When you build software for embedded devices (your Wi-Fi router or home automation setup on your Raspberry Pi), there is always the question how to build these images and how to update them. What I want is:

A mostly immutable system with few moving parts.
A disk image that can be written to disk without a complicated installation procedure.
A simple mechanism to securely download updates from the Internet.

There are bonus points for:

A/B updates with automatic rollback.
Integrity protection for system images.

The systemd project has tooling that solves these problems: systemd-repart creates disk images during the build process and applies a partition scheme during boot. systemd-sysupdate downloads and applies system updates. They have lots of documentation, but I couldn’t find any end-to-end example.

So let’s build an end-to-end example! We’ll use NixOS, but the high-level setup is not NixOS-specific. The final example lives here. For the version referenced in this blog post, check out the blog-post tag.

Partition Layout with systemd-repart

Starting from our goals above, we want the following partition layout. We’ll do this with systemd-repart offline at build time. The sizes are somewhat arbitrary. I’m aiming for the low end here.

Name Size Mount Point Description ESP 256 MiB /boot The boot partition that holds the boot loader and Linux boot files. System A 1 GiB /nix/store The system files. System B 1 GiB /nix/store Alternate system files for the other installed version. Persistent >2 GiB /var Any files that need to survive reboots.

When we build a disk image for the initial installation, the B partition can be empty. The persistent /var/ partition could be created on the fly. However, in this example, we create it at build time for simplicity.

You can see the whole partition configuration in the partitions.nix module in the example. Here’s a shortened version:

image.repart.partitions = {
    "esp" = {
      # The NixOS repart module let's us populate partitions easily. Here we install systemd-boot
      # and the Unified Kernel Image (UKI) of the system.
      contents = {
        "/EFI/BOOT/BOOT${lib.toUpper efiArch}.EFI".source =
          "${pkgs.systemd}/lib/systemd/boot/efi/systemd-boot${efiArch}.efi";

        "/EFI/Linux/${config.system.boot.loader.ukiFile}".source =
          "${config.system.build.uki}/${config.system.boot.loader.ukiFile}";
      };
      repartConfig = {
        Type = "esp";
        Format = "vfat";
      };
    };

    "store" = {
      # We drop all Nix store paths that we require into this partition. This includes all binaries,
      # but also everything to populate /etc.
      #
      # This is our System A partition in the table above.
      storePaths = [ config.system.build.toplevel ];
      stripNixStorePrefix = true;

      repartConfig = {
        Type = "linux-generic";
        Label = "store_${config.system.image.version}";
        Format = "squashfs";
      };
    };

    # Placeholder partition for the System B partition.
    "store-empty" = {
      repartConfig = {
        Type = "linux-generic";
        Label = "_empty";
      };
    };

    # Persistent storage
    "var" = {
      repartConfig = {
        Type = "var";
        Format = "xfs";
        Label = "nixos-persistent";

        # Wiping this gives us a clean state.
        FactoryReset = "yes";
      };
    };
  };
};

With this configuration, we already get a bootable image. Here we build version 17 of our image:

$ nix build .#appliance_17_image
$ ls -l result/
total 1.1G
-r--r--r-- 2 root root 1.1G Jan  1  1970 disk.qcow2

You can then boot this image in Qemu with the provided qemu-efi script available in the development shell:

$ nix develop .
$ qemu-efi ./result/disk.qcow2
[...]
<<< Welcome to ApplianceOS 24.11.20240731.9f918d6 (x86_64) - ttyS0 >>>

applianceos login: root (automatic login)

root@applianceos (version 17) $

So far so good!

Building an Update Package

Now that we have our bootable image of version 17, we need a way to update it to version 18. As stated in the beginning, we do not want to do nixos-rebuild, because this involves Nix evaluation and potentially building code. We don’t want to mutate our system, we want to simply replace it with the new version.

For the update, we need two things:

A new version of the Nix store,
A new Linux kernel and initrd as UKI.

We already prepared our system for a second copy of the Nix store: We have an empty partition for this. We just need a new partition image for the Nix store. The image.repart module can provide individual partition images via the following in the NixOS configuration:

image.repart.split = true;

We can build the UKI for our new system version via the config.system.build.uki of an evaluated NixOS configuration:

$ nix build .#nixosConfigurations.appliance_18.config.system.build.uki
$ ls -lh result/
total 43M
-r--r--r-- 2 root root 43M Jan  1  1970 appliance_18.efi

With some minor NixOS magic, we can build our update package:

$ nix build .#appliance_18_update
$ ls -lh result/
total 318M
-r--r--r-- 2 root root  43M Jan  1  1970 appliance_18.efi.xz
-r--r--r-- 2 root root 276M Jan  1  1970 store_18.img.xz

Configuring systemd-sysupdate

Ok, we have our update, but now we need to apply it. This is where systemd-sysupdate comes in. systemd-sysupdate is a tool that scans update sources for new updates and then allows to apply them to targets.

Sources can be web servers for fetching files via the Internet or local directories. Targets can be directories or partitions on the local system.

In our example, we want to:

Place the UKI of an update package into the right directory on the ESP,
Place the new Nix store into an available partition.

For simplicity, we will tell systemd-sysupdate to look for updates in /var/updates. You can see the whole systemd-sysupdate configuration in the sysupdate.nix module in the example. Here’s the shortened version:

systemd.sysupdate = {
  enable = true;

  transfers = {
     # This section describes the UKI update procedure.
    "10-uki" = {
      Source = {
        # The name pattern of compressed UKI files to download. @v is
        # a place holder for the version number.
        MatchPattern = [
          "${config.boot.uki.name}_@v.efi.xz"
        ];

        # We could fetch updates from the network as well:
        #
        # Path = "https://download.example.com/";
        # Type = "url-file";
        Path = "/var/updates/";
        Type = "regular-file";
      };

      # We want to place the uncompressed UKI into the ESP.
      Target = {
        MatchPattern = [
          "${config.boot.uki.name}_@v.efi"
        ];

        Path = "/EFI/Linux";
        PathRelativeTo = "boot";

        Type = "regular-file";
      };

      # Prevent the currently booted version from being garbage
      # collected by systemd-sysupdate.
      Transfer = {
        ProtectVersion = "%A";
      };
    };

    # This section describes the Nix store update procedure.
    "20-store" = {
      Source = {
        MatchPattern = [
          "store_@v.img.xz"
        ];

        Path = "/var/updates/";
        Type = "regular-file";
      };

      Target = {
        # The target is an available partition on this device.
        # This can in some cases be auto-detected.
        Path = "/dev/sda";

        # The target partition will have this label.
        MatchPattern = "store_@v";
        Type = "partition";
      };
    };
  };
};

Applying the Update

To apply the update, boot the system image as before:

$ nix build .\#appliance_17_image
$ qemu-efi ./result/disk.qcow2
[ ... ]

We continue in the shell in the VM. For demo convenience, the example already has the update package for version 18 in /var/update:

$ ls -lh /var/updates/
total 324M
-r--r--r-- 1 root root  43M Aug 11 15:47 appliance_18.efi.xz
-r--r--r-- 1 root root 276M Aug 11 15:47 store_18.img.xz

systemd-sysupdate finds version 18 as an update candidate:

$ systemd-sysupdate
  VERSION INSTALLED AVAILABLE ASSESSMENT
↻ 18                    ✓     candidate
● 17          ✓               current

The update to version 18 can then be applied:

$ systemd-sysupdate update
Selected update '18' for install.
Making room for 1 updates…
Removed no instances.
⤵️ Acquiring /var/updates/appliance_18.efi.xz → /boot/EFI/Linux/appliance_18.efi...
Importing '/var/updates/appliance_18.efi.xz', saving as '/boot/EFI/Linux/.#sysupdateappliance_18.efifce0abb2fdba79a5'.
[...]
Successfully acquired '/var/updates/appliance_18.efi.xz'.
⤵️ Acquiring /var/updates/store_18.img.xz → /proc/self/fd/3p2...
Importing '/var/updates/store_18.img.xz', saving at offset 269484032 in '/dev/sda'.
[...]
Successfully acquired '/var/updates/store_18.img.xz'.
Successfully installed '/var/updates/appliance_18.efi.xz' (regular-file) as '/boot/EFI/Linux/appliance_18.efi' (regular-file).
Successfully installed '/var/updates/store_18.img.xz' (regular-file) as '/proc/self/fd/3p2' (partition).
✨ Successfully installed update '18'.

Now you can reboot the VM. Once the system is back up, you can remove the last version. This would also happen automatically when the next version is installed:

% systemd-sysupdate vacuum -m 1

Final Words

This was a whirlwind tour through systemd-repart and systemd-sysupdate that hopefully gave you an overview how they work. I invite you to explore the example!

There are lots of pieces missing in the example that I would like to add:

Growing partitions on boot,
Automatically creating /var on first boot,
Automatic rollback on boot failures,
Secure Boot,
TPM-based disk encryption,
…

If you feel like experimenting with any of these features, please open a PR or drop me a message. I would love to see what you did!

PS. If you need consulting, reach out to Cyberus Technology.

https://x86.lol/generic/2024/08/28/systemd-sysupdate.html

Confidential Computing: Complexity vs Security

Jul 7, 2024

This blog post is a continuation of my previous posts about Confidential Computing.

tl;dr

Complexity frequently leads to security issues. Adding support for a bunch of confidential computing technologies to KVM increases its complexity and thus softens its security stance.

Longer Version

While scrolling through KVM security vulnerabilities, it’s hard not to notice an uptick of vulnerabilities related to confidential computing, specifically AMD SEV. Here are some examples. These vulnerabilities typically don’t break the security promises of the confidential VM, but open up issues on the host.

I have been wondering whether the enabling of confidential computing features in KVM inadvertently lowers the security of KVM as a whole. The confidential guest may enjoy the benefits of some protection against malicious hypervisors, but the hypervisor has a harder time enforcing isolation on the whole system.

KVM on x86 is already a beast through no fault of its maintainers. x86 is notoriously hard to virtualize because it is an architecture with lots of legacy. The complexity of KVM reflects that. Also, KVM has often been the first public implementation of many virtualization features and thus can’t enjoy the benefit of hindsight. It also has many users, so rectifying any unfortunate API design or implementation choice is tough because someone’s problem is another person’s feature.

Given the complexities, our open-source virtualization stack would benefit from some big corporation money and brains to simplify, harden its security, and improve its trustworthiness. But as the incentive structures are, CPU vendors instead have started pouring money into developing mutually incompatible confidential computing solutions.

AMD, Intel, and ARM designed their confidential computing projects so they can be bolted onto the existing software stack. As such, each of these technologies adds thousands of lines of code to KVM and further increases the code base’s complexity. Due to the increased complexity, we now unsurprisingly see security issues in the modified code.

So the technology that is supposed to help to increase trust in virtualization has ultimately weakened the security of virtualization for many users. Isn’t this ironic?

https://x86.lol/generic/2024/07/07/confidential-complexity.html

RISC-V: The (Almost) Unused Bit in JALR

Dec 20, 2023

In the RISC-V architecture, you have excellent support for embedding information into code by choosing compressed or uncompressed instructions. While being a typical RISC with fixed 32-bit instruction length, RISC-V allows certain common instructions to be encoded as compressed 16-bit instructions to improve code density. Each compressed instruction has a functionally identical 32-bit cousin.

If you are interested in how that is used to embed information into a binary, you can check out my x86 instruction set steganography post from a couple of years ago, which uses a similar property of the x86 instruction set to do exactly this.

What I found more interesting, when reading the RISC-V User-Level ISA specification, is that the jalr (“Jump and Link Register”) instruction has an essentially unused bit that can be used to embed information as well.

To see why this bit is essentially unused, consider how jalr works. jalr computes its jump target by adding an immediate value to a source register. This immediate is unlike other jump immediates not encoded as multiples of 2. The specification says that the lowest bit of the sum is ignored and treated as zero. Since the source register is practically always aligned and its lowest bit is zero, this means that the lowest bit of the jalr is ignored in practice.

That there is a unused bit in the instruction encoding is unusual. Typically, all the available space is used to encode bigger immediates. But for the jalr instruction the RISC-V designers decided to go for simplicity. Here is an excerpt from the spec (page 16):

Note that the JALR instruction does not treat the 12-bit immediate as multiples of 2 bytes, unlike the conditional branch instructions. This avoids one more immediate format in hardware. In practice, most uses of JALR will have either a zero immediate or be paired with a LUI or AUIPC, so the slight reduction in range is not signiﬁcant.

The JALR instruction ignores the lowest bit of the calculated target address. This both simpliﬁes the hardware slightly and allows the low bit of function pointers to be used to store auxiliary information. Although there is potentially a slight loss of error checking in this case, in practice jumps to an incorrect instruction address will usually quickly raise an exception.

The nice thing about this unused bit is that we can use it to embed information without changing the size of the instruction itself. This makes it more useful than selecting different-length encodings of the same instruction, because we can do so after compiling an application. Choosing different instruction sizes has to be done at compilation time, because it will shift around function addresses and jump targets.

Of course, this only works as long as no one is actually storing information in the low bit of function pointers. But this is rare in practice.

So how much information can we embed using this method? Let’s look at GCC as a medium-sized application. Let’s see how much we have to work with for a RISC-V 32-bit GCC:

$ readelf -l gcc

Elf file type is EXEC (Executable file)
Entry point 0x292bc
There are 11 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  PHDR           0x000034 0x00010034 0x00010034 0x00160 0x00160 R   0x4
  INTERP         0x000194 0x00010194 0x00010194 0x00075 0x00075 R   0x1
  RISCV_ATTRIBUT 0x1948c6 0x00000000 0x00000000 0x00057 0x00000 R   0x1
> LOAD           0x000000 0x00010000 0x00010000 0x18e47c 0x18e47c R E 0x1000 <
  LOAD           0x18eb38 0x0019fb38 0x0019fb38 0x05d7c 0x0a290 RW  0x1000
  DYNAMIC        0x192ee8 0x001a3ee8 0x001a3ee8 0x00118 0x00118 RW  0x4
  NOTE           0x00020c 0x0001020c 0x0001020c 0x00020 0x00020 R   0x4
  TLS            0x18eb38 0x0019fb38 0x0019fb38 0x00000 0x00008 R   0x4
  GNU_EH_FRAME   0x1566d4 0x001666d4 0x001666d4 0x07b34 0x07b34 R   0x4
  GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x10
  GNU_RELRO      0x18eb38 0x0019fb38 0x0019fb38 0x044c8 0x044c8 R   0x1

There are 0x18e47c bytes of executable code (the LOAD segment with Execute permission). So there are roughly 1.5 MiB of code to work with. Let’s see how much jalr instructions we have:

$ objdump -Mno-aliases -d gcc | grep -E "[^.]jalr" | wc -l
190

There are 190 jalr instructions in these 1.5 MiB of code. That means we can embed 190 bits using this method into GCC. Not a lot. It turns out that jalr almost exclusively used for function entry stubs in the PLT. So there is also no hope of orders of magnitude more in larger binaries.

If we use the obvious method of switching between compressed instructions and normal instructions in RISC-V we have much more to work with. Let’s count the compressed instructions in the GCC binary:

$ objdump -Mno-aliases -d gcc | grep -F "c." | wc -l
186412

That make makes 186412 bits of information (around 23 KiB). Much more useful!

Finally, why would you want to embed information into binaries? I can only think of contrived examples, but they are fun. Consider an air-gapped build system that produces signed binaries. You can only put source code in on one side and you get a signed binary out on the other side. An attacker that manages to exploit this system can covertly smuggle the signing key out by embedding it into the signed binaries itself!

Maybe it is time to insist on reproducible builds instead of air-gapped build systems. 😼

https://x86.lol/2023/12/20/risc-steganography.html

Split Lock Detection VM Hangs

Nov 7, 2023

Recently, I’ve noticed strange hangs of KVM VMs on a custom VMM. As it fits the topic of this blog, I thought I make the issue more googleable. Until we dive into the issue, we have to set the scene a bit.

The Scene

Consider that we want to run a KVM vCPU on Linux, but we want it to unconditionally exit after 1ms regardless of what the guest does. To achieve this, we can create a CLOCK_MONOTONIC timer with timer_create that sends a signal to the thread that runs the vCPU (via SIGEV_THREAD_ID). We choose SIGUSR1, but other signals work as well.

We have to make sure that we do not receive the signal when the vCPU does not execute. This is important, because then the signal will not fulfill its goal of kicking the vCPU out of guest execution. For that, we mask SIGUSR1 with pthread_sigmask in the host thread and unmask it for the vCPU via KVM_SET_SIGNAL_MASK.

This setup works beautifully and in essence emulates the VMX preemption timer1. There is only one wart at this point. When KVM_RUN returns EINTR, because the timer signal was pending, we need to “consume” the signal or the next KVM_RUN will immediately exit again. We can do this with sigtimedwait with a zero timeout.

Weird VM Hangs

When I used this scheme on my Intel Tiger Lake laptop, I noticed strange hangs in VMs. The VM would sometimes get stuck on one instruction. The weird thing was that the vCPU could still receive and handle interrupts, but this one harmless looking instruction would never complete. The effect was that some Linux kernel threads would just get stuck while others continue to run.

The instruction in question was this from the set_bit function of my Linux 5.4 guest:

ffffffff810238b0 <set_bit>:
ffffffff810238b0:       f0 48 0f ab 3e          lock bts %rdi,(%rsi)
ffffffff810238b5:       c3                      ret

Way too late, I noticed the following warning in the host’s kernel log with a matching instruction point:

x86/split lock detection: #AC: vmm/61253 took a split_lock trap at address: 0xffffffff810238b0

Split lock detection is an anti-DoS feature that can find or kill processes that perform misaligned locked memory accesses, because they trigger extremely slow paths in the CPU that impact the performance of other cores in the system.

When I checked in more detail, the lock bts was indeed performing a misaligned locked memory access, but why would this warning cause a permanent hang at this instruction?

On my laptop running Linux 6.6, split lock detection was in its default setting warn. This is reasonable, because the underlying issue is not something you typically care about on a desktop system. The documentation of the relevant kernel parameter reads as follows:

split_lock_detect=
   [X86] Enable split lock detection or bus lock detection

   When enabled (and if hardware support is present), atomic
   instructions that access data across cache line
   boundaries will result in an alignment check exception
   for split lock detection or a debug exception for
   bus lock detection.

...

   warn    - the kernel will emit rate-limited warnings
             about applications triggering the #AC
             exception or the #DB exception. This mode is
             the default on CPUs that support split lock
             detection or bus lock detection. Default
             behavior is by #AC if both features are
             enabled in hardware.

There were no clues about the hang here either. 🤔

Going Deeper

When I checked the kernel function that emits the warning (called via handle_guest_split_lock), the pieces started falling together:

static void split_lock_warn(unsigned long ip)
{
	struct delayed_work *work;
	int cpu;

	if (!current->reported_split_lock)
		pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n",
				    current->comm, current->pid, ip);
	current->reported_split_lock = 1;

	if (sysctl_sld_mitigate) {
		/*
		 * misery factor #1:
		 * sleep 10ms before trying to execute split lock.
		 */
		if (msleep_interruptible(10) > 0)
			return;
		/*
		 * Misery factor #2:
		 * only allow one buslocked disabled core at a time.
		 */
		if (down_interruptible(&buslock_sem) == -EINTR)
			return;
		work = &sl_reenable_unlock;
	} else {
		work = &sl_reenable;
	}

	cpu = get_cpu();
	schedule_delayed_work_on(cpu, work, 2);

	/* Disable split lock detection on this CPU to make progress */
	sld_update_msr(false);
	put_cpu();
}

When the host detects a split lock, it will try to punish the offending thread by introducing a 10ms delay. But recall that our vCPU has a 1ms timer pending!

The situation is thus the following:

The VMM programs a 1ms timer and starts guest execution with KVM_RUN.
The guest executes a misaligned lock bts and exits with an #AC exception.
The host Linux kernel sleeps for 10ms to punish this behavior.
The sleep is interrupted and the function immediately returns with split lock detection still enabled.

At this point, the VMM sees that 10ms has passed and processes its timeout events. It programs a new timeout and we have the same sequence of events again.

I have created a minimal example of this issue here. The guest code just counts how many times it can can execute the lock bts instruction.

When you execute this test program once with split_lock_detect=warn and once with split_lock_detect=off, you get the following data:

The plot shows number of loops that the guest finished on the vertical axis and the pending timeout in ms on the horizontal axis.

You can clearly see that for timeouts below 10ms, this (artificial) guest makes no progress at all when split lock detection is enabled! On the other hand, when split lock detection is disabled, the guest makes roughly as much progress as we give it time.

Workarounds

As I already mentioned, the easiest workaround is to turn split lock detection off via split_lock_detect=off. This is safe unless you run a public cloud. Alternatively, the punishment can be disabled by writing 0 into /proc/sys/kernel/split_lock_mitigate.

A Bug?

The split_lock_warn function is clearly written to allow the offender to make some progress. But in the situation where msleep_interruptible is actually interrupted, this is not the case anymore. It looks like a bug to me.

It’s a difficult question what the correct behavior should be here. If msleep_interruptible managed to sleep at least a bit (i.e. some punishment was dealt), we should still go into the lower part of the function that disables split lock detection and allow for forward progress. This may make it possible to circumvent this punishment though.

I couldn’t find a good resource to link here. The VMX preemption timer is a simple timer that counts down a value in the VMCS proportional to the TSC frequency and generates a VM exit when it reaches zero. See chapter 24.5.1 “VMX-Preemption Timer” in the Intel SDM. ↩

https://x86.lol/generic/2023/11/07/split-lock.html

Intel TDX Doesn't Protect You from the Cloud

Jun 28, 2023

This post is a continuation of my previous post about Intel TDX. It’s worth a read before reading this post. As before, I’m not going to introduce TDX itself. If you need a refresher, Intel has good overview material available.

tl;dr

While Intel TDX does make some attacks by the cloud vendor harder, you still have to trust the cloud vendor unless you go to extreme lengths. We need to build trustworthy virtualization stacks instead of hoping for the silver bullet from CPU vendors.

Longer Version

Let’s take Intel TDX’s promises at face value. When everything goes well, TDX provides CPU state and memory integrity. This is useful because it prevents trivial attacks on VMs from a compromised hypervisor. The hypervisor cannot read secrets directly from memory or inject code.

The problem is that in the TDX trust model, the virtual machine monitor (think Qemu) is not trusted. Yet it emulates all virtual devices. This means all devices are potential machiavellian devils wanting to screw the kernel in the trusted VM. Having completely untrusted devices opens a large attack surface to driver code written in C, rarely considered security critical.

There is a real-world analogy here. If you are security-minded, you want to limit access to the external ports of your laptop. For example, malicious USB devices can exploit vulnerabilities in the operating system’s USB stack to gain code execution. But at least internal devices without exposed ports are out of the attacker’s reach.

With TDX, the attack surface includes all device drivers 😭. All devices are fair game from the attacker’s perspective. The malicious VMM can craft problematic responses from any device, such as the PCI Configuration Space or VirtIO.

So what does this mean? Running a standard OS in a TDX Trusted Domain (TD) instead of plain VMs gives little additional security if the attacker is the cloud vendor. The attacker will eventually find vulnerable device drivers to exploit because device drivers are not typically written in a way where they consider the device’s responses malicious.

But what is there to do about this? While you can minimize drivers in the VM to the bare minimum or run a custom high-security OS in the VM, this takes away the charm of running a stock OS in the trusted VM. You could rewrite all drivers and formally verify them. But that won’t happen any time soon. In reality, people will just run Ubuntu.

You could also implement device emulation in the TDX module, but there are problems:

You can’t do it because it’s not open but “shared” source.
Only Intel can sign the module so the CPU accepts it.
It would only increase the attack surface of this monolithic blob that you have to trust for the complete security of TDX.

People assume that with TDX, you don’t have to trust the cloud vendor when you run your Ubuntu there. This is clearly false. You cannot deploy a standard application into a TDX VM and expect it to be secure from the cloud vendor.

TDX limits exposure to certain classes of attacks. For example, it is hard for the on-call engineer with access to a VM host to extract secrets from a TDX TD. Yet TDX does not provide protection against an entirely malicious cloud vendor that can arbitrarily deploy device emulation code.

But then there is also the burden on the end user. Suppose you don’t do remote attestation and bind your secrets to the VM’s configuration using Trusted Computing magic. In that case, TDX brings no benefit at all. You can’t tell whether your VM runs inside a TDX TD or some software emulation of it.

Not all is lost, though. Check out my previous blog post, which shows a way that sidesteps these problems by allowing devices to be trustworthy. Ultimately, it comes down to the cloud vendor becoming trustworthy and not only trusted. Confidential computing technologies, such as Intel TDX, are a puzzle piece. Still, there is no trustworthy virtualization without a trustworthy virtualization stack.

Update 2023-08-06

The Linux Guest Hardening documentation indirectly makes the same point as the blog post above. There are multiple fun points in the document, but the main point is this:

Every time a driver performs a port IO or MMIO read, access a pci config space or reads values from MSRs or CPUIDs, there is a possibility for a malicious hypervisor to inject a malformed value.

Don’t expect to get solid security out of TDX any time soon:

While some of the hardening approaches outlined above are still a work in progress or left for the future, it provides a solid foundation for continuing this work by both the industry and the Linux community.

https://x86.lol/generic/2023/06/28/intel-tdx-2.html