a1k0n.net — GeistHaus

Pure Silicon Demo Coding: No CPU, No Memory, Just 4k Gates

Dec 19, 2025 Updated Dec 19, 2025

Show full content

In addition to the VGA donut, I submitted two other entries in the Tiny Tapeout 8 demo competition, where you submit a "tiny" ASIC design (room for about 4000 logic gates) that outputs 2-bit RGB to a VGA port and 1-bit audio to a speaker. One was a an old school C64/Amiga intro type of thing, and the other was a nyan cat.

"TT08" intro

Click to play -- video loops imperfectly but the hardware loop is seamless

The intro features a background starfield, a panning 3D checkerboard with colorful tiles, and wavy scrolling text which casts a shadow onto the checkerboard. I was inspired by lft's first AVR demo Craft to add a little cyan oscilloscope to the side of the screen as well.

The size limit was two Tiny Tapeout "tiles" which is not a lot of space for music, effects, and data; this is a very constrained platform for a demo: you get no ROM, no RAM, and every bit of state is a flipflop which takes up a decent chunk of space compared to combinational logic. Traditional "sizecoding" tricks don't apply here -- data compression doesn't help you if you can't actually store a ROM; you have no frame buffer so you must produce a pixel every clock cycle. There's no CPU to exploit, just your own state machine.

Prototyping

Like the VGA donut, this one used the weird custom video mode of 1220x480 and a 48MHz clock as it was easiest to prototype with given the only FPGA hardware I had on hand, but in retrospect I very much regret using this video mode. Most of the digital captures of the demo look bad thanks to dithering and aliasing artifacts. It does look great on a real CRT though!

The above video was made with a Verilator simulation, which compiles the Verilog source down to C++ and allowed me to do cycle-by-cycle simulation of the design, rendering to a SDL window / audio stream. This also allowed me to dump out simulated frames to create the above video. I would later use this SDL VGA verilator testbench for the other projects.

The C++ simulation runs about half-realtime on my system, though, and most of the time I synthesized it for the OrangeCrab's Lattice ECP5 FPGA strapped to an ugly R-2R DAC (it's two bits per channel, so literally R and 2R, no ladder, for RGB) hacked together on a perfboard.

prototype VGA FPGA board

orangecrab FPGA and my RGB222 VGA connector

Sound is output on a single pin through an RC filter; most people used PWM of some kind, but I implemented sigma-delta conversion which is actually simpler (with some tradeoffs; more below). The little solder blob on pin 13 of the orangecrab is where I soldered an audio cable and plugged it into a portable guitar amp for sound.

To get some more color fidelity out of RGB222, I used the same ordered dithering code as in the donut.

Synthesizing for Skywater 130nm

Tiny Tapeout provides a basic workflow for taking your Verilog design and "hardening it" -- laying out the GDS file (which is what goes to the fab) -- as a GitHub action on commit, but there's also a local hardening tool which was absolutely necessary as I alternately added elements to the demo and optimized it to fit. Either there would be too many cells to fit the area, or they would fit but routing wires between them would fail (sometimes after an hour or more!). Being able to dry run several experiments on my own PC was invaluable.

ASIC layout This design ended up using 3374 total cells (breakdown of cell use and 3D viewer here) and filling just about every square nanometer. The sort of largish bluish areas are mostly flip-flops, each representing a bit of state held by the circuit. Flip-flops are relatively big, so minimizing state bits turned out to be essential. I still ended up with 293 total flipflops.

Sine scroller

The invitation intro (which is not a real ASIC demo, just a mockup!) that Matt Venn created for the competition inspired me to do something in a similar vein, and for some reason, I was fixated on using one of the fonts for my demo. Encoding text in this font cost a ton of gate area, and so everything else was highly constrained -- eventually I had to give up encoding the color values in the font (but I kept the palette) and go for a generic diagonal striped pattern.

the font

the font used in the demo, from a large collection of ripped demo fonts

Encoding data in an ASIC

You can think of a ROM as a circuit that has a huge truth table -- address 0000000 yields some value, 0000001 another value, etc. Usually this is laid out in a big grid on a chip with some address decoders doing rows and colums and a transistor for each bit encoding high or low values.

Sometimes these are encoded as Programmable Logic Arrays (PLAs) instead of a fully laid out matrix of bits; one example of this is at the heart of the famous Pentium FDIV bug which Ken Shirrif breaks down nicely here.

But you could also just encode the ROM as a bunch of logic gates specific to the pattern of bits specified, e.g., maybe if address bit 7 is on, the result is always 0. In fact part of what the synthesis pipeline1 does is to take an arbitrarily large truth table and match it to the set of available logic cells (on an FPGA this tends to be a bunch of 4-input lookup tables but Sky130 has all kinds of assorted gates and combined logic cells - random example). So I provided a giant lookup table as a hex file and the yosys synthesis pipeline figured out a way to implement it with the standard cell library, effectively by finding patterns in the data that match up with patterns in the addresses, and then, instead of fitting them in a nice rectangle, sprayed those standard logic cells all over my available die area in whatever shape made sense for the other standard cells connected to them. This makes is very hard to predict how much "data" will fit.

What this means for a limited gate area demo is that instead of minimizing the size of data to store as lookup tables, it's better to limit the algorithmic complexity of the data as a function of its address. Repeating the same power-of-two-sized pattern is free; only variations need extra logic. This is helpful to keep in mind when composing music, for example. So encoding "TT08", if the width of a character cell is a power of two, means that second copy of 'T' is basically free in terms of logic gates.

a1 a0 data 0 0 'T' 0 1 'T' 1 0 '0' 1 1 '8'

a1 == 0 -> 'T'
a1 == 1 && a0 == 0 -> '0'
a1 == 1 && a0 == 1 -> '8'

drastically simplified example

Sines without sine tables

Usually in a sine scroller you'd expect to see a sine/cosine lookup table, but given the above, it seems wise to avoid it if possible. Just like in my donut demo, I instead rotate a vector on every clock cycle. I don't need random access to the sine table, only \sin(x + o) for each x coordinate on the screen which increments each cycle, with an offset o which increments each frame.

In fact I have a high-resolution sin/cos vector (a_cos, a_sin) that updates slowly each frame, and a low-resolution one (b_cos, b_sin) initialized from the high-resolution one at the start of each scanline that updates each clock cycle.

Cutting bits out of the per-pixel sine generator was one of many hacks done late in development to save on area (remember: flip-flops are pretty large so they had to be trimmed aggressively), and there is visible jankiness in the sine scroller as a result, if you watch closely.

This is the symplectic integrator which generates a perfect circle slightly off-center in just a few simple shift/add operations, first described by Minsky in HAKMEM 149.

reg signed [14:0] a_cos;
reg signed [14:0] a_sin;
reg signed [11:0] b_cos;
reg signed [11:0] b_sin;

// on reset:
  a_cos <= 15'h2000;
  a_sin <= 15'h0000;

// on new frame:
  wire signed [14:0] acos1 = a_cos - (a_sin >>> 6);
  wire signed [14:0] asin1 = a_sin + (acos1 >>> 6);
  a_cos <= acos1;
  a_sin <= asin1;

// on new line:
  b_cos <= a_cos >>> 3;
  b_sin <= a_sin >>> 3;

// else on each clock:
  wire signed [11:0] bcos1 = b_cos - (b_sin >>> 7);
  wire signed [11:0] bsin1 = b_sin + (bcos1 >>> 7);
  b_cos <= bcos1;
  b_sin <= bsin1;

Excerpt of Verilog code for generating the horizontal sine offset

Plane checkerboard

To do the checkerboard, I do a perspective transform of the screen x,y coordinates onto the plane's coordinate system (which I will call u,v). I use the convention that the plane is parallel to the xz-plane at a height of y_p.

I project a ray from the screen until it hits the plane, then recover the u,v coordinate hit. I actually want v and \frac{du}{dx} particularly, as they are constants on each line when looking directly toward the z axis. \frac{du}{dx} gives a fixed value to add to u each clock cycle and I don't need to do any complex math per pixel, and can just multiply that by half the screen width to get u_0 at the left side of the screen.

A ray starts at the origin at t=0 and has 3D coordinates (xt,yt,t), intersecting the plane when yt = y_p for a plane height y_p. Therefore t=\frac{y_p}{y}.

The (u,v) coordinates of the plane are the x and z coordinates at the intersection point, (u,v) = (xt, t) = (\frac{x y_p}{y}, \frac{y_p}{y}). Clearly, \frac{du}{dx} is again just \frac{y_p}{y}, so this expression is the only thing that needs computing.

So I needed to pick an appropriate plane height scale y_p that gives a field of view that looks good and that I can easily divide by y in hardware. I also add an offset y_0 to y to move the camera "down" a bit and also to avoid dividing by 0 (and I don't render the plane when y < y_0 ).

Here's an early mockup in a Jupyter notebook I did working out the math for this, trying to get sensible fixed point value ranges (i.e., number of bits I needed to use in Verilog):

def genboard2(w, h, y0, k):
    M = (1<<(k+1))-1
    y = np.arange(h)+y0
    hw = w << (k-2)
    yx = np.mgrid[:h,:w]
    dx = (1<<(k-1))//y
    xl = (-dx*w>>1)&M
    z = dx
    print('k', k, 'M', M)
    print('max dx', dx[0])
    print('dx = z =', (1<<(k-1)), '/ y')
    print('x_left <=', M+1, '-', w//2, '* dx &', M)
    px = (yx[1].T*dx + xl)>>10
    img = px.T.astype(int) ^ z.astype(int)[:, None]
    return img&32

plt.figure(figsize=(12,3))
plt.imshow(genboard2(1220, 480, 128, 17))

k 17 M 262143
max dx 512
dx = z = 65536 / y
x_left <= 262144 - 610 * dx & 262143

jupyter notebook checkerboard

This means we will either need to create a lookup table for each scanline, or actually implement a division algorithm; I tried both, and implementing non-restoring division is smaller.

Sixteen clocks before the end of each scanline, the recip16 module gets a start signal and begins dividing 65536 by the screen y coordinate. This yields a fixed point number we can use to scale the u coordinate, and it also gives the v coordinate.

reg [21:0] plane_u;
reg [10:0] plane_du; // latched current du/dx
// simplification: the vertical component happens to be
// equal to the horizontal step size
wire [10:0] plane_v = plane_du;
wire [10:0] plane_du_;  // output of the divider, next du/dx to load

// we can compute this at the end of the previous line during horizontal blank;
// we'll latch the output value into plane_du at the start of the line.
recip16 plane_dx_div (
    .clk(clk48),
    .start(h_count == H_DISPLAY - 16),
    .denom(plane_y+1),
    .recip(plane_du_)
);

// when x = 0:
  // to get plane_u at the left side of the screen, multiply plane_du_ by
  // -displaywidth/2, implemented here with shifts and adds
  plane_u <= -((plane_du_<<1) + (plane_du_<<5) + (plane_du_<<6) + (plane_du_<<9));
  plane_du <= plane_du_;

// else:
  plane_u <= plane_u + plane_du;

Verilog code generating the u,v coordinates of the checkerboard plane

The checkerboard is created by picking out bits of this (u,v) coordinate frame and XORing them, and the RGB color is some mix of those bits too.

// a_scrollx and y define the plane's checkerboard movement offset
wire signed [15:0] a_scrollx = a_cos>>>4; // re-using the per-frame cosine defined above
wire signed [15:0] a_scrolly = frame << 3;

wire [11:0] hscroll = plane_u[20:9] + a_scrollx[11:0];
wire [10:0] vscroll = plane_v[10:1] + a_scrolly[10:0];
wire checkerboard = hscroll[7] ^ vscroll[6];

wire [5:0] checker_raw_r = checkerboard ? hscroll[8:3] : 0;
wire [5:0] checker_raw_g = checkerboard ? vscroll[8:3] : 0;
wire [5:0] checker_raw_b = checkerboard ? vscroll[7:2] : 0;

6-bit RGB rainbow checkerboard generator

It's subtle, but the kick drum in the song adds to the plane y_0 offset, so you can see it "jumping" on every drum hit. And at certain points in the song, random checkers start lighting up, before finally turning white in diagonal stripes to the beat, panning the plane_y starting coordinate all the way to the top of the screen, turning the entire screen white before fading to black and looping.

Shadows

When drawing the plane, it also renders the scroll text2, except using the projected (u,v) coordinates instead of the screen (x,y) so that the shadow of the text is projected onto the plane. However, as I mentioned above, the sine wave is generated by rotating vectors each pixel and so there isn't random access to \sin(u) -- so instead of using the correct scroll height from the projected u, it uses whatever sine is currently available. This shortcut means the shadow isn't really accurate, but it's good enough.

wire shadow_active = char_active && (plane_v is within scrolltext height...);

wire [5:0] checker_r = shadow_active ? {2'b0, checker_raw_r[5:2]} : checker_raw_r;
wire [5:0] checker_g = shadow_active ? {2'b0, checker_raw_g[5:2]} : checker_raw_g;
wire [5:0] checker_b = shadow_active ? {2'b0, checker_raw_b[5:2]} : checker_raw_b;

The shadow mask just shifts the checker color right two bits

Starfield

To do the horizontal scrolling starfield, I essentially needed a constant random number sequence per scanline, so naturally I used an LFSR that resets at the top of the screen and updates at the start of each row; then bits of the LFSR state are used to specify a star's horizontal offset and speed, and then the global frame counter is added on, multiplied by speed. The stars also have longer streaks in time with the snare drum in the music track.

reg [12:0] linelfsr;

// on new frame:
  linelfsr <= 13'h1AFA;

// on new scanline:
  linelfsr <= linelfsr[0] ? (linelfsr>>1) ^ 13'h1159 : linelfsr>>1;

// define the x offset of the star on this scanline
// bits [12:2] of the LFSR state are the random x offset
// bits 1 and 0 boost the movement speed by 2x and 4x
wire [10:0] starfield_x = linelfsr[12:2] + (frame<<1) + (linelfsr[1] ? frame<<2 : 0) + (linelfsr[0] ? frame<<3 : 0);
// define whether the current pixel (h_count) is actually a star, lengthened by the snare drum timer
wire star_pixel = h_count >= starfield_x && h_count < starfield_x + 2 + (7^(audio_snare_frames[3:1]));
// stars natually loop off the right side of the screen and re-emerge on the
// left side as the starfield_x value overflows

Music

Keeping in mind the constraints on how data is stored, and using some simple music composition ideas, I decided to structure each section of music in an "ABAC" pattern (or ABACABAD). This works well with our logic gate based encoding: The reduced logic will have something like "if measure bit 0 is 0, then use pattern A". And B is just a variation of A, so it can find a lot of redundancies, etc.

Synthesizer

I also needed some very simple-to-code instruments; a square wave arpeggio was a given for this genre of intro, but for percussion I wanted to generate a nice sounding kick drum and I settled on using a triangle wave synthesizer with an exponentially decaying frequency. Then I could multiplex a smooth bassline with the kick drum -- after the kick drum decays, just set the frequency back to the current bassline. For a hihat/snare, add some LFSR noise modulated by an exponential envelope for a hihat/snare kind of sound.

All the instruments were initially mocked up in a Jupyter notebook, as I tried to find efficient ways to generate triangle-based kick drums, etc.

def tri(x):
    return ((x&65535) ^ (x&32768 and -1 or 0) & 65535) - 16384

def kick(n):
    osci = 0x3fff # 512*32 - 1
    oscp = 0x1235
    out = np.zeros(n, np.int16)
    y1 = 0
    for i in range(n):
        out[i] = tri(oscp>>5)
        oscp = (oscp + osci)&((1<<21)-1)
        osci = (osci - ((osci+2047)>>11))&((1<<14)-1)

    return out

plt.figure(figsize=(12,4))
k = kick(4410*2)
plt.plot(k)
IPython.display.display(IPython.display.Audio(k.astype(np.float32)/127.0, rate=48000))

kickdrum plot

There are three total channels:

noise w/ exponential envelope (snare)
square wave with two-note arpeggio, exponential envelope (chords/melody).
constant-volume triangle w/ exponential pitch decay (kick)

Exponential envelopes are simply implemented as right-shift counters, where volume 0 is full volume and every few ticks that counter increments, and then the final sample is right shifted by the volume.

As much as possible, I avoided storing state in registers (flipflops are expensive, see above), and tried to make the music driven by a combinatorial logic chain from a simple song position counter. So there is no register for the pulse frequency increment; that's purely derived from the song position. The triangle synth does need one, though, in order to implement the kick exponential.

This led to a lot of off-by-one errors developing the code, as triggers are defined for updating envelopes for the next song position, but the notes for the pulse channel are already set for the current position, etc. Eventually I sorted them all out though it does make the code quite confusing to read, as various counter values differ from the Python version and yet it sounds the same.

A simple trick I found: For the pulse synth, we can also just pick different bits out of the per-sample accmulator to raise the pitch by octaves. I decided to constrain it to a two-octave range.

// note this is signed, so it generates a (+0x4000 or -0x4000) >>> volume
wire signed [15:0] pulse_output = {pulse_octave ? pulse_osc_p[12] : pulse_osc_p[13], 1'h1, 14'h0} >>> (2+pulse_vol);

Sigma-delta DAC

Instead of PWM, I did the simplest sigma-delta scheme I could think of:

reg [15:0] sigma_delta_accum;
wire [15:0] audio_sample; // assigned by synth
wire [16:0] sigma_delta_accum_ = sigma_delta_accum + audio_sample;
reg audio_out; // actual output pin

always @(posedge clk48) begin
  sigma_delta_accum <= sigma_delta_accum_[15:0];
  audio_out <= sigma_delta_accum_[16];
end

We add the (unsigned) audio sample to the accumulator every clock, and the pin gets the carry. So if the audio sample is at 50% (0x8000), then we get a full clock rate square wave, and if the output sample is 0x0001, we get a constant 0 with a single 1 pulse every 65536 clock cycles and conversely for 0xfffe, we get a constant 1 with a single 0 pulse every 65536 cycles. In general it creates a dense pulse train with no specific period.

The problem with this is that the rising edges and falling edges happen at different speeds on the Tiny Tapeout chip, which introduces a bit of harmonic distortion. I wasn't too worried about harmonic distortion in my triangle/pulse/noise synth.

Composition

I took a lot of inspiration from a C64 SID tune called Crooner by Drax/Vibrants, that then later was adapted for AdLib and I had made an AdLib OPL2 emulator for it. I decided to keep my tune very simple though, so there's no real melody, just the same chord progression as Crooner in a repeating four bar pattern:

| Am      | C       | D       | F   G   |

The D major isn't really in the Am/C scale (more of a secondary dominant) but this meant there were a total of eight unique notes in the scale instead of the usual seven: C D E F F# G A B, giving me a power of two-sized table of base frequencies. This meant I could have a 3-bit wide table for notes, plus another bit for octave, so each instrument has a total of 16 possible notes to play.

Using a two-note arpeggio is a departure from conventional three-note arpeggios in most tracked music, but counting to 2 by slicing off a bit from the song tick counter is easier than counting to 3 which would require an extra counter; besides, the bass can play the root note of the chord and the melody can play the thirds and fifths.

I also swung the tempo pretty strongly, swapping between 15 and 25 ticks per beat which really improved the feel.

And then I literally wrote the song as a bunch of strings in Python, just typing out the scale notes and using the # symbol for F#:

snareA = '..x..xx...x..xx.'
kickA  = 'x...x...x...x...'

snareB = '..x...x..xx...x.'
kickB  = 'x...x..x.x..x...'

snareC = '..x...x...x..xxx'
kickC  = 'x...x..xx...x...'

snareD = '..x..xx...x...xx'
kickD  = '................'

# concatenate ABACABAD pattern
snare = snareA + snareB + snareA + snareC + snareA + snareB + snareA + snareD
kick = kickA + kickB + kickA + kickC + kickA + kickB + kickA + kickD

# bassline which repeats throughout entire song
#                   1+2+3+4+1+2+3+4+1+2+3+4+1+2+3+4+
bass_chart_notes = 'AEAAAEAACCGCECCCDDD#D#DDFCFFGDGG'
bass_chart_octs  = '01100000000110100011010101100110'

# pulse aka square wave
# the note and 'arp' note alternate in a rapid arpeggio; oct defines the octave
# the envelopes are triggered by a non-blank entry in the note string
# A section, basic
pulsAnot = 'A...A.....A...A.G...G.....G...G.#...#.....#...#.F...F...G...G...'
pulsAoct = '0000000000000000000000000000000000000000000000000000000000000000'
pulsAarp = 'EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBB'

# B section, slight embellishments
pulsBnot = 'C...C.....A...C.G...C....CG...GG##..#.....#...#FFF..FF.FGG..GG..'
pulsBoct = '1000100000000000000000000100000001000000000000000100010101000100'
pulsBarp = 'EEEEEEEEEEEEEEEEEEEEGEEEEGCEEEECAAAAAAAAAAAAAAACACAAAAAABDBBBDBB'

# C section, melodic finale
pulsCnot = 'CCA.CAA..ACACCCAGGCCGCGE..GGG.GG#DD###.#.D###.#DFF.FF.CFGG.GG.GD'
pulsCoct = '0000011001100100001000100001001000100100001010011001000001001001'
pulsCarp = 'EEEEAEEEEEEEAAECEEEECEECEEEECECEAAADADAAAAADDAAAAACCAAFABBBBDBBB'

# ABAC sections concatenated
puls_chart_note = pulsAnot + pulsBnot + pulsAnot + pulsCnot
puls_chart_oct =  pulsAoct + pulsBoct + pulsAoct + pulsCoct
puls_chart_arp =  pulsAarp + pulsBarp + pulsAarp + pulsCarp

The full Python script emulates all the instruments exactly and plays the entire song, while also dumping out the data tables used by the verilog code.

notefreq: 5b 67 73 7a 81 89 9a ad
pulsmask:  1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1
pulsfreq1: 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 6 6 6 6 0 0 5 5 5 5 0 0 0 0 0 0 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 0 0 6 6 0 6 6 6 6 6 0 6 0 0 0 6 5 5 0 0 5 0 5 2 2 2 5 5 5 5 5 5 4 1 1 4 4 4 4 4 4 1 4 4 4 4 4 1 3 3 3 3 3 3 0 3 5 5 5 5 5 5 5 1
pulsfreq2: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 0 0 0 0 2 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0 6 0 0 0 6 6 6 6 7 1 1 1 7 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 2 2 2 2 6 2 2 2 2 2 2 2 6 6 2 0 2 2 2 2 0 2 2 0 0 0 2 2 0 0 0 2 6 6 6 1 6 1 1 6 6 6 6 1 1 1 6 6 6 6 6 0 6 6 3 6 7 7 7 7 1 1 7 7
pulsoct:   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 1
bassfreq:  6 2 6 6 6 2 6 6 0 0 5 0 2 0 0 0 1 1 1 4 1 4 1 1 3 0 3 3 5 1 5 5
bassoct:   0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0
kick:   1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
snare:  0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1

all data needed to play the song

Retrospective

There are a few design choices I made that I regret after finishing this intro.

First, the audio track has a completely independent clock from the visual component. A lot of classic chiptune music was constrained by using 50Hz or 60Hz frame timers, limiting the available tempo choices. Additionally I didn't want to tie my sample rate to the VGA horizontal frequency (31.5kHz) which would have been a natural choice. Instead I used a 48MHz / 1024 clock divider (46875Hz) and used a tick rate asynchronous from the VGA frequency, and then I could better control tempo, etc.

This seemed like a good idea, except when I tried to do simple music-sync effects. Now at any given point on the screen, the music can tick over, resulting in "tearing". There's a speckly checkerboard effect that's in sync with the music, but it doesn't "pop" because it doesn't actually work correctly. Additionally, the music loops about halfway through the last frame of the intro, resulting in a checkerboard blinking in for half of a frame as some counter overflows before it can be properly reset at the top of the next frame.

The 1220x480 video mode was also a huge mistake. In general, the people who can test this design aren't going to do it on a CRT, so better to just sync up properly with the LCD pixels and not rely on 90s era temporal dithering and CRT smoothing.

Nyan cat

nyan.cat demo; in the actual design the musical intro only plays once, and then loops seamlessly as the sunglasses scroll off the screen.

A couple days before the Tiny Tapeout 8 deadline, Uri threw down a challenge in the Discord to make a Nyan Cat, and so I decided to take everything I learned from making the TT08 intro and the donut and put together a design using ripped art and music. I'd reuse some code, avoid previous mistakes, use 640x480 VGA, audio synced to video (31.5kHz sample rate, 60Hz tempo ticks), simpler dithering. Ideally, get it all to fit in one tile.

Well, it turns out the design I came up with worked and all the cells fit into one tile -- but it was too crowded like that for detailed routing3 to succeed. Without any workable ideas to cut down on space and without much time left before the deadline, I just grew the design to two tiles and added a "deal with it" sunglasses easter egg to consume some tiny portion of the expanded space.

nyan GDS

The layout. The red cells throughout the bottom half are actually just filler capacitors, essentially unused space. Alas, it didn't fit into one tile though.

Art

I started by grabbing the original animated .gif from nyan.cat and wrote a python script to extract the individual frames, the palette, and generate a remapped data table with all of the unique animation frames. I used a color picker to grab all the original rainbow colors and background color, then converted them to their closest match in RGB222 with a 2x2 dithering matrix.

original nyancat.gif

original nyan.cat .gif

The cat smoothly shifts back and forth in the original animation, so I reused the sine wave generator code to give him a little x-offset.

Originally there are little exploding stars/fireworks? scrolling by in the background, but I didn't get around to emulating those. Instead I just reused the LFSR-based starfield from the TT08 demo except without random star speeds. This leads to extremely visible artifacts where the star positions are obviously being right-shifted from one line to the next near the bottom of the screen.

Music

I found a MIDI rendition of the nyan.cat song and wrote yet another crappy Python script to parse the MIDI, split out the melody and bass sections, remap note indices to a minimal scale of notes (once again, only 8 notes are used in the scale), and dump out tables that the Verilog code could read.

Total length: 288 ticks
melodynotes ['C#', 'D ', 'D#', 'E ', 'F#', 'G#', 'A#', 'B ']
bassnotes ['C#', 'D#', 'E ', 'F#', 'G#', 'B ']
Remapped melody scale: ['C#', 'D ', 'D#', 'E ', 'F#', 'G#', 'A#', 'B ']
Number of unique notes after remapping: 8
melody increment table 48 4c 51 56 60 6c 79 81
bassnote 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 3 3 3 3 4 4 4 4 2 2 2 2 5 5 5 5 0 0 0 0 4 4 4 4 7 7 7 7 7 7 7 7 3 3 3 3 4 4 4 4 2 2 2 2 5 5 5 5 0 0 0 0 4 4 4 4 7 7 7 7 7 7 7 7 3 3 3 3 4 4 4 4 2 2 2 2 5 5 5 5 0 0 0 0 4 4 4 4 7 7 7 7 7 7 7 7 3 3 3 3 4 4 4 4 2 2 2 2 5 5 5 5 0 0 0 0 4 4 4 4 7 7 7 7 7 7 7 7 3 3 5 5 7 7 3 3 2 2 4 4 7 7 2 2 0 0 3 3 5 5 7 7 7 7 2 2 4 4 7 7 3 3 5 5 7 7 3 3 2 2 4 4 7 7 2 2 0 0 3 3 5 5 7 7 7 7 2 2 4 4 7 7 3 3 5 5 7 7 3 3 2 2 4 4 7 7 2 2 0 0 3 3 5 5 7 7 7 7 2 2 4 4 7 7 3 3 5 5 7 7 3 3 2 2 4 4 7 7 2 2 0 0 3 3 5 5 7 7 7 7 2 2 4 4 7 7
bassoct 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 0 0 1 1 0 0 1 1 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 0 0 1 1 0 0 1 1 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 0 0 1 1 0 0 1 1 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 0 0 1 1 0 0 1 1 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2
basstrigger 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
melodynote 2 3 4 4 7 7 2 3 4 7 0 2 0 6 7 7 4 4 2 3 4 4 7 7 0 6 7 0 3 2 3 0 4 4 5 5 2 2 2 7 1 0 7 7 7 7 0 0 1 1 1 0 7 0 2 4 5 2 4 0 2 7 0 7 2 2 4 4 5 2 4 0 2 7 1 2 1 0 7 0 1 1 7 0 2 4 0 2 0 7 0 0 7 7 0 0 4 4 5 5 2 2 2 7 1 0 7 7 7 7 0 0 1 1 1 0 7 0 2 4 5 2 4 0 2 7 0 7 2 2 4 4 5 2 4 0 2 7 1 2 1 0 7 0 1 1 7 0 2 4 0 2 0 7 0 0 7 7 0 0 7 7 4 5 7 7 4 5 7 0 2 7 3 2 3 4 7 7 7 7 4 5 7 4 3 2 0 7 4 2 3 4 7 7 4 5 7 7 4 5 7 7 0 2 7 4 5 4 7 7 7 6 7 4 5 7 3 2 3 4 7 7 6 6 7 7 4 5 7 7 4 5 7 0 2 7 3 2 3 4 7 7 7 7 4 5 7 4 3 2 0 7 4 2 3 4 7 7 4 5 7 7 4 5 7 7 0 2 7 4 5 4 7 7 7 6 7 4 5 7 3 2 3 4 7 7 0 0
melodyoct 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 2 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1
melodytrigger 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0

output of the Python script containing the full song data extracted from MIDI

Then for kicks I added the kick drum (ha ha) to the bass track, same as the TT08 intro. I'm pretty sure I just wrote the triggers for that out by hand; there wasn't time to create any fancy tools.

This has just two audio channels:

Bass/kick, pure square wave this time
Melody, 25% duty cycle pulse wave to add timbre

Both bass and melody use per-tick exponential decay envelopes, with each tick occurring at the start of a video frame (so 60Hz). The kick drum takes over the bass channel for a few frames, exponentially decreasing its frequency before returning to the original bass pitch/envelope.

All notes use the same increment table, but octaves are shifted by changing the phase bits used. The "sqr" channel is actually 25% duty cycle pulse, turning on only when both of the 2 MSBs are on.

reg [12:0] sqr_pha;
reg [5:0] sqr_vol;

wire [2:0] cur_melody_note = melody_note[songpos];
wire [7:0] sqr_inc = noteinctable[cur_melody_note];

wire sqr_on =
  cur_melody_oct == 2 ? sqr_pha[10]&sqr_pha[9] :
  cur_melody_oct == 1 ? sqr_pha[11]&sqr_pha[10] :
  sqr_pha[12]&sqr_pha[11];

wire [5:0] sqr_sample = sqr_on ? sqr_vol : 6'd0;

// On each beat (6 ticks):
  if (melody_trigger[songpos_next]) begin
    sqr_vol <= 63;
  end

// On each non-beat tick:
  sqr_vol <= sqr_vol - (sqr_vol>>3);

// On each audio sample:
  sqr_pha <= sqr_pha + {4'b0, sqr_inc};

Verilog code generating the pulse channel with envelopes

I also added a global low pass filter because the pulse melody was a bit piercing. This turned out to dull the sound a lot more than I intended, though.

Manufacturing TT08

Tiny Tapeout 8 began manufacturing in September 2024, completing production of wafers and dies. And then, in March 2025, we received the devastating news that Efabless, the company supporting Tiny Tapeout, abruptly shut down, leaving its assets, which included mid-production TT08 and TT09 chips, in an uncertain state. TT08 was just being packaged at the time of the shutdown.

This was kind of a crushing blow for me, since I had put so much effort into the designs on this chip in particular. It also meant that the judging for the demo competition was indefinitely on hold, as it required the physical hardware to be delivered to the judges.

And then, out of nowhere, the #tt08-silicon channel showed up on the Tiny Tapeout discord server in late September 2025. It seems the chips were acquired by a new company and turned over to the Tiny Tapeout team. Soon after, this arrived at my doorstep:

TT08 development board

Tiny Tapeout 8 development board, with audio and VGA Pmods attached

In fact all of my designs worked perfectly -- except it turns out I had relaxed the timing on the donut a bit in order to reduce gate area (lower clock rates require lower drive strength which means smaller transistors), so the donut needs to run at 45MHz instead of 48MHz to avoid some glitches.

TT08 running nyan cat

Tiny Tapeout 8 running Nyan Cat on my LCD and a guitar amp for audio

Wrapping up

Needless to say, I was elated when I saw my designs running perfectly on my desk after a long year of waiting and expecting the worst.

I was holding out on doing this writeup until the judges had definitively seen the hardware running. bitluni showed them all off on a recent livestream with both LCD and CRTs, so I figured it's finally time to post this.

Revisiting these designs over a year later, I had forgotten how involved they were and how many intricacies I wanted to explain in this post.

I learned so many new low-level tricks building these demos and I have even better tricks in store for the next one.

The pipeline as a whole is composed of many parts, but specifically optimizing and mapping combinational logic to a target architecture is handled by ABC; lecture 7 of this UC Berkeley class covers the process in some detail. ↩
Technically the scroll text hardware is doing something every clock cycle regardless, so might as well use it, right? ↩
Detailed routing is the step where OpenROAD draws wires in the metal layers between standard cells and is usually the most compute intensive part of hardening. When trying to squeeze designs into tight constraints, your only indication that it won't work is that this step fails after trying, in some cases, for several hours. ↩

https://www.a1k0n.net/2025/12/19/tiny-tapeout-demo.html

From ASCII to ASIC: Porting donut.c to a tiny slice of silicon

Jan 10, 2025 Updated Jan 10, 2025

Show full content

[Update 12/19/2025: I got the chip, and it works! Here is a writeup of my other designs on this chip, as well as a picture of it in action]

For many years after coming up with donut.c, I wondered in the back of my mind if it could be simplified somehow, like maybe there was a way to raytrace a donut with a small chunk of code. In October 2023, I tweeted a dumb random epiphany where I figured out another way to do it which requires no memory, no sines or cosines, no square roots, no divisions, and technically, not even any multiplications. The whole thing can be rendered with just shifts and adds, and there's an updated C version here.

It took almost another year to put this idea into action with an actual hardware implementation. In early September 2024, I submitted a 4-tile design to Tiny Tapeout 8, taking up 0.8% of a 130nm process chip shared with many other designs. This chip is currently (as of January 2025) being manufactured by SkyWater Technologies, and if my design works, it will render this to a VGA monitor:

This design is not rendering polygons! Instead, it's making an iterative approximation of a ray-traced donut, and it is racing the VGA beam — every 39 nanoseconds, the monitor is going to show the next pixel, ready or not, practical considerations limit us to a ~50MHz clock, and we have no memory buffer whatsoever — so it doesn't have enough time to do a good job; the polygonal appearance is a complete accident!

And no, I already know what you're going to ask: the actual silicon is not in the shape of a donut. Maybe next time. It takes about 7000 standard cells (i.e., flip flops or logic gates), and you can actually see a 3D view of the die here thanks to the amazing tools the Tiny Tapeout team has put together.

visualization of the chip Visualization of the actual chip layout which produces the donut on Tiny Tapeout 8; this is a small piece of a much larger die

With more die area, and/or a faster clock (which would need something more modern than the 130nm SkyWater process), this could be pipelined so that every pixel was perfectly rendered, but the point was to make it as small as I could.

Source

The TinyTapeout Verilog project is on github here with a Verilator testbench which produced the above video and an OrangeCrab FPGA version I used to develop.

Because I developed it on an OrangeCrab, I ended up using its native 48MHz clock for all timing; typically VGA would have a 25.175MHz clock, so I opted to use a slighly strange VGA mode of 1220x480 in order to have the timing work out. This looks great on an analog CRT, but LCD monitors can end up with weird aliasing artifacts. Lesson learned.

I have this and two other designs (which deserve their own write-ups!) on Tiny Tapeout 8 and they were my first real Verilog projects; the code is frankly atrocious in order to get everything to fit under tight constraints.

Rendering donuts with just shifts and adds

The trick is to represent the torus as a signed distance function and "raymarch" it, which is a shockingly powerful technique popularized by Inigo Quilez, who does incredible work. His site is a treasure trove in general, but see this excerpt from his distance functions page:

torus as signed distance function

The signed distance to a torus can be computed by two 2-dimensional length operations in sequence. Guess how you can compute the length of a 2D vector? With CORDIC, a variation on the same trick I used last time!

The implication of this is that you can find the intersection of a ray and a donut by ray marching: starting at the 3D location of the screen, calculate the above function to find the distance; then, move that distance along the ray between the origin and the pixel you are rendering, and repeat the whole process until the distance converges near 0.

That gives you the point, but what color do we actually render? We need the surface normal to compute lighting, which requires rotating the light direction vector along the two axes of the torus with whatever angle we happened to hit it at. This seemed way too complicated at first.

The random epiphany I had was this: the amount we need to rotate the lighting vector to find the surface normal is exactly the same as the rotations we undid to find the distance to the torus in the first place! And furthermore, we can do this rotation almost for free as a side effect of computing those lengths.

Let me try to explain.

Vectoring mode CORDIC

There's a trick where you can compute \sqrt{x^2 + y^2} using only shifts and adds, a profoundly clever application of CORDIC: you iteratively rotate the \overrightarrow{(x,y)} vector toward the x axis, one "bit" at a time (first by \arctan 1= 45°, by \arctan \frac{1}{2} = 26.5°, \arctan \frac{1}{4} = 14.04°, etc). This not only allows you to recover the length but also the angle, in effect transforming between Cartesian and polar coordinates.

Length/Angle here

The above widget is running this code for eight iterations and drawing all the intermediate steps in various colors, with the bold green line being the final iteration:

function cordicLength(x, y, n) {
    // x must be in right half-plane; since we only really care about distance, we
    // discard the fact that we did this, but we could set a flag to recover the
    // correct angle here.
    if (x < 0) {
        x = -x;
    }
    for (var i=0; i < n; i++) {
        var t = x;
        if (y < 0) {
            x -= y>>i;
            y += t>>i;
        } else {
            x += y>>i;
            y -= t>>i;
        }
    }
    return x;
}

The way this works is that the shifts and adds here perform a rotation and scale; effectively doing this matrix multiplication:

\left[\begin{matrix} x_{i+1} \\ y_{i+1} \end{matrix}\right] = \left[\begin{matrix} 1 & \mp 2^{-i} \\ \pm 2^{-i} & 1\end{matrix}\right] \left[\begin{matrix} x_i \\ y_i \end{matrix}\right] = \left[\begin{matrix} 1 & \mp \tan \theta_i \\ \pm \tan \theta_i & 1\end{matrix}\right] \left[\begin{matrix} x_i \\ y_i \end{matrix}\right]

If the point is above the x-axis, we rotate it clockwise, otherwise rotate it counterclockwise, making smaller and smaller adjustments. The angles we're rotating by are \theta_i = \arctan 2^{-i}.

The obvious problem, and you can see it in the iterations above, is that each one of these operations isn't just rotating but also making the vector longer. However, it always makes it longer by the same amount, because we're always doing the same rotations (just changing the sign of them). This works out to a scale factor of about 0.607. So if we keep track of how much we've pre-scaled the lengths by doing this operation, we can correct for it later.

Vectoring an extra vector

By making a small tweak, we can rotate (and scale, by the same .607 factor) a second vector by the angle of the original vector in the process of finding the original vector's length.

Now if blue vector \overrightarrow{(x_2, y_2)} is our "extra" vector, the black vector \overrightarrow{(x,y)} is the one we want to find the length of, and the green line finds the relative angle between the two vectors, independent of the length of (x,y).

function rotateCordic(x, y, x2, y2, n) {
    xsign = 1;
    // if we're in the left half-plane, then negating the input x, y and the
    // output x2, y2 will give the correct result
    if (x < 0) {
        x = -x;
        y = -y;
        xsign = -1;
    }
    for (var i=0; i < n; i++) {
        var t = x;
        var t2 = x2;
        if (y < 0) {
            x -= y>>i;
            y += t>>i;
            x2 -= y2>>i;
            y2 += t2>>i;
        } else {
            x += y>>i;
            y -= t>>i;
            x2 += y2>>i;
            y2 -= t2>>i;
        }
    }
    // x is now the length of x,y and [x2*xsign, y2*xsign] is the rotated vector
    return [x, x2*xsign, y2*xsign];
}

tweaked CORDIC function to rotate an auxiliary vector

This is exactly what we need to compute lighting: if we feed in the lighting direction unit vector and the surface normal, the x coordinate of the green line is the cosine of the angle between them; we've managed to do a vector dot product without any multiplications as a side effect while computing a distance.

How do we know what the surface normal is? Well, it turns out that when the ray-marching converges, the points we're taking lengths of are relative to the center ring of the torus, so they point in the direction of the surface normal by definition.

Here's a little animation illustrating the process of finding the closest point on a torus: first, you find the point in the center ring of the torus closest to your point, which is equivalent to projecting your point onto the torus's x-z plane and scaling the distance to r_1. This new point is the center of a ring of radius r_2 in the plane defined by that point, the original point, and the center of the torus. Then, do another projection of the original point in that plane out to a distance of r_2.

float distance_to_torus(float x, float y, float z, float r1, float r2) {
  // First, we find the distance from point p to a circle of radius r1 on the x-z plane
  float dist_to_r1_circle = length(x, z) - r1;
  // Then, find the distance in the cross section of the torus from the center ring to point p
  return length(dist_to_r1_circle, y) - r2;
}

That last projection line drawn to calculate the final distance between the torus and the point is exactly the surface normal.

It is this two-2D length property of toruses that makes this whole technique possible; I haven't been able to come up with a similar way to render a sphere for instance, which you would think is easier!

Ray-marched donut.c

With this technique in hand, I came up with yet another donut.c (source here), shifts and adds only. It's not very compact as it takes a lot more C code, but it takes a lot fewer logic gates. Soon after I tweeted it out, Bruno Levy replied with an interesting discovery: if you restrict the number of CORDIC iterations, the donut renders with facets instead of smoothly!

r1 iterations: 8 r2 iterations: 8

In the final ASIC version, I used 3 r_1 iterations and 2 for r_2, because that was as much combinational logic as I could unroll in a single cycle at 48MHz and still meet setup/hold time requirements. That's why the rendering looks the way it does.

Compound rotation without multiplication

The rotation of the object as a whole is implemented by tracking eight variables: sin/cos of angle A, sin/cos of angle B, and four specific products thereof. In the C code, this is done towards the bottom with a rotation macro R():

// HAKMEM 149 - Minsky circle algorithm
// Rotates around a point "near" the origin, without losing magnitude
// over long periods of time, as long as there are enough bits of precision in x
// and y. I use 14 bits here. Cheap way to compute approximate sines/cosines.
#define R(s,x,y) x-=(y>>s); y+=(x>>s)

    R(5, cA, sA);
    R(5, cAsB, sAsB);
    R(5, cAcB, sAcB);
    R(6, cB, sB);
    R(6, cAcB, cAsB);
    R(6, sAcB, sAsB);

These are all the products necessary to reconstruct the camera origin and ray directions without explicitly constructing a matrix or direction vector.

Unfortunately, this can drift over time1 resulting in a very skewed looking render, so there's a hack to reinitialize all of these to their zero-rotation values when a full cycle through A and B complete.

TinyTapeout considerations

TinyTapeout recommended using up to 50MHz for the clock, so I went with 48MHz (the default clock of my FPGA dev board), giving me about 1220 clocks per visible horizontal scanline. By cutting back on precision, I was able to stuff all of the CORDIC iterations into one cycle, but ray-marching needs to iterate until convergence, and any fewer than 8 iterations looked pretty bad, so I chose to iterate 8 times, reading the output of the hardware ray marching unit and loading new ray directions in every 8 clocks. But in order to keep it from looking too blocky, the horizontal offset is dithered both in space and time, which is what gives the glitchy/noisy appearance in the video.

Video output is provided through the Tiny VGA Pmod which has only two bits per RGB channel. To deal with this I chose to do ordered dithering, using 6 bits of internal color depth, and this turns out to be very easy to do in Verilog. This generates a dithering matrix which alternates every frame, yielding temporal dithering as well.

// Bayer dithering
// this is a 8x4 Bayer matrix which gets flipped every frame
wire [2:0] bayer_i = h_count[2:0] ^ {3{frame[0]}};
wire [1:0] bayer_j = v_count[1:0];
wire [2:0] bayer_x = {bayer_i[2], bayer_i[1]^bayer_j[1], bayer_i[0]^bayer_j[0]};
wire [4:0] bayer = {bayer_x[0], bayer_i[0], bayer_x[1], bayer_i[1], bayer_x[2]};

This generates a matrix that looks like this, and mirrors horizontally on alternating frames:

\text{bayer}_{[4:3],[2:0]} = \left[ \begin{matrix} 0& 24& 6& 30& 1& 25& 7& 31 \\ 16& 8& 22& 14& 17& 9& 23& 15 \\ 4& 28& 2& 26& 5& 29& 3& 27 \\ 20& 12& 18& 10& 21& 13& 19& 11 \end{matrix} \right]

This is applied to the 6-bit color like so2:

// output dithered 2 bit color from 6 bit color and 5 bit Bayer matrix
function [1:0] dither2;
    input [5:0] color6;
    input [4:0] bayer5;
    begin
        dither2 = ({1'b0, color6} + {2'b0, bayer5} + color6[0] + color6[5] + color6[5:1]) >> 5;
    end
endfunction

wire [1:0] rdither = dither2(r, bayer);
wire [1:0] gdither = dither2(g, bayer);
wire [1:0] bdither = dither2(b, bayer);

Avoiding the last multiplication

Another challenge was stepping a 3D unit vector along a specified distance in order to do the ray-marching. Ideally this would have 3 parallel 14-bit multiplies, but I was running out of gate area, and didn't want to unroll it over multiple clock cycles. Seeing as the incoming distance was approximate anyway, I opted for an approximate multiplier which finds the leading 1 bit and returns a shifted copy of the unit vector as a step direction.

Wrapping up

From obfuscated C code printing ASCII art, to bitwise operations rendering 3D graphics, and finally to a tiny slice of silicon racing an electron beam, this project shows how the same basic idea can be reimagined and refined. What started as a fun programming puzzle led to discovering an elegant way to render 3D graphics with minimal computational resources. While the original donut.c was an exercise in creative constraint — fitting a 3D renderer into as little code as possible — this hardware implementation pushed different limits: how few gates, how few operations per pixel could still produce a recognizable rotating donut?

The polygonal appearance that emerged from limiting CORDIC iterations was a happy accident, a reminder that sometimes the most interesting results come from working within tight constraints.

Tiny Tapeout 8 is estimated to deliver chips in May 2025; it is currently January and I cannot wait. This chip also hosts the inaugural Tiny Tapeout Demoscene Competition; it may be the coolest ASIC ever made. I also have a democompo entry which I will write about once judging is done.

Even though the HAKMEM 149 algorithm can rotate a point in a circle without ever losing magnitude, the products of sines and cosines together are being rotated in a Lissajous pattern instead of a circle, so they do lose magnitude. This eventually results in a very broken rotation matrix without explicit resets. ↩
The reason for the various additions and mixing of bayer and color is because to cover the full gamut of full black to full white, there are actually 33 different dithering levels and 2 bits of color to dither into, but only 64 different 6-bit colors in our input. This means we have 66 combinations of dithered values (33 with the high bit on and 33 more with the high bit off) -- I had to find some way to squeeze out two of the middle values while preserving full black and full white. ↩

https://www.a1k0n.net/2025/01/10/tiny-tapeout-donut.html

Fast indoor 2D localization using ceiling lights

Jan 22, 2021 Updated Jan 22, 2021

Show full content

My DIYRobocars autonomous RC car has come a long way since my last post on the subject. I want to highlight a localization algorithm I came up with which works really well, at least in this specific setting. I've been using it in races for a while and it really stepped up the speeds I was able to achieve without the car getting "lost" -- about 22mph on the front straight of this small track. It makes use of a fisheye camera looking towards the ceiling, and a localization update runs in about 1ms at 640x480 on a Raspberry Pi 3, with precision on the order of a few centimeters.

Here is a recent race against another car which is about as fast, except it uses LIDAR to localize using the traffic cones as landmarks. (My car previously localized using the cones as well, except using a monocular camera, but the ceiling lights are much better when we have them. Cone-based localization deserves a blog post of its own.) It's a three lap head-to-head race where there's an extra random cone on the track. We haven't developed passing strategies yet though, so there are frequent racing incidents...

DIYRobocars race on March 7, 2020. We had a little accident but my car (blue) somehow made it through unscathed... I finished my three laps just to record a time for that heat.

I put a fisheye lens Raspberry Pi camera out the front windshield of a 1/10 scale touring car. It's mounted sideways because it has a wider field of view horizontally, and I wanted more pixels looking towards the front so I could see a bit of the ground ahead. (Full hardware details are on the github repo.) Here is what it looks like without the lid:

RC car hardware

I record telemetry and video in each race, which I can then replay through a Dear ImGui-based tool:

In the lower-right corner of the video is a virtual view of the car's position and speed, entirely derived from the ceiling light localization algorithm.

With the fisheye lens, it can see a lot of ceiling lights from any point on the track, and they're all set up in a regular grid with one or two exceptions. The algorithm is robust to extra lights being added, lights being burned out, and other cars flying over the top of the camera (watch closely at 0:07), without losing tracking.

Since it's hard to tell what's going on from the raw, sideways fisheye camera, the tool has an option to use cv2.undistort and remap the image as if it were looking straight out the front windshield:

The basic assumptions: ground vehicle localization under a drop ceiling

To make this practical at high FPS on a Raspberry Pi, I chose to make some simplifying assumptions: assume all the ceiling lights are point sources on a perfectly even grid, and that the camera moves along the floor at a fixed height. This turned out to be good enough to work in practice. The method doesn't necessarily require an even grid, it just needs to be able to determine the distance to the nearest landmark from any point. In a grid, that's trivial, but it's not hard to use a map of the ceiling layout instead.

I also assumed I knew the grid size ahead of time, since I can just measure the ceiling tile sizes and count the spaces between lights, and also measure the height of the ceiling from the camera.

Further, the setting allows us to initialize the algorithm approximately at a known location, namely the starting line. If the pattern of lights is perfectly regular, there's no way to tell which grid cell you started in, so we have to assume we're at the one containing the start of the race when we reset our state.

Camera model

undistorting fisheye

The first step is to deal with the lens distortion. This is particularly obvious when using a fisheye lens, but every lens should be calibrated. OpenCV provides a module for calibrating converting from distorted/undistorted fisheye coordinates.

I won't go too much into camera calibration here, but what I do is print this ArUco calibration target, take still photos of it from several angles, and use this script to get the camera intrinsic parameters.

With those parameters, you can use several OpenCV utilities to do things like undistort images (shown above) or compute ideal pinhole camera ray vectors from each pixel with cv2.undistortPoints.

In my tracking code, the initialization step computes a lookup table with the supplied camera parameters. For each pixel, the lookup table has a \left[x, y, 1\right] vector which forms a ray in 3D space starting at the camera and heading out towards the ceiling. There is also a circular mask to filter out pixels pointing too far away from the ceiling.

Note that the tracking code does not use cv2.undistortImage or equivalent. The illustration above shows what undistorting the whole image does: objects far away from the camera with relatively small numbers of pixels in the source image take up a large number of blurry pixels in the output, and the objects directly in front taking up most of the image are a small portion of the output. What I want to do is give equal weight to each pixel in the source image, and less weight to potential outliers. That way, the lights directly above the camera are mostly what it will try to track on, which gives better accuracy.

Once the lookup table is available, the problem reduces to matching a 2D grid to all the corrected pixel locations for all pixels within the mask which look like ceiling lights, meaning the grayscale pixel is brighter than some threshold. The mask is necessary to deal with reflections from the car's body and light from nearby windows -- it makes sure we're only looking upwards for ceiling lights.

Least squares grid alignment

aligning a grid to ceiling lights
Grid alignment objective. Blue dots: initial uninitialized configuration [u, v, \theta] = (0, 0, 0); orange dots: fit after five iterations of the Levenberg-Marquardt algorithm described below. Left: the grid dots are transformed back into the fisheye distortion here to show the fit on the original image. Right: the fit in undistorted space.

We can solve this with the same hammer we pound on all the other nails with in computer vision: nonlinear least squares via Levenberg-Marquardt, which is a fancy name for the ordinary least squares algorithm with a simple tweak to make sure it converges.

The vehicle state is defined as [u, v, \theta] where u, v are the position within the grid along the x and y axes respectively, and \theta is the heading angle of the vehicle. We'll also define x_{spc} and y_{spc} as the grid spacing in each axis.

Conceptually, if we superimpose a grid with that spacing on the ceiling, rotate it by -\theta around the center and then translate it by [-u, -v], the grid should line up with the pixels in our image which are above the "white" threshold.

To do the optimization, we start with an initial guess for [u, v, \theta] and repeat two steps: First, find the nearest grid point to every transformed white pixel, and then solve for an updated [u, v, \theta] to minimize the residuals. Solving for [u, v, \theta] may cause some pixels to match different grid points, so repeating the two steps a few times is usually necessary, otherwise it would be solved in closed form. In practice, once the algorithm is "synced up", only two iterations are really necessary to keep up with tracking.

Let's define a residual for every undistorted white pixel [x_i, y_i] measuring the distance to the closest grid point [g_{xi}, g_{yi}]:

\mathbf{r}_i = \begin{bmatrix}\cos{\theta} & \sin{\theta}\\-\sin{\theta} & \cos{\theta}\end{bmatrix} \begin{bmatrix}x_i \\ y_i \end{bmatrix} - \begin{bmatrix}u \\ v \end{bmatrix} - \begin{bmatrix}g_{xi} \\ g_{yi} \end{bmatrix}

This says: Rotate each pixel [x_i, y_i] by the car's heading angle \theta, subtract the car's position [u, v], and compare it to the grid point we think it should be on [g_{xi}, g_{yi}]. We want to minimize the sum of the squares of all residuals, \sum_{i} \mathbf{r}_{i}^\top \mathbf{r}_i.

(Note: [x_i, y_i] here are the pixels after undoing the camera transform / distortion, so they are not pixel coordinates but rather normalized coordinates. You can think of it as a ray shooting from the camera intersecting a plane 1 unit in front of the camera; (1, 0) would thus be a ray 45-degree angle to the right of center shooting out of the camera.)

If we're fitting to a grid, then [g_{xi}, g_{yi}] can be found simply by taking the transformed point modulo the grid spacing x_{spc}, y_{spc}. In principle, any pattern of landmarks, not just grids, can be used here.

Here's my Python prototype implementation that finds the nearest grid point:

def moddist(x, q):
    return (x+q/2)%q - q/2

def nearest_grid(xi, yi, xspc, yspc):
    gxi = moddist(xi, xspc)
    gyi = moddist(yi, yspc)
    return gxi, gyi

To do the least-squares minimization, we need to compute the Jacobian (a fancy word for the derivative of a vector with respect to a vector) of the residuals, which is simple enough. I used SymPy in a Jupyter notebook:

from sympy import *
init_printing()

x, y, gx, gy = symbols("x_i y_i g_x_i g_y_i")
u, v, theta = symbols("u v theta")
S, C = sin(theta), cos(theta)
R  = Matrix([[C, -S], [S, C]]).T
uv = Matrix([[u], [v]])
xy = Matrix([[x], [y]])
gxy = Matrix([[gx], [gy]])
r = R*xy - uv - gxy
r

\left[\begin{matrix}- g_{x i} - u + x_{i} \cos{\left(\theta \right)} + y_{i} \sin{\left(\theta \right)}\\- g_{y i} - v - x_{i} \sin{\left(\theta \right)} + y_{i} \cos{\left(\theta \right)}\end{matrix}\right]

J = r.jacobian(Matrix([[u, v, theta]]))
J

\left[\begin{matrix}-1 & 0 & - x_{i} \sin{\left(\theta \right)} + y_{i} \cos{\left(\theta \right)}\\0 & -1 & - x_{i} \cos{\left(\theta \right)} - y_{i} \sin{\left(\theta \right)}\end{matrix}\right]

Now, to actually compute our least squares update, we need to compute \delta = \left(\sum_i \mathbf{J}_i^\top \mathbf{J}_i + \lambda \mathbf{I} \right)^{-1} \left(\sum_i \mathbf{J}_i^\top \mathbf{r}_i\right) . Looks complicated, but what it means is we need to add a up contribution for each pixel into a symmetric matrix and a vector, and then solve for an update to [u, v, \theta]. \lambda is effectively a hyperparameter of the algorithm which keeps the matrix from becoming singular. I just set it to 1 and never really needed to tune it afterward. (I tried setting it to 0, making this the Gauss-Newton iteration, but that will diverge if there aren't enough observations, like when something covers the camera for exmaple).

Our Jacobian is pretty sparse, though, so let's see what those terms look like:

simplify(J.T*J)

\left[\begin{matrix}1 & 0 & x_{i} \sin{\left(\theta \right)} - y_{i} \cos{\left(\theta \right)}\\0 & 1 & x_{i} \cos{\left(\theta \right)} + y_{i} \sin{\left(\theta \right)}\\x_{i} \sin{\left(\theta \right)} - y_{i} \cos{\left(\theta \right)} & x_{i} \cos{\left(\theta \right)} + y_{i} \sin{\left(\theta \right)} & x_{i}^{2} + y_{i}^{2}\end{matrix}\right]

SymPy's simplify was smart enough to reduce the bottom right entry which was the length of a vector rotated about \theta to just the length of the vector, as rotation wouldn't affect the length. So to compute this, we need to sum up rotated versions of each white pixel and the squared distance from the center.

Now, that was the Jacobian for just one pixel. For convenience, let's define a couple auxiliary variables to clean that up a bit:

\begin{bmatrix}\delta R x_i \\ \delta R y_i\end{bmatrix} = \begin{bmatrix} x_{i} \sin{\left(\theta \right)} - y_{i} \cos{\left(\theta \right)}\\ x_{i} \cos{\left(\theta \right)} + y_{i} \sin{\left(\theta \right)} \end{bmatrix}

These are the partial derivatives of [x_i, y_i] rotated by \theta.

If we add up all the pieces we need, our final left-hand term is:

\sum_{i=1}^N \mathbf{J}_i^\top \mathbf{J} + \lambda \mathbf{I} = \begin{bmatrix} N + \lambda & 0 & \sum_i \delta R x_i \\ 0 & N + \lambda & \sum_i \delta R y_i \\ \sum_i \delta R x_i & \sum_i \delta R y_i & \sum_i \left(x_i^2 + y_i^2\right) + \lambda \end{bmatrix}

In the end, to compute this, we only need to calculate the three running sums in the right column (or bottom row) of the matrix, and to count up the number of white pixels N.

Now, the right hand side is a bit messier, so let's assume we've already computed the residuals (the distance of each white pixel from the grid point it belongs to):

\mathbf{r}_i = \left[\begin{matrix}\Delta x \\ \Delta y\end{matrix}\right] = \left[\begin{matrix}- g_{x i} - u + x_{i} \cos{\left(\theta \right)} + y_{i} \sin{\left(\theta \right)}\\- g_{y i} - v - x_{i} \sin{\left(\theta \right)} + y_{i} \cos{\left(\theta \right)}\end{matrix}\right]

We'll also assume we computed \delta R x_i and \delta R y_i and swap in our simpler Jacobian for J.

dx, dy = symbols("\Delta\\ x_i \Delta\\ y_i")
dRx, dRy = symbols("\delta\\ Rx_i \delta\\ Ry_i")
J_simple = Matrix([[-1, 0, -dRx], [0, -1, -dRy]])
r_simple = Matrix([[dx], [dy]])
J_simple.T*r_simple

\left[\begin{matrix}- \Delta x_{i}\\- \Delta y_{i}\\- \Delta x_{i} \delta Rx_{i} - \Delta y_{i} \delta Ry_{i}\end{matrix}\right]

Again, this is the contribution from a single pixel; to get the final term we need to solve it, we have to sum up \mathbf{J}^\top\mathbf{r}:

\sum_i \mathbf{J}_i^\top\mathbf{r}_i = \begin{bmatrix} -\sum_i \Delta x_i \\ -\sum_i \Delta y_i \\ -\sum_i \left(\Delta x_i \delta R x_i + \Delta y_i \delta R y_i\right) \end{bmatrix}

If you're actually following along this far, I salute you! We're almost done deriving what we need to get a working algorithm.

Adding it all up

To recap, we're starting from an initial guess of our grid alignment [u, v, \theta], and we're doing a nonlinear least squares update to improve our fit. The things we need to compute to solve it are a few different sums, added up over each pixel [x_i, y_i] brighter than the threshold to consider it part of a ceiling light. They are:

\begin{align} \Sigma_1 & = \sum_i \left(x_i \sin{\theta} - y_i \cos{\theta} \right) = \sum_i \delta R x_i \\ \Sigma_2 & = \sum_i \left(x_i \cos{\theta} + y_i \sin{\theta} \right) = \sum_i \delta R y_i \\ \Sigma_3 & = \sum_i \left(x_i^2 + y_i^2 \right) \\ \Sigma_4 & = \sum_i \Delta x_i \\ \Sigma_5 & = \sum_i \Delta y_i \\ \Sigma_6 & = \sum_i \left(\Delta x_i \delta R x_i + \Delta y_i \delta R y_i\right) \end{align}

These are just six floats we add up while looping over the image pixels, and then we construct and solve this system of equations for our update:

\begin{bmatrix}u' \\ v' \\ \theta'\end{bmatrix} = \begin{bmatrix}u \\ v \\ \theta\end{bmatrix} - \begin{bmatrix} N + \lambda & 0 & \Sigma_1 \\ 0 & N + \lambda & \Sigma_2 \\ \Sigma_1 & \Sigma_2 & \Sigma_3 + \lambda \end{bmatrix}^{-1} \begin{bmatrix} -\Sigma_4 \\ -\Sigma_5 \\ -\Sigma_6 \end{bmatrix}

Yes, those minus signs are redundant, but I'm keeping it in the original Levenberg-Marquardt form

Solving the system

It would be relatively easy and fast at this point to use a linear algebra package to solve this. My Python prototype implementation just constructed these matrices and called np.linalg.solve here, which is definitely the way to go in Python.

But when porting it to C++, I considered using Eigen before wondering whether it was even worth using for a 3x3 matrix solve. What would happen, I wondered, if I just had SymPy spit out a solution?

# redefine our JTJ, JTr matrices in terms of the sums defined above
N, S1, S2, S3, S4, S5, S6 = symbols(
    "N Sigma1 Sigma2 Sigma3 Sigma4 Sigma5 Sigma6")
# Note: we'll add lambda to N and S3 before computing the solution,
# as it simplifies the following algebra a lot.
JTJ = Matrix([[N, 0, S1], [0, N, S2], [S1, S2, S3]])
JTr = Matrix([[-S4], [-S5], [-S6]])
JTr = Matrix([[-S4], [-S5], [-S6]])
simplify(JTJ.inv() * JTr)

\left[\begin{matrix}\frac{- N \Sigma_{1} \Sigma_{6} + \Sigma_{1} \Sigma_{2} \Sigma_{5} + \Sigma_{4} \left(N \Sigma_{3} - \Sigma_{2}^{2}\right)}{N \left(- N \Sigma_{3} + \Sigma_{1}^{2} + \Sigma_{2}^{2}\right)}\\\frac{- N \Sigma_{2} \Sigma_{6} + \Sigma_{1} \Sigma_{2} \Sigma_{4} + \Sigma_{5} \left(N \Sigma_{3} - \Sigma_{1}^{2}\right)}{N \left(- N \Sigma_{3} + \Sigma_{1}^{2} + \Sigma_{2}^{2}\right)}\\\frac{N \Sigma_{6} - \Sigma_{1} \Sigma_{4} - \Sigma_{2} \Sigma_{5}}{- N \Sigma_{3} + \Sigma_{1}^{2} + \Sigma_{2}^{2}}\end{matrix}\right]

Now I know what you're thinking: "you're kidding, right?" Are we going to type all that in instead of just using a matrix solver?

Let me show the hidden superpower of SymPy: cse, the common subexpression elimination function. SymPy can also print C code from any expression. cse returns a list of temporary variables and a list of expressions in terms of those variables, generating a very efficient routine for computing ridiculously complex expressions. Check this out:

ts, es = cse(JTJ.inv() * JTr)
for t in ts:  # output each temporary variable name and C expression
    print('float', t[0], '=', ccode(t[1]) + ';')
print('')
for i, e in enumerate(es[0]):  # output C expressions
    print(["u", "v", "theta"][i], '-=', ccode(e) + ';')

float x0 = pow(Sigma1, 2);
float x1 = -x0;
float x2 = N*Sigma3;
float x3 = pow(Sigma2, 2);
float x4 = x2 - x3;
float x5 = 1.0/(x1 + x4);
float x6 = Sigma6*x5;
float x7 = 1.0/(pow(N, 2)*Sigma3 - N*x0 - N*x3);
float x8 = Sigma5*x7;
float x9 = Sigma1*Sigma2;
float x10 = Sigma4*x7;

u -= Sigma1*x6 - x10*x4 - x8*x9;
v -= Sigma2*x6 - x10*x9 - x8*(x1 + x2);
theta -= -N*x6 + Sigma1*Sigma4*x5 + Sigma2*Sigma5*x5;

Once we've summed up our Sigma1..Sigma6 variables, the above snippet computes the amount to subtract from u, v, and theta to get the new least squares solution, without needing to call out to a matrix library. It's really not much code. (I would personally change all pow(x, 2) into x*x first, though).

One thing I glossed over: I took \lambda out temporarily from the derivation to simplify, so to put it back in you need to do N += lambda; Sigma3 += lambda; before running the above code.

The full algorithm

Let's put it all together now for the full algorithm.

Initialization

First, calibrate the camera and compute a lookup table equivalent to doing cv2.fisheye.undistort() on all pixels in the image. OpenCV wants sort of a weird matrix shape, so I use something like:

K = np.load("camera_matrix.npy")
dist = np.load("dist_coeffs.npy")

# I calibrated in 2592x1944, but use 640x480 on the car, so compensate
# by dividing the focal lengths and optical centers
K[:2] /= 4.05

# generate a (480, 640, 2)-shaped matrix where each element contains
# its own x and y coordinate
uv = np.mgrid[:480, :640][[1, 0]].transpose(1, 2, 0).astype(np.float32)
pts = cv2.fisheye.undistortPoints(uv, K=K, D=dist)

I further rotate and renormalize pts to compensate for the tilt of the camera, and then export it as a lookup table; the C++ code then loads it on initialization. Let's call it float cameraLUT[640*480][2].

We'll also initialize float u=0, v=0, theta=0;

Update from camera image

Step by step, the update algorithm is:

Get the grayscale image -- I have the camera sending native YUV420 frames, so I just take the Y channel here.
Clear a vector of white pixel undistorted ray vectors xybuf
For every pixel in the image > the white threshold:
- Look up cameraLUT for that pixel and push it onto xybuf
For each solver iteration (I use two iterations):
- Initialize float accumulators Sigma1..Sigma6 (defined above) to 0
- Precompute \sin \theta and \cos \theta -- constant for all pixels
- For each pixel x, y in xybuf:
  - Compute the distance to the nearest landmark \Delta x_i, \Delta y_i, using e.g. moddist for grids, as well as \delta R x_i, \delta R y_i, etc.
  - Add the pixel's contribution to Sigma1..Sigma6, defined above in the "Adding it all up" section
- Compute updates to u, v, theta defined above using the length of xybuf for N.

The actual code running on the car for this is here but it's a bit cluttered because there are two SIMD versions (SSE and NEON) and a bunch of other optimizations and hacks, and the variable names are a bit different than described in this article, but it should be recognizable from this description.

Efficiently implemented, the most expensive part of this algorithm is the branch misprediction penalty it incurs when thresholding white pixels and adding to xybuf.

Further applications

Making a map of the floor by tracking the motion of the ceiling

In order to actually use this for racing, I need to get a map of the track with respect to the ceiling. This is why I have the camera partially looking towards the floor in front of the car. Here, two masks are defined for the image: one for ceiling-facing pixels, and one for pixels intersecting the floor within a range of distances (not too close, or it will see the hood, and not too far, or it will just be a blur). Using the position and angle of the tracking result, the floor pixels are reprojected to a bird's eye view as I drive the car around manually along the track.

The tracking isn't perfect, and the geometry I'm using to project the floor in front of the car isn't perfect, so there's a little wonkiness in the result, but it's good enough to use as a map.

Track planner GUI. Cyan circles are ceiling light locations; blue dots are centers of curvature, orange dots are cone locations. The magenta line is a minimum curvature line, but not one used for racing anymore.

I then import this into another tool to define a racetrack using the ceiling-derived map, run an optimization procedure which computes the best racing line from any point, and send the optimized map to the car. After that, it will drive autonomously.

Conclusion

This ceiling light tracking algorithm works so well I consider localization on this track a solved problem and have been focusing on other areas, like reinforcement learning based algorithms for autonomous racing.

Unfortunately, Circuit Launch, our regular racing venue, has torn out the entire ceiling in a recent renovation, so once we can start racing again we'll have to see whether the perfect grid assumption still holds, or if it needs to be extended with an actual ceiling map. Nevertheless, I'm pretty confident that this is the best way to do 2D indoor localization.

https://www.a1k0n.net/2021/01/22/indoor-localization.html

donut.c without a math library

Jan 13, 2021 Updated Jan 13, 2021

Show full content

My little donut.c has been making the rounds again, after being featured in a couple YouTube videos (e.g., Lex Fridman and Joma Tech). If I had known how much attention this code would get over the years, I would have spent more time on it.

One thing that's always been sort of unfortunate is the heavy use of sin and cos -- both because it necessitates linking the math library (-lm), but also because it makes it much more CPU-intensive than it really needs to be. This is especially apparent if you try to port it to an older CPU or an embedded device.

So, here's a revised version with no use of sin, cos, and no need for linking the math library (though this version still does use float types).

             i,j,k,x,y,o,N;
         main(){float z[1760],a
      #define R(t,x,y) f=x;x-=t*y\
   ;y+=t*f;f=(3-x*x-y*y)/2;x*=f;y*=f;
   =0,e=1,c=1,d=0,f,g,h,G,H,A,t,D;char
 b[1760];for(;;){memset(b,32,1760);g=0,
h=1;memset(z,0,7040);for(j=0;j<90;j++){
G=0,H=1;for(i=0;i<314;i++){A=h+2,D=1/(G*
A*a+g*e+5);t=G*A        *e-g*a;x=40+30*D
*(H*A*d-t*c);y=          12+15*D*(H*A*c+
t*d);o=x+80*y;N          =8*((g*a-G*h*e)
*d-G*h*a-g*e-H*h        *c);if(22>y&&y>
 0&&x>0&&80>x&&D>z[o]){z[o]=D;b[o]=(N>0
  ?N:0)[".,-~:;=!*#$@"];}R(.02,H,G);}R(
  .07,h,g);}for(k=0;1761>k;k++)putchar
   (k%80?b[k]:10);R(.04,e,a);R(.02,d,
     c);usleep(15000);printf('\n'+(
        " donut.c! \x1b[23A"));}}
          /*no math lib needed
             .@a1k0n 2021.*/

It's a little misshapen and still has comments at the bottom. I used the first frame of its output as a template and there's slightly less code than filled pixels -- oh well. Output is pretty much the same as before:

toggle animation

Defining a rotation

So, how do we get sines and cosines without using sin and cos? Well, the code doesn't really need sine and cosine per se; what it actually does is rotate a point around the origin in two nested loops, and also rotate two angles just for the animation. If you'll recall from the other article, the inner loop is just plotting dots in a circle, which goes around another, larger circle. In each loop, the sine/cosine terms are just moving by a small, fixed angle.

So we don't need to track the angle at all, we only need to start at cos=1, sin=0 and rotate a circle around the origin to generate all the sines and cosines we need. We just have to repeatedly apply a fixed rotation matrix:

\begin{bmatrix} c' \\ s' \end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} c \\ s \end{bmatrix}

So for example, if we were to use an angle of .02 radians in our inner loop, it would look something like:

float c=1, s=0;  // c for cos, s for sin
for (int i = 0; i < 314; i++) {  // 314 * .02 ~= 2π
  // (use c, s in code)
  float newc = 0.9998*c - 0.019998666*s;
  s = 0.019998666*c + 0.9998*s;
  c = newc;
}

Renormalizing

That works, but there's a problem: no matter how precisely we define our constants, after repeated iteration of this procedure, the magnitude of our \left(c, s\right) vector will exponentially grow or shrink over time. If we only need to make one pass around the loop, maybe we can get away with that, but if we have to make several (for the rotating animation, we do), we need to fix that.

sine and cosine magnitude creep
an exaggerated illustration of what happens when repeatedly doing low-precision rotations

The simplest way to do that would be to multiply c and s by 1/\sqrt{c^2 + s^2} , but then we're back to using the math library again. Instead, we can take advantage of the fact that our magnitude starts out very close to 1, and we're going to be iterating this procedure: we can do a Newton step after each rotation, and that will be enough to keep the magnitude "close enough" to 1 over time.

Our goal is to find the reciprocal square root (sound familiar?) of a = c^2 + s^2, our \left(c, s\right) vector magnitude. Say we define a function f(x) = \frac{1}{x^2} - a. The function is 0 when x = \frac{1}{\sqrt{a}}. We can start with an initial guess of 1 for x, perform a Newton iteration to obtain x', which will be "closer to" \frac{1}{\sqrt{a}}, the correct value to scale c and s by so that their magnitude c^2 + s^2 is "close to" 1 again.

A Newton step is defined as x' = x - \frac{f(x)}{f'(x)}. I used SymPy to do the derivative and simplification and came up with x' = \frac{x\left(3 - a x^2\right)}{2}. Since we're only doing one step, we can plug in our initial guess of 1 for x and back-substitute c^2 + s^2 for a to finally get our adjustment: x' = (3 - c^2 - s^2)/2 .

Further simplifying the rotation

But now that we don't have to worry so much about the magnitude of our result (within limits), we can take another shortcut (I got this idea studying old CORDIC algorithms). If we divide out the cosines from our original rotation matrix, we get

\begin{bmatrix} c' \\ s' \end{bmatrix} = \frac{1}{\cos \theta}\begin{bmatrix} 1 & -\tan \theta \\ \tan \theta & 1 \end{bmatrix} \begin{bmatrix} c \\ s \end{bmatrix}

using the trig identity \tan \theta = \frac{\sin \theta}{\cos \theta}. Since we're only dealing with small angles, the leading \frac{1}{\cos \theta} term is close enough to 1 that we can ignore it and have our Newton step take care of it.

And now we can finally understand how the rotation is done in the code. Towards the top of the donut code is this #define, which I've reindented:

#define R(t,x,y) \ 
  f = x; \
  x -= t*y; \
  y += t*f; \
  f = (3-x*x-y*y)/2; \
  x *= f; \
  y *= f;

This does an in-place rotation of a unit vector x, y where t is \tan \theta. f is a temporary variable; the first three lines do the "matrix multiplication" on x, y. f is then re-used to get the magnitude adjustment, and then finally x and y are multiplied by f which moves them back onto the unit circle.

With that operation in hand, I just replaced all the angles with their sines and cosines and ran the rotation operator R() instead of calling sin/cos. The code is otherwise identical.

Getting rid of float, too

We can use exactly the same ideas with integer fixed-point arithmetic, and not use any float math whatsoever. I've redone all the math with 10-bit precision and produced the following C code which runs well on embedded devices which can do 32-bit multiplications and have ~4k of available RAM:

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define R(mul,shift,x,y) \
  _=x; \
  x -= mul*y>>shift; \
  y += mul*_>>shift; \
  _ = 3145728-x*x-y*y>>11; \
  x = x*_>>10; \
  y = y*_>>10;

int8_t b[1760], z[1760];

void main() {
  int sA=1024,cA=0,sB=1024,cB=0,_;
  for (;;) {
    memset(b, 32, 1760);  // text buffer
    memset(z, 127, 1760);   // z buffer
    int sj=0, cj=1024;
    for (int j = 0; j < 90; j++) {
      int si = 0, ci = 1024;  // sine and cosine of angle i
      for (int i = 0; i < 324; i++) {
        int R1 = 1, R2 = 2048, K2 = 5120*1024;

        int x0 = R1*cj + R2,
            x1 = ci*x0 >> 10,
            x2 = cA*sj >> 10,
            x3 = si*x0 >> 10,
            x4 = R1*x2 - (sA*x3 >> 10),
            x5 = sA*sj >> 10,
            x6 = K2 + R1*1024*x5 + cA*x3,
            x7 = cj*si >> 10,
            x = 40 + 30*(cB*x1 - sB*x4)/x6,
            y = 12 + 15*(cB*x4 + sB*x1)/x6,
            N = (-cA*x7 - cB*((-sA*x7>>10) + x2) - ci*(cj*sB >> 10) >> 10) - x5 >> 7;

        int o = x + 80 * y;
        int8_t zz = (x6-K2)>>15;
        if (22 > y && y > 0 && x > 0 && 80 > x && zz < z[o]) {
          z[o] = zz;
          b[o] = ".,-~:;=!*#$@"[N > 0 ? N : 0];
        }
        R(5, 8, ci, si)  // rotate i
      }
      R(9, 7, cj, sj)  // rotate j
    }
    for (int k = 0; 1761 > k; k++)
      putchar(k % 80 ? b[k] : 10);
    R(5, 7, cA, sA);
    R(5, 8, cB, sB);
    usleep(15000);
    printf("\x1b[23A");
  }
}

The output is pretty much the same.

https://www.a1k0n.net/2021/01/13/optimizing-donut.html

Fast line-following robots

Nov 13, 2018 Updated Nov 13, 2018

Show full content

Since February 2016, I have been participating in DIYRobocars meetups here in the SF Bay Area, wherein a bunch of amateur roboticists compete in autonomous time trials with RC cars. I had been doing very well in the competition with essentially just a glorified line-following robot, but have lately upgraded to a full SLAM-like approach with a planned trajectory.

In this series of posts I will try to brain-dump everything I've learned in the last two years, and explain how my car works.

Here's a race between my car and @otaviogood's Carputer, which uses an end-to-end behavioral cloning neural network to drive. It had to dodge a lot of other cars in training that day, which we think is why it sort of chokes when my car cuts it off. My car is completely oblivious to other cars; the wheel-to-wheel race here was just for fun.

In the video above you can see everything the car can see/sense: a birds-eye view of the ground, gyroscope, accelerometer, and wheel velocity measurements.

Here's a more recent video showing the current state of the same car, using a slightly different but still fundamentally line-following strategy (instead of following the line on the ground, it's following a pre-optimized racing line):

Line-following theory

Imagine you are a line following robot. Your mission is to move forward and keep the line in the middle of your sensors (maybe they're IR sensors looking at the ground, maybe it's a camera). The line can go straight or it can turn, and so can you. As far as you're concerned, the world consists of only you and the line, so we need to think about your position in space relative to the line.

Curvilinear coordinates

A line-following robot lives in a curvilinear coordinate system: all measurements of position are relative to a line which has some (probably varying) curvature. Therefore instead of saying the robot has an x, y position and maybe an angle θ, I try to follow the notation I've seen in the robotics / automotive control literature.

We assume that the car is traveling along a circle with curvature κ, which is equivalent to 1/radius r, except it can be either positive or negative -- in the convention I'm using, if it's positive, the circle curves to the left and negative curvature goes right (if this seems backwards, think about how angles conventionally run counterclockwise on the plane, as in (x, y) = (cos θ, sin θ)). 0 curvature of course means a perfectly straight line.

At the point on the track centerline closest to the car's position, the forward direction is called x and the direction toward the left is y. (Again if y seems backwards, it's because of the right-hand rule given that x goes forwards. Why is x forwards? Not really sure, that's just how they tend to do it.)

The distance from the car to the nearest point on the center line is called ye, which is positive if the car is on the left side of the center, negative on the right. The heading angle of the car with respect to the centerline is denoted Ψe, which increases as the car turns counterclockwise with respect to the centerline.

Control strategies

The most obvious thing to do is to ignore everything but ye -- if the line is on the left, then turn left, and if it's on the right, turn right. That'll get you started, but it will wander back and forth and won't work at all when you crank up the speed. Here's a little simulation:

Pure Proportional Control

Kp v control: \kappa' = -K_p y_e

This is the first thing I tried, and is about as far as many people get when making line-following robots. But there's a simple tweak that makes it work much better, if you can determine not just the distance to the centerline but also the relative angle of the line Ψe.

Pure Proportional-Differential Control

We need a way to damp out those oscillations, and the classical way to do that is to add a derivative feedback term. We could either numerically differentiate our ye error from the last frame, or we could use the sine of the angle between our heading and our centerline, which is pretty much equivalent (except it doesn't depend on velocity).

Kp Kd v control: \kappa' = -K_p (y_e + K_d \sin \psi_e)

That's a lot better, isn't it? Only it tends to get "surprised" when there's a turn, and it will always overshoot. If we know the curvature of the path we're following, we can just add that in to our control signal.

Curvature-aware Proportional-Differential Control

Kp Kd v control: \kappa' = -K_p (y_e + K_d \sin \psi_e) + \kappa

Even better. It still overshoots a bit, but it can be made to track very accurately. The only real issue with this is the curvature of the track actually depends on the car's y position; the inside of a turn has a higher curvature than the outside.

There's a paper from 1993 that studies this problem in depth that I've found very helpful: Trajectory tracking for unicycle-type and two-steering-wheels mobile robots.1 The authors derive a nice closed-form PD control law (a bit hairier than the above, but not hard to compute) adjusted for curvy target trajectories.

Kp Kd v control: \kappa' = \frac{\cos \psi_e}{1 - \kappa y_e}\left[ -K_p \frac{y_e \cos^2 \psi_e}{1 - \kappa y_e} + \sin \psi_e \left(\kappa \sin \psi_e - K_d \cos \psi_e \right) \right]

If you play around with the constants, you'll see this one gets around the track faster than any of the others above.

Target velocity in a curve

The above give us control targets for curvature, but don't say anything about exactly how fast we should be driving in a turn. In fact the above simulations are just driving the motor at a constant speed, and the simulator understeers in turns (in other words, the front wheels skid, it doesn't actually turn as far as it wants to, and it slows the car down).

Ideally we'd brake for the turns and not understeer in them. But how fast should we go?

This can be very complicated due to tire dynamics and weight transfer, but the simplest thing that works is to shoot for a given lateral acceleration -- the tires will be able to exert a certain amount of g force sideways, and you can measure this by driving the car around in circles and seeing what the accelerometer (or the product of the forward velocity and gyro yaw rate) says.

There's a simple formula for lateral acceleration, and we can solve it for velocity given a maximum lateral acceleration.

a_L = \frac{v^2}{r} = v^2 \kappa v_{Max} = \sqrt{\left|\frac{a_{LMax}}{\kappa}\right|}

There are some minor practicalities to consider here ( \kappa can be near zero) -- I compute \kappa_{min} = a_{Lmax} / v_{max}^2 and set v = v_{max} if \left|\kappa\right| < \kappa_{min} .

Kp Kd aL vmax lookahead

\kappa' = \frac{\cos \psi_e}{1 - \kappa y_e}\left[ -K_p \frac{y_e \cos^2 \psi_e}{1 - \kappa y_e} + \sin \psi_e \left(\kappa \sin \psi_e - K_d \cos \psi_e \right) \right] v = \min\left(\sqrt{\left|\frac{a_L}{\kappa'}\right|}, v_{max}\right)

Now it speeds up for straights and slows down for turns. The only problem is it doesn't have instantly-acting brakes, so it's still surprised by turns; I've added one more parameter which is how far it looks ahead before determining the \kappa to use to compute its velocity -- it will look ahead on the track and take the maximum of its control curvature and the curvature on the track coming ahead to determine its speed limit.

Future work

At this point we're reaching the limits of what we can do with simple PD-type control, and to get really great tracking we need to start doing finite-horizon planning, but that's out of the scope of what I want to cover here. Iterated Linear Quadratic Regulators / Differential Dynamic Programming are what I would use next, but we haven't talked about even making the simple algorithms practical yet!

Finding y, Ψ, κ, and all that

In a real robot, we need to use our sensors to make measurements of the line with respect to our own position / orientation. This can be as simple as some phototransistors pointing at the ground, or a front-facing (preferably wide-angle) camera.

Either way, we can at least indirectly measure these parameters. To track them over time and refine our estimates requires a Kalman filter.

For now, this is a very brief overview of how it was done in the first video above:

Calibrate the camera with OpenCV
Generate a look-up table mapping pixels in our front-facing camera to a virtual birds-eye view
Set the camera to take video in YUV color space (most cameras can do this natively) -- YUV is a perceptual color space, and makes it really easy to find colors like yellow and orange
Remap each image and use a convolutional filter on the virtual birdseye view to find the lane lines (note: not a convolutional neural network or anything like that, just a hand-designed function which "activates" on lane lines and is relatively immune to lighting changes)
Use linear regression to fit a parabola to the lines
Compute ye, Ψe, κ from the equation of the parabola at the point where the fit was best.

Of course, the measurement isn't perfect and if you completely leave the track or even just turn too far to one side, it provides no information on your rapidly increasing ye. This is why I added a Kalman filter to track the line's position relative to the car, and wheel encoders, brushless motor sensor feedback, and/or MEMS gyroscopes can help tremendously with this approach.

Alain Micaelli, Claude Samson. Trajectory tracking for unicycle-type and two-steering-wheels mobile robots. [Research Report] RR-2097, INRIA. 1993. ↩

https://www.a1k0n.net/2018/11/13/fast-line-following.html

3D Rendering on an Arduboy

Feb 1, 2017 Updated Feb 1, 2017

Show full content

The Arduboy is a cute little credit-card sized Arduino-compatible with a 128x64 black and white OLED screen. It has 32kb of flash for program memory and 2560 bytes of RAM.

(N.B.: this is "reblogged" from my post on community.arduboy.com)

Over the Christmas break I put together a little demo which renders the Utah teapot (240 faces) at >30fps on the Arduboy, with 4x4 ordered dithering.

Source code is available on Github as usual.

I started off just rendering an octahedron, as it was simple and the faces were triangles instead of squares (usually I'd use a cube...). I got it working while on the train home and posted a quick video tweet:

Playing with an #arduboy on the train home, got subpixel dithered triangle renderer working pic.twitter.com/n5oYztuSHI
— Andy Sloane (@a1k0n) December 22, 2016

And then I extended it to an icosahedron (in spaaaaace!)...

@a1k0n now with more polygons, PWM dithered pixels, and fixed point math (no floats) pic.twitter.com/F1P00NeDiH
— Andy Sloane (@a1k0n) December 26, 2016

...but then Kevin Bates, the Arduboy creator, pushed me to do more.

@a1k0n I downloaded this and it looks even better in person. What are the odds for a teapot? 😊
— Kevin Bates (@bateskecom) December 22, 2016

That required doing a ton of optimization work, but I'm happy I persevered. In the course of this project, I ended up creating a bespoke statistical profiler which runs in a timer ISR and inspects the call stack, buffers it, and sends it over the USB serial port. I'll get to that later.

Here's how it works.

Polygon rendering

The arduboy screen is a 1024 byte array where each byte corresponds to eight vertically arranged pixels; you can think of it as 8 horizontal "banks" of 8 pixels tall, 128 pixels wide. Here are the addresses and masks of the pixels on the left side of the first two "banks":

To make it fast, the triangle filler works in vertical stripes (instead of the more typical horizontal ones), and the inner loop can plot up to eight pixels at once just by writing a byte to memory and then moving down to the next "bank". It's no extra work to plot a dithered pattern; you just write a different byte value for each horizontal column (0x55, 0xaa, etc for a checkerboard).

slow motion triangle fill, one frame per byte written

Another important property to making this look good is that all screen coordinates are computed with sub-pixel accuracy (it's 4 extra subpixel bits, so the real 128x64 screen becomes a virtual 2048x1024 one). The reason for this is that when plotting lines or triangles, the interior pixels covered by the triangle can change even if the vertices of the triangle map to the same pixels. So slow rotations make subtle changes to the shape of the outline and it really improves the overall aesthetic, which is super important at low resolutions.

left: subpixel accurate rendering; right: vertices clamped to integer coordinates

Compromising on rendering accuracy would speed things up a bit, but I'm unwilling to do that and I found better ways to optimize, though ultimately I did compromise precision in 3D coordinates.

Ordered dithering

I use the same 4x4 matrix as what's on the wikipedia page, but converted into a 17-level lookup table (from pure black to 15 in-between levels to pure white). A few lines of numpy generate this.

3D math

The teapot has 137 vertices and 240 faces. Naïvely doing the rotation, projection and backface sorting on a microcontroller stretches the frame budget, even without rendering polygons.

The original octahedron used the AVR float library for all computations, and that ran pretty fast, but then I got a bit more ambitious and had to switch to fixed point math in order to render a teapot. I first did this in a straightforward way, using a lot of 32-bit integer multiplications instead of float multiplications.

However the profiler showed me I was basically spending all my time in the 32-bit multiply and divide routines.

Division is slow

The AVR does not have any efficient way to perform division -- divides are super expensive. There were two places where division was heavily used: in the triangle filler, to determine line slopes, and in 3D perspective projection.

In the triangle filler, we're making a relatively small adjustment and the quotient shouldn't be very large. I just replaced dy02 /= dx02; fy02 %= dx02; with a subtraction loop:

  // unroll divmod here; this is sadly much faster
  while (fy02 >= dx02) { ++dy02; fy02 -= dx02; }
  while (fy02 <= -dx02) { --dy02; fy02 += dx02; }

For perspective projection, I decided to linearly approximate the division by z. This looks good enough for viewing a single object, but wouldn't hold up to scrutiny if we were rendering a whole room. The computation to perform is to get the screen coordinates xs and ys from the object x, y, and z, plus a distance d between the camera and the object, and a scale factor k to account for the screen size and field of view.

x_s = f(z) = \frac{k x}{d - z}

If we just take a linear Taylor expansion around z = 0 , we end up with:

x_s = f(0) + f'(0)\ z = \left(\frac{k}{d} + \frac{k}{d^2} z\right) x

z, being an eight-bit signed object coordinate, ranges between -127 and +127. A d of 210 (i.e., 1024) gives a nice 3D effect given that z range without looking very distorted, and a scale factor of 220 puts the projection in roughly the correct range for our 1024-subpixel-unit tall display.

With these nice round numbers, it works out to x_s = (1024 + z)\ x , which means we can just re-use that subexpression and multiply for y as well. Which is good, because...

Multiplication is fast

The AVR has a variety of 2-cycle 8x8 -> 16 multiply instructions, and it behooves us to make extensive use of that, rather than letting gcc generate calls to 32-bit multiply routines. Which means we need to reduce our precision, where possible, to 8 bits.

The first step in rendering (DrawObject if you're looking at the code) is to create a rotation matrix; this is done with a 10-bit precision sine lookup table (another neat trick I discovered: 1024 * (sin(x) - x) is < 256 in the first quadrant, so a 10-bit precise sine LUT fits in 8 bits just by adding x back on), and some 32 bit math to construct a rotation matrix once up front, and then that matrix is quantized down to 8 bits. This is only done once per frame, so it's not a big deal.

The object models are also quantized down to 8 bit coordinates with a little python script which takes an .obj format, rescales the model and quantizes to 8 bits, fuses merged vertices, and also rescales / quantizes face normal vectors. Then the rotation is a 3x3 8-bit matrix multiplied with an 8-bit 3-vector; the AVR has a special instruction (fmuls) which will do a 1.7 signed fixed point multiplication which is exactly what I needed, and is available in avr-gcc as __builtin_avr_fmuls().

Then the vertices are projected with the Taylor expansion trick: \frac{2^{20} x}{2^{10} - z} is approximated as x \cdot (2^{10} + z) and it's good enough for a 128x64 screen...

I still need to do 32-bit multiplications to check the face winding order (faces that have clockwise vertices are facing away from the camera, so are not rendered), since the vertices are 16-bits in precision: both because the subpixel screen grid is 2048x1024 but also because some vertices might be off-screen, and we still want to correctly clip the triangles if they are.

Face sorting

The teapot is not a convex object; if you render it without allowing for this, the knob at the top, the handle, or the spout will show through the object like a ghost, unless you either z-buffer (HA! too slow, not enough memory anyway) or draw the faces back-to-front.

I actually implemented a heap sort to do this, which worked perfectly on my computer but would crash on the AVR. Turns out, I'm completely out of memory. Between the frame buffer (1024 bytes), projected vertices (137*2*2 = 548 bytes), stack and other globals, I definitely don't have another 377 bytes left (137 z coordinates, 240 face orders) to dynamically sort.

Instead, I pre-sort the faces, as I'm not hurting for flash space. The model comes with the faces "baked in" sorted by x, and also has auxiliary face orders sorted by y and z. At runtime, I figure out which axis is most facing toward (or away from) the camera, choose the face order from the lookup table (reversed if necessary), and render in that order. It's close enough.

Profiling on Arduboy

I was surprised to find there's no real decent arduino profiler option.

I ended up hacking together a statistical profiler on a timer interrupt which walks the stack:

static const size_t PROFILEBUFSIZ = 32;
static volatile uint8_t profilebuf_[PROFILEBUFSIZ];
static volatile uint8_t profileptr_ = 0;

ISR(TIMER4_OVF_vect) {
  if (profileptr_+2 <= PROFILEBUFSIZ) {
    uint8_t *profdata = profilebuf_ + profileptr_;
    // disassemble, count # pushes in prologue, add 1 to determine SP+x
    uint8_t *spdata = SP+17;  // pointer to data on stack when this ISR was hit
    profdata[1] = spdata[0];  // copy into profiling data buffer
    profdata[0] = spdata[1];
    profileptr_ += 2;
    for (uint8_t j = 2; j < 16 && profileptr_+2 < PROFILEBUFSIZ; j += 2) {
      // walk the stack and see if we have any return addresses available

      // subtract 2 words to get address of potential call instruction (opcode 0x940e);
      // attribute this sample to that instruction
      uint16_t stackvalue = (spdata[j] << 8) + spdata[j+1] - 2;
      if (stackvalue >= 0x4000  // is this a valid flash address?
          || pgm_read_word_near(stackvalue << 1) != 0x940e) {
        break;
      }
      profdata[j] = stackvalue & 255;
      profdata[j+1] = stackvalue >> 8;
      profdata[j-1] |= 0x80;  // add continuation bit on previous addr
      profileptr_ += 2;
    }
  }
}

I have to manually count the number of 'push' instructions in the generated assembly to figure out where the return address on the stack is to get it to work, but once it is working it loads a buffer with a stream of sampled addresses, which later in the code I stream out the USB port.

It also walks the stack a bit and probes the flash for anything that points to a CALL instruction; so that if, for instance, the avr-gcc multiply routine is interrupted, it can find the function which called it and attribute the sample to that function as well.

There is an auxiliary python script which collects the data from USB and saves the histogram out, and another one to annotate the disassembly with the histogram.

It's not great, but it's really really useful.

Future work

The Arduboy is a really constrained platform -- 128x64 black and white graphics, only 2560 bytes of RAM, no dedicated graphics hardware -- which is somewhat mitigated, compared to true retro platforms, by having a 16MHz CPU. Just because of the resolution I don't think we'll see much 3D on it, but it was fun to get it to work anyway.

https://www.a1k0n.net/2017/02/01/arduboy-teapot.html

Playing Fasttracker 2 .XM files in Javascript

Nov 9, 2015 Updated Nov 9, 2015

Show full content

Play Load

What is this thing?

Hit the play button above and it will play music; specifically, music composed with (or at least, compatible with) a program released in 1994 called FastTracker 2; this is a Javascript homage I wrote in a fit of nostalgia. The source code is on GitHub if you'd like to check it out.

Hit "Load" to load a few other .XMs I have handy on my web host, or drag and drop an .xm from your computer onto the player window.

FastTracker 2 looks like this:

FastTracker II screenshot

I using the original font to render a pastiche of the FT2 interface above.

.XM files

The ubiquitous .MOD music file format originated on the Commodore Amiga in the late 80s. It was more or less designed around the "Paula" chip inside the Amiga which plays four 8-bit PCM channels simultaneously. It is fairly amazing what artists are able to accomplish in just four voices; I've converted a few of Lizardking's old four-channel .MODs to .XM format so they can be played in the player above; try them out with the load button above.

FastTracker 2's .XM (eXtended Module) format was created as a multi-channel extension to .MOD in the 90s; it was written by PC demo group Triton. There were other contemporaneous multi-channel MOD-type music trackers, like ScreamTracker 3's .S3M and later Impulse Tracker's .IT. Nearly all PC demos and many games in the 90s and early 2000s played music in one of these formats.

Underneath the hood, it's several (14 in the case of the default song that comes up here) channels independently playing samples at various pitches and volume levels, controlled by the pattern which is what you see scrolling above.

Patterns are a lot like assembly language for playing music, complete with a slew of hexadecimal numbers and opaque syntax, except they are laid out like a spreadsheet rather than a single column of instructions. Each cell contains notations for playing or releasing a note with a certain instrument, optionally modifying the volume, panning, or pitch with effects.

Every sound that is played is one of the samples, shown at the top of this page as little waveforms and numbers in the table below the pattern. The idea is that the musician would record an instrument, like a piano or a bass, at a certain note, and then we play them back at higher or lower speeds to change the pitch.

Theoretically each sample has a name like "piano" or "bass", but in practice, musicians tended to erase those and write messages instead. Later trackers would add a song message field to avoid abuse of the instrument list.

Annotated pattern image

In the above example, we are about to play an E note in octave 5 with instrument number 2 on the first channel, and in the second channel we play the same note an octave higher with the same instrument, except we also lower the volume to 0x2C (maximum volume is 0x40, so this is between a half and 3/4ths full volume). The second channel is also going to play with effect 0, which is an arpeggio, switching between 0, 0x0c or 12, and 0 semitones higher -- meaning it's going to rapidly play a sequence of E-6, E-7, E-6, and repeat.

Tracker programs don't have any knowledge of the musical scale, so they always render accidentals as sharps, e.g. C#4 instead of Db4.

XM instruments go beyond .MOD samples in that they include volume and panning envelopes, and can have multiple different samples per instrument -- so you could record various piano notes and use a recording closest to the note you're playing rather than stretching a sample several octaves, which sounds bad. I only render the first sample per instrument in the table above though.

In .MOD-derived formats, each row of pattern data is played at a certain rate controlled by the speed and BPM which have a sort of tangential relationship to the terms you may be familiar with. speed controls the row speed in ticks, and BPM controls how long ticks are. A tick is defined as 2500/BPM milliseconds.

Typical values are BPM=125 and speed=6, corresponding to 2500/125 = 20ms per tick, or 50 ticks/second; and 50/6 = 8.333 rows per second. If we assume the downbeat happens every four rows then it works out to 125 beats per minute. But if speed is 5 then it's 150 beats per minute, so we can't really take the terms at face value.

Each tick we process effects -- modifications to the volume or pitch of a note. Effects are represented by three hexadecimal digits: the effect type (first digit) and parameter (second two digits). In XM the effect type goes beyond the hex digits and goes all the way from 0-9 and A-Z.

Examples include arpeggio (switching between three notes really fast; effect 047 plays a major chord), portamento (sliding from one note to another, like a trombone does; effect 1xx slides up 2xx down, and 3xx to a specific note), vibrato (vibrating the pitch up and down; 4xx), or ramping the volume up or down (Axy). If you watch closely while the song is playing you can try to work it out.

The intracacies of how effects work -- some affect the song playing as a whole, some happen only on the first tick of a row, some only on every tick except the first one in a row -- could be its own long and not very interesting article, so I'll leave out the details. There are decent but mutually contradictory resources online.

Real-time Audio Synthesis in Javascript

The webkitAudioContext element was introduced to HTML a while back, and is now available in the standard as AudioContext. The API is fairly flexible, and one of the things you can do with it is get a Javascript callback which writes samples more or less directly to the output device, just like we used to do with DMA loop interrupts back in the day. It's called createScriptProcessor.

You can be up and running making all kinds of noise in a jiffy:

function InitAudio() {
  // Create an AudioContext, falling back to 
  // webkitAudioContext (e.g., for Safari)
  var audioContext = window.AudioContext || window.webkitAudioContext;
  var audioctx = new audioContext();

  // Create a "gain node" which will just multiply everything input
  //  to it by some constant; this gives us a volume control.
  gainNode = audioctx.createGain();
  gainNode.gain.value = 0.1;  // master volume

  // Create a script processor node with 0 input channels,
  // 2 output channels, and a 4096-sample buffer size
  jsNode = audioctx.createScriptProcessor(4096, 0, 2);
  jsNode.onaudioprocess = function() {
    var buflen = e.outputBuffer.length;
    var dataL = e.outputBuffer.getChannelData(0);
    var dataR = e.outputBuffer.getChannelData(1);
    // dataL and dataR are Float32Arrays; fill them with
    // samples, and the framework will play them!
  };

  // Now, form a pipeline where our script feeds samples to the
  // gain node, and the gain node writes samples to the
  // "destination" which actually makes noise.
  jsNode.connect(gainNode);
  gainNode.connect(audioctx.destination);
}

Your onaudioprocess function will be called back whenever new samples are needed to fill the output buffer. With a default samplerate of 44.1kHz, a 4096-sample buffer will expire, and thus your callback will be called, every ~92 milliseconds.

Our goal is to fill this buffer up with the sum of each channel's output waveform. Each channel outputs a sample playing at a certain frequency and at a certain volume. And during a tick, the sample frequency and volume is constant1. Between ticks, we recompute the channel frequencies, volumes and envelopes; and on ticks which are even multiples of the current speed value (3 in the example below), we read a new row of pattern data -- notes, instruments, effects, etc.

The output is just the sum of the individual channels2 for each sample. Output summed up per tick

When we're done, the output will be laid out something like this: Audio buffer layout diagram

In this example, a tick is 20ms and our output is playing at 44.1kHz, so a tick is 882 samples. Of course, ticks don't evenly divide sample buffers, and the AudioContext specification requires you to use buffer sizes which are powers of two and ≥ 1024. So when we render samples to our buffer, we may have to play a partial tick at the beginning and end.

So in our audio callback, we figure out how many samples are in a tick, and fit them into the output buffer, and add up each channel's data within a tick. It looks like this3:

var cur_ticksamp = 0;  // Sample index in current tick
var channels = [ ... ];  // Channel state: sample num, offset, freq, etc

function audioCallback(e) {
  var buf_remaining = e.outputBuffer.length;
  var dataL = e.outputBuffer.getChannelData(0);
  var dataR = e.outputBuffer.getChannelData(1);

  dataL.fill(0);  // (see footnote 3)
  dataR.fill(0);

  var f_smp = audioctx.sampleRate;  // our output sample rate
  var ticklen = 0|(f_smp * 2.5 / xm.bpm);  // # samples per tick
  var offset = 0;
  while(buf_remaining > 0) {
    // tickduration is the number of samples we can produce for this
    // tick; either we finish the tick, or we run out of space at the
    // end of the buffer
    var tickduration = Math.min(buf_remaining, ticklen - cur_ticksamp);

    if (cur_ticksamp >= ticklen) {
      // Update our channel structures with new song data
      nextTick();
      cur_ticksamp -= ticklen;
    }
    for (var j = 0; j < xm.num_channels; j++) {
      // Add this channel's samples into the dataL/dataR buffers
      // starting at <offset> and spanning <tickduration> samples
      mixChannelInfoBuf(channels[j], offset, offset + tickduration,
                        dataL, dataR);
    }
    offset += tickduration;
    cur_ticksamp += tickduration;
    buf_remaining -= tickduration;
  }
}

Note frequencies

In .MOD derivatives, notes at C-4 (that is, a C on octave 4) are played at a samplerate of 8363Hz. Everything is relative to that; an octave higher, for instance, would be 16726Hz -- twice the frequency. For each semitone up, we effectively multiply the frequency by the twelfth root of two. In general, the frequency can be computed by 8363 * Math.pow(2, (note - 48) / 12.0). note here is the note number in semitones, starting at 0 for C-0 and going up to 95 for B-7.

To support fine tuning, vibrato, slides and so forth in the XM format, effects modify the note "period" which is not a period but 1/16th of a (negative4) semitone. Also, each sample has a coarse and fine tuning offset which are added on when we compute the play frequency.

Resampling

mixChannelIntoBuf's job is to write a sample played at a certain frequency onto a buffer which is played at the output frequency, usually 44.1kHz. It turns out that playing a sample recorded at a different frequency from the output frequency is a surprisingly nontrivial problem.

From the Nyquist-Shannon theorem, we know that our original sample only contains frequencies less than or equal to half the original sample rate. So ideally, when we play it back at the new frequency, the same set of frequencies should be played.

In practice, though, without using a "brick wall" or "sinc" filter, we're going to end up with some spurious harmonics due to the aliasing phenomenon. The only way to avoid that requires a convolution as long as the original sample. In other words, it's computationally expensive.

Also, since the original .XM players didn't do anything fancy, it's also typically unnecessary.

It turns out that if we use advanced resampling techniques to eliminate unwanted alias harmonics, many songs5 sound pretty bad. The original hardware and software that played these either used "zero-order hold" -- just output the sample closest to the interpolated time -- or "first-order hold" -- interpolate between the surrounding samples.

I compromised with a very simple implementation that I think sounds pretty good. I use a combination of zero-order hold and a per-channel two-pole low-pass filter, which has very low implementation complexity. I will leave the details out here.

Either way, we compute the speed at which we proceed through the sample, which is the ratio of the playback frequency to the output frequency:

function UpdateChannelPeriod(ch, period) {
  var freq = 8363 * Math.pow(2, (1152.0 - period) / 192.0);
  ch.doff = freq / f_smp;
}

The doff variable is the delta offset per sample. If we are playing at exactly 44100Hz, then doff is 1, and we are just copying the sample into the output.

Here's what zero-order hold looks like, omitting many details of sample looping or ending:

function mixChannelIntoBuf(ch, start, end, dataL, dataR) {
  var samp = ch.sample;
  for (var i = start; i < end; i++) {
    // Dumb javascript tricks:
    // we use a bitwise OR with 0 to truncate offset to integer
    var s = samp[0|ch.off];

    ch.off += ch.doff;
    dataL[i] += ch.volL * s;  // multiply by left volume
    dataR[i] += ch.volR * s;  // and right volume
    if (ch.off >= ch.sample.length) {
      if (ch.sample.loop) {
        ch.off -= ch.sample.looplen;
      } else {
        return;
      }
    }
  }
}

In practice I unroll that loop, compute how many samples I can generate before the sample ends or loops so I don't have to test in the inner loop, and also unroll short sample loops in memory.

The end result runs fast enough to play without hitches on my phone.

The main problem with zero-order hold is that it produces hard edges between samples, and so there are a lot of high pitched hisses and noise. As a result, I added an additional low-pass filter to each channel with a cutoff equal to half the playback sampling frequency.

Synchronized visualizations

It might seem reasonable that in order to show what's happening on the screen while you're hearing it, you need to use a really small buffer so that the latency between when we compute a sound buffer and when it plays is minimized. The problem with that is on a webpage, a low latency buffer can mean choppy audio as the browser might have better things to do than call your javascript audio callback within 20 milliseconds. It's pretty far from a realtime system.

But we actually don't care about latency at all; there are no external events changing the sound in an unpredictable way (other than pausing, which is handled separately). The code on this page is using a callback buffer size of 16384, over 300 milliseconds, and yet the oscilloscopes and patterns are updating at roughly the tick rate (~50Hz). How does that work?

Whenever I finish rendering a tick to the audio buffer, I push a visualization state onto a queue, keyed by the current audio time in samples. AudioContext has a currentTime attribute, which is the number of seconds that audio has been playing since the context was created. On my audio callback, I capture this and then add offset / f_smp to the time for each tick. And then I push the first few samples of each channel, plus the current pattern and current row, onto a queue. My rendering code just compares audioctx.currentTime against the head of the queue and renders what comes next.

var audio_events = [];

function redrawScreen() {
  var e;
  var t = XMPlayer.audioctx.currentTime;
  while (audio_events.length > 0 && audio_events[0].t < t) {
    e = audio_events.shift();
  }
  if (!e) return;

  // draw VU meters, oscilloscopes, and update pattern position
  // using various 2D canvas calls
}

The patterns are rendered by calling drawImage on individual letters from the original font image onto an off-screen canvas every time a new pattern is to be shown, and then compositing that canvas onto the screen above in order to highlight the currently-playing row.

End result

The player is also available here without the accompanying article. By the way, you can load arbitrary XMs on the web by using an anchor with a link, e.g. https://www.a1k0n.net/code/jsxm/#http://your.host.here/blah.xm -- but only if the host sets a CORS (Access-Control-Allow-Origin) header.

I think what was most fun about playing .MODs and their ilk back in days of yore was that you could actually see how a complex piece of music is put together, and watch your computer perform it live. I wanted to recreate that experience. There are other web-based .MOD players now, like mod.haxor.fi, but I wanted to write my own just for fun.

It was fun. Though it isn't complete effects-wise, I couldn't be happier with how it turned out.

Footnotes

The one exception being when the sample ends during the tick. ↩
I've omitted the details of stereo, but we sum up the left and right channels independently. Panning is just a different volume on the left channel vs. the right channel. ↩
Array.fill() is not available in all browsers, so in practice I use a for loop to clear the buffer. And the buffer is not already cleared when it is passed to us; it may actually contain data from the previous callback. ↩
For historical reasons I won't go into here, .MOD effects modified the sample period, or inverse frequency, as making linear changes to it sounds more linear to the ear; but really, it's an approximation of the logarithmic scale the human ear actually hears. A log scale is used in .XM. So when we increment period in XM, we are stepping down 1/16th of a semitone. ↩
Especially "chiptunes" -- songs with tiny looping square wave or triangle wave samples, to emulate simple vintage sound hardware chips. ↩

https://www.a1k0n.net/2015/11/09/javascript-ft2-player.html

Donut math: how donut.c works

Jul 20, 2011 Updated Jul 20, 2011

Show full content

[Update 1/13/2021: I wrote a follow-up with some optimizations. ]

There has been a sudden resurgence of interest in my "donut" code from 2006, and I've had a couple requests to explain this one. It's been five years now, so it's not exactly fresh in my memory, so I will reconstruct it from scratch, in great detail, and hopefully get approximately the same result.

This is the code:

             k;double sin()
         ,cos();main(){float A=
       0,B=0,i,j,z[1760];char b[
     1760];printf("\x1b[2J");for(;;
  ){memset(b,32,1760);memset(z,0,7040)
  ;for(j=0;6.28>j;j+=0.07)for(i=0;6.28
 >i;i+=0.02){float c=sin(i),d=cos(j),e=
 sin(A),f=sin(j),g=cos(A),h=d+2,D=1/(c*
 h*e+f*g+5),l=cos      (i),m=cos(B),n=s\
in(B),t=c*h*g-f*        e;int x=40+30*D*
(l*h*m-t*n),y=            12+15*D*(l*h*n
+t*m),o=x+80*y,          N=8*((f*e-c*d*g
 )*m-c*d*e-f*g-l        *d*n);if(22>y&&
 y>0&&x>0&&80>x&&D>z[o]){z[o]=D;;;b[o]=
 ".,-~:;=!*#$@"[N>0?N:0];}}/*#****!!-*/
  printf("\x1b[H");for(k=0;1761>k;k++)
   putchar(k%80?b[k]:10);A+=0.04;B+=
     0.02;}}/*****####*******!!=;:~
       ~::==!!!**********!!!==::-
         .,~~;;;========;;;:~-.
             ..,--------,*/

...and the output, animated in Javascript: toggle animation

At its core, it's a framebuffer and a Z-buffer into which I render pixels. Since it's just rendering relatively low-resolution ASCII art, I massively cheat. All it does is plot pixels along the surface of the torus at fixed-angle increments, and does it densely enough that the final result looks solid. The "pixels" it plots are ASCII characters corresponding to the illumination value of the surface at each point: .,-~:;=!*#$@ from dimmest to brightest. No raytracing required.

So how do we do that? Well, let's start with the basic math behind 3D perspective rendering. The following diagram is a side view of a person sitting in front of a screen, viewing a 3D object behind it.

To render a 3D object onto a 2D screen, we project each point (x,y,z) in 3D-space onto a plane located z' units away from the viewer, so that the corresponding 2D position is (x',y'). Since we're looking from the side, we can only see the y and z axes, but the math works the same for the x axis (just pretend this is a top view instead). This projection is really easy to obtain: notice that the origin, the y-axis, and point (x,y,z) form a right triangle, and a similar right triangle is formed with (x',y',z'). Thus the relative proportions are maintained:

\begin{aligned} \frac{y'}{z'} &= \frac{y}{z} \\ y' &= \frac{y z'}{z}. \end{aligned}

So to project a 3D coordinate to 2D, we scale a coordinate by the screen distance z'. Since z' is a fixed constant, and not functionally a coordinate, let's rename it to K1, so our projection equation becomes (x',y') = (\frac{K_1 x}{z}, \frac{K_1 y}{z}). We can choose K1 arbitrarily based on the field of view we want to show in our 2D window. For example, if we have a 100x100 window of pixels, then the view is centered at (50,50); and if we want to see an object which is 10 units wide in our 3D space, set back 5 units from the viewer, then K1 should be chosen so that the projection of the point x=10, z=5 is still on the screen with x' < 50: 10K1/5 < 50, or K1 < 25.

When we're plotting a bunch of points, we might end up plotting different points at the same (x',y') location but at different depths, so we maintain a z-buffer which stores the z coordinate of everything we draw. If we need to plot a location, we first check to see whether we're plotting in front of what's there already. It also helps to compute z-1 = \frac{1}{z} and use that when depth buffering because:

z-1 = 0 corresponds to infinite depth, so we can pre-initialize our z-buffer to 0 and have the background be infinitely far away
we can re-use z-1 when computing x' and y': Dividing once and multiplying by z-1 twice is cheaper than dividing by z twice.

Now, how do we draw a donut, AKA torus? Well, a torus is a solid of revolution, so one way to do it is to draw a 2D circle around some point in 3D space, and then rotate it around the central axis of the torus. Here is a cross-section through the center of a torus:

So we have a circle of radius R1 centered at point (R2,0,0), drawn on the xy-plane. We can draw this by sweeping an angle — let's call it θ — from 0 to 2π:

(x,y,z) = (R_2,0,0) + (R_1 \cos \theta, R_1 \sin \theta, 0)

Now we take that circle and rotate it around the y-axis by another angle — let's call it φ. To rotate an arbitrary 3D point around one of the cardinal axes, the standard technique is to multiply by a rotation matrix. So if we take the previous points and rotate about the y-axis we get:

\left( \begin{matrix} R_2 + R_1 \cos \theta, & R_1 \sin \theta, & 0 \end{matrix} \right) \cdot \left( \begin{matrix} \cos \phi & 0 & \sin \phi \\ 0 & 1 & 0 \\ -\sin \phi & 0 & \cos \phi \end{matrix} \right) = \left( \begin{matrix} (R_2 + R_1 \cos \theta)\cos \phi, & R_1 \sin \theta, & -(R_2 + R_1 \cos \theta)\sin \phi \end{matrix} \right)

But wait: we also want the whole donut to spin around on at least two more axes for the animation. They were called A and B in the original code: it was a rotation about the x-axis by A and a rotation about the z-axis by B. This is a bit hairier, so I'm not even going write the result yet, but it's a bunch of matrix multiplies.

\left( \begin{matrix} R_2 + R_1 \cos \theta, & R_1 \sin \theta, & 0 \end{matrix} \right) \cdot \left( \begin{matrix} \cos \phi & 0 & \sin \phi \\ 0 & 1 & 0 \\ -\sin \phi & 0 & \cos \phi \end{matrix} \right) \cdot \left( \begin{matrix} 1 & 0 & 0 \\ 0 & \cos A & \sin A \\ 0 & -\sin A & \cos A \end{matrix} \right) \cdot \left( \begin{matrix} \cos B & \sin B & 0 \\ -\sin B & \cos B & 0 \\ 0 & 0 & 1 \end{matrix} \right)

Churning through the above gets us an (x,y,z) point on the surface of our torus, rotated around two axes, centered at the origin. To actually get screen coordinates, we need to:

Move the torus somewhere in front of the viewer (the viewer is at the origin) — so we just add some constant to z to move it backward.
Project from 3D onto our 2D screen.

So we have another constant to pick, call it K2, for the distance of the donut from the viewer, and our projection now looks like:

\left( x', y' \right) = \left( \frac{K_1 x}{K_2 + z} , \frac{K_1 y}{K_2 + z} \right)

K1 and K2 can be tweaked together to change the field of view and flatten or exaggerate the depth of the object.

Now, we could implement a 3x3 matrix multiplication routine in our code and implement the above in a straightforward way. But if our goal is to shrink the code as much as possible, then every 0 in the matrices above is an opportunity for simplification. So let's multiply it out. Churning through a bunch of algebra (thanks Mathematica!), the full result is:

\left( \begin{matrix} x \\ y \\ z \end{matrix} \right) = \left( \begin{matrix} (R_2 + R_1 \cos \theta) (\cos B \cos \phi + \sin A \sin B \sin \phi) - R_1 \cos A \sin B \sin \theta \\ (R_2 + R_1 \cos \theta) (\cos \phi \sin B - \cos B \sin A \sin \phi) + R_1 \cos A \cos B \sin \theta \\ \cos A (R_2 + R_1 \cos \theta) \sin \phi + R_1 \sin A \sin \theta \end{matrix} \right)

Well, that looks pretty hideous, but we we can precompute some common subexpressions (e.g. all the sines and cosines, and R_2 + R_1 \cos \theta) and reuse them in the code. In fact I came up with a completely different factoring in the original code but that's left as an exercise for the reader. (The original code also swaps the sines and cosines of A, effectively rotating by 90 degrees, so I guess my initial derivation was a bit different but that's OK.)

Now we know where to put the pixel, but we still haven't even considered which shade to plot. To calculate illumination, we need to know the surface normal — the direction perpendicular to the surface at each point. If we have that, then we can take the dot product of the surface normal with the light direction, which we can choose arbitrarily. That gives us the cosine of the angle between the light direction and the surface direction: If the dot product is >0, the surface is facing the light and if it's <0, it faces away from the light. The higher the value, the more light falls on the surface.

The derivation of the surface normal direction turns out to be pretty much the same as our derivation of the point in space. We start with a point on a circle, rotate it around the torus's central axis, and then make two more rotations. The surface normal of the point on the circle is fairly obvious: it's the same as the point on a unit (radius=1) circle centered at the origin.

So our surface normal (Nx, Ny, Nz) is derived the same as above, except the point we start with is just (cos θ, sin θ, 0). Then we apply the same rotations:

\left( \begin{matrix} N_x, & N_y, & N_z \end{matrix} \right) = \left( \begin{matrix} \cos \theta, & \sin \theta, & 0 \end{matrix} \right) \cdot \left( \begin{matrix} \cos \phi & 0 & \sin \phi \\ 0 & 1 & 0 \\ -\sin \phi & 0 & \cos \phi \end{matrix} \right) \cdot \left( \begin{matrix} 1 & 0 & 0 \\ 0 & \cos A & \sin A \\ 0 & -\sin A & \cos A \end{matrix} \right) \cdot \left( \begin{matrix} \cos B & \sin B & 0 \\ -\sin B & \cos B & 0 \\ 0 & 0 & 1 \end{matrix} \right)

So which lighting direction should we choose? How about we light up surfaces facing behind and above the viewer: (0,1,-1). Technically this should be a normalized unit vector, and this vector has a magnitude of √2. That's okay -- we will compensate later. Therefore we compute the above (x,y,z), throw away the x and get our luminance L = y-z.

\begin{aligned} L &= \left( \begin{matrix} N_x, & N_y, & N_z \end{matrix} \right) \cdot \left( \begin{matrix} 0, & 1, & -1 \end{matrix} \right) \\ &= \cos \phi \cos \theta \sin B - \cos A \cos \theta \sin \phi - \sin A \sin \theta + \cos B ( \cos A \sin \theta - \cos \theta \sin A \sin \phi) \end{aligned}

Again, not too pretty, but not terrible once we've precomputed all the sines and cosines.

So now all that's left to do is to pick some values for R1, R2, K1, and K2. In the original donut code I chose R1=1 and R2=2, so it has the same geometry as my cross-section diagram above. K1 controls the scale, which depends on our pixel resolution and is in fact different for x and y in the ASCII animation. K2, the distance from the viewer to the donut, was chosen to be 5.

I've taken the above equations and written a quick and dirty canvas implementation here, just plotting the pixels and the lighting values from the equations above. The result is not exactly the same as the original as some of my rotations are in opposite directions or off by 90 degrees, but it is qualitatively doing the same thing.

Here it is: toggle animation

It's slightly mind-bending because you can see right through the torus, but the math does work! Convert that to an ASCII rendering with z-buffering, and you've got yourself a clever little program.

Now, we have all the pieces, but how do we write the code? Roughly like this (some pseudocode liberties have been taken with 2D arrays):

const float theta_spacing = 0.07;
const float phi_spacing   = 0.02;

const float R1 = 1;
const float R2 = 2;
const float K2 = 5;
// Calculate K1 based on screen size: the maximum x-distance occurs
// roughly at the edge of the torus, which is at x=R1+R2, z=0.  we
// want that to be displaced 3/8ths of the width of the screen, which
// is 3/4th of the way from the center to the side of the screen.
// screen_width*3/8 = K1*(R1+R2)/(K2+0)
// screen_width*K2*3/(8*(R1+R2)) = K1
const float K1 = screen_width*K2*3/(8*(R1+R2));

render_frame(float A, float B) {
  // precompute sines and cosines of A and B
  float cosA = cos(A), sinA = sin(A);
  float cosB = cos(B), sinB = sin(B);

  char output[0..screen_width, 0..screen_height] = ' ';
  float zbuffer[0..screen_width, 0..screen_height] = 0;

  // theta goes around the cross-sectional circle of a torus
  for (float theta=0; theta < 2*pi; theta += theta_spacing) {
    // precompute sines and cosines of theta
    float costheta = cos(theta), sintheta = sin(theta);

    // phi goes around the center of revolution of a torus
    for(float phi=0; phi < 2*pi; phi += phi_spacing) {
      // precompute sines and cosines of phi
      float cosphi = cos(phi), sinphi = sin(phi);
    
      // the x,y coordinate of the circle, before revolving (factored
      // out of the above equations)
      float circlex = R2 + R1*costheta;
      float circley = R1*sintheta;

      // final 3D (x,y,z) coordinate after rotations, directly from
      // our math above
      float x = circlex*(cosB*cosphi + sinA*sinB*sinphi)
        - circley*cosA*sinB; 
      float y = circlex*(sinB*cosphi - sinA*cosB*sinphi)
        + circley*cosA*cosB;
      float z = K2 + cosA*circlex*sinphi + circley*sinA;
      float ooz = 1/z;  // "one over z"
      
      // x and y projection.  note that y is negated here, because y
      // goes up in 3D space but down on 2D displays.
      int xp = (int) (screen_width/2 + K1*ooz*x);
      int yp = (int) (screen_height/2 - K1*ooz*y);
      
      // calculate luminance.  ugly, but correct.
      float L = cosphi*costheta*sinB - cosA*costheta*sinphi -
        sinA*sintheta + cosB*(cosA*sintheta - costheta*sinA*sinphi);
      // L ranges from -sqrt(2) to +sqrt(2).  If it's < 0, the surface
      // is pointing away from us, so we won't bother trying to plot it.
      if (L > 0) {
        // test against the z-buffer.  larger 1/z means the pixel is
        // closer to the viewer than what's already plotted.
        if(ooz > zbuffer[xp,yp]) {
          zbuffer[xp, yp] = ooz;
          int luminance_index = L*8;
          // luminance_index is now in the range 0..11 (8*sqrt(2) = 11.3)
          // now we lookup the character corresponding to the
          // luminance and plot it in our output:
          output[xp, yp] = ".,-~:;=!*#$@"[luminance_index];
        }
      }
    }
  }

  // now, dump output[] to the screen.
  // bring cursor to "home" location, in just about any currently-used
  // terminal emulation mode
  printf("\x1b[H");
  for (int j = 0; j < screen_height; j++) {
    for (int i = 0; i < screen_width; i++) {
      putchar(output[i,j]);
    }
    putchar('\n');
  }
  
}

The Javascript source for both the ASCII and canvas rendering is right here.

https://www.a1k0n.net/2011/07/20/donut-math.html

Yahoo! Logo ASCII Animation in 462 bytes of C

Jun 26, 2011 Updated Jun 26, 2011

Show full content

[Update 10/21/2015: Changed the title from "six lines of C" to "462 bytes of C" to avoid endless arguments about what constitutes a line of C code.]

[Update 6/28/2011: added Javascript version; press button below to see the output without compiling the code.]

[Update 7/9/2011: if you're using a compiler other than gcc, you might need to put a #include <math.h> at the top for it to work correctly -- I seem to be depending on the builtin behavior of sin and cos w.r.t. their return types when undeclared.]

Last week I put together another obfuscated C program and have been urged by my coworkers to post it publicly. I've made some refinements since posting it to our internal list, so here is the final version (to those who had seen it already: it's one line shorter now, and the angles are less screwy, and the animation is 2 seconds instead of 3). Go ahead, try it:

var F,S,V,tmr,doframe=function(){var k=document.getElementById("output"),c,d,e,a,f,g,h,j,b,i=[];S+=V+=(1-S)/10-V/4;for(d=0;d<24;d++){for(c=0;c<73;c++){for(a=e=0;a<3;a++){f=S*(c-27);g=S*(d*3+a-36);e^=(136*f*f+84*g*g<92033)<<a;b=0;p=6;for(m=0;m<8;){h=('O:85!fI,wfO8!yZfO8!f*hXK3&fO;:O;#hP;"i'.charCodeAt(b)-79)/14.6423;j="<[\\]O=IKNAL;KNRbF8EbGEROQ@BSXXtG!#t3!^".charCodeAt(b++)-79;if(f*Math.cos(h)+g*Math.sin(h)<j/1.165){b=p;p="<AFJPTX".charCodeAt(m++)-50}else if(b==p){e^=1<<a;m=8}}}i.push(" ''\".$u$"[e])}i.push("\n")}k.innerHTML= i.join("");if(!F--){clearInterval(tmr);tmr=undefined}};function animate(){F=40;V=S=0;if(tmr===undefined)tmr=setInterval(doframe,50)};

$ cat >yanim.c
c,p,i,j,n,F=40,k,m;float a,x,y,S=0,V=0;main(){for(;F--;usleep(50000),F?puts(
"\x1b[25A"):0)for(S+=V+=(1-S)/10-V/4,j=0;j<72;j+=3,putchar(10))for(i=0;x=S*(
i-27),i++<73;putchar(c[" ''\".$u$"]))for(c=0,n=3;n--;)for(y=S*(j+n-36),k=0,c
^=(136*x*x+84*y*y<92033)<<n,p=6,m=0;m<8;k++["<[\\]O=IKNAL;KNRbF8EbGEROQ@BSX"
"XtG!#t3!^"]/1.16-68>x*cos(a)+y*sin(a)?k=p,p="<AFJPTX"[m++]-50:k==p?c^=1<<n,
m=8:0)a=(k["O:85!fI,wfO8!yZfO8!f*hXK3&fO;:O;#hP;\"i[by asloane"]-79)/14.64;}
^D
$ gcc -o yanim yanim.c -lm
[warnings which real programmers ignore]
$ ./yanim

[you'll see - show the animation]

It's a 20fps, antialiased ASCII art animation of the Yahoo! logo. If you want to figure out how it works on your own, you're welcome to. Otherwise, read on.

I encourage you to play with the constants in the code: S+=V+=(1-S)/10-V/5 is the underdamped control system for the animation -- S is scale (=1/zoom), V is velocity, and 1/10 and 1/5 are the PD constants. S=0 corresponds to infinite zoom on the first frame. S<0 is funny. F is the frame counter. The 1.16 controls the scale of the polygon rendering (68 is an approximation of 79/1.16 so you have to adjust that too), and 136/84/92033 define the ellipse. The 14.64 is not a tunable parameter, though (it's 46/π, and for a good reason).

The antialiasing is simple: each character consists of three vertically-arranged samples and an 8-character lookup table for each arrangement of three on/off pixels. Each frame consists of 73x24 characters, or 73x72 pixels. The 73 horizontal choice was somewhat arbitrary; I suppose I could have gone up to 79.

The logo is rendered as an ellipse and eight convex polygons using a fairly neat method (I thought) with sub-pixel precision and no frame buffer. It required some design tradeoffs to fit into two printable-character arrays, but it's much less code than rendering triangles to a framebuffer, which is the typical way polygon rasterization is done.

To produce this, first I had to vectorize the "Y!" logo. I did this by taking some measurements of a reference image and writing coordinates down on graph paper. Then I wrote a utility program which takes the points and polygon definitions and turns them into angles and offsets as defined below. [I put the generator code on pastebin until I get can some code highlighting stuff set up for my blog].

The ellipse is fairly standard high-school math: x2/a2 + y2/b2 < 1. Each point is tested and if it's inside the ellipse, the pixel is plotted. (136x2 + 84y2 < 92033 was a trivial rearrangement of terms with a and b being the radii of the two axes of the ellipse measured from my source image, scaled to the pixel grid).

Each polygon is made up of a set of separating half-planes (a half-plane being all points on one side of an infinitely long line). If a given point is "inside" all of the half-planes, it's inside the polygon (which only works as long as the polygon is convex) and the pixel is toggled with the XOR operator ^ (thus it handles the "inverse" part inside the ellipse as well as the uninverted exclamation mark without any special cases). Each side of a polygon is defined by the equation ax + by > c. To represent both a and b I use an angle θ so that a = cos(θ) and b = sin(θ) and quantize the angle in π/46 increments — my angles are thus represented from -π to +π as ASCII 33 to 125 — '!' to '}' — with 'O' (ASCII 79) as zero. Then I solve for c, also quantized in scaled increments from -47 to +47, so that the midpoint of the side is considered inside the polygon.

Here's an extremely crude diagram: (I'm writing this on a plane and none of my drawing programs are working. Sorry.)

The shaded area is ax + by < c, implying it's outside the polygon, and the dashed line is ax + by = c.

(a,b) form a vector orthogonal to the line segment they represent pointing towards the inside of the polygon, so we can get them directly from the points defining the line segment by taking the vector defining the side — (x1 - x0, y1 - y0) — and rotating it 90 degrees, resulting in (a, b) = (y1 - y0, x0 - x1). Then we normalize (a, b) as the actual magnitude doesn't matter, but it will be 1 when we decode the angle and we can compensate with our choice of c later (if ax+by>c, then sax+sby>sc for some scale s>0). Then compute θ = atan2(a,b), quantize to one of our 94 angles, and get our new (a,b) = (cos(θ), sin(θ)).

c is easy to get by directly substituting any of the points making up the line on the side of the polygon into c = ax+by. I use the midpoint of the line segment on the side, (xt, yt) = ((x0 + x1)/2, (y0 + y1)/2), because the angle of the side can be slightly off after we quantize θ, and this evens the errors out across the length of the side.

You'll notice on the first couple frames (you can pause with ^S, resume with ^Q -- xon/xoff) that the bottom section of the 'Y' has little bites taken out of it due to the quantization error in the separating half-plane equations.

It could probably be made somewhat more efficient CPU-wise by careful reordering of the separating plane arrays so that most of the drawing area is rejected first. I didn't get to that in my generator code.

The animation is done by the <ESC>[25A sequence — it moves the cursor up 25 lines in just about any terminal emulation mode. I technically only need to move up 24 lines, but puts is shorter than printf and it implicitly adds a newline. If your terminal isn't at least 26 lines high, though, it does funky things to your scrollback. And usleep is there to limit it to 20fps, which is the only non-ANSI Cism about it.

And then I shrunk the code down by arranging it into clever for loops and taking unorthodox advantage of commas, conditionals, and globals being ints by default in C (which is all par for the course in obfuscated C code). And that pretty much reveals all the secrets as to how it was done.

It would be fairly easy to enhance this with a different movement sequence, or rotation (or any kind of 3D transform, as it's basically just ray-tracing the logo). I just animated the scale to prove the point that it was being rendered dynamically and not just a compressed logo, and kept the animation short and sweet.

I apologize in advance for the various sign errors I'm sure to have made when typing this up, but you get the idea.

https://www.a1k0n.net/2011/06/26/obfuscated-c-yahoo-logo.html

Google AI Challenge post-mortem

Mar 4, 2010 Updated Mar 4, 2010

Show full content

[Update 8/2/2011: you can play against a dumbed-down Javascript version of the bot here.]

I can't believe I won.

I can't believe I won decisively at all.

I've never won any programming contest before (although I did place in 3rd in the Mario AI contest but there were only about 10 entrants). Whenever I badly lose at an ICFP contest I'm always anxious to see the post mortems of the people who did better than I did, and I imagine a lot of people are curious as to how I won, exactly. So here's my post-mortem. Code

Before we get into it, note that all of my code is on github here. The commit logs might be a mildly entertaining read. Phase 1: denial

The first thing I did was attempt to ignore this contest as long as possible, because month-long programming contests get me into a lot of trouble at work and at home. The contest was the Tron light cycle game. I've played variants of this game ever since I got one of these when I was in 1st grade or so. The most fun variant was my uncle's copy of Snafu for the Intellivision which we played at my Grandma's house all day long. I've long wondered how to write a bot for it, because the AI on these usually isn't very smart. Phase 2: space-filling

But I finally gave in on the 9th, downloaded a starter pack, and attempted to make a simple semi-decent wall-hugging bot. I quickly discovered a simple useful heuristic: the best rule for efficiently filling space is to always choose the move that removes the least number of edges from the graph. In other words, go towards the space with the most walls for neighbors. But! Avoid cut vertices (AKA articulation points), and if you have to enter a cut vertex, then always choose the largest space left over. At this stage I wasn't actually calculating articulation points; I just checked the 3x3 neighborhood of the square and made a lookup table of neighbor configurations that might be articulation points based on the 3x3 neighborhood. This is what the potential_articulation function does in my code, and artictbl.h is the lookup table.

I was, however, computing the connected components of the map. This is a simple two-pass O(NM) algorithm for N squares in the map and M different components. For each non-wall square in the map, traversed in raster order, merge it with the component above it (if there is no wall above) and do the same to its left (if there is no wall to the left). If it connects two components, renumber based on the lowest index, maintaining an equivalence lookup table on the side (equivalence lookups are O(M) but really just linear scans of a tiny vector). Then scan again and fixup the equivalences. This is what the Components structure is for; it has this algorithm and simple sometimes-O(1), sometimes-O(NM) update functions based on potential_articulation above. Phase 3: minimax

I left it at that for the rest of the week and then Friday the 12th I was inspired by various posts on the official contest forum to implement minimax with alpha-beta pruning, which would let it look several moves ahead before deciding what to do. The problem with this approach is that you have to have some way of estimating who is going to win and by how much, given any possible configuration of walls and players. If the players are separated by a wall, then the one with the most open squares, for the most part, wins. If they aren't separated, then we need to somehow guess who will be able to wall in whom into a smaller space. To do that, I did what everyone else in the contest who had read the forums was doing at this point: I used the so-called Voronoi heuristic.

The "Voronoi heuristic" works like this: for each spot on the map, find whether player 1 can reach it before player 2 does or vice versa. This creates a Voronoi diagram with just two points which sort of winds around all the obstacles. The best way to explain it is to show what it looks like during a game:

The light red area are all squares the red player can reach before the blue player can. Similarly for the light blue squares. If they're white, they're equidistant. The heuristic value I used initially, and many other contestants used, was to add up the number of squares on each side and subtract.

Once the red player cuts the blue one off, they are no longer in the same connected component and then gameplay evaluation switches to "endgame" or "survival" mode, where you just try to outlast your opponent. After this point, the minimax value was 1000*(size of player 1's connected component - size of player 2's connected component). The factor of 1000 was just to reward certainty in an endgame vs. heuristic positional value. Note that this was only used to predict an endgame situation. After the bot actually reached the endgame, it simply used the greedy wall-following heuristic described above, and it performed admirably for doing no searching at all. evaluation heuristic tweaking

I next noticed that my bot would make some fairly random moves in the early game, effectively cluttering its own space. So I took a cue from my flood filling heuristic and added a territory bonus for the number of open neighbors each square in the territory had (effectively counting each "edge" twice). This led to automatic wall-hugging behavior when all else was equal.

After fixing a lot of bugs, and finally realizing that when time runs out on a minimax search, you have to throw away the ply you're in the middle of searching and use the best move from the previous ply, I had an extremely average bot. Due to the arbitrariness of the ranking up until the last week in the contest, it briefly hit the #1 spot and then settled to a random spot on the first page. It was pretty hard to tell whether it was any good, but I was losing some games, so I figured it must not be. endgame tweaking

The next realization was that my bot was doing a lot of stupid things in the endgame. So the next improvement was to do an iteratively-deepened search in the endgame. I exhaustively tried all possible moves, and at the bottom of the search tree, ran my greedy heuristic to completion. Whichever move sequence "primed" the greedy evaluator the best wins. This works great on the smallish official contest maps. It works terribly on very large random maps currently in rotation on dhartmei's server, but I didn't realize that until after the contest. data mining

I was out of ideas for a while and spent some time optimizing (I used Dijkstra's to do the Voronoi computation and I sped it up by using what I call a radix priority queue which is just a vector of stacks... see dijkstra). But it had been bothering me that my edge count/node count Voronoi heuristic was pretty arbitrary, and wondered if I could do any kind of inference to discover better ones.

Well, hundreds of thousands of games had been played on the contest server by this point, and they are extremely easy to download (the contest site's game viewer does an AJAX request to get some simple-to-parse data for the game), so I figured I'd try to do some data mining. I wrote a quick perl hack to grab games from the site and output them in a format that Tron bots recognize. Then I copied-and-pasted my code wholesale into examine.cc and marked it up so it would read in a game back-to-front, find the point at which the players enter separate components, guess what the optimal number of moves they could have made from that point forward, and then use the existing territory evaluation code on every turn before that and dump out some statistics. The goal was to discover a model that would predict, given these territory statistics, what the difference in squares will eventually be in the endgame.

I started with an extremely simple linear model (and never really changed it afterwards): the predicted difference in endgame moves is K1 (N1 - N2) + K2 (E1 - E2) where Ni is the number of nodes in player i's territory and Ei is the number of edges (double-counted actually).

Now, this model is pretty far from absolutely great, and only a little predictive. This is what the raw data looks like after analyzing 11691 of the games the top-100 players (at the time) had played:

That's the difference of nodes/edges on the x-axis and the difference of endgame moves on the y-axis. So both nodes and edges by the Voronoi metric are, of course, correlated. I did a linear regression to find approximate values for K1 (currently 0.055) and K2 (0.194) and multiplied through by 1000 to keep everything integers.

This definitely improved play in my own testing (I kept 14 different versions of my bot throughout the contest so I could compare them. Interestingly, no bot ever totally shut out a previous bot on all maps in my tests; every bot has a weakness). Once I had that, I was doing very well in the leaderboard rankings. applied graph theory

Next I noticed dmj's "mostly useless" idea on the official contest forums: Pretend the game is played on a checkerboard. Each player can only move from red to black and vice versa. Therefore, if a given space has a lot more "red" squares than "black" squares, the surplus "red" squares will necessarily be wasted. I switched out all my space counting code to count up red and black spaces, and found a tighter upper bound on the amount of space an ideal bot could fill. This let my endgame code stop searching when it had found a solution matching the upper bound, and gave slightly more realistic territory evaluations.

I had already started to think about what came to be called "chamber trees", as pointed out by iouri in the same thread: find all the articulation points on the map and construct a graph of connected spaces. I implemented the standard O(N) algorithm for finding articulation points (calc_articulations, taken from this presentation [pdf]). I messed around with this idea but nothing came to fruition until just before the deadline.

At around this point, I got extremely sick and spent all day Wednesday in bed. That day, dhartmei's server showed up, which was a huge blessing. I ran my bot on there in the background all Thursday long, and it did very well on there too, which was a very reassuring thing. But it was still losing a lot of games.

So finally, after failing to get to sleep Thursday night thanks to coughing and being unable to breathe through my nose, I was struck by an idea at around 3am. This, it turns out, was probably the contest-winning idea, though I'm not so sure that nobody else implemented it. Anyway, take a look at this game (a1k0n_ v. ecv257):

(The little circles are the articulation points found by the algorithm above.) By the so-called Voronoi heuristic, blue has a lot more space than red does. But red is ultimately going to win this game, because the only space that blue controls that matters here is the space that borders red. Blue can choose to cut off that space and fill in the two chambers on the right, or it can choose to move into the left chamber and take its claim on what I call the "battlefront": the border between blue space and red space.

I had long ago come to the realization that a better evaluation heuristic will always beat deeper minimax searches, because a deep minimax search using a flawed evaluation heuristic is self-deluded about what its opponent is actually going to do, and will occasionally favor moves that lose to moves that win, simply because it can't tell the difference. Anything you can do to make your evaluation function smarter will result in improved play in the long run.

In this case, I decided to make my evaluation function aware of the above condition: if the player is not in the chamber containing the "battlefront", then make the choice I outlined above. More formally, the new heuristic value is the same as the old one, but Ni and Ei are counted differently. First, find all cut vertices assuming the opponent's territory by the Voronoi metric is disconnected. Start a depth-first search in the player's "chamber", count Ni and Ei within the chamber, and list all the neighboring cut vertices but do not traverse them (_explore_space). Now, explore space recursively for each adjacent cut vertex. If child j's space is not a battlefront, then our potential value is the sum of the current chamber size and child j's value. Otherwise, it is a battlefront, and we ignore the current chamber's size but add only the number of steps it takes to enter the child chamber (I don't have a good formal reason for this, it just seemed intuitively right). After computing this potential value for each child, we return the maximum of them as the current chamber's value.

Therefore the new code will handle the mutual exclusion of battlefront chambers and other chambers, and force it to choose to either ignore the upper left chamber or ignore the two chambers on the right.

The idea was extremely roughly-formed when I implemented it (see max_articulated_space), but it did improve play markedly after throwing it together (again, it didn't shut out the previous version of my bot totally but it won 12-1 IIRC).

I also had the idea of negatively counting the space we ignore on a battlefront, as we are effectively handing that space to our opponent. Never got a chance to try it, though. Might be a nice improvement.

So that was it. I submitted that Friday around noon, and was subsequently afraid to touch it. (My wife and son and I left for Wisconsin Dells right afterwards, where I couldn't help but check rankings on my cellphone and keep up on the forums the whole time, which didn't go over well) The bot is still running on dhartmei's server, and still loses many games as a result of miscounting the opponent's space in the endgame, since my endgame evaluation wasn't very good. But it was good enough for the contest.

https://www.a1k0n.net/2010/03/04/google-ai-postmortem.html

Another short C program.

Aug 24, 2007 Updated Aug 24, 2007

Show full content

b[2080];main(j){for(;;){printf("\x1b[H");for(j=1;j<2080;j++)b[j]=j<
2000?(b[j+79]+b[j+80]+b[j]+b[j-1]+b[j+81])/5:rand()%4?0:512,j<1840?
putchar((j%80)==79?'\n':" .:*#$H@"[b[j]>>5]):0;usleep(20000);}}

It's supposed to be the old fire demo effect but in ASCII. It looks kinda like camoflauge instead.

https://www.a1k0n.net/2007/08/24/obfuscated-c-fire.html

Embellishing the donut: an old-school CG cliche

Sep 20, 2006 Updated Sep 20, 2006

Show full content

_,x,y,o       ,N;char       b[1840]       ;p(n,c)
{for(;n       --;x++)       c==10?y       +=80,x=
o-1:x>=       0?80>x?       c!='~'?       b[y+x]=
c:0:0:0       ;}c(q,l       ,r,o,v)       char*l,
       *r;{for       (;q>=0;       )q=("A"       "YLrZ^"
       "w^?EX"           "novne"     "bYV"       "dO}LE"
       "{yWlw"      "Jl_Ja|[ur]zovpu"   ""       "i]e|y"
       "ao_Be"   "osmIg}r]]r]m|wkZU}{O}"         "xys]]\
x|ya|y"        "sm||{uel}|r{yIcsm||ya[{uE"  "{qY\
w|gGor"      "VrVWioriI}Qac{{BIY[sXjjsVW]aM"  "T\
tXjjss"     "sV_OUkRUlSiorVXp_qOM>E{BadB"[_/6  ]-
62>>_++    %6&1?r[q]:l[q])-o;return q;}E(a){for (
       o= x=a,y=0,_=0;1095>_;)a= " <.,`'/)(\n-"  "\\_~"[
       c  (12,"!%*/')#3"  ""     "+-6,8","\"(.$" "01245"
       " &79",46)+14],  p(""       "#$%&'()0:439 "[ c(10
       , "&(*#,./1345" ,"')"       "+%-$02\"! ", 44)+12]
-34,a);  }main(k){float     A=0,B= 0,i,j,z[1840];
puts(""  "\x1b[2J");;;      for(;; ){float e=sin
(A), n=  sin(B),g=cos(      A),m=  cos(B);for(k=
0;1840>   k;k++)y=-10-k/    80   ,o=41+(k%80-40
       )* 1.3/y+n,N=A-100.0/y,b[k]=".#"[o+N&1],  z[k]=0;
       E(  80-(int)(9*B)%250);for(j=0;6.28>j;j   +=0.07)
       for  (i=0;6.28>i;i+=0.02){float c=sin(    i),  d=
       cos(  j),f=sin(j),h=d+2,D=15/(c*h*e+f     *g+5),l
=cos(i)        ,t=c*h*g-f*e;x=40+2*D*(l*h*  m-t*n
),y=12+       D  *(l*h*n+t*m),o=x+80*y,N  =8*((f*
e-c*d*g       )*m   -c*d*e-f*g-l*d*n)     ;if(D>z
[o])z[o       ]=D,b[     o]=" ."          ".,,-+"
       "+=#$@"       [N>0?N:       0];;;;}       printf(
       "%c[H",       27);for       (k=1;18       *100+41
       >k;k++)       putchar       (k%80?b       [k]:10)
       ;;;;A+=       0.053;;       B+=0.03       ;;;;;}}

A version of this (the scroll text says "IOCCC 2006" instead) was featured in the 2006 International Obfuscated C Code Contest.

Output:

(as with the first one, compile it with -lm, and it needs ANSI-ish terminal emulation)

https://www.a1k0n.net/2006/09/20/obfuscated-c-donut-2.html

Have a donut.

Sep 15, 2006 Updated Sep 15, 2006

Show full content

(compile with gcc -o donut donut.c -lm, and it needs ANSI- or VT100-like emulation)

How it works

             k;double sin()
         ,cos();main(){float A=
       0,B=0,i,j,z[1760];char b[
     1760];printf("\x1b[2J");for(;;
  ){memset(b,32,1760);memset(z,0,7040)
  ;for(j=0;6.28>j;j+=0.07)for(i=0;6.28
 >i;i+=0.02){float c=sin(i),d=cos(j),e=
 sin(A),f=sin(j),g=cos(A),h=d+2,D=1/(c*
 h*e+f*g+5),l=cos      (i),m=cos(B),n=s\
in(B),t=c*h*g-f*        e;int x=40+30*D*
(l*h*m-t*n),y=            12+15*D*(l*h*n
+t*m),o=x+80*y,          N=8*((f*e-c*d*g
 )*m-c*d*e-f*g-l        *d*n);if(22>y&&
 y>0&&x>0&&80>x&&D>z[o]){z[o]=D;;;b[o]=
 ".,-~:;=!*#$@"[N>0?N:0];}}/*#****!!-*/
  printf("\x1b[H");for(k=0;1761>k;k++)
   putchar(k%80?b[k]:10);A+=0.04;B+=
     0.02;}}/*****####*******!!=;:~
       ~::==!!!**********!!!==::-
         .,~~;;;========;;;:~-.
             ..,--------,*/

Output:

(This was my first attempt at obfuscated C and I feel it's pretty amateurish; see Donut Mark II for a more impressive demo -- though this one is simple and elegant in comparison.)

https://www.a1k0n.net/2006/09/15/obfuscated-c-donut.html

https://a1k0n.net/atom.xml

Posts