GeistHaus
log in · sign up

https://bumbershootsoft.wordpress.com/atom

atom
10 posts
Polling state
Status active
Last polled May 19, 2026 03:26 UTC
Next poll May 20, 2026 00:12 UTC
Poll interval 86400s
Last-Modified Mon, 18 May 2026 23:44:58 GMT

Posts

Simulated Evolution on the PICO-8
nonretrocoding
The articles from these past couple of weeks kinda burned me out. After wrapping it up, I decided I was going to take it easy, do something simple and brainless in a nice comfy high-level language, and just not worry about what kind of writeups I could get from it. This backfired, of course, and […]
Show full content

The articles from these past couple of weeks kinda burned me out. After wrapping it up, I decided I was going to take it easy, do something simple and brainless in a nice comfy high-level language, and just not worry about what kind of writeups I could get from it.

This backfired, of course, and now I have a whole new project I want to do. But in the meantime, I also spent a nice lazy afternoon with PICO-8 making a port of Simulated Evolution.

Getting all of this working was mostly a matter of bringing together various techniques I’ve already had to use in other contexts—some of which aren’t even PICO-8. Below the fold I’ll talk about what challenges I faced, how I solved them, and link to the related older projects that helped me along the way.

Implementing the Simulation

The first challenge, of course, was that I had to write the simulation itself. My earlier implementations were all in C or assembly language, and PICO-8 needs to be programmed in its own dialect of Lua. Happily, Lua is, itself, pretty comfy, so adapting the C code was straightforward. It was not, however, perfectly exact.

The original BASIC code I had adapted was a little bit buggy. It intended to loop through all the bugs in each simulation step, but because of the way iterations interacted with births and deaths, sometimes a bug would have its turn skipped if some other bug died or fissioned near it in memory. My C code made this more consistent while retaining much of the structure of the BASIC original. The Lua version, however, diverges a little bit in its design, and as a result while I did keep things consistent, it’s differently consistent.

All my previous implementations kept all the bug creatures in an array. If a bug died, the last bug in the array would be moved over to take its place and the total size of the array would drop. To ensure that the moved bug didn’t miss its turn—this was the issue in the BASIC original—it would meddle with the iterator to reprocess the same index with its new bug. A similar process was done when a bug fissioned: one of the newborns would replace the original parent, while the other was appended to the array. All of this was in the service of safely modifying a collection while iterating through it.

My Lua implementation elected to instead simply not do that. It instead made use of something more like a “for-each” loop, which makes it much less feasible to meddle with the iterator index in this way. However, the closest thing that Lua has to arrays are hashtables with integer keys, which in turn means they’re easy to create, grow, and shrink. For the Lua implementation I created BORN and DIED tables to collect the necessary changes along the way, and then use its table-manipulation functions after the iteration to delete any dead bugs and introduce any newly-born ones. However, that does introduce our discrepancy: in my previous implementations, the newborns had a simulation tick on the same turn they were created, while in the Lua one it does not. Neither result is truly crucial to the correctness of the simulation, but I’ll take either of these over the original behavior where exactly one of the newborns got a simulation tick on their birth turn.

Screen Memory as Real Memory

I wanted to replicate the trick I did in my Cyclic Cellular Automaton project, where the spritesheet is used as a sort of extra backbuffer for the screen. This would let me efficiently both track and render the plankton without declaring a supplemental 15,000-element array. There’s just one problem: that spritesheet is indexed as a 128×128-pixel region, and I need it to be a 150×100 one.

The solution here was to lean into some of the logic that I used on the Commodore 64 port. That port used a bitmap screen, which split up the bitmap display into 4×8-pixel blocks. To place or check for plankton, we would first find the block of pixels that our map coordinate corresponded to, and then index into that block for the final test.

PICO-8’s map facility is essentially a super-powered version of that bitmapped block placement. I set aside 247 of my 255 possible map tiles and arranged them into a 19×13 grid of 8×8-pixel blocks, into which my 150×100 world would fit. I then wrote a function that take world coordinates and return the X and Y coordinates in the 128×128 spritesheet space encoded into a single integer:

FUNCTION SCOORD(X,Y)
LOCAL CHAR=19*(Y\8)+(X\8)+1
LOCAL CY=8*(CHAR\16)+(Y&7)
LOCAL CX=8*(CHAR&15)+(X&7)
RETURN (CY<<7)|CX
END

(In PICO-8’s Lua dialect, the backslash is a truncating integer divide. This also means that X\1 is a reasonable way to convert a real number into an integer.)

Now we face our second problem: it’s not just the spritesheet that is 128×128. The screen is too. We can’t fit the whole world on-screen at once. We faced that problem on our other 8-bit version, for the Atari 800, and the high-level solution is identical: allow the user to scroll the display as desired. Vertical scrolling over a larger buffer was a trick that the Atari’s display lists made easy, and scrolling over a large map in any direction is even easier on the PICO-8. The whole map subsystem is dedicated to making this precise problem as easy as possible.

It turns out that this ease extends to sprites and custom drawing operations as well; the CAMERA() and CLIP() commands let you translate any drawing commands on their way to the screen and also restrict changes to a particular region within it. With this capability, I decided I could just leave the bugs as sprites, keeping the spritesheet bitmap exclusively for tracking the plankton. My DRAW() routine thus became this:

FUNCTION _DRAW()
CLS()
PRINT("SIMULATED EVOLUTION",26,0,10)
PRINT("POPULATION: "..#BUGS,0,120,7)
PRINT("⬅➡\F7 TO SCROLL",72,120,12)
CAMERA(CAM_X,-12)
RECTFILL(-2,-2,151,101,6)
MAP(0,0,0,0,19,13)
FOR BUG IN ALL(BUGS) DO
RECTFILL(BUG.X,BUG.Y,BUG.X+2,BUG.Y+2,15)
END
CAMERA()
END

(The emojis there are present in the source! PICO-8’s extended character set gets rendered as unicode graphics to the extent that it can when it saves it out.)

The borders of the world scroll in and out of view as necessary, and CAMERA()‘s transform did everything we needed to produce that effect.

Dodging System Limitations

One of the weirder parts of the PICO-8 platform is that its numbers are restricted to the 16-bit signed integer range of -32768-32767. That’s inconvenient for us; when I made the simulation 8-bit friendly I relied on unsigned 16-bit numbers and needed to be able to generate random numbers with values up to 40,960. That’s something we could work around with some clever extra checking, but happily we don’t need that. PICO-8’s numbers are 32-bit fixed-point, with 16 bits of fraction hard-coded in. We can just divide all our values in half and everything still works, no negative or rounded values required.

Even more happily, it turns out that its Lua dialect accepts negative numbers to its shift operations, so the expression 1 << -1 is actually 0.5 and we don’t even need to special-case anything.

Handling Metadata

One of the key aspects of Simulated Evolution is the “Garden of Eden:” the environment can be modified to produce a special region with far more food in it than normal. Its presence or absence has an effect on the behavior that ultimately ends up selected for in the bugs.

My PICO-8 port of Lights Out introduced me to the MENUITEM() facility for customizing the pause menu. I added a Garden of Eden switch there, and also a way to reset the simulation without rebooting the cartridge; that way your configuration didn’t get lost alongside it.

The reset ultimately worked exactly the same way the “New Puzzle” option did in my PICO-8 port of Lights-Out. The garden is a little more interesting, though, because it has to alter itself as part of its own handler to announce whether the garden is currently active or not. This turns out to map closely to an example in the manual, happily enough. Their sample code demonstrates the quirk where if you omit arguments from MENUITEM(), or offer them as NIL, it borrows the values from the option being updated. This is our menu code, in the end, run at the top level after defining _INIT(), _UPDATE60, and _DRAW():

MENUITEM(1|0X300,"NEW SIMULATION",_INIT)
MENUITEM(2,"GARDEN: ON",
FUNCTION()
GARDEN=NOT GARDEN
MENUITEM(NIL,"GARDEN: "..(GARDEN AND "ON" OR "OFF"))
RETURN TRUE
END)

Much of the graphical data for the project had to be supplied externally, again. The “spritesheet” data is all dynamic, but laying out all 247 tiles of the map promised to be extremely tedious. I wrote a couple lines of Python to spit out the map data for me; it’s just an ever-increasing list of hex numbers. This actually backfired the first time I tried to do it; if the hex numbers use capital letters for A-F, the reader rejects them but does so by silently replacing them with the transparent Tile 0. At least that corruption shows up in re-saved copies!

PICO-8 also includes some data import facilities of its own. The IMPORT command will let you load a 128×128 image as the spritesheet, or even put smaller images into particular places within it if you don’t want to make the atlas all at once. Even better, for me, the “cartridge label” may be imported by passing an additional -L option. That meant I could use Aseprite to create an image for it that brings in the beaker and petri dish from the Amiga and Mac icons:

I’m still not much of a pixel artist, alas, but I figured this one deserved more than a lazy screenshot.

What’s Missing

The biggest thing missing is something I didn’t want: lag! Even when I set the simulation to update at 60Hz, we still only use 60% of the PICO-8 CPU. Given the extra work I had to do in rendering here, I didn’t expect to fit the budget so cleanly. 60% is still enough that Warp Mode is a fool’s errand, though. It would just look like occasional frameskip.

I also don’t allow custom seeds. I didn’t pack my own RNG with this one, relying instead on PICO-8’s built-in RNG one. I could have fixed that—the 16-bit Xorshift PRNG I used for the 8-bit versions can be implemented easily enough in its fixed-point math—but without a keyboard to enter the seed with, the feature would be more irritation than convenience. I opted for simplicity here.

Downloads

As is traditional for PICO-8 works, the cartridge art up above includes the program within it, steganographically encoded within the cover art. I’ve also added it to the SimEvo collection on the Projects page and put the source code proper in the Bumbershoot Software GitHub repository.

Next time: a new project!

http://bumbershootsoft.wordpress.com/?p=5825
Extensions
Comparing an LZ4 Decompressor on Four Legacy CPUs
retrocodingtheorycrafting
A few years ago, I needed to save some cartridge space in a SNES project, and I did so by compressing that data with the LZ4 compression algorithm from 2012. I found that working within the constraints of the SNES allowed me to take some convenient shortcuts during decompression and I have since found those […]
Show full content

A few years ago, I needed to save some cartridge space in a SNES project, and I did so by compressing that data with the LZ4 compression algorithm from 2012. I found that working within the constraints of the SNES allowed me to take some convenient shortcuts during decompression and I have since found those constraints and shortcuts to be widely relevant on the 8- and 16-bit platforms my hobby programming often involves. I accumulated two more implementations (for Motorola’s 6809 and 68000 processors) as I worked on projects for the Tandy Color Computer and the Sega Genesis.

Now, as a consequence of my recent experiences with the Sorcerer, I found myself with four new implementations, for the Z80, two CPUs related to it (the Intel 8080 and 8086), and the 6502. As I mentioned at the time, the Z80 implementation in particular was very straightforward, and it informed the other three pretty directly.

My article about this for the 68000 was kind of desultory—it’s pretty clear on rereading it that the only thing I found interesting about it was the way the 16-bit code ended up bearing more in common with the 8-bit 6809 code than the 16-bit 65816 code. I expect this article to be my last hurrah on the subject, so I’d also like to give it more respect and more scope. I’ll start with my potted summary of how LZ4 works and what variations I made to it, but then also look at why LZ4 is so friendly to the Z80, especially as it compares to the 8080 and 6502. After that, I’ll use the LZ4 implementation as a worked example of last week’s article comparing all these CPUs by showing how the Z80 implementation can be easily transformed into the 8080 and x86 implementations, and some of the very different implementation decisions the 6502 version must make.

Fair warning: This article is very long and very dry compared to my usual fare. If you want to see the same function implemented in four different assembly languages, this is the good stuff and go on and have fun. If not, maybe come back next week when I expect to be goofing around with high level languages again.

A Review of LZ4

(I’ve repeated this section in every implementation article so that readers don’t have to click back for context; if you’ve seen this all before, feel free to skip ahead to the section “Other Restrictions on LZ4 Encoders.” You’re not missing anything.)

LZ77-class decompressors all work on the same two basic tricks:

  • If a string of data repeats itself throughout the text, you specify an offset and a length within the output to copy. This will be shorter than repeating the data itself.
  • If the length of the amount to repeat kicks you past the original end of the output, keep repeating the values that you originally wrote. This matches what happens with a naïve copy loop, anyway, so this normally works out. It also means that this class of decompressors can do run-length encoding for free, even when the run is a copy of multiple bytes instead of just one.

LZ4’s block format hinges on the insight that we can think of a compressed stream as alternating between strings of copied values and backreferenced ranges. Backreferences might follow one another directly, but if two blocks of literals are adjacent, that’s just a single block of literals. As such, the fundamental unit of decompression is a string (possibly length zero) of literals, followed by a backreference. The format specification calls this pair a sequence.

A sequence begins with a single byte giving the length of both the next literal string and the next backreference; the top four bits represent the string length directly, and the bottom four bits represent the backreference length after adding four to it. (If your backreference is shorter than four, it’s not worth the space to create a new sequence compared to just copying the literals again. Thus, four is the shortest meaningful backreference.) This is followed by that many literal characters to copy into the output, and then a two-byte little-endian value to indicate how far back in the output to copy the backreference from.

It may, of course, come to pass that your literal string is longer than 15, or your backreference is longer than 19. To represent this, the length nybble for that part is recorded as 15 and additional bytes are added to the stream to indicate how much to add to the length. These bytes occur right after the initial lengths byte for the literal string, or just after the offset for the backreference. Bytes are read and added to the length until one of them is not 255; then that value too is added in to produce the final result. (This does mean that a literals length of exactly fifteen needs an extra length byte of zero to indicate that we did in fact mean fifteen exactly.)

Our Variation

Encoders have to worry about a few extra constraints, because there are old decoders that rely on hacks to allow decompression to go way faster. Since we’re decoding, we don’t have to care about those, but one of them is extremely useful to us: We are guaranteed that the final sequence contains only literals. The compressed data ends just before what would otherwise be an offset word.

The other important fact about offset words is that they may never be zero. After all, that would mean we would need to copy from underneath the very byte we were writing!

Combined, these facts are what let us dispense with the traditional frame data: we can simply null-terminate the compressed data with a pair of zero bytes and use “our alleged backreference has offset zero” as the signal to stop decompressing.

Other Restrictions on LZ4 Decoders

There are three restrictions placed on the LZ4 encoder when creating compressed blocks. We only rely on the first.

  • The last sequence is only literals.
  • The last sequence has at least five literal characters in it, unless it is the only sequence in the entire block.
  • The backreference in the next-to-last sequence begins at least 12 bytes before the end of the block.

The specification explicitly states that a consequence of this is that no string of less than 13 bytes can be compressed. It also describes these rules as being “in place to ensure compatibility with a wide range of historical decoders which rely on these conditions for their speed-oriented design.” That’s interesting to me, because my own exploitation of these rules was to ensure correctness more easily and to reduce the amount of state the decoder needed. Speed was not really involved at all.

I can, at least, come up with an easy efficiency-oriented reason to insist on the last sequence being only literals; the LZ4 frame format (which I have discarded) signals termination by pre-declaring the size of the compressed data. Insisting on ending with literals means we only have to actually adjust or check that counter after copying a block of literals. Without this restriction, we’d have to juggle the check in with reading the extended length bytes of the backreference.

I am utterly in the dark as the the advantages the last two restrictions might grant, though, particularly since while decoding we generally won’t know that we’re in the last two sequences and neither of these restrictions apply to any previous sequences within the block. My best guess on the second is that for every block but the last you can transfer four bytes unconditionally with a 32-bit move and just adjust your destination pointer if it turns out that copied too much, and you will still (thanks to the minimum backreference size being four bytes long) never blow out your destination buffer. I don’t even have bad guesses for what the final restriction buys us.

LZ4 On Our Various Chips

Before digging into the details of the implementations, let’s take a high-level look at how the decompression algorithm interacts with the capabilities of our four CPUs.

Our Yardstick: The Zilog Z80

The LZ4 decompression algorithm revolves around making block copies, which the Z80 is quite good at thanks to the LDIR instruction. However, that is not the only way it fits well with the chip. We can only conveniently use two pointers at once on the Z80, with the remaining three general-purpose registers managing the overall computation and the index registers kept at a bit of a remove from the rest of the instruction set. LZ4 only needs two long-lived pointers as well; the source and destination pointers. A third pointer value shows up regularly as the alternate source pointer when copying from a backreference; however, this pointer effectively temporarily replaces the source pointer over its lifetime; it is, with one slight wobble, computed, used exclusively in place of the source pointer, and then thrown out. This means that to the extent that we won’t be able to fit everything in registers, our usage of the stack can restrict itself to storing temporary values.

The “wobble” shows up with long-running backreference copies. If the backreference part of a sequence is more than 19 bytes long, the remaining bytes in the length data appear after the offset. This requires us to stash the backreference pointer briefly after computing it so that we may read a few more bytes from the source pointer first. While the Z80 is not really well-suited to keeping local variables on a stack, it inherits an instruction from the 8080 that allows it to swap HL with the top value on the stack. This turns out to be sufficient to let us juggle our source pointers as needed, and everything works out.

Doing Things the Hard Way: The Intel 8080

The 8080 implementation almost exactly matches the Z80 one, except for assembler syntax. There are only two Z80-specific capabilities we use at all in the implementation: the LDIR block-copy command, and the single instruction SBC HL,BC. We replace the block copies with hand-written loops, and break out the 16-bit subtraction into a pair of 8-bit subtractions. These operations mean we make heavier use of the accumulator, and as a result we do need to save and restore registers to the stack a bit more frequently.

The translation here was otherwise very direct; we can read the 8080 version as being exactly the Z80 version with some blocks of instructions filling in for the missing ones.

Doing More With More: The Intel 8086

The 8086 has enough spare registers, and enough expressive capability, that we do not need to touch the stack at any point the 8-bits do. We do still hit the stack a bit though; the customary ABI preserves some of the registers we use, and we shouldn’t violate that without a good reason. Furthermore, I expanded the scope of the 8086 implementation to work with “far pointers”, covering the entire 1MB address range. This required some juggling of the segment registers, and I felt the code was a bit clearer if it used the stack while doing so.

Beyond that, we get extremely good use out of the “string” instructions that serve as autoincrementing loads, stores, and moves. Not only do these serve the same purpose as the Z80’s LDIR instruction, they are actually useful in even more places in the implementation. A consequence of this, however, is that we often find that the main accumulator gets overwritten by different intermediate computations than it does on the 8-bit editions, which resulted in me shifting around some parts of the order of operations as a result.

And Now For Something Completely Different: The MOS Technology 6502

The 6502 is vastly more restricted, when viewed from the perspective of the Z80 or its relatives. It has zero registers that can double as pointers, its 8-bit stack struggles to accomplish much beyond preserving values in strict stack discipline, and it offers even less direct support for 16-bit operations than the 8080.

The gap narrows considerably, however, when we shift the perspective to what the 6502 is actually good at. I dedicated an entire article to the differences in implementation philosophy encouraged by the 6502 compared to the Z80, and much of that analysis applies here. The weaker stack and smaller register bank mean that the 6502 relies far more heavily on scratch space elsewhere in RAM, but that very reliance means that we have no register pressure whatsoever. All arithmetic goes through the accumulator, and 16-bit values can’t be used out of any registers at all, so conceptually all the operands are coming out of memory and returning to it once processing is complete.

The LZ4 algorithm is definitely a less comfortable fit overall, though. The 6502 is much less comfortable with pointers than the Z80, or even the 8080, is, and we will have only limited opportunities to work around that. In my old article, I noted that the 6502 vastly prefers to work with arrays than with pointers, and that furthermore, when we have multiple arrays, it wishes to be working with corresponding indices within those arrays. I ended up writing an entire helper function to implement what the Z80 does with LDIR, and it was able to provide that abstraction of corresponding indices; however, the decompressor as a whole is not afforded that luxury. It is, after all, kind of the whole point of compressed data that we will be advancing the destination’s pointer more rapidly than the source’s!

Designing the API

The Z80 API is straightforward, and is basically dictated by the fact that we’re going to lean on LDIR for our copy operations. HL will be our source pointer, pointing to the beginning of the compressed data, and DE will be our destination pointer, pointing to the beginning of the buffer we wish to fill with decompressed data. Registers ABC will all be scratch registers for the function, and on function exit, both HL and DE will point one byte past the last byte read or written.

As I mentioned above, the 8080 implementation is extremely close to the Z80 one and as a result it will enjoy an identical API.

The 8086 was a bit more fraught. We’ve now reached a point where things like “calling conventions” actually exist, and the C compilers I’ve worked with are pretty consistent about what they expect. BP is used as a frame pointer to access arguments that were pushed on the stack prior to the call, 16 bit return values are returned in AX, the high 16 bits of 32-bit return values are returned in DX, those two along with CX are scratch registers, and everything else is preserved. This is extremely inconvenient for comparison purposes with my other implementations, so I’ve bent it heavily.

I have retained the notion that the function’s API should closely match the block-copy instruction. For the 8086, that means these must be far pointers that range over the full 1MB of address space, with DS:SI as the source pointer and ES:DI as the destination pointer. While these are normally all preserved, I let the routine update SI and DI for use as a pair of return values (again, pointing just past their final byte processed). Those two pointers aside, I do also ensure that all segment registers are preserved alongside BP and BX.

The 6502 API is, like the others, dictated by the actual processing we do, but since its architecture is so different, the API is likewise distinct. Here, we set aside four bytes of scratch space in the zero page to hold the source and destination pointers. The routine updates those memory locations directly, leaving them in their final locations on function exit. Four additional scratch bytes are necessary as well, but I implemented the function such that they do not need to be in the zero page. Like most functions of any complexity, this one trashes all three of the 6502’s registers.

Implementation on the Z80 and Intel Chips

Let’s now look at the implementations in detail. The 8080, Z80, and 8086 implementations are so similar that it’s worth simply tracking them in parallel.

We open with a function prologue… or we would, if we had one. Only the 8086 has one; it needs to save off the BX register because it will be trashing it over the course of the function and it needs to restore the caller’s value on the way out.

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             -----------------
lz4dec:                                                     push bx

After that, we begin our main loop by reading the next byte from the compressed data and advancing the source pointer. We’ll need to split this into two 4-bit values over the course of the function, so we’ll need to stash a copy of the original byte for later.

The 8080 and Z80 code here is actually identical, though the instructions look very different. When reading a byte out of the memory location specified by HL, 8080 syntax refers to it as the pseudo-register M. Furthermore, instructions that work on register pairs are distinct from those that work on single registers, so they are usually named only by their high byte (so, H for what the Z80 calls HL, and so on). A further exception, which we also see here, is that the register pair that the Z80 calls AF—the accumulator and flags—is instead called PSW—the processor status word.

The 8086 code, on the other hand, is much different. Instead of pushing the byte read to the stack, it simply stores it in the BL register where it will be left alone until needed. More interestingly, it uses one of the CPU’s “string processing” instructions. The three basic instructions are LODSB, STOSB, and MOVSB:

  • LODSB (“Load String Byte”) is equivalent to MOV AL,[DS:SI]; ADD SI,1. This instruction replaces the first two instructions in the 8-bit case.
  • STOSB (“Store String Byte”) is the reverse; it is equivalent to MOV [ES:DI],AL; ADD DI,1.
  • MOVSB (“Move String Byte”) combines them into MOV [ES:DI], [DS:SI]; ADD SI,1; ADD DI,1. This is roughly equivalent to the Z80’s LDI instruction.
  • It is also possible to move 16-bit words (with AX as the other register in the load and store cases) by replacing the B with a W. Once the CPU goes 32-bit, a double-word variant also appears relying on EAX.
  • There is a “direction” flag that causes the pointers to be decremented instead of incremented; calling conventions all demand that this flag be set to increment mode at all function boundaries, and we don’t mess with it here.
  • The store- and move-string instructions may be repeated by prefixing the instruction with REP. This uses CX as a count register, and functions analogously to LDIR or LDDR depending on the direction flag.

The end result is that in this brief snippet the 8086 code is one instruction shorter:

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             -----------------
.main:  mov a,m               ld a,(hl)                     lodsb
        inx h                 inc hl
        push psw              push af                       mov bl,al

Now we need to handle the literals. The 8086 code is, again, pretty straightforward; it isolates the high nybble by shifting AL right 4, then skips ahead to backreference processing if the result is zero. Otherwise it calls our .rdlen helper function to finalize the length and does the block copy with REP MOVSB. The Z80 and 8080 code is similar, with LDIR handling the block copy on the Z80 side and the 8080 (which lacks this instruction) filling it in with a more manual loop. There’s an interesting difference in the way they collect the high nybble, though:

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             ------------------
        rrc                   rrca                          mov cl,4
        rrc                   rrca                          shr al,cl
        rrc                   rrca
        rrc                   rrca
        ani 15                and 15
        jz .bkref             jr z,.bkref                   jz .bkref
        call .rdlen           call .rdlen                   call .rdlen
.lp1:   mov a,m               ldir                          rep movsb
        stax d
        inx h
        inx d
        dcx b
        mov a,c
        ora b
        jnz .lp1

The 8080, it turns out, doesn’t have any shift instructions. Instead, it has four “rotate accumulator” instructions, two of which are 9-bit rotations through the carry and two of which are 8-bit rotations strictly within the register (with the carry bit mirroring the bit that moves from least to most significant). On the 8080, we need to rotate right four times to swap the high and low nybbles and then mask out the upper bits. On the Z80, we actually have a proper shift-right instruction SRL A but this instruction, as part of its extended instruction set, is 2 bytes long and takes 8 cycles to execute, while RRCA, which it inherited from the 8080, is only 1 byte and takes only 4 cycles to execute. Copying the 8080’s more roundabout processing is both smaller and faster. This is one of those cases I noted more abstractly last week: the Z80 rewards staying within the 8080’s constraints when it’s not very awkward to do so.

The literals of this sequence handled, we now move on to the backreference part. Step one is always to read a 16-bit value from the compressed data and end decompression if it is zero. Once again, the string operations dramatically simplify the memory access on the 8086 side, and its 16 bit registers mean that the zero test may be done with a single operation as well:

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             ------------------
.bkref: mov c,m               ld c,(hl)                     lodsw
        inx h                 inc hl
        mov b,m               ld b,(hl)
        inx h                 inc hl
        mov a,c               ld a,c                        test ax,ax
        ora b                 or b
        jz .done              jr z,.done                    jz .done

(The TEST instruction is basically an AND that throws away the result; when passed the same register twice it basically asks the CPU to set the Zero and Sign flags according to that register’s value.)

At this point, things start to diverge. Several things need to happen, and which operations interfere with what are different on each chip:

  • Construct the backreference pointer by subtracting the offset we just read from the destination pointer.
  • Read the rest of the backreference length out of the original source pointer, and add 4 to it.
  • Do the block copy with the backreference pointer standing in for the source pointer.
  • End this sequence with the source pointer one byte past the last length byte we read (or just past the offset, if this backreference was 18 characters or less).

Keeping the source pointer happy means it will need to be saved and restored to the stack, on the 8-bits, and that also means that we’ll need to pop the length byte before pushing the source pointer onto the stack. This all works out very neatly, with a high level procedure something like this:

  • Pop the length byte from the stack and isolate the low nybble this time.
  • Save HL to the stack. This is our source pointer that we’ll be setting aside, but also then…
  • Compute HL = DE-BC to get our backreference pointer. The Z80’s extended instructions make this easy. The 8080 has to do more work, which also means re-saving and re-restoring the accumulator.
  • Swap the backreference and source pointers, finalize the backreference length by reading any extra length bytes, then swap them back and do the bulk copy.
  • Pop the source pointer back into HL, ready to begin the next sequence, and jump back to the top to begin processing the next sequence of literals.

The 8086, however, has to reorganize things a bit. The 8-bits were able to load their 16-bit offset directly into BC because that is what its 16-bit math demanded, and that meant that it was free to use A to hold the initial backreference length during the subtraction. The 8086, however, loaded into AX with its LODSW instruction, and needs to completely finish computing the backreference pointer before dealing with the length at all. Fortunately, since it isn’t using the stack to hold temporary values, this is just a matter of register moves. The backreference pointer is computed in DX and exchanged with SI as needed; the saved length value remains in BL for as long as we need it and is copied over once it’s necessary to be passed to .rdlen.

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             ------------------
        pop     psw           pop af                        ; Below
        ani     15            and 15
        push    h             push hl
        mov     h,d           ld h,d                        mov dx,di
        mov     l,e           ld l,e
        push    psw           or a                          sub dx,ax
        mov     a,l           sbc hl,bc
        sub     c
        mov     l,a
        mov     a,h
        sbb     b
        mov     h,a
        pop     psw
        xthl                  ex (sp),hl
        ; above               ; above                       mov al,bl
                                                            and al,15
        call .rdlen           call .rdlen                   call .rdlen
        inx b                 inc bc                        add cx,4
        inx b                 inc bc
        inx b                 inc bc
        inx b                 inc bc
        xthl                  ex (sp),hl                    xchg dx,si
                                                            push ds
                                                            push es
                                                            pop ds
.lp2:   mov a,m               ldir                          rep movsb
        stax d
        inx h
        inx d
        dcx b
        mov a,c
        ora b
        jnz .lp2
        pop h                 pop hl                        mov si,dx
                                                            pop ds
        jmp .main             jr .main                      jmp .main

There’s one last subtlety we can see in the 8086 code; when we swap the pointers, we also need to copy the value of ES into DS for the copy so that the backreference pointer has the correct segment as well, then restore it once we’re done. It’s a bit inconvenient to move values in and out of segment registers; I just do stack operations here because it’s clear and simple.

We also can see some counterintuitive optimization in the 8-bit code; we add 4 to BC simply by calling INC BC four times. This turns out to be both faster and shorter than trying to add 4 to BC directly with 8- or 16-bit math.

All that’s left in the main function is the exit; each function needs to restore the stack and return. On the 8-bits, this involves popping the unused backreference length off the stack; on the 8086, it involves restoring the BX register, which we aren’t allowed to permanently trash in our calling convention.

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             ------------------
.done:  pop psw               pop af
                                                            pop bx
        ret                   ret                           ret

Now we need to implement the .rdlen helper function. This function takes an initial length in A (AL), and returns a final length in BC (CX). If the initial length is under 15, that is also the final length; otherwise it keeps reading bytes and adding them to the total length until we read a byte whose value is not 255. The first thing to do is to copy our 8-bit (really, 4-bit) value into the 16-bit result and quit immediately if the value isn’t 15:

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             ------------------
.rdlen: mvi b,0               ld b,0                        xor ah,ah
        mov c,a               ld c,a                        mov cx,ax
        cpi 15                cp 15                         cmp cl,15
        rnz                   ret nz                        jne .rdend

The 8-bits handle this by zeroing the destination high byte and copying the low byte; the 8086 handles it by zeroing the high byte of AX and copying the whole word over. The 8086 also lacks RET NZ, and instead jumps to the end of the function if it needs to quit immediately.

Reading an 8-bit value and adding it to a 16-bit counter is old hat for us at this point, but the 8086 gets to be a little sneaky here; since it zeroed out AH at the top of the function, it can load bytes exclusively into AL and then just do 16-bit adds. The 8-bit chips need to use the accumulator to carry out the add, but also need to preserve the byte read for the final check, so there’s some stack work here too.

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             ------------------
.rdlp:  mov a,m               ld a,(hl)                     lodsb
        inx h                 inc hl
        push psw              push af                       add cx,ax
        add c                 add c
        jnc .rdok             jr nc,.rdok
        inr b                 inc b
.rdok:  mov c,a               ld c,a
        pop psw               pop af

Finally we loop back if the value we had read was 255. The easiest way to check this on all platforms is to do an 8-bit increment on it and see if that makes it zero.

        Intel 8080            Zilog Z80                     Intel 8086
    ------------------    -----------------             ------------------
        inr a                 inc a                         inc al
        jz .rdlp              jr z,.rdlp                    jz .rdlp
.rdend: ret                   ret                           ret

That gets us all three implementations for the ’80 cousins. Let’s shift gears to the more distinct 6502 now.

Implementation on the 6502

It turns out that, at least at first, the 6502 code doesn’t diverge that much from the Z80. While the golden rule of 6502 data processing is that we want that data to be in arrays and processing to be in corresponding elements, the only part of LZ4 compression that lets us do that is the bulk copy that the Z80 handles with LDIR. Pretty much everywhere else, we will be locking down the Y register and using the “indirect indexed” mode as an expensive pointer mechanism. With src and dest as 4 bytes of data in the zero page that also serve as our arguments and return values, we can map those to HL and DE and our handling of the literals in a sequence tracks very closely:

        MOS 6502              Zilog Z80
    ----------------      -----------------
lz4dec: ldy #$00              ld a,(hl)
        lda (src),y
        inc src               inc hl
        bne +
        inc src+1
+       pha                   push af
        lsr                   rrca
        lsr                   rrca
        lsr                   rrca
        lsr                   rrca
                              and 15
        beq .bkref            jr z,.bkref
        jsr .rdlen            call .rdlen
        jsr .ldir             ldir

(Syntax note: 6502 code requires a lot more of these short-term jumps when it’s doing its 16-bit math, so I’m using + as a temporary label. Branching to + just means “skip to the next + label” in each case.)

The only real divergence at this point, beyond INC HL and LDIR requiring more work, is that the 6502 does have proper logical-shift instructions and doesn’t need a masking step. Our replacement for LDIR turns out to have quite a lot going on in it, so we’ll hold off on the details there. Suffice to say that we have created another 16-bit variable—not necessarily in the zero page this time—named .count and it is the length value produced by .rdlen and consumed by .ldir, much like those uses of BC in the Z80 code.

The backreference processing on the 6502 diverges completely from the Z80 code, to the point that there’s no sense in putting the Z80 code alongside it for reference. It begins by reading the 16-bit offset from the compressed data stream to produce the backreference pointer, much like on the Z80, but instead of shuffling data around on the top of the stack, or even storing out the offset to more scratch RAM, we instead do the subtraction as we load and store that result to our scratch RAM under the name .bksrc:

        MOS 6502
    ----------------
.bkref: sec
        lda dest
        sbc (src),y
        sta .bksrc
        iny
        lda dest+1
        sbc (src),y
        sta .bksrc+1

As part of this read we’ve bumped up the Y register to let us quickly index the next two bytes. Our next step will be to renormalize the src pointer and the Y register so that when we visit the .rdlen function it will be ready to go:

        MOS 6502
    ----------------
        dey
        clc
        lda src
        adc #$02
        sta src
        bcc +
        inc src+1
+

Unlike the Z80 and 8080, where repeated 16-bit INC instructions are the fastest way to add small numbers, the calculus flips in favor of direct 16-bit math on the 6502 basically immediately. Note that we check the carry flag this time instead of the zero flag, since we could have skipped over the page boundary without hitting it exactly, and also note that it remains both shorter and faster to do this branch-or-increment dance than it does to add zero to the high byte with carry.

With this work done, we now need to see if the offset was zero, which is a bit more problematic here because we didn’t actually store it. We used it immediately to create .bksrc. What we can do, however, is see whether the result of the subtraction changed anything. If .bksrc != dest, then the offset wasn’t zero and we can proceed with processing:

        MOS 6502
    ----------------
        lda dest
        cmp .bksrc
        bne .bkok
        lda dest+1
        cmp .bksrc+1
        bne .bkok

Otherwise, it was zero and we need to pop our useless length byte and return, just like on the Z80.

        MOS 6502
    ----------------
        pla
        rts

Assuming everything was fine, though, it’s time to pop our still useful length byte, isolate the backreference part, and complete the length read, including adding the extra four to the counter. Like the 8086, no juggling is necessary here because we never displaced the src pointer.

        MOS 6502
    ----------------
.bkok:  pla
        and #$0f
        jsr .rdlen
        clc
        lda .count
        adc #$04
        sta .count
        bcc +
        inc .count+1
+

Now it is time to juggle some values. We will be hardcoding the pointer locations for our .ldir helper function as src and dest, so we need to stash the original src and replace it with with the value in .bksrc before the call.

        MOS 6502
    ----------------
        lda src
        ldx .bksrc
        pha
        stx src
        lda src+1
        ldx .bksrc+1
        pha
        stx src+1
        jsr .ldir

Stackwork can be expensive, but using the stack here produces shorter code than just using .bksrc itself as swap space, and if .bksrc is not on the zero page, the code is faster, too.

We are now at the very end of the loop. This is where we finally re-sync with the Z80 code, restoring the original src pointer and jumping back to the start of the loop:

        MOS 6502              Zilog Z80
    ----------------      -----------------
        pla                   pop hl
        sta     src+1
        pla
        sta     src
        jmp     lz4dec        jr lz4dec
Implementing the Length Reader

With the main function complete, it’s time to move on to the helper functions. Both of these end up relying on the fact that I’ve kept the Y register mostly locked to zero, and strictly locked to zero at all loop points and function boundaries.

We’ll look at .rdlen first. It ends up synchronizing much more closely with the Z80 version. There isn’t much to say about it that we didn’t say about the eighters’ implementations, so here I’ll just quote the two in parallel:

        MOS 6502              Zilog Z80
    ----------------      -----------------
.rdlen: sta     .count        ld c,a
        sty     .count+1      ld b,0
        cmp     #$0f          cp 15
        bne     .rdone        ret nz
.rdlp:  lda     (_src),y      ld a,(hl)
        inc     src           inc hl
        bne     +
        inc     src+1
+       tax                   push af
        clc                   add c
        adc     .count
        sta     .count
        bcc     .rdok         jr nc,.rdok
        inc     .count+1      inc b
.rdok:                        ld c,a
                              pop af
        inx                   inc a
        beq     .rdlp         jr z,.rdlp
.rdone: rts                   ret

Like the 8086, the 6502 has no conditional return statement so we use a simple branch-to-a-normal-return-statement instead. Also like the 8086, I stash the accumulator in a register (here, X) instead of relying on the stack like the Z80 does.

More interestingly, I actually played around a bit with consolidating the updates to src at the end of the function, relying instead on INY instructions inside of it. Based on a few other drafts and some napkin math, this turned out to only become worth it once your sequences extended to a kilobyte or so. Given the data I work with, I stuck with this. I think it’s clearer, anyway.

Supporting LDIR on the 6502

Now we turn to a new helper function that we’ve only needed on the 6502, and we finally get to write some code that plays well with indirect-indexed addressing. We need to write a function that does the same work that the LDIR or REP MOVSB instructions did, taking a 16-bit counter (in the .count variable) and copying that many bytes from the buffer in src to dest. At the end of the function, src and dest should each point one byte past the last byte copied in their respective buffers.

The 6502, unlike even the 8080, does not have a 16-bit decrement operator, so we’re going to need to organize this as two nested loops, one for looping the low byte of the counter, and the other for looping the high byte. We also would really prefer to let our decrement options also be our loop-end test, which on the 6502 means it has to be a test against zero. We want the loop to end with something like DEC COUNTER; BNE LOOP; DEC COUNTER+1; BNE LOOP, but that carries a subtle trap. If the initial value of COUNTER here is an even multiple of 256—that is, if the byte in COUNTER is zero to start with—this actually works. If it isn’t, you will end up iterating the wrong number. A starting value of $0100 loops 256 times, as we’d expect. A starting value of $0101, however, only loops once, and a starting value of $00d6 will loop 65,750 times!

This bug isn’t unique to the 6502; the identical bug, for the identical reason, appears in the ColecoVision BIOS in some functions, because someone decided they wanted to rely on the automatic flag updates of the 8-bit decrements instead of doing a proper 16-bit check like we did in the 8080 code. The solution is simple enough, though; we need to check that low byte and increment the high byte if, and only if, the low byte isn’t zero.

        MOS 6502
    ----------------
.ldir:  ldx     .count
        beq     .llp
        inc     .count+1

I loaded into X instead of A here because I intend to keep the low byte of the counter in a register for faster access. We’ll need Y to index the src and dest arrays, and since the copy is in sync, we use it directly for both.

The core loop is pretty simple, with an odd little division of responsibility for both; the Y register counts up, incrementing the high byte of both src and dest any time it wraps around as an offset. The low bytes of src and dest can be completely unrelated and this all still works fine. The X register, on the other hand, is counting down the total bytes copied, and it decrements the high byte of the counter once it hits zero itself. That loop wastes no space:

        MOS 6502
    ----------------
.llp:   lda     (src),y
        sta     (dest),y
        iny
        bne     +
        inc     src+1
        inc     dest+1
+       dex
        bne     .llp
        dec     .count+1
        bne     .llp

We aren’t done when this loop is, though. We need to add the value in Y to both src and dest to get their final values right. We’ve been incrementing the high bytes along the way—we had to, in order to reach later parts of the buffer with just an 8-bit offset—and now we must add what’s left over. We also reset Y to zero on the way out so that the main function can use src and dest normally with no extra work.

        MOS 6502
    ----------------
        clc
        tya
        adc     src
        sta     src
        bcc     +
        inc     src+1
+       clc
        tya
        adc     dest
        sta     dest
        bcc     +
        inc     dest+1
+       ldy     #$00
        rts

Relying on this function for both copies technically does some wasted work here; the backreference pointer doesn’t need that final update since we trash it immediately afterwards. Past a certain point, though, overall brevity and clarity have to win.

Catching Our Breath

This article ended up a lot longer than I expected, and in the service of an honestly kind of questionable goal. If you made it this far, I salute you. As for what I’ve ended up getting out of this, I’ve been collecting these drafts in their own directory in my repository so the practical outcome of this is that I now can LZ4-compress my data for Bumbershoot projects whenever I want.

From a craftsmanship standpoint, switching so rapidly between CPUs was an interesting experience for me. I’ve gotten quite a lot more experience with the Z80 lately, and working with both the 8080 and 8086 underlined a lot of the habits I’d picked up along the way. Usually I’ve stumbled a bit in my Z80 work, because the less appropriate habits of 6502 or 68000 programming would wrong-foot me from time to time, but working on this project is the first time in a long time where I’ve felt that stumbling in the other direction. I’ve had a much stronger Z80 focus over the past year or so and it’s really showing. On the plus side, I think that means at this point that my proficiency in both is up to par, and I simply need to remind myself to shift gears properly as needed from project to project.

Next week we’ll do something more freewheeling and fun, I promise.

http://bumbershootsoft.wordpress.com/?p=5810
Extensions
Comparing the Z80 and 6502 to Their Relatives
retrocodingtheorycrafting
Last week’s adventures with the Exidy Sorcerer led me to write a Z80 version of the LZ4 decompressor I’d previously used on the SNES, the CoCo, and the Genesis. At this point, this has become generic enough that I took my previous implementations and broke them out into their own little library directory so I […]
Show full content

Last week’s adventures with the Exidy Sorcerer led me to write a Z80 version of the LZ4 decompressor I’d previously used on the SNES, the CoCo, and the Genesis. At this point, this has become generic enough that I took my previous implementations and broke them out into their own little library directory so I could use them in other projects. The SNES implementation turned out to be too tightly tied to the project it was in to be made generic, but the other three were just fine.

However, while I was looking at those implementations, I rapidly found that I had not three implementations, but six; the Z80 implementation inspired versions for the earlier Intel 8080 and the later Intel 8086, and I felt the lack of a dedicated 6502 version and wrote a decompressor there as well.

My original plan was to present all four new implementations side by side as a way of highlighting the similarities and differences between the various CPUs, with maybe a little historical scene-setting at the start to map out the relationships between all the various chips. This got entirely out of hand so I’ve split it in two; this week’s article is just about comparing the CPUs in their context, and next week I’ll dig into the implementations as a worked example of why these differences matter.

The Z80 and the 8080

The Z80 is a direct descendant of the 8080; it was designed as a binary-compatible upgrade. This means that the 8080, despite not being directly used by any famous 80s-era machines, would still be very familiar to an 80s-era developer; we can describe it very well by simply treating it as a cut-down Z80. It’s pretty easy to characterize what’s missing, too:

  • Relative jumps. No JR or DJNZ instructions.
  • Shadow registers. The EX AF,AF' and EXX instructions are missing.
  • All multibyte instructions. That removes the index registers and indirect access to I/O ports, drastically restricts our options for interrupt modes and bitwise operations, and modestly restricts our ability to do 16-bit math directly on register pairs.

Interestingly, we get to keep our conditional procedure calls and returns. If you ever wondered why JR had fewer possibilities for what flag to branch on compared to JP, CALL or RET, that’s why; there weren’t enough unused opcodes in the 8080 to fit in all the possibilities somewhere that kept instruction decoding reasonable.

Speaking of instruction decoding, the 8080’s native assembly language format was very explicitly designed to closely match its instruction encoding. The full instruction set is described on its Wikipedia page, and its table makes it very clear how the opcodes directly map to the arguments the syntax permits. This is a sharp contrast to Zilog’s assembly syntax, which instead revolve around what the instructions actually do. Thus the 8080’s INR and INX instructions become the 8- and 16-bit versions of INC on the Z80, while the LD instruction on the Z80 side absorbs an absolutely dizzying number of 8080 instructions: MOV, MVI, LXI, LDA, STA, LDAX, STAX, LHLD, SHLD, SPHL, and potentially PCHL.

That collapse of instruction mnemonics on the Z80 side allowed other parts of the syntax to be more regular as well. Following the 8008, for instance, the 8080 has a notional “memory” register M that represents the byte loaded from the memory address stored in HL; the 8080’s LDAX and STAX instructions expand that capability to get the address out of BC or DE instead. Zilog followed the 6502 syntax in using parentheses to represent indirection: M becomes (HL), and the new instructions simply became places where the operands (BC) or (DE) were also legal. Less happily, it also introduces the same parsing anomaly we saw recently on the 6502; here it manifests as the instruction LD A,(20+6) translating to 8080 syntax as LDA 26 but LD A, (20)+6 translating to MVI 26 instead. The fixes are identical; there’s nothing about the processors that causes the problem.

Toolkits for 8080 development

Since the Z80 is binary-compatible, it is entirely possible to program the 8080 with a Z80 assembler as long as one avoids the instructions that the 8080 does not support. The Pasmo assembler includes a -w8080 command-line option that generates a warning if you use incompatible instructions, so it’s probably the best option for this approach. Meanwhile, WLA-DX and ASMX support traditional 8080 syntax directly.

The Z80 vs the 8086

The Z80 came out two years after the 8080, and the 8086 came out two years after the Z80. The 8086 is a full-fledged 16-bit CPU, and as such it’s in more of a class of its own compared to the 8080 or Z80. Still, the relationship is pretty clear; while the Z80 answered the question “how can we improve the 8080?”, the 8086 attacks the question “what do we get if we design a 16-bit CPU on the same principles as the 8080?”

The 8080 offers an 8-bit accumulator, supplemented by 6 less capable registers that meld into 3 16-bit “register pairs” that can also be used as pointers to access memory. It also provides a 16-bit stack pointer that allows only limited programmatic control. The Z80 supplements this with more extensive control of the stack pointer, along with two new 16-bit index registers (IX and IY) that cannot be split, and which can serve as a base address for a hardcoded displacement. It’s a little odd that the “index” register is the base address, but it’s worth noting that the Motorola 6800 (released the same year as the 8080) works the same way.

The 8086, on the other hand, has a 16-bit accumulator AX alongside three 16-bit registers named BX, CX, and DX, each of which can be split into two 8-bit registers (AX becomes AH and AL, and the other registers split similarly). These are supplemented with two index registers named SI and DI (“source” and “destination” index), and a BP “Base Pointer” to supplement the stack pointer. None of these were splittable.

Memory access is far more sophisticated in the 8086. Any of BX, BP, SI, or DI may be used directly as a pointer or with a hard-coded displacement, like Z80 index registers, but also the base registers may be added to an index register, meaning the indices are themselves proper, full 16-bit indices. Also, while the 8080 and Z80 both generally could only do arithmetic operations on A or HL, the x86 is considerably freer about what may be used where.

The 8086 also can address 1MB of memory space instead of the 64KB of the 8-bits. It manages this using a banking system similar to what we saw on the 65816; special secondary registers (CS, DS, ES, and SS, the code, data, extra, and stack segments) hold the extra necessary bits of each address. Unlike the 65816, these segments are multplied only by 16 instead of 65536 before being added to the main 16-bit address. This allows segments to partially overlap, and results in the the 16-byte “paragraph” becoming an important unit of memory measurement in 16-bit x86 systems. This segmentation system is easily the most reviled aspect of the entire 8086 design, but I must admit that I find it enormously preferable to the 65816’s bank system. The primary advantages are twofold: segment overrides may be provided to any pointer, which means that it is less necessary to juggle segment values the way it is necessary to juggle the 65816’s data bank pointer, and—even more crucially—the 8086 has two simultaneous data bank pointers (DS and ES), which allow accessing two “far” pointers simultaneously without any register juggling at all. This makes it much, much easier to write code that consumes an input buffer and produces an output buffer while working in larger memory spaces. The finer grain of the 8086’s paragraphs also means that while it shares a 64KB pointer-offsets limit with the 65816, it doesn’t have to care about buffers hitting bank boundaries nearly as much. A 64KB buffer on the 8086 is trivially fully accessible using only a 16-bit pointer as long as it’s 16-byte aligned in physical memory.

There are a few other Z80-like features the 8086 has picked up as well, including a more generic version of its LDIR family of instructions; this is where we get to see more strict assumptions from the instruction set about what registers are for. By analogy with the Z80, DS:SI serves the role of HL as the bulk source pointer; ES:DI serves the role of DE as the bulk destination pointer, and CX or CL work like BC (or just B) in holding the 16- or 8-bit counters for instructions that repeat in an LDIR-like manner.

Toolkits for 8086 development

The IBM PC, its compatible clones, and the DOS it used from 1981-1995 was kind of a big deal so there’s no shortage of toolkits for developing on the chip for any language under the sun. However, for assembly language development in the modern world, I will strongly suggest NASM over all alternatives; it uses a simplified and more regular assembler syntax than many of the period assemblers, and it felt no need to stay compatible with the older, messier systems the way that that other modern tools such as MASM did.

The Motorola 6800 and 6809

The 6800 came out in 1974, the same year as the 8080, but struggled to compete with it on price. One year later, some of its designers released the 6502 as a much cheaper competing product, and this largely turns the 6800 into a footnote. It was used as the basis of the 6809, though, which was also enormously more expensive than the 6502 or Z80, but also was quite likely the most powerful 8-bit microprocessor of its era. The 6809 saw some success, especially in arcade machines, but it did not steamroll the world the way the 6502 and Z80 did.

The 6800’s design is markedly different from the 8080 and its descendants. It offers two 8-bit accumulators (A and B), both of which may participate fully in arithmetic operations and which can interoperate in some limited ways. Unlike the 8080 design, where values are loaded out of memory into registers and then operated upon, the 6800 prefers to take memory locations directly as operands. As part of that focus, it also includes a 16-bit index register X which offers very similar capabilities to the Z80’s IX or IY register, right down to the in-instruction displacements being limited to 8 bits, thus meaning that once again our “index” register is really more properly understood as a base register.

The 6809 expands this design considerably. The two accumulators can now be glued together into a 16-bit accumulator D which can do 16-bit math. An additional index register Y and an additional stack register U supplement the pre-existing X and S. Stack registers gain all the capabilities of index registers, and index registers themselves may now take full 16-bit offsets when dereferenced. A final addition is a “direct page” register; the 6800 used shorter instructions to refer to memory addresses where the high byte was zero (the “zero page”); in the 6809, the high byte for these shorter instructions was taken from the Direct Page register.

Less immediately visible to someone working at the assembly language level instead of the machine code one is that relative addressing is much more common on the 6809, meaning that it’s significantly more viable to write position-independent code on it than any of the other chips we’ve looked at here. Only the 8086 comes close, and it achieves it by using its segment registers as a de facto relocation base.

The biggest improvement, in my experience with the 6809, is the more numerous and powerful index registers. Having both the register and the in-instruction displacement be the size of the entire 16-bit address space means that either value may be used as a base or index, which grants the programmer considerable freedom. Being able to use the stack pointers as index registers is also very important to modern practice; it lets us finally set up stack frames with local variables in the ways we can on the 8086 or on more modern systems.

Toolkits for the 6800 and 6809

I used asm6809 for my own 6809 work. I haven’t actually done anything with the 6800 yet, but of my usual stable of assemblers it looks like WLA-DX supports it.

The MOS Technology 6502

Like the Z80, the 6502 was intended to compete with the chip that inspired it. Unlike the Z80, the 6502 is a much more distinct design and it is not a strict upgrade. Compared to the 6800, we’ve lost one of our accumulators, and while we’ve gained a new index register Y, both X and Y are only 8 bits wide; these cannot be sensibly used as base registers. This obliges us to lean more heavily on the indexed modes, and since no register is large enough to hold a pointer, any kind of memory indirection has to be vectored through the zero page. Even the stack pointer is only 8-bit; the stack is hardcoded to the range $0100$01FF.

This is essentially the opposite of how the Z80 treated its predecessor, but it does make sense. The 6800 struggled; it was very difficult to manufacture and it was far too expensive. The 6502’s design emerges by identifying a set of core features, leaning very hard into them, and then using them as an excuse to remove or simplify other aspects of the design.

  • As long as we have every instruction able to use memory arguments, we don’t really need two accumulators.
  • The accumulator is 8-bit, so it’s kind of silly for the other registers to be 16-bit.
  • We’ve leaned heavily into operands being in memory; let’s put the pointers there too.
  • With the pointers in memory, we can bring the index registers back as indices, since they aren’t being used as base pointers anymore.
  • We can strip down the instruction set; no need for an “add ignoring carry” instruction when we have an “add with carry” instruction and a “clear carry” instruction.

The end result is a system where there are a lot of moving parts but where the most obvious way to express a computation within it is usually a practical one. There really isn’t an equivalent on the 6502 to the way that getting up to speed on the Z80 includes learning that OR A is how you clear the carry bit or that XOR A is better at setting A to zero than LD A,0 is. The subtleties tend to be quirks like how carry interacts with subtraction, or bugs like how Jump Indirect can’t have its address cross a page boundary.

Revisiting My Earlier Advice

The vast majority of my 8-bit work has been on either the 6502 or Z80, and as we’ve seen here, they’re from very different traditions. I do not remember where I first saw the claim, but back when I was learning either the 68000 or the Z80 one of the books I was studying divided the assembly language programming community of the time into the “sixers,” who preferred the 65xx and 68xx chips, and “eighters,” who preferred the 8080 and Z80. There’s a pretty obvious source for this split, looking at the different lineages and design decisions here.

I started out this blog very firmly as a sixer. The 6502 was the first assembly language I’d ever learned; I didn’t pick up 8086 until much later and, broadly speaking, under duress. However, a lot of the systems I’ve come to care about over the course of this blog are Z80-based, and this has obliged me to get myself up to speed with those. It’s really only in the last year or so that I think I’ve reached parity; at this point I’m confident in my ability to “switch-hit” between the sixes and the eights as needed and produce decent-to-good code in either.

My stock set of maxims for programming the two chips, however, comes from several years before I think I got there. Is there anything I’d correct?

A solid chunk of the advice, in retrospect, turns out to just be characterizing the primary differences between the sixer and eighter programming model. I think all of these leap immediately out of our design comparison above.

  • 6502: When evaluating complex expressions, decompose the operations into two-address code, with the destination as the accumulator if possible or a memory location if not. Fixed memory locations can be used freely as operands. This was a design consequence that was clearly inherited from the 6800.
  • Z80: Computation should keep all of its data either in 8-bit registers or accessible via pointer dereference through a 16-bit register. This, again, falls directly out of what the instruction set lets you directly ask the CPU to do. We see this style not merely in the Z80 but in the 8080 as well.
  • 6502: 6502 code does not want to work with pointers. It wants to work with arrays, and if the process involves accessing multiple arrays at once, it wants to work with corresponding elements in each of those arrays. This was completely unique to the 6502, as it turns out; the 6809 is much more comfortable working with pointers thanks to its 16-bit index registers, and the 6800’s X register was effectively an HL register that didn’t break apart.
  • Z80: Memory access is overwhelmingly through unindexed pointer dereference. Looping through arrays will rely on mutating the pointers themselves. We see this on the 8080 as well.
  • 6502: The stack on the 6502 is extremely weak, and is best for caching temporary values at the beginning and end of computations to make scratch space available. This is also pretty 6502-specific. The 6809 can actually use stack frames and local variables at a level we’d consider acceptable today.
  • Z80: Rely on the stack when you can. It is ideally suited for stashing temporaries of all kinds. The Z80, on the other hand, much like the 8080, cannot match the 6809 here and is pretty much restricted to stashing temporaries. I think my experience since then has underlined that the EX (SP),HL instruction (equivalent to the 8080’s XTHL) does a very significant amount of work in making the Z80 stack more usable mid-function than the 6502. We don’t see proper stack frames amongst the eights until the 8086, and even that requires jumping through a few hoops that the 386 ultimately made unnecessary.

Beyond that were some extremely specific pieces of advice for each chip. For the 6502 I had these:

  • Inner loop variables in 6502 code should generally be held in index registers, with memory taking a distant second place and the accumulator being most unsuitable. This still holds up pretty easily; it’s a direct consequence of the instruction set and the amount of time each instruction takes.
  • When working through a set of arrays, the innermost loop variable should be held in the Y register. This one, however, I would add some caveats to. The idea here is that the indexed-indirect mode requires the Y register to be its index, and you’ll be much more efficient if that’s also your loop variable. That’s great when you can get it, but you can’t always, and when you can’t, the earlier advice becomes more important. In those cases, use X as the innermost loop variable and leave Y to control the indexing.

For the Z80, I offered these:

  • The Z80 provides a grab-bag of specialized capabilities, so it’s often worth reviewing the instruction set to see if it’s got something for you to use. This is less advice, I think, and more a warning that more work has to be done to really get the whole design properly in your head. There’s just a bunch more opcodes and a larger number of total moving parts. To the extent that the design is more chaotic (for example, JP allowing branches on more conditions than JR), it’s pretty clearly due to the need to fit the new capabilities around the existing ones of the 8080.
  • When assigning registers to tasks, prefer HL for standalone or source pointers, DE for destination pointers, and BC for count values. When using 8-bit loop counters, one of them should use B so that it may take advantage of the DJNZ instruction. This is solid advice as far as it goes, but it boils down to “your own ABIs should play nice with DJNZ and LDIR.”

I think I would add one new guideline here: Z80 code is often faster when it’s also 8080 code. The fastest instructions are single-byte instructions that only touch registers, and the only Z80 instructions that do this that aren’t also in the 8080’s instruction set are the shadow-register instructions EX AF,AF' and EXX. In particular, I find I need to carefully justify any use of the Z80-specific bitwise logic instructions; 8080-friendly rephrasings are very often smaller and faster once we look at the final sizes and cycle counts.

Next week: we put theory to practice and render the same algorithm on four chips at once.

http://bumbershootsoft.wordpress.com/?p=5805
Extensions
Staying a Spell with the Exidy Sorcerer
retrocoding
The idea of what it means to be a personal computer has evolved over the years. The usual story is that Back In The Day you turned the computer on and were immediately in BASIC, while nowadays you boot into some kind of graphical operating system and choose what application to run from there. There’s […]
Show full content

The idea of what it means to be a personal computer has evolved over the years. The usual story is that Back In The Day you turned the computer on and were immediately in BASIC, while nowadays you boot into some kind of graphical operating system and choose what application to run from there.

There’s a few problems with this. “Nowadays” starts clear back in 1984 with the Apple Macintosh, while “back in the day” was only a handful of computers like the Commodore and Sinclair machines. Apple’s 8-bits expected a boot disk and needed a special command code to enter BASIC, while the Atari and PCjr needed cartridges installed. PCs proper, along with the TRS-80, would boot into a command-line DOS that you could then enter BASIC from in a single command. If we allow those two, then “back in the day” extends through 1995, as the pre-Windows-95 DOS machines of that era were still a single command away from entering the (now much fancier, IDE-adjacent) QuickBASIC environment. Finally, there’s a clear era before this where computers are primarily electronics projects built from kits. That era has a fuzzy border with our C64-and-Apple-II era, too; the earliest Sinclair machines are contemporaneous with the VIC-20 and were sold as kit computers as well.

The kit computer era is interesting because it has a clear, iconic standout in the early 70s, much the same way that the Apple/PET/TRS trio stood out in the late 1970s: the Altair 8800. The two key features of the Altair were an 8080 processor and a very useful and expandable peripheral system. The 8080, thanks in part to the Altair’s popularity, got a large library of software, and the later systems based on the Zilog Z80 (which was fully backwards-compatible at the binary level) could make full use of all of it. Meanwhile, the Altair’s peripheral system became a de facto industry standard for expansion boards as the S-100 bus. A decade later, the IBM PC AT would inspire a new Industry Standard Architecture that they would literally name that.

So let us shift focus, then, to the year 1978, a year that’s in the middle of all of this. The home computer revolution has kicked off in earnest, with the PET, Apple II, and TRS-80 all already well-established. S-100 is still an industry standard, and kit computers, while niche, are still commonplace. We even have a vision of the future, as the Xerox Alto, the machine that inspired the Macintosh, also exists and has for some time. Into this space, the arcade game company Exidy released a home computer of its own: the Exidy Sorcerer. As an arcade company, we might expect the Sorcerer to be a video game powerhouse, prefiguring the Atari 400 and 800 that their competitor would release the next year. What we actually get, though, is something quite different:

The Sorcerer is an S-100-based computer with a Z80 CPU, fully pre-assembled, and with a cartridge system that let you boot into BASIC and a programmable text system similar to the TI-99/4’s graphics mode. Unlike those systems, though, it had a remarkable 64×30 monochrome display, so even its text-mode graphics could look quite detailed, as we can see in the game Galaxians:

This machine is the first one I’ve ever seen where the default boot is into a machine code monitor. This lets us create and enter simple programs like Hello World directly:

The monitor’s ENter and GO commands. The forward slash ends data entry.

We can also save our program to tape from here; the JSorcerer emulator I am using here will let us save cassette data both to .WAV files or to a more sensible, compact .TAPE format that represents the encoded data directly. To do so in either case, I attach a cassette to the first cassette input, give it a nonexistent file name so it creates a new cassette image, enter the command SET X=0100 to tell the system where program execution should start, and finally enter the command SAVE HELLO 0100 011B to save out that memory block. After a reboot and remount, the command LOAD will load the file off the tape and prompt you with the proper address to feed to GO:

We also can combine those two commands into a single command LOG.

Making Tape Files Directly

We’d really prefer to not have to go through the monitor for this, though. It would be far nicer if we could just generate the tape files directly from our assembler.

Unfortunately, the tape format is a bit too finicky to implement directly in the assembler. There’s a sort of out-of-band memory image format that MAME calls “QuikLoad” which is simple enough to do that; I document it as part of my sample code in my Platform Guide. Creating a tape file really requires a post-processing step, which is, fortunately, easy enough to do. I’ve created a simple utility program to convert simple binaries into tape images, analogous to the similar utility I wrote for the ZX Spectrum.

The tape format itself, as understood by the system firmware, is a bit more complex. The audio data on the cassette is converted into a stream of bytes as part of the loading process; the .TAPE file format is simply that stream of bytes. The complexity comes from what it encodes. Here is the data format for a program on tape:

  • A tape lead: 100 bytes of $00 bytes, followed by a single $01 byte.
  • Five bytes naming the program. If the program name is fewer than five bytes long, the extra letters in the filename are spaces.
  • The bytes $55 $00, which give the header and file identifiers, respectively. The zero file ID represents an executable binary program.
  • The length of the file, the loading address of the file, and the “go address” where execution starats, all in order as 16-bit little-endian integers.
  • Three zero bytes to pad out the header size to 16 bytes.
  • A checksum byte for the header. This checksum is computed by starting with zero, and then, for each byte we write, subtracting the current checksum from that byte and then inverting all the bits. This process is repeated for each byte output, and the final byte in the block is the checksum.
  • Another tape lead, just like what we opened with, to separate the file header from the main file data.
  • The file data, 256 bytes at a time, finishing with a checksum byte for this block of bytes. The last block will probably have fewer bytes in it; this isn’t indicated anywhere in the format. The loader knows to expect a short block in the end because it knows the full file size in advance.

The Exidy manuals are a little vague about the checksum process; the description makes it sound like there is a single checksum for the entire file, but the checksum is in fact reset to 0 every 256 bytes.

Emulation Woes

As noted in my platform guide, there seem to be two serious emulators for the Sorcerer: MAME and JSorcerer. Both of them are difficult enough to use that it’s really hard to pick a favorite, and honestly both of them had serious problems.

MAME seems to have the more accurate hardware emulation, but I found that something about its simulation of the tape conrols did not work. Any attempt to load a tape file resulted in a CRC checksum error, unless I opened up the tape menu in the UI and manually paused, rewound, and restarted the tape controls at the appropriate times. With that done everything mostly worked out.

JSorcerer, on the other hand, had a more extensive UI that does more things for you, but I nevertheless found it strangely unresponsive and inconsistent. Buttons would need to be held down or repeatedly pressed to get any effects, and sometimes the keyboard mapping would get wildly bent out of shape, mapping the SHIFT key to the S key or forgetting that the RETURN key exists at all.

Worst of all, there are some significant gaps in JSorcerer’s hardware emulation. There’s an I/O pin that gets read along with keyboard data (the $20 bit on port $FE) that is supposed to be 1 during VSYNC and 0 at all other times. There’s a utility routine at $E1A2 that relies on that VSYNC signal to read some system data out of VRAM without flickering the screen. It turns out that JSorcerer just leaves that pin reporting that we are in VSYNC at all times, which means that not only will waits for VSYNC not introduce any frame delay, waiting to leave VSYNC, say because one is hoping to count out frames, will hang the system forever. MAME does not suffer from this.

On the other hand, it doesn’t seem like action games relied on this behavior in the first place, so perhaps it’s not as vast an oversight as it seems to be at first.

Having Some Fun

Despite the obscurity of the Sorcerer, I’ve had it glance against my experiences a few times in the past. One of the books of type-in programs I had as a child was Celestial BASIC by Eric Burgess, which was mostly written for the Apple II but which had an alternate version of one of its programs that revealed that these programs were all originally written for the Sorcerer. Second, when I first started seriously looking at the C64 as an adult, one of my long-time online acquaintances observed that getting anything done on the system seemed to involve a lot more busywork than what he had been used to on his own childhood system. That system turned out to also be the Sorcerer.

Well. If only I had some kind of graphical program that I could use as a Rosetta Stone to put the Sorcerer through its paces in a way analogous to other systems.

If only.

After putting this together, I can say that there is in fact something to my acquaintance’s claim. My program for the Sorcerer came together in a single lazy evening and the graphics are acquitting themselves nicely given the system’s capabilities. Most of the actual work here was in creating versions of the graphics that looked right when rendered with double-height pixels, since my usual ports all relied on square or mostly-square pixels for their work.

And there is, in fact, much less to worry about here. My Commodore 64 versions did a spectacular number of backflips to get the display working the way I wanted it to, even before I started relying heavily on raster effects to transcend my normal color limits. Mixing the system font and custom character graphics required me to command the memory mapper to swap out the system I/O chips and replace them with the main character ROM, which is otherwise not visible to the CPU at all, and copy the letters I wanted to use into RAM. Then I needed to swap the I/O chips back into place and tell the VIC-II to fetch its character graphics from a nonstandard location. On top of all this, I needed to be careful about where I put that data in RAM or the character generator ROM would supercede us. The end result looks great, but there are quite a few ducks to get into a row over there.

In comparison, creating this display on the Sorcerer was simplicity itself. The first 128 characters have their character definitions in ROM, starting at $F800. The remaining 128, starting at $FC00, are in RAM, and while the first 64 of those have default definitions, all of them are available for our reuse. The logo and board graphics all fit neatly within that space, and I was also able to arrange the tiles so that the “lit” cells all had character codes exactly 64 larger than the “unlit” codes, even when the “unlit” codes were in ROM. This let me take the same sorts of shortcuts rendering the board as I got on the platforms with more flexible graphics; we just found ourselves adding or subtracting 64 rather than 128.
That simplicity came at a cost, though. The C64 has a lot more moving parts, but that ends up being because it has more things to offer. The character ROM is out of the CPU’s memory space by default to allow more RAM to be available. The system ROMs can also be swapped out to reveal usable RAM underneath them, after all. The redefinable character sets need more configuration to use because we may swap between multiple entirely different tilesets instantly, with a single register write. This is before we consider the multiple graphics modes or the ability to precisely time changes to them to allow for splitscreen effects. The complexity here wasn’t spite or carelessness; all of it ultimately grew out of the facts that the C64 offered more total memory than the CPU could address all at once, and that the VIC-II expected to need to change its whole graphics configuration in single-digit microseconds.

One final trick I managed wasn’t really Sorcerer-specific; I was able to shrink the program quite a bit by LZ4-compressing the graphics data. The decompression logic turns out to be an exceptionally good fit for the Z80’s capabilities, which is pretty funny when we consider that LZ4 is a 21st-Century algorithm that cares a great deal about performance in general, but does not care even a little bit about performance on 8-bit systems from the 1970s.

Downloads

Links to documents, emulators, development systems, and so on are all available as part of my platform guide. The Sorcerer version of Lights-Out is available as a tape image as part of the overall Lights-Out collection.

http://bumbershootsoft.wordpress.com/?p=5781
Extensions
Testing Out the ARMIPS Assembler on the Xorshift PRNGs
retrocoding
As long as I’m looking at new-to-me assemblers, I ought to take a look at Kingcom’s armips. This is a patching assembler whose intended audience is ROM hackers on the GBA, Nintendo DS, N64, and Playstation 1, with an overall workflow very similar to Asar for SNES work. These platforms are all later than the […]
Show full content

As long as I’m looking at new-to-me assemblers, I ought to take a look at Kingcom’s armips. This is a patching assembler whose intended audience is ROM hackers on the GBA, Nintendo DS, N64, and Playstation 1, with an overall workflow very similar to Asar for SNES work. These platforms are all later than the period I’m generally interested in, but the GBA’s architecture is close enough to the fullsize consoles of the 16-bit era that I’ve put together a quick Platform Guide for it.

The GBA is also fun because while it uses a stock 32-bit ARMv4 CPU (the ARM7TDMI), its overall system layout is wildly heterogeneous, with something like five different memory busses of different speeds and even different bit widths. I feel like the ARMv4’s ability to switch at will between 32- and 16-bit instruction modes went to the heads of Nintendo’s hardware designers and they decided to lean into that capability as hard as they could. This also means that standard practice was to write most code using the 16-bit THUMB instruction set in the expectation that it would be run directly off the cartridge and its 16-bit bus, but write any complex or speed-critical routines in 32-bit ARM code and copy it to the 32KB of 32-bit “Internal Work RAM” for full-power, full-speed execution. (The Internal Work RAM is not to be confused with the 256KB of External Work RAM, which has a 16-bit bus and thus is slower to access a full word at a time.)

I’ve already ported a 64-bit PRNG to the ARM and that code would work just fine here; however, both it and THUMB got much more cavalier treatment for the simpler 16-bit xorshift. I mentioned that ARM was good at this work (which we then proved with the more sophisticated PRNG!) and noted that compilers produced good, compact code from the C original.

To get the hang of armips, I’m going to take that 16-bit function and make a good THUMB version of it suitable for the GBA. While I’m at it, I haven’t actually touched the MIPS architecture in decades. The R3000 in the PSX is, to a first approximation, the chip I actually learned modern computer architecture on; a radiation-hardened version of it also got launched to Pluto as part of the New Horizons mission. This is a good and honorable chip, and it deserves a good implementation of the 64-bit RNG to sit alongside my versions for the 386, ARMv4, and Motorola 68000.

I’ll look at ARM first, since we’re implementing a simpler function there.

Implementing for THUMB

32-bit ARM code is kind of neat. It offers a number of capabilities that are very unusual compared to the other architectures we’ve looked at—even modern ones.

  • Every single instruction is exactly 32 bits long.
  • Every single instruction can be conditionally executed on any condition bit.
  • Instructions that update the flags never have to, and have versions that do the work while leaving flags intact.
  • Every arithmetic instruction, including memory lookups that do some address calculation on the fly, can fold an arbitrary shift operation into the instruction along the way.

Beyond that, everything is mostly very standard three-address code within a load/store architecture. This does leave one rather obvious gap, though; when loading an immediate value into a register, we normally would be unable to load all 32 bits at once thanks to the fact that any given instruction will be the size of the constant we want to load. ARM mitigates this a bit by allowing immediate loads to include the shift operations, which in turn means that as long as our constant ends in a bunch of zero bits it might fit anyway. But when nothing else will do, the standard practice is to load the full values out of a literal pool outside of your functions but nearby them. These translate into “add a constant to the program counter, dereference that address, and load 32 bits from it into the target register”, but assemblers will simplify that for you with the syntax LDR r0, =value32.

That’s extremely handy for us when working with THUMB; THUMB instructions are all, with one arguable exception, only 16 bits long and the same trick works just as well there to load 32-bit registers. It only requires the literal pools to be a bit more nearby.

We face more restrictions than just more frequent literal pools, though. Several hardware generations later, in the ARMv6 range, ARM would provide a variable-width “THUMB2” instruction set that essentially matches the 32-bit ARM instructions exactly, just with better code density. Here in the ARMv4 generation, though, we are giving some things up in THUMB1:

  • We lose the free shifts. In their place are dedicated LSL, LSR, and ASR instructions.
  • A bunch of more complex instructions go missing. In particular we don’t have the lovely UMULL instruction that let us efficiently write that 64-bit PRNG.
  • Only the branch instruction may be conditional now, which ends up looking like a suite of instructions like BEQ, BNE, BLT, and so on.
  • All instructions that can update flags now always update flags.
  • Instructions with immediate-mode components usually have narrower ranges.
  • Most instructions can only access the first 8 registers instead of all 16. The overall ABI has not changed, however, so the high-numbered registers representing the stack pointer, link register, and program counter are often out of reach as a result. To help deal with that, a few instructions have special versions that hardcode these registers in various ways.

The end result, ironically, is an ISA that ends up looking much more traditional than ARM32 itself.

Calling Conventions

THUMB has the same ABI as traditional 32-bit ARM:

  • Registers r0-r3 are scratch registers for functions and also hold the first four words of any arguments or return values.
  • Registers r4-r12 are “local” registers that must be preserved across function calls.
  • Register r13 is the stack pointer and is often called SP as a result. ARM stacks are traditionally “full-descending”, but the general flexibility of ARM means they don’t have to be. THUMB, on the other hand, insists on that format.
  • Register r14 is the link register. The BL (“Branch and Link”) instruction places the return address of the procedure call here. If the function is to call other functions, this value must be preserved on the stack in order to return to the right place.
  • Register r15 is the program counter itself. One way of returning from a procedure call is via the instruction MOV PC,LR. This is not preferred, however, because…
  • The approved way of swapping between ARM and THUMB mode involves the interworking branch instruction BX. This takes a register argument, and if the least significant bit is a 1, it subtracts 1 from the address and jumps to that address in THUMB mode. Otherwise, BX does the jump while switching back into ARM mode. This in turn means that the preferred way to return from a function is the instruction BX LR. This works in both ARM and THUMB and will restore the correct processing mode on the way out.
Implementing Xorshift

Here is the C function we will be implementing here:

uint16_t rng(void) {
    static uint16_t x=1,y=1;
    uint16_t t=x^(x<<5); 
    x=y; 
    return y=(y^(y>>1))^(t^(t>>3));
}

This allows a pretty straightforward hand-translation, without even needing to save and restore any registers. There are really only three differences between my code and the C:

The first is that instead of keeping two independent memory variables for x and y, I instead have a single 32-bit memory location named rng_state and load 16-bit values out of it with built-in offsets. Even THUMB will let us load y as rng_state + 2 without needing extra code.

The second is that, thanks to the fact that we do all our work in registers, I can move the store-halfword instruction that x=y; translates to much earlier in the function.

The third is actually evading a subtle trap. The “obvious” hand-translation of t=x^(x<<5) will produce an incorrect answer when we compute t^(t>>3) later in the function. That is because all of these operations are 32-bit only, and when we shift x left 5, its top five bits will leak back in when we shift right again later. We need to coerce the result to an unsigned 16-bit value before we use it further. Starting in ARMv6 there is an “unsigned extend halfword” instruction variously spelled UEXT16 or UXTH; however, we do not have that instruction on ARMv4 even in ARM mode. To perform a signed or unsigned extend, we need to shift the value so far left that only the bits we want are left in the register, and then shift it back into place with an arithmetic (for a signed extend) or logical (for an unsigned extend) shift-right. For us here, that means shifting left 16, then right 16 logical. That leads us to the one place where we can improve on the “straightforward hand translation”—we can merge the shift-left-5 operation with with the 16-bit truncation by shifting x left 21 places, and then right 16.

This gives us our implementation, annotated with the computations as we make them:

rng:    ldr     r3, =rng_state        ldrh    r1, [r3, #2]            ; r1 = y        ldrh    r2, [r3, #0]            ; r2 = x        strh    r1, [r3, #0]            ; x = y (does not effect computation of r2)        lsl     r0, r2, #21             ; r0 = (uint16_t)(x << 5)        lsr     r0, #16        eor     r2, r0                  ; r2 = t = (uint16_t)(x ^ (x << 5))        lsr     r0, r2, #3              ; r2 = t ^ (t >> 3)        eor     r2, r0        lsr     r0, r1, #1              ; r0 = (y ^ (y >> 1)) ^ (t ^ (t >> 3))        eor     r0, r1        eor     r0, r2        strh    r0, [r3, #2]            ; y = r0, return y        bx      lr        .pool

This is 14 instructions, each two bytes long, a literal pool with a single four-byte entry, and then four bytes of actual RNG state. The final memory usage is 36 bytes.

Racing Against GCC

GCC’s THUMB compilation of the above C code under the -O2 option has a memory footprint of 44 bytes. Even when we allow THUMB2, it only drops it to 40. (It would have gotten to 38, but there are word-alignment restrictions so it pads it out with a NOP.) We’re doing surprisingly well here; we’re beating THUMB1 by 4 full instructions, and even when comparing against THUMB2 we still sneak in one word smaller. Our one trick is one that a compiler should be able to trivially match. What gives?

For the THUMB1 code, the issue is clearly that it first compiled it to 32-bit ARM code and then re-encoded that logic into THUMB1. As a result, there are a lot of extra shifts scattered about that it didn’t need, and as a result of that it couldn’t reuse registers as easily as we could. That led to needing to spill and restore registers which cost it quite a bit of code space. In the THUMB2 case, it seems to have taken advantage of the ability to do 32-bit loads to remove the need for a literal pool—and it turns out that this costs it an extra two bytes. It gets to take advantage of other instructions to make the rest of the code far more natural—instead of shifting a value an extra 32 total places it can just use that UXTH instruction as an incidental part of a register move—but by merging the shifts and copies as best we could with the THUMB1 code it turns out all of that came out in the wash.

The size discrepancy here is because we asked the compiler to optimize for speed instead. Despite being two bytes longer, the compiler’s code removes the need for a literal pool and that should allow the execution to remain entirely inside the instruction cache with perfect data locality. If we instead tell it to optimize for space with the -Os option, it brings back the literal pool and exactly matches our manual code results.

We have gotten really quite good at compiling ordinary code for ARM.

Implementing for the MIPS R3000

Like a lot of people, the first class I took that was taught in assembly language was based on Patterson and Hennesey’s textbook Computer Organization and Design: the Hardware/Software Interface, and it was taught using an early MIPS processor as the example CPU. My homework back then was done in a simulator called SPIM, which kept getting updates until 2023. It’s retired now, but it’s still perfectly functional and it’s still cheerfully simulating something approximating an R2000. This is fairly close to the R3000 that armips supports and that the original PlayStation used. As long as I’m here, I might as well port a xorshift PRNG to it too. As a proper 32-bit system, we should expect it to be able to efficiently implement the 64-bit xorshift-star variant like the 386, 68000, and ARM.

Before we dig into implementation, though, let’s do a quick overview of the MIPS I architecture implemented by the R2000 and R3000.

RISC Gonna Change Everything: MIPS I

The MIPS I architecture is actually from the mid-1980s, of a similar vintage to the 386. It was the flagship Reduced Instruction Set Computer (RISC) system. The idea was that with a very small number of very consistently formatted and executed instructions, you could get the cycles-per-instruction count very low and also be able to dramatically crank up the clock rate. This is what the hardware looks like to an assembly language programmer:

  • All instructions are 32 bits wide and there are only three possible instruction formats.
  • There are 32 registers, each 32 bits wide. Register 0 is always zero and writing it is a no-op. Register 31 is a link register that works roughly the way the link register works on ARM or the TMS9900. The rest of the registers are general purpose, though there are pretty strong conventions for what roles they hold.
  • There are some specialized registers beyond the core 32, but for normal logic the only relevant ones are LO and HI, which hold the 64-bit results of multiplications and the paired quotient/remainder results from divisions.
  • All instructions use the same pipeline stages in the same order. The only thing that ever actually stalls the pipeline is accessing the special LO and HI registers while an operation is still pending. Except for that, this results in a sustained execution rate of one instruction per cycle. The PlayStation’s 33MHz is already impressive compared to the Neo Geo’s 12MHz 68000 chip, but the speed difference gets even more pronounced when you recall that 68000 instructions can take dozens of cycles to execute. Even though the 68000’s instructions are doing work that can take six or more MIPS instructions to match, the MIPS CPU can work through the longer list far more quickly.

That final point, however, hides a slightly deceptive claim. It’s true that the pipeline effectively never stalls, but the reason that is true is that the CPU doesn’t wait for one instruction to finish executing before starting the next. This means that an instruction might want to make use of information computed by a previous instruction but which has not actually been computed yet. This is very common in modern systems—the general term for them is pipeline hazards. The defining feature of the MIPS I architecture, for the human programmer, is that preventing pipeline hazards is your problem, not the CPU’s. There are three main places this happens:

  • After starting a branch instruction, the system needs to fetch the next instruction before processing that branch instruction enough to know where that branch is. MIPS I—and, for that matter, all of its successors—solve this by simply executing the instruction after the branch unconditionally. That location is called the branch delay slot. Ideally you would put the last part of your loop’s main logic there since it doesn’t interact with your loop counter.
  • Similarly, a memory read will not actually deliver its result to the destination register until partway through the next instruction. That next instruction occupies the load delay slot and it must not touch the destination register either to read or write it. This constraint was onerous enough that every later iteration of the MIPS architecture removed them, replacing them with automatic pipeline stalls when necessary. Paying attention to data hazards here still would pay off, but now only execution speed is at stake instead of actual correctness.
  • After reading the LO or HI registers, there must be at least two instructions before a new multiplication or division instruction is issued. This is an extremely bizarre-looking restriction at first—it is described in the architecture manual as if MULT is reaching back in time to invalidate results—and information on it is hard to find. SPIM doesn’t simulate it, I do not remember it coming up in my old class, and I couldn’t even find a clear explanation of it in the architecture manual. All I found was the assertion. My best guess is that the multiplier circuitry uses LO and HI as scratch space, and it starts doing so two cycles before MFLO and similar instructions would actually execute the register transfer.

If no instructions in the program logic could be moved to fill a delay slot, the programmer had to insert a NOP instruction themselves, effectively stalling the pipeline by hand.

Pseudo-Instructions

Amusingly, MIPS does not actually have a dedicated NOP instruction. In fact, there are a lot of extremely basic operations that are not directly representable in the final binary. As a result, there are a stock set of pseudo-instructions, written as a single instruction in “normal” assembly code but which expand to one or more actually existing instructions. The classic NOP is SLL $0, $0, 0, which does nothing but also encodes to a zero word. Most branch instructions turn into a combination that relies on the SLT (Set if Less Than) instruction. Some assemblers will automatically generate no-op instructions to pre-fill the delay slots. Register 1 is reserved to assemblers to help assemble these higher-level instructions.

Important pseudo-instructions for us are LA and LI. These “Load Address” and “Load Immediate” instructions translate into one of several instruction sequences that result in being 32-bit constant loads.

Calling Conventions

MIPS assemblers refer to registers by number, ranging from $0 to $31, but they also provide standard names that reflect how they are used. $0 is $zero, $31 is $ra (“return address”), and the customary stack and frame pointers get symbolic names as well.

These mnemonics bake in a calling convention as well: $a0$a3 hold arguments passed in registers, $v0 and $v1 hold return values, $t0$t9 are scratch registers that functions may overwrite freely, and $s0$s7 are saved registers that a function must save and restore if used.

Our routine is pretty simple; we’ll fit everything in the scratch registers.

The Xorshift-Star PRNG

Here is the C implementation of our PRNG:

uint32_t xorshift64star(void)
{
    uint64_t x = rng_state;
    x ^= x >> 12;
    x ^= x << 25;
    x ^= x >> 27;
    rng_state = x;
    return (x * 0x2545F4914F6CDD1DLLU) >> 32;
}

Our implementation strategy for MIPS turns out to be a blend of ones we used previously. The instruction set does not include 64-bit shift operations the way the 386 does, so that code ends up strongly resembling the 68000 code. It does, however, include 32-bit multiplication with 64-bit results, so the final multiplication logic is similar to what we saw on 32-bit ARM.

Before we do any of that, though, we need to load our 8 bytes of state into registers.

rng:    la $v0,rngstate                 # Load state into t0-t1        lw $t0,0($v0)        lw $t1,4($v0)

Due to our load delay requirements, we are not permitted to use $t1 in our first computation instruction. That is fine; we weren’t going to anyway. The Xorshift operations themselves aren’t meaningfully improved from the 68000 code; our only real gains come from being able to shift any amount in one instruction, and in being able to set a destination register (the leftmost) that isn’t either of the main operands. What on the Motorola was this:

        moveq   #12,d3                  ; 68000, not MIPS!
        moveq   #20,d4
        move.l  d1,d2
        lsr.l   d3,d2
        eor.l   d2,d1
        move.l  d0,d2
        lsl.l   d4,d2
        eor.l   d2,d1
        move.l  d0,d2
        lsr.l   d3,d2
        eor.l   d2,d0

Becomes this:

        srl $t2,$t0,12                  # state ^= state >> 12
        xor $t0,$t0,$t2
        sll $t2,$t1,20
        xor $t0,$t0,$t2
        srl $t2,$t1,12
        xor $t1,$t1,$t2

There is also an endianness-switch here, so $t0 is the low word on MIPS while d1 was on the Motorola.

The remaining xorshifts translate similarly, and we then also write back the results to rngstate when we are done.

        sll $t2,$t1,25                  # state ^= state << 25
        xor $t1,$t1,$t2
        srl $t2,$t0,7
        xor $t1,$t1,$t2
        sll $t2,$t0,25
        xor $t0,$t0,$t2

        srl $t2,$t0,27                  # state ^= state >> 27
        xor $t0,$t0,$t2
        sll $t2,$t1,5
        xor $t0,$t0,$t2
        srl $t2,$t1,27
        xor $t1,$t1,$t2

        sw $t0,0($v0)                    # write state back
        sw $t1,4($v0)

We used $v0 as our pointer register here because we’ll be overwriting it anyway with our result and we don’t need to touch it until after the writeback.

Now it’s time to do the 64-bit multiplication. With the low word of our RNG state in $t0 and the high word in $t1, the final result of our function (to be put in v0) is the sum of three values:

  • The high word of $t0 times 0x4F7CDD1D.
  • The low word of $t1 times 0x4F7CDD1D.
  • The low word of $t0 times 0x2545F491.

The derivation of that result is in the article linked at the start of this section. While we’re at it, since we also have a secondary return value available in $v1, we might as well put the low word of $t0 times 0x4F7CDD1D there. This makes the complete 64-bit result available unobtrusively and enables stunts like running the PRNG backwards.

The first two multiplications are simple enough; we can use the two delay slots we need after moving the data out of HI and LO to prepare the final constant we’ll need.

        li $t2,0x4F6CDD1D
        multu $t0,$t2
        mfhi $v0
        mflo $v1
        li $t3,0x2545F491
        multu $t1,$t2
        mflo $t4
        addu $v0,$v0,$t4

This looks suspicious: there’s only one instruction between MFLO $v1 and MULTU! However, LI here expands into two instructions: a “Load Upper Immediate” instruction LUI $t1,0x2545 and an “OR Immediate” instruction ORI $t3,$t1,0xF491. If we wanted to be really explicit about this, we could spell it out ourselves. That might be worthwhile given that LI is not actually guaranteed to be two instructions if the constant loaded fits in one.

There’s no getting around the fact that ADDU is only one instruction though; we’ll need some padding before we do the final multiplication.

        nop
        multu $t0,$t3
        mflo $t4

The last thing we need to do is add $t4 to $v0 here to get our final result, and that will in fact be the very last thing we do. We actually return from the function first by doing a Jump Register to the return address, and put the final add in the branch delay slot:

        jr $ra
        addu $v0,$v0,$t4

This works out to 156 bytes, comparable to the 68000’s 162 bytes and over half again the size of the ARM or 386 versions. However, this version also is vastly faster, requiring only 70 cycles or so meaning the full routine runs in about 2 microseconds on a 33MHz R3000. The Motorola version ran into the hundreds of microseconds.

RISC in its purest form didn’t really pan out, but the hype in the 1990s was real and it wasn’t at all misplaced.

http://bumbershootsoft.wordpress.com/?p=5717
Extensions
Mega Drive development with ASMotor
retrocodingreviewssega
ASMotor is a multiplatform cross-assembler written by the author of RGBDS, a system I liked quite a lot and which I used for my own Game Boy projects. The rename sets it apart as a very different product; while ASMotor owes an obvious debt to its predecessor, it feels no need to match it closely […]
Show full content

ASMotor is a multiplatform cross-assembler written by the author of RGBDS, a system I liked quite a lot and which I used for my own Game Boy projects. The rename sets it apart as a very different product; while ASMotor owes an obvious debt to its predecessor, it feels no need to match it closely enough to remain compatible.

Beyond being later work from an author whose work I appreciated, ASMotor drew my attention because it was designed to be readily able to create binaries that mix code for multiple CPUs. This is especially important for the SNES and Genesis, where the audio coprocessor needs a program of its own to manage sound and music. My previous projects did this by relying on different toolchains for each CPU, and then embedding that code as an opaque binary blob at the top level. I shifted over to ASMX for my Genesis work because it permitted me to simply build code for both CPUs at once as part of the same overall process.

In the end, it turns out that ASMX is still a somewhat better fit for me than ASMotor, but ASMotor worked for me better than WLA-DX did.

Building a Mega Drive ROM

Like RGBDS, ASMotor has distinct assembly and link steps. Creating a Mega Drive ROM looks much like creating a program for any of its supported platforms: each source file is independently assembled into an object file, and all the object files are then linked into a final binary. What makes ASMotor special, compared to my traditional mixed-CPU workflows, is that the linker is itself multi-CPU and the object files for each assembler are thus cross-compatible. Since each CPU has its own assembler, though, we do remain obliged to keep each source file restricted to a single CPU.

Let us imagine that we have a small Mega Drive program that includes a main program main.asm that is written for the Motorola 68000 and a sound driver snddrv.asm written for the Z80 coprocessor. Converting this into a final binary is a matter of three commands:

  • motor68k -omain.obj main.asm to assemble the main 68000 code,
  • motorz80 -osnddrv.obj snddrv.asm to assemble the sound driver, and
  • xlink -csmd -fsmd -oprogram.bin main.obj snddrv.obj to produce the final program.

The xlink program is format-aware, and the -csmd -fsmd options declare the memory configuration and output format to be the Sega Mega Drive. This will automatically handle the details my smdfix program managed like properly declaring ROM size and checksum.

This looks wonderful, but nothing is free. The toolchain-specific challenges for ASMotor lie within the source files. They’re nothing we haven’t seen before, but we need to know ASMotor’s answers to use it effectively.

  • Parts of the final ROM need to exist at specific points. ASMotor follows RGBDS here in dividing the source code into individual sections, and in allowing sections to force themselves to be loaded into specific addresses. The ROM metadata, and the 68000’s interrupt table, may be placed in a single section with its location fixed at $000000. This broadly matches how we handled Game Boy metadata with RGBDS.
  • The Z80 coprocessor cannot run entirely out of ROM, and the RAM that it shares with the 68000 is not addressed the same way between the two chips. This is a special case of offset assembly. ASMotor’s ORG directive changes the virtual program counter used to compute labels and addresses, while not changing where in the final program the bytes of those instructions actually live. This is often used for situations where a program needs to be relocated into a different part of memory to actually run; since ASMotor is so agnostic about exactly what CPU a block of code is for, no additional work is required.

This may still seem a bit odd, so here is a very simple code sample to demonstrate.

        EXPORT driver, driver_size

        SECTION "SoundDriver",CODE
driver:
        ORG     $0038
driver_main:
        ;; Code goes here...
driver_end:
driver_size     EQU     driver_end-driver_main

The driver label appears before the ORG, and it will hold the address where this code will actually be in the ROM. The driver_main symbol, on the other hand, is after it and so will take the value matching the load address (here, $38, the IRQ handler). We do not have a way to get back to where we were, directly, but we may readily calculate the size of the block by subtracting driver_main from driver_end. Adding that value to driver will give us a pointer just past the end of the code block, if we need it. It turns out we usually won’t.

Pushing the Boundaries

All this makes for a breezy explanation, but if you’re used to other toolchains, it really isn’t very reassuring. We are obliged, after all, put code for each CPU in its own file, and we even assemble those files with dedicated and independent assemblers (motor68k and motorz80). Our driver label in the example above is a label for use by the 68000, not the Z80. Z80 labels are only 16 bits wide; a Z80 assembler would not normally be expected to be able to handle the 24-bit addresses the 68000 works with. We also might worry about section placement; if we insist on a particular location for our Z80 code, we might well wish to insist on a location that is at an address beyond 64KB.

It turns out that our worries here are unfounded; ASMotor simply absorbs all of this without complaint. Its technique involves deferring most label work to the linker, and the linker is guided by the output format, not by the CPU targeted by any particular object. That means that not only may labels be defined and exported that will not fit in a Z80 address, and not only may section annotations refer to the full 68000 address space, but even within the Z80 code 24-bit addresses may be imported and computed with as long as you do the necessary bit shifts and masks to extract components that fit in your registers. This, for instance, works exactly as one might hope:

        EXPORT z80drv, z80drv_size

        SECTION "Z80Driver",CODE[$14000]
z80drv:
        ORG $0000
.start:
        ld      hl,z80drv & $ffff
        ld      e,z80drv >> 16
        ld      d,z80drv_size
.lp:    jp      .lp
.end:
z80drv_size EQU z80drv.end-z80drv.start

This way of working goes against a great many of my well-rooted instincts, but in my experiments with it, it really did just work transparently in the end. Checking the final binary shows the code appearing 80KB into the ROM at offset $014000, HL being loaded with $4000, and E being loaded with $01.

Comparing to ASMX and WLA-DX

The other mixed-CPU systems I’ve considered in the past were ASMX and WLA-DX. These each are quite different from one another and from ASMotor, but all three do support the necessary CPUs for Mega Drive development.

ASMX, uniquely among the three, has no linker. Code is organized into multiple files and put together via INCLUDE or INCBIN. This design decision dictates a need for several other capabilities: it must be possible to swap what CPU you are assembling for on the fly, and when producing a block of code via offset assembly, it must be possible to “pick up where you left off” when you are done. ASMX’s CPU directive permits the former, and the latter is provided via the RORG and REND (relocation origin/end) directives. I’ve encountered these before in other Z80 assemblers under the name PHASE and DEPHASE. ASMotor is interesting here because while its ORG is roughly the equivalent of PHASE, it has no DEPHASE option. The developer is expected to put displaced code in its own section and “return to normal” by ending that section. The linker will handle the rest.

As it happens, I did not use ASMX in quite the way I have described here; I was adapting a build process from a more primitive toolchain and would assemble the Z80 code into its final binary form in advance, then have the 68000 code include it as a binary blob. It turns out that this more traditional process is how WLA-DX must be used. While it is a very powerful and generic assembler, and while it does include its own linker, the linker requires all objects to be for the same CPU. The “chain of builds that include each other as binaries” system becomes mandatory, and while it offers a broad array of mechanisms for configuring the memory layouts of your target platforms, such mechanisms are mandatory in every program as well.

All three of these systems also include at least one hiccup in their syntax that annoys me, as well. By default, ASMX does not allow $ to mark hexadecimal constants in Z80 code (and only Z80 code), while neither ASMotor nor WLA-DX permit you to specify short branches with a .s suffix, insisting instead on the data-size suffix .b.

When choosing between all three of these for what to do my Genesis work in, my main criterion was my own comfort. I am, broadly speaking, much happier with simpler and more freeform assembly language systems that do not require linkers and instead defer such decisions to the assembly language level. This is not always possible—it’s challenging on the SNES and it’s basically impossible on DOS or the Apple IIgs—but the Sega systems allow it. This preference, plus inertia, led me to stick with ASMX for my own work, but I did find ASMotor’s linker to be extremely friendly and powerful compared to WLA-DX’s and would definitely recommend people just starting out to take a look at it.

WLA-DX, on the other hand, is notable because if you are interested in the Super Nintendo as well as the Mega Drive, it’s the only toolkit of the three that offers SPC700 support for the SNES’s audio coprocessor, and it even lets you directly write for some of its most common expansion CPUs. If you want to standardize on something for a collection of projects, it probably becomes the better choice there.

http://bumbershootsoft.wordpress.com/?p=5664
Some Subtleties When Parsing 6502 Assembly Language
nonretrocoding
I recently made a new release of my Ophis assembler. I ended that announcement with a note that fixing some internal logic issues was more exciting than I’d bargained for, but I didn’t go into details. It is now time to go into details. Adding support for parentheses in math expressions turns out to have […]
Show full content

I recently made a new release of my Ophis assembler. I ended that announcement with a note that fixing some internal logic issues was more exciting than I’d bargained for, but I didn’t go into details.

It is now time to go into details. Adding support for parentheses in math expressions turns out to have modestly disastrous consequences for my parser design. The problem is simple to state: the instruction JMP ($A000)+2 is an instruction that is using the Absolute addressing mode, but the instruction JMP ($A000+2) is an instruction that is using the Indirect addressing mode. Getting the parser to handle this properly requires some care—and after discovering this issue I swiftly discovered that many other widely-used assemblers get this wrong. So in today’s article, I’m going to leave retrocomputing behind, dust off my old compiler technology expertise, and precisely characterize what went wrong here and how we can fix it. I’ll be explaining it in terms of hand-coded parsers like the one I made for Ophis, but I will also be taking a more theoretical approach using the GNU Bison parser generator.

Like many of the articles where I “leave retrocomputing behind,” the technologies that I ultimately end up discussing will long predate even the Atari 2600.

The Parts of a Compiler Front-End

Computer languages are specified by formal grammars. These grammars are usually expressed in terms of productions. These are syntax rules that we write down in a format much like the ones we use to document command-line utilities. A grammar for a subset of 6502 instructions might look like this:

  • exprID | NUMBER | ( expr ) | expr + expr | expr * expr
  • insnOPCODE | OPCODE expr | OPCODE ( expr )

The tokens like ID and NUMBER here are defined in advance. This can be part of the grammar, but it could just as easily be an outside description like “a NUMBER is a series of digits.” Inside an actual compiler, a preprocessing phase called lexical analysis distills an input text into a series of basic tokens for the parser to then consume. This is also where we often do things like strip out comments. A program or module that carries out lexical analysis is called a lexer.

The lexer’s token stream is then fed to a parser, which transforms that stream into some more complex data structure suitable for further processing. For a compiler or assembler, this is most likely an abstract syntax tree or AST. The formal theory behind doing all of this was well-enough established by the late 1970s that parsers could be generated from grammar specifications not unlike my example above. One of those, YACC (Yet Another Compiler-Compiler), was eventually the template for GNU Bison. Both have seen heavy use in actual practice over the decades, and are still in use today alongside more modern systems like ANTLR or Parsec, or alongside hand-coded parsers like the one we see in Ophis.

Classes of Parsers: LL vs LR

Interestingly, there is a dramatic difference between the actual computation performed by a parser created by a parser generator and a hand-coded one, and they correspond to mathematical properties of the grammar specifications that they may directly accept.

Hand-coded parsers tend to be implemented via recursive descent: conceptually, each symbol type in the grammar becomes a function in its own right, and that function decides which production to use, consumes the appropriate tokens for that production, and then returns the new node for the AST. When other compound symbols are involved in a production, those symbols’ functions are called as well. For things like arithmetic expressions, this usually ends up being recursive. What makes this approach mathematically distinctive is that it is a predictive parser: in order to know what functions to call, we need to know in advance what the next piece of input is going to be. To help determine this, we are allowed to look ahead in the token stream some fixed amount. A parser that looks ahead k tokens is said to be an LL(k) parser, and the language the parser processes is called an LL(k) language. The advantage of LL-type languages is that they are much easier to implement and reason about than broader classes. The main disadvantage is that most naturally-expressed grammars, including the one we gave at the start of this article, are not LL(k) grammars even though they specify an LL(1) language. There are ways to systematically transform such grammars into LL-compliant grammars, but the process is tedious and the resulting grammars are often ugly. A hand-coded parser can often finesse the worst of that, but not always.

Machine-generated parsers, on the other hand, usually maintain a stack of the tokens consumed, and follow a sequence of either pushing the next token in the stream onto that stack (shifting) or deploying a production to replace the last few tokens on the stack with a new token that is the production’s result (reducing). There is still a lookahead mechanic, like in the predictive parsers, but here we are not trying to figure out what kind of symbol we are at the beginning of, but rather what kind of symbol we are at the end of. Parser generators of this type typically permit only a single token of lookahead: the parsers and languages that work with these grammars are called LR(1). The set of LR(1) languages is a strict superset of the set of LL(1) languages.

(A historical note: shift-reduce parsers are generally table-driven, and the earliest parser generators of this type, including YACC, took some shortcuts in their table generation to save space. That results in a new language class called LALR(1); it is a subset of LR(1) and incommensurate with LL(1). LALR does not really have a concise description; even compiler textbooks characterize it by first defining LR parse tables and then documenting the corners you cut. It is, in short, a weird and no-longer-necessary hack, and Bison will generate full LR tables if it turns out you hit a sharp corner in LALR.)

Ambiguous Grammars

Let us return to the grammar we defined above. It’s defined in a fairly straightforward way, but it is also clearly incomplete. If you translate it into YACC syntax and feed it to Bison, it will give you an output parser but will also warn you that the result is surely incorrect. There are two basic issues here, and before we look into what exactly went wrong inside the parser generator, we can at least characterize the issue in human terms:

  • Our arithmetic operations are defined on pairs of elements. When parsing an expression like 1+2*3+4, it needs to parenthesize the expression so that all operations are paired. While the answer we want is ((1+(2*3))+4), we have not actually specified that in any way.
  • The grammar has the ambiguity between Indirect and Absolute instructions that kicked off this whole article: there are two entirely different ways of interpreting OPCODE (LABEL).

The first kind of ambiguity is both more common and less harmful. This manifests in the parser as a shift-reduce conflict: the productions of the grammar have a point where it would be potentially legitimate to read new tokens on to the stack, or to reduce some of what we have already. While it is possible to address these by rewriting the grammar, parser generators do not usually require this. Instead, you provide rules for resolving the conflicts outside of the grammar. In practice, this works because what you are doing is specifying the precedence and associativity of your arithmetic operators, and if you’re going to do that, it’s clearer to do it in one place. (Indeed, that sort of specification in a stack-based parser predates systems like YACC—it goes back at least to 1961 with Dijkstra’s shunting yard algorithm.)

The second kind is nastier: in these the parser knows that it needs to reduce, but it has multiple productions that are applicable at that point. These must be addressed by rewriting the grammar, and that has historically been a tricky affair. One reason that I have been namechecking GNU Bison specifically, though, instead of just talking about YACC and its many inheritors, is that Bison has a “counterexamples” mode that will explain exactly what input is ambigous and why:

Transcript of a bison run with Unicode-art diagrams

Our case is pretty simple: after shifting and reducing so that the stack holds OPCODE ( expr ), there are two paths forward:

  • Reduce the entire stack to insn via the rule insnOPCODE ( expr ).
  • First reduce ( expr ) to expr via the rule expr( expr ), then reduce the stack to insn via the rule insnOPCODE expr.

This cannot be acceptably solved with mere annotations. Even at the human scale, an annotation-based solution is awkward at best. (“Parentheses group arithmetic expressions, unless they group the entire expression, in which case they mean pointer indirection.” It’s true as far as it goes, but it’s ugly and warty even at the human level.) We’ll have to revise the grammar to make it work, and how we do that revision will depend on what kind of parser we are.

Fixing the Ambiguity with an LR(1) Parser

We’ll start by clearing up operator precedence and associativity. We define that up at the top of the grammar, in a sort of “preface” section that also defines any high-level tokens we use:

%token ID%token NUMBER%left '+'%left '*'

This wipes out all our shift-reduce conflicts. To handle the reduce-reduce conflict, we take inspiration from our informal human-level explanation we gave before. We will specifically distinguish expressions that are not parenthesized from the other kinds with a new kind of symbol, npexpr. We will then specify that the Absolute addressing mode only uses those:

  • npexprID | NUMBER | expr + expr | expr * expr
  • exprnpexpr | ( expr )
  • insnOP | OP npexpr | OP ( expr )

That’s all it takes. It’s maybe a little subtle, but it’s not difficult. It’s not even that ugly.

Doing It the Hard Way

If we want it to be ugly we can try to solve the shift-reduce conflicts by adjusting the grammar as well. Precedence levels are managed by chaining together individual symbols for each level, and associativity is expressed by restricting which part of the production is actually recursive. For expr, that gets us this:

  • atomID | NUMBER | ( expr )
  • termatom | term * atom
  • exprterm | expr + term

That’s still not too bad; in fact, it’s very close to the kinds of things that recursive-descent parsers do. If we want things to get bad we mix this with our fix to the reduce-reduce conflict, and need to thread the non-parenthesized state through the whole chain:

  • npatomID | NUMBER
  • patom( expr )
  • multitermterm * npatom | term * patom
  • nptermnpatom | multiterm
  • ptermpatom
  • termnpterm | pterm
  • multiexprexpr + npterm | expr + pterm
  • npexprnpterm | multiexpr
  • pexprpterm
  • exprnpexpr | pexpr

Yuck. Well, at least it works. Any attempts I made at collapsing away other intermediate terms made the whole grammar collapse into half a dozen or more reduce-reduce conflicts as everything potentially becomes everything else.

Fixing the Ambiguity in Ophis

Ophis uses a hand-written recursive-descent parser that is LL(1) if you squint at it. It uses a chain-of-symbols approach for arithmetic precedence (“bits,” “arith,” “term,” and “atom”) and collects all terms within a precedence level within a list instead of chaining pairs together according to associativity rules. Adding parentheses and curly braces for grouping just meant extending the atom() function to check for those the way it already checked for square brackets. This alone almost worked out of the box; it breaks only when presented with a statement like JMP ($A000)+2, and it breaks with the error “expected comma or end-of-line.”

The commas have to do with indexing. In short, Ophis needs to distinguish OP (expr), OP (expr,X), and OP (expr),Y, and that logic was already there. This meant that, since it is a predictive parser, it would commit to the instruction being some kind of Indirect mode if it saw that the first token after an OP was an open parenthesis. After reading that open paren and parsing a complete expression, it expected to see either a comma (for the Indirect-X mode) or a close paren (for the other two). It finds the close paren, assumes all is well, and then tries to distinguish Indirect from Indirect-Y and finds a plus sign instead. At this point, it is stuck and gives up with a syntax error.

To work through the failure, though, tells us what we need to know to fix it: we need to not commit to indirection until we know for sure that it’s what we face (thanks to discovering use of comma syntax or due to the whole expression being parenthesized). At the point where we expect a comma or end-of-line, we also accept any arithmetic operators, and if we find those, then we need to reinterpret the expression we just finished reading as the first term of an arithmetic expression and proceed from there.

That’s a little slipperier than I’m making it sound, but not much. Each precedence level’s parser would start by reading a term of the next lower precedence level, then read one of its operands, then keep doing this until it hit a lower-precedence operation. At that point it would collect all its terms into a single sequence. I needed to provide an optional first argument that had been already-read, and making that work with the precedence parsing meant that I basically had to dig my way back out of the precedence pit; we’d realized that we’d just finished reading an atom, so now we need to read the term-tail, take that result and use it to seed reading the arith-tail, and then take that result and use it to seed the bits-tail. A lot of words for what is ultimately not a lot of code.

http://bumbershootsoft.wordpress.com/?p=5685
Reversing and Replicating Compute!’s MLX
commodoreretrocoding
A quicker article this week, inspired both by my revisit of Compute!’s Gazette and pondering its paper-based binary software distribution. I’d already looked at its two BASIC-focused automatic proofreaders in detail, but these were each one program of a pair. Its companion program was called “MLX,” and its job was to allow you to enter […]
Show full content

A quicker article this week, inspired both by my revisit of Compute!’s Gazette and pondering its paper-based binary software distribution.

I’d already looked at its two BASIC-focused automatic proofreaders in detail, but these were each one program of a pair. Its companion program was called “MLX,” and its job was to allow you to enter programs published simply as long columns of bytes.

This kind of binary listing wasn’t uncommon, back in the day—many old computers had simple debuggers built in called machine code monitors that would let you load up memory with strings of hexadecimal values, save them to disk, and load them back later. That approach had two disadvantages, though: first, an error in the data entry could be insidious and nearly impossible to track down, and second, not all computers actually came with monitors built in. The Commodore 64, for example, did not. MLX was a self-contained, monitor-like application for entering, saving, and loading a binary listing which required nothing but BASIC (and ideally the Automatic Proofreader to help you type it in), and it also included a checksum byte every few bytes to make sure that you had not made any errors. This made it far more likely that your machine code program was entered properly the first time.

(And yes, if this process sounds like a miserable slog to you, it’s not exactly a thrill a minute, but also, it’s still a faster process than one might expect. When I was on a roll with one of these listings, I could easily end up entering a kilobyte of binary in 15-20 minutes, and it was very rare to encounter a program listing over 6KB. Listen to an album or two along the way and it’ll be done in less than an evening, even if you take regular breaks to rest your eyes and hands.)

Despite being pretty effective, Compute! always admitted that there were some serious flaws in the checksums they used, despite shifting to a far more robust one in 1986. The revived Gazette brand introduced third editions of the Automatic Proofreader and MLX in January of 2026 and while the Proofreader’s functionality was identical, MLX’s checksum algorithm was rewritten yet again.

Today, I’m going to look at these three versions and what it was like to use them, then recreate their checksum algorithms in a modern system that will let us fabricate MLX listings out of our .PRG files in the comfort of our modern systems.

But first, let’s look at them in action.

1983: MLX

The first version of MLX provides a fancy banner, then asks you for the start and end addresses of the program you are entering:

Once we give the addresses the screen clears and we can start entering our code. We could, for instance, start typing in LADS, one of the assemblers we played with two weeks ago:

As you can see, the first version of MLX worked entirely in decimal and offered six bytes of data per checksum byte. As each line is completed, the program will beep happily if the checksum works properly and buzz angrily if it’s wrong, rewinding and prompting us to re-enter the line. Two hours or so later…

… we have all 5K of the program entered, the screen clears again, and we are prompted to save.

It then sends us back into the data entry mode, past the actual end of our program. From here we can reboot, and try out our fancy new program.

It is not visible in the screenshots, but one nice feature that MLX had from the start was that while in data entry mode it remapped some of the keys to work like a numeric keypad. I never really relied on this but I must imagine that this would have been a lifesaver at the time for those folks who needed to do this regularly.

A less friendly aspect of this first version of MLX is that the data you entered was placed directly into memory in the place it would eventually be loaded. This in turn meant that if you were writing a program that expected to be loaded like a BASIC program, you would need to prepare the ground first by manually rewriting BASIC’s pointers before loading MLX to keep it somewhere that will remain safe. This sort of technique came up last week in the onefiling discussion, as well as when I discussed loader generators in the context of hybrid BASIC/machine language programs as well, and it’s just as vexing here.

The checksum algorithm in this first edition is very simple: it sums up the six bytes you just entered, adds the low byte of the address that started the line, and takes the low byte of the result as the checksum. This is much, much better than nothing, but as Compute! itself admitted, it can’t catch typing the bytes in in the wrong order. More insidiously, if you make a matching typo on both a data byte and the checksum byte (say, consistently mistyping the number 5 as 6), there’s a good chance that your errors will cancel out producing a “successful” line entry that is nevertheless erroneous.

1986: MLX II

The update to MLX is less pretty in its opening but brings several extremely welcome changes.

We’re working in hex now, and instead of having to dig through an appendix to figure out what the control keys are to load or save partial work, there’s a full-scale menu! When we choose to “Enter Data”, we also see that we’ve got eight data bytes per line:

We look like a proper monitor entry system now! The numpad feature is also back, expanded a bit around the edges so that it can enter hexadecimal. Better yet, MLX II stores every program you enter in a buffer ranging from $3200$9FFF no matter what the final load address will be. That means we no longer need to poke at BASIC’s internal memory layout in order to enter a program with a BASIC stub in it.

The checksum algorithm is also much more sophisticated.

  • At the start of each line, the checksum is initialized to the low byte of the address plus twice the high byte of the address.
  • For each of the eight bytes in the line, this value is doubled, then the new value is added to it.
  • The final sum modulo 255 is the checksum of this line. If the sum was not zero, but is congruent to zero modulo 255, the checksum is $FF, not zero.

This is a much better system: it uses 15 of the address bits instead of 8 in the checksum, and order now matters thanks to the doubling before each sum. Earlier bytes do not end up shifting out of the checksum entirely because the modulus is not a power of two.

That non-power-of-two modulus does bring us to this edition’s openly-admitted problem: it cannot distinguish the bytes $00 and $FF. The synchronized-typo problem is also present, though lessened; in MLX 2 this only poses a consistent hazard when our error is in the checksum and the final data byte.

The strange test where at the end we probably, but do not necessarily, have to change 0 to 255 is an artifact of them doing modulo by repeated subtraction and testing with > when they should have tested with >= for a true modulo operation.

2026: MLX III

Which brings us to the present day. We seem to have persisted in a pattern of each edition getting less flashy but more robust:

We’re up to ten bytes per line, using as much screen space as we can. We’re not even bothering with reverse video for our prompts anymore, much less color. But we’ve also got a checksum algorithm that solves all the problems I’ve mentioned so far:

  • We start the line with three checksum bytes. The first is the low byte of the start address, the second is the high byte, and the third is zero.
  • For each byte in the line, first add its value to the first checksum byte. Then add the first checksum byte to the second. Finally, if the value was $80 or larger, increment the third byte.
  • The checksum value is the low byte of the sum of these values, unless that ends up being zero, in which case the checksum value is $FF.

This cumulative-sum approach ends up making the first byte contribute eleven times to the sum, the second ten, and so on down to the tenth byte contributing twice. That is very similar to what the Automatic Proofreader does, but is more efficient to compute. The third checksum byte means that we can distinguish zero from 255. And the fact that the last byte contributes twice means that a synchronized typo will now basically never result in accidentally typo-ing in an erroneously-correct checksum.

Presumably, the zero checksum here is forbidden out of solidarity with the MLX II results. This narrows the space unnecessarily—we actually are using all 256 possible values here—but with only ten bytes to summarize it doesn’t seem too harmful.

MLXKit

I’ve collected these three checksum algorithms into a simple commandline application: MLXKit. It accepts a C64 program and optionally a version argument, defaulting to 2. It will then print out an appropriate listing, along with start and end addresses. MLX 1 listings will include the pointer-hacking commands to type if necessary.

This turns out to be a simple enough program that it I got it to build and run on DOS and the Amiga as well:

Why yes, this listing will give you a ready-to-run, pure machine-code version of my first project on the blog.

It’s always fun when a modern program runs unchanged on 40-year-old systems.

http://bumbershootsoft.wordpress.com/?p=5647
Ophis 2.3 Released
nonretrocodingretrocoding
I’ve released version 2.3 of the Ophis Assembler to its website and to PyPI. The changes in this release are pretty minor in the grand scheme of things, but they address some long-standing irritations that pinched me during my recent Commodore 64 adventures. Doing the work on the parser improvements actually led to some adventures […]
Show full content

I’ve released version 2.3 of the Ophis Assembler to its website and to PyPI. The changes in this release are pretty minor in the grand scheme of things, but they address some long-standing irritations that pinched me during my recent Commodore 64 adventures.

  • If you want to override the precedence of arithmetic operators, Ophis insisted that you use square brackets: 1+2*3=7, but [1+2]*3=9. You may now use parentheses in the traditional way, or even curly braces if the spirit so moves you.
  • Anonymous labels in Ophis are traditionally written as an asterisk starting the line. Many old assemblers treated this as a special form of comment, and syntax highlighters sometimes filed suit, messing up your source display. A lone colon, ca65 style, now also works.
  • Syntax error messages in Ophis have often been misleading. These errors normally state what kind of input the parser was expecting at that point, but those lists were both incomplete and often included suggestions for addressing modes that only exist on CPUs you weren’t targeting. I’ve improved the reporting on this a bit.

Doing the work on the parser improvements actually led to some adventures in their own right. That will show up as a proper article at some point in the future, I think.

http://bumbershootsoft.wordpress.com/?p=5682
Onefiling
commodoreretrocoding
The C64 Ultimate isn’t the only old brand attempting to revive itself for the modern retrocomputing scene. Compute!’s Gazette has also been bought and is well into its second volume of issues as I write this. I expect I won’t have too much to say about it on the blog overall, but that shouldn’t be […]
Show full content

The C64 Ultimate isn’t the only old brand attempting to revive itself for the modern retrocomputing scene. Compute!’s Gazette has also been bought and is well into its second volume of issues as I write this. I expect I won’t have too much to say about it on the blog overall, but that shouldn’t be understood as a knock against it—it’s gotten regular features from names I respect, and it’s been publishing a solid mix of historical retrospectives, articles about modern retrotech, and interviews with figures in the space both past and present.

It also has been publishing type-in programs, a feature I’m a little ambivalent about here in 2026. I won’t be doing any “type-in rescues” on these—for the code that I wouldn’t consider “unusually-published commercial software,” they have code repositories I could just submit patches to—but reading through them has raised the salience of some old techniques I never properly explored.

In particular, I was inspired here by Gift Drop, a C64 game published in the December 2025 issue of the Gazette. This is a very slick and smooth action game:

Like many of the fast-action type-ins, this game is a hybrid of BASIC and machine language. The program overall ends up having four components:

  • The BASIC program itself, loaded by the main interpreter by a LOAD command, and expecting to occupy $0801$414E.
  • Sprite graphics data, loaded into $BD00$BFFF by the BASIC program itself. This is pretty wild, first because it suggests that the game is remapping the VIC-II memory, and second because this data region is occupied by the BASIC interpreter itself. That’s actually fine, though, because as we noted in our C64 spritework discussion, the VIC-II gets an unobstructed view of the RAM the BASIC ROM shadows. The happy result is that this gets all the sprites the game uses in place without consuming any RAM that BASIC can itself use.
  • Machine language support routines, loaded into $C000$C9D6 by the BASIC loader. These are loaded by independent logic, which is kind of wacky because it directly abuts the sprite data; this could have been treated as a single binary blob.
  • A small, independent machine language support routine, loaded into $CA08$CA4D that lets the previous two blobs be entered into BASIC in hex. It also produces a cute little spinner as the bytes are loaded into place.

It would have been unusual for this much data to be managed by a BASIC loader in the Gazette‘s heyday. The BASIC loaders here take over a minute and a half to actually get the data into place! A game like this would probably have been printed as pure machine code and entered entirely as a hex dump. (The Gazette published a special program called “MLX” beside its automatic proofreader that would ease the entry of such dumps. Here in the modern world, you can mostly just copy-paste out of the PDF itself.) However, outside of the constraints of type-ins, we have other strategies we could employ.

Splitting the Program

The most direct approach here would be to put our binary data in a separate file. This dramatically shortens the BASIC program, since we can strip out nearly 200 lines of DATA statements totaling well over ten kilobytes, and we can replace it with the 3KB it decodes down to. We do, however, still have to get that data into place. This turns out to be an excellent place to use what I called the “LOADception” technique when I discussed strategies for creating hybrid programs. In the original code, line 110 printed out a “Loading, please wait” message, lines 120-150 called out to the loader subroutines, and line 160 replaced the message with a prompt to begin gameplay. This means converting the program could be done with the following steps:

  • Collect all the hex data from line 1690 on in the source and turn it into an actual binary file with those contents.
  • Prefix that binary file with its load address of $BD00. The loader code may have ignored the fact that these two regions are adjacent, but we don’t have to. This turns the file into a valid .PRG program that the C64’s LOAD command will put into the proper spot.
  • Add this file to the same disk as GIFT DROP with a name like GIFT DROP.DAT.
  • Delete every line in GIFT DROP from line 1390 on; this clears out not only the data we converted, but the loader routines that put it into place.
  • Also delete line 90, and lines 120-150, which call those routines.
  • Add a new line 120 with the commands OK=1:LOAD "GIFT DROP.DAT",8,1. When this line is hit, BASIC will load the binary file into place and then restart execution from the beginning of the program.
  • Put a new line at the very top of the program, say as line 5, that resumes where we left off if we discover we just finished our load: IF OK=1 THEN 160.

This is pretty nice. We could maybe improve it further by replacing the logic involving the OK variable with a few PEEK checks to see if it looks like the data was already loaded; that way, if we run the program twice, we don’t have to load the ancillary data twice.

Re-Merging the Split Program

Another way to avoid loading the ancillary data twice would be to have an initial step that loads all the code and data into its proper place before the BASIC parts of the program even start running at all. In this case, we’d delete all the same parts of the BASIC program we did for the split-file case, but would not need to add any new lines to manage the load.

That loading could be done directly with what my earlier article called a “dedicated loader program.” These essentially arranged the necessary configuration and loading commands to run in BASIC but not as part of a running BASIC program, allowing the final mixed program to be properly loaded and configured before synthesizing a RUN command to begin the program proper.

That approach assumed that the developer was primarily developing in BASIC, wished to solve their problems in BASIC, and was happy to keep each loadable module in its own file. We do not have to rely on these assumptions. Indeed, I don’t have them: in my own work, I strongly prefer shipping a single file that contains all the necessary data in one place, and which appears to the user after loading to be a one-line BASIC program that forwards to a machine language program in memory that BASIC won’t touch. We can attack this problem from that angle, too. In this case, the machine language program will hold the entire BASIC program, the graphics, and the machine language library within it as big data tables, and the program that runs when you type RUN will carry out these three steps:

  • Copy the BASIC program and the binary components into their intended locations at $0801 and $BD00.
  • Stuff the keyboard buffer so that the next time the system asks for user input, it receives the string RUN and a return key.
  • Warm-reset the BASIC interpreter so that it goes back to accepting commands without attempting to go through the normal program exit process.

This is quite easy to implement. The first step is just two block copies; if this were a Z80 system instead of a 6502 one it would be two LDIR commands. The second works just like it did in the BASIC autoboot program: write the number of characters to $C6 and the characters themselves into the keyboard buffer at $0277. The third is simply the instruction JMP ($A002).

There are two aspects of this that we have to be careful about. The first was that we must put the actual copy logic at the end of the program rather than at the beginning; moving the BASIC data into place would destroy anything in the program before it. (This is also why we had to JMP ($A002) instead of simply using RTS—by copying the BASIC part of the program into place at $0801, we obliterate our initial BASIC program as part of it and would not be returning into the same BASIC program we started from!) The second is that no matter how we organize the program, the memory range that we’re copying the BASIC program into will include some of the memory range that we’re copying the BASIC program from. We need to make sure we never overwrite any part of the program before we’ve already copied it. That’s pretty easy in this case because it turns out that a normal loop that just copies each byte in order from first to last will let this work. Certain kinds of loop unrolling would break this, so we just don’t do those.

The end result is very nice: we get a single .PRG that’s about 6KB long, and that we LOAD like any other BASIC program. If we LIST that program it gives us the usual 10 SYS 2062 BASIC stub. If we RUN it we find ourselves in the game immediately, ready to play with no further preprocessing. After exiting the program, if we LIST again, we will see, not our initial BASIC stub, but the actual text of the main BASIC program. Keeping everything in a single file also means that accelerators like the 1541 Ultimate II (which the C64 Ultimate includes within it) can load the entire program basically instantly via their “DMA Load” facilities.

This was a pretty common way to distribute software on the C64; I’ve always heard the process of packaging it up in this way as “onefiling”.

Sidebar: Bulk Memory Copies in C and C++

The standard library function for copying a block of memory from one place to another in C and C++ is memcpy(). This function insists, however, that the source and destination regions not overlap, and it makes no guarantees about what will happen if they do. When it goes wrong, what usually happens is that parts of the data get repeated over and over, replacing what should have been fresh data. The LZ77 compression algorithm and its successors like LZ4 actually rely on that sort of behavior to efficiently compress data that repeats a pattern; however, it cannot use memcpy() to do this because it does not guarantee any particular failure mode.

The more interesting case for us here is the case where the source and destination regions overlap, but we want the copy to succeed—that is, copying into the destination should destroy the source data, but the contents of the destination region should exactly match what the source region was at the start of the copy. The C and C++ standard libraries offer a different function to guarantee this, called memmove(). One simple way to implement memmove() is to check whether the source memory begins before or after the destination memory. If the source memory begins after the destination memory, as it did for Gift Drop‘s BASIC data, copy the data over a byte at a time, starting with the first byte and moving to the last. If the source memory begins before the destination memory, as it did for Gift Drop‘s graphics and machine code, start copying from the last byte and work your way backwards to the first. In each case, if a write to the destination destroys a byte in the source buffer, we’ll have already copied that byte so nothing would be lost.

As it happens, I copied first-to-last for both blocks for Gift Drop, because the graphics and machine code didn’t overlap its source data. If the source and destination buffers are disjoint, you can basically do whatever you want.

Generalizing the Technique

I made the copier above by hand, but it was simple enough to do that it should be possible to automate. Our problem statement is simple: given a program with multiple, discontinuous blocks of code and data scattered through memory, turn them into a single machine language program that is a single continous block of data and a small program that unpackages them into their necessary locations and then hands control to the true “main program.” In this more general form, we have a few more constraints on us, but all of them have straightforward solutions:

  • Copying code and data into place can’t clobber our unpackaging program. The easiest way to solve this is to copy the packager somewhere the target program won’t be and run it from there. This will likely require the use of offset assembly, but since the packaged program all fits in memory in the first place we shouldn’t have any trouble finding a spot to run from. In the worst case scenario, we could run the unpackager out of the screen memory.
  • Copying code and data into place can’t clobber code and data that we haven’t copied into place yet. The most direct way to solve this is to store the target program blocks in the order they’ll be in once they’re unpackaged, and then copy then into place from the “outside in.” Blocks that move forward are copied last to first (until we hit the first block that moves backwards), and then blocks that move backwards are copied first to last.
  • There are multiple ways to start a program. This reprises the problem that the autoboot generator had to solve, and it has the same solution. A pure machine-language program needs to supply a start address to transfer control to once unpacking is complete, and a BASIC program needs to reset the interpreter and synthesize a RUN command.
  • BASIC programs can be relocated, but not trivially. Gift Drop happened to copy its BASIC component into the normal load location of $0801, but it wasn’t unheard of to not do this. One of the tasks a dedicated loader program might do, in addition to loading and running the program’s individual components, was reconfigure BASIC’s pointers so that the BASIC program would occupy a different chunk of memory. This generally allowed it to free up space for the VIC-II to use bitmap graphics or other memory-hungry operations without competing with the BASIC program itself. We also saw how to address this when investigating autobooters. It is a two-part process: we need to update BASIC’s internal pointers to correctly point at where we want the BASIC program to begin and end, and we need to fix up any internal pointers within the program. The former involves setting appropriate values to the pointers in the $2B-$30 range, and the latter just involves calling a routine in the BASIC ROM at $A533.
Other Platforms

In the 8-bit world, this technique ends up being unusually Commodore-specific. It’s rare for BASIC to be so fundamental to the OS that a manual LOAD and RUN is the way to run every program in the world. It’s rarer still within a BASIC environment to lack a command like Apple’s or IBM’s BLOAD that manages binary components of a hybrid system. Without something like the 1541 Ultimate, there isn’t much benefit to loading one file instead of three during startup, and without something like the C64’s odd memory discontinuity where user RAM spans $0400-$9FFF and then also $C000-$CFFF, there’s rarely cause to juggle memory locations post-load anyway. The closest we get is the Atari 8-bits, and while it didn’t have the kind of BASIC integration we’ve outlined here, its binary executable format was wonderfully flexible in this regard.

Outside of the 8-bit world, things that look like this are extremely common, but it’s because they’re part of how programs are loaded in the first place. I did a brief tour of how this problem is attacked a few years ago, and if the initial relocation operation (“link-loading”) wasn’t sufficient for your needs, you almost certainly needed to use something like overlays to handle the work you wanted and thus wouldn’t fit into a single neat package like we’re offering here. The “single neat package” option on these systems has always been something morally equivalent to a ZIP file.

We can look at it the other way, too: onefiling is the technique that results when you take link-loading out of the world of proper operating systems and bring it to the world where loading a program is simply copying a chunk of data into memory.

http://bumbershootsoft.wordpress.com/?p=5635