Abandonculture — GeistHaus

Opinion/Rant: RISC isn’t about instruction set cleanness (size or redundancy)

wareya Apr 5, 2026

Show full content

There’s a pervasive understanding of the RISC vs CISC divide in tech spaces (these are kinds of processor/computer chip) that seems to take its framing from the PowerPC vs ARM vs x86 era of the late 90s and mid 200Xs. I think that this period of history lead to some misunderstandings of what the RISC vs CISC divide looked like at the time that the distinction was thought up, and leads to people attacking and defending the wrong parts of a design (e.g. RISC-V vector instructions or its Zicond extension), or describing things like modern x86 processor internals in confused ways.

The thing that I think is a misunderstanding goes something like this. RISC is about being “clean”. They don’t say it in terms of being clean, but rather some combination of having fewer instructions, making them do fewer things, avoiding compound or elaborate instructions, limiting how many operands they have, avoiding complex addressing modes, etc. (This misunderstanding doesn’t necessarily come from the name; it often comes from comparing PowerPC and ARM to x86.) When programming or compiling, operations should be broken down into atomic individual steps, and when making hardware, it should be designed to expose just those atomic individual steps. If you can break something down into two or three existing instructions, you shouldn’t add a new instruction for it, and you shouldn’t add three ways to do the same thing.

RISC designs do tend to be “cleaner” in these ways, but I think that treating it as the RISC philosophy is a horrible misunderstanding of where RISC came from and what it’s good for. Taken to its logical conclusion, you end up with a bunch of design heuristics that do you more harm than good. It means that you can’t have, for example, a cmov or csel instruction, because you can break those down into three or four bit munging instructions, which makes the instruction set cleaner. Or you can’t have a fused shift-and-add instruction, because you can break that down into separate shift and addition steps.

It is simply unreasonable to reject instructions just because you can break them down into two other instructions. RISC-V knows this, which is why it has, for example, the sh3add instruction. This looks redundant, but it’s not, because the performance is different compared to running a bitshift and addition separately. If an operation is used often enough, giving it its own instruction saves decoder and dispatch time and makes your code faster. Vroom vroom.

Where did this misunderstanding come from? Some people think the misunderstanding came from the “reduced” in the name, but I think that’s only part of the problem. I think it came from the ARM vs x86, and also PowerPC vs ARM, debates. Where people were arguing about purity and trying to one-up each other by pointing out downsides in the thing the other person was using. x86 is big and has too many instructions and it’s hard to decode so that must be why x86 processors run so hot, right? ARM is cool. RISC is cool. Less heat. More vroom per watt.

But when RISC was first envisioned, the computing world looked very different.

You had CPUs with instructions like “multiply these two decimal-coded binary numbers in memory, the sentinel is 0xFF” or “allocate a serial port ownership handle” or “calculate this entire polynomial, thanks”. These instructions were implemented almost entirely in a tiny program running in the CPU’s frontend called its microcode, and the microcode was often very, very buggy. So you had these super complicated instructions in the CPU that were programmed instead of part of the silicon, and you couldn’t access the individual operations they were running inside of those programs, and they were super slow to run, and they were buggy, and you couldn’t patch your microcode because it was baked into the silicon. Your CPU had a tiny OS in it that you couldn’t poke at or debug and it was breaking your code. And you were in this situation because the CPU’s designers decided it was OK to put that tiny, sorry, not tiny, actually giant program into the CPU in the first place. And you ran into problems because of this *constantly*.

Despite the way I’m describing it, at the time, it wasn’t immediately obvious that these incredibly task-specific gigantic microcoded instructions were a bad thing. At the worst it looked like they would just be unused features that you could ignore. But they complicated the CPU’s design and caused edge cases in how pipelines worked and inhibited optimization, and compilers only used them when they were forced to. Someone had to sit down, stare at it, with an understanding of all three of instruction set design, hardware design, and compiler design in mind, and come to the realization that these instructions were one of the main problems.

RISC was envisioned in that context. Instead of having half of your instruction set consist of weird task-specific instructions, you make the programmer use subroutines to implement those task-specific behaviors on their own. Then the programmer can debug and fix them properly. You’ll need to add a few instructions back in after removing all the task-specific ones, because they depended on functionality that you previously only exposed via those OS-like abstractions, but that’s a small price to pay.

In other words: instead of exposing behaviors, expose the tools used to build those behaviors.

You might look at this and go, okay, so we’re back to splitting instructions down to their fundamental parts, right?

No.

Do you think it’s a good idea to split a multiplication down into a series of several dozen additions, bitshifts, and masking operations? No? It’s probably going to be super slow, right? Right.

You stop cutting instructions up at the point where it starts making things slower. You stop cutting instructions up at the point where you stop looking at behaviors and start looking at foundational operations. Sure, you can represent some foundational operations in terms of others, but once you DO have a foundational operation, you should probably stop cutting it up. This is why sh3add exists in RISC-V: it’s a foundational operation. It represents the act of calculating an offset inside of a non-unitary set of items. Array indexing. Any software you run is going to do something like sh3add *constantly*.

Does this mean that you don’t want a microcode anymore?

Well, if your microcode is really simple, no, it doesn’t mean that. The goal was to get rid of the super slow bug-prone microcoded instructions, not all microcoded instructions in general. You might genuinely want to run into a MUL instruction, go “ah, crap”, and emit a bunch of bit shifts, masks, and adds, because your ALU doesn’t support multiplication, because you’re on an extremely small chip. And you want it to be a MUL instruction instead of the compiler or programmer implementing multiplication out of addition manually because code takes up space in cache, and because the frontend/decoder is one of the most expensive parts of a CPU to run (especially one simple enough to not have multiplication in its ALU).

So you don’t ENTIRELY want to get rid of the microcode. You probably want to get rid of as much of it as possible, but there are a few loose ends that you might want to leave there for practical reasons.

The microcode is not an operating system. It’s just a microcode.

But, devil’s advocate: let’s say you really can’t justify having a microcode program. Microcode emits micro-ops, which are the instructions that the processor dispatches internally without being part of the front-facing instruction set. Do you need to get rid of micro-ops too?

Answering the devil’s advocate, no, you *don’t* need to get rid of micro-ops just because you got rid of the microcode. Look at sh3add for example. Do you want to be forced to add a sh3add calculation complex to your super small microcontroller’s ALU? No? You’re probably going to want to dispatch separate addition and shift instructions, right? Right. You can throw a tiny state machine into your MCU’s frontend, and make it crack unsupported instructions into multiple backend instructions. This isn’t run by a microcode program, but baked into how the instruction decoder works as a tiny state machine. And this is entirely fine. If you can handle memory loads taking a few cycles, you can afford to handle the “wait until this is done, then do this” pattern that this causes. If you *can’t* handle memory loads taking a few cycles, you’ve gone beyond “low-spec” and into “so un-specced that it only makes sense as a sub-sub-sub-processor or an academic learning experiment”.

You don’t have to be afraid of micro-ops just because you’re working on something that’s RISC.

x86 actually has what I think is a great example of the RISC philosophy in action. There was this thing that some encryption code needed, and instead of adding accelerated support for that specific encryption code, Intel added the Galois Field New Instructions. These instructions look incredibly arcane and overly task-specific, but in truth, they’re “just” a bit of accelerated math. They ended up being incredibly useful all over the place, not just encryption but even in emulators and simulations. Intel looked at what they *could* have added, and realized that adding this one piece instead would be a better use of silicon, and maybe useful outside of the original thing they wanted to make faster (encryption), and people ran with it. Splitting the operations up any further would start to make things worse again; you’d end up in a scenario more similar to standard math primitives, and it might have still been a good addition, but not as good as what we got in the timeline we live in. (I swear I’m not handwaving here; the abstraction level they chose was the right one for genuinely complicated number theory reasons.)

This actually comes back in an interesting way. When we look for what operations are “foundational”, or where splitting things starts to make things slower or worse, we can’t do that in a vacuum looking at spec sheets or matrixes of supported numeric operations or behaviors. We would have never accepted the galois field instructions if we were doing that. We need to look at what people are actually doing with the hardware.

RISC-V has gigantic vector processing instructions. And it should! But they look, at a superficial level, like they aren’t RISCy. If you showed them to someone from the 80s they’d probably say “this is RISC? are you sure? are you sure it’s not a CISC DSP?” — but they’re RISC.

Vector instructions are genuinely useful. Universally. We know that now. It’s worth treating them as foundational items instead of splitting them up and telling people to macro-op-fuse individual additions or whatever because that feels more RISCy to us. Because we know what compilers and programmers actually do with these instructions and why they need them, and it’s faster if we give them to them. If you were fixated on a smaller instruction set, you’d try to axe the vector instructions. But you can’t. You need them.

This might seem unsatisfyingly subjective or situation-specific, but that’s also the point. The right amount of “reduced complexity” to feel RISCy depends on what people are actually doing with your hardware. A csel/cmov instruction would be insane on a tiny MCU with no branch prediction, but indispensable on a desktop/laptop CPU. An atomic increment-or-retry ritual for “just increment this atomic memory value” looks horribly wasteful on a desktop CPU, but like a godsend on a two-core data processing chip.

There’s an extent to which it looks like RISC designs and CISC designs have converged, especially on the high end. ARM has conditional execution modes that makes x86’s memory addressing units blush, and modern x86 implementations have a relatively small RISCy engine hiding behind the frontend. This is true to some degree, but I think that the convergence has some inevitability to it, and is some kind of new third thing that doesn’t have a definitive name, rather than being a halfway point between RISC and CISC. It doesn’t *really* CISCify ARM for it to have crazy conditional modes, and it doesn’t *really* RISC-ify x86’s backend hing for an x86 chip to emit multiple micro-ops in response to an atomic memory operation. It might un-RISC-ify ARM and un-CISC-ify x86, but the place it leads isn’t in the middle, or on the opposite side, it’s somewhere else. The CPU is building a new program, on the fly, out of the parts you feed it, and the instruction set is optimized for *that*, instead of for assembly programmers (CISC) or to expose the pipeline to the compiler better (RISC).

…If you wanted me to say something about RISC designs tending to have load-store architectures, I’m sorry to disappoint, but that aspect is just not that interesting to talk about. People have much more reasonable understandings about it on average. The closest I can get to a hot take on it is, it’s probably not un-RISCy to add a single-item mem-to-mem move instruction, or a specialized vector instruction that operates on an address instead of a vector register, when the situation demands it. And it’s probably un-RISCy to add complex addressing modes to every single operand of every single instruction. But in the gradient of possible designs between everything else, there aren’t very many other sweeping claims about how to do memory accesses that I’m comfortable affirming or denying.

People generally understand the memory addressing part of the RISC design space and philosophy, and I can’t add anything to that part of the topic. I feel the same about fixed-width-instructions. RISC-V compromising and having compressed instructions was philosophically and pragmatically the right choice, and doesn’t make it any less RISC in my eyes, and people with bad opinions on instruction width are either uncommon or don’t hold those opinions strongly enough for it to be a problem.

http://wareya.wordpress.com/?p=931

Extensions

Small bit: Cheap specularity masking (possibly a new technique?)

wareya Apr 3, 2026

Context: I have Physically-Based Rendering shaders for OpenMW. OpenMW is mostly used with content from Morrowind, so there aren’t a lot of useful modern niceties like bent normals or accurate baked AO maps, and most of the content was not designed for PBR at all, it’s ducktaped on. For example models don’t have unique textures, […]

Show full content

I was looking at techniques for occluding specularity where it shouldn’t be visible, like bent normals, specular occlusion, etc. Those approaches range from the extremely rigorous (bent normals) to the heuristic (specular occlusion). I think I’ve come up with an approach that’s slightly rigorous while not requiring any extra work from tool or content authors.

The metaphor is this: AO roughly tells you the size of the geometric visibility cone coming out of a texel. It’s not perfect, but it’s a reasonable approximation. In theory you could teach all your specularity math about visibility cone reduction. Instead, we can treat it as “reflections that bounce straight in and out are OK, but reflections that come in and out at a steep angle are blocked”. Diagram:

If this were a raytracer or path tracer this would be relatively easy to throw in in a physically coherent way, but unfortunately the real world runs on rasterizers. So instead I have this bodge:

    float specularMask = 1.0;
#if PBR_SPECULAR_AO_HACK
    // The squaring of ao is because some AO maps are linear and some are sRGB.
    float ao_closedness = 1.0 - ao*ao;
    // You can use just about any dot(V...L...N) here, as long as the value is mapped from 0 to 1 with tangent at 1.
    // -- The visual effect of doing so is different, but remember that this is a bodge anyway.
    specularMask = (dot(viewDir, lightDir) * -0.5 + 0.5) * ao_closedness;
    // Example alternative: specularMask = (1.0 - dot(viewDir, normalDir)*dot(lightDir, normalDir)) * ao_closedness;
    specularMask = 1.0 - specularMask;
#endif
    // ....
    specular *= specularMask;

This vaguely follows the occlusion behavior that the reduced-size cone would have. It’s not strictly correct, because the *shape* of the specularity lobe should be getting affected too, not just its brightness, but it’s close enough. The reason I’m ignoring roughness in this bodge is that the BRDF function should take into account all the microfacet roughness effects, so the remaining part of occlusion, at the level of texel normals, comes mostly from macro-facets. In other words, a roughness-aware version of this bodge would need to be *inside* of the BRDF, not a mask applied afterwards.

I scanned some similar techniques but they tend to be roughness-dependent and/or use the pow() function, or they’re intended for environment mapping instead of direct light (and are designed differently because of that).

You can see in the below screenshots that the specular reflections that are more perpendicular are unaffected, while those that are more tangent are more affected. In these images, the effect is subtle, because the AO data is purely textural, and doesn’t include anything baked from the geometry of the models. The black and white image is showing the AO maps.

(EDIT: I got nerdsniped into making a better example. In particular look at the top of the bigger barrel, where the blue candle casts a specular lobe, and the edge of the furthest right metal band. Original example follows afterwards.)

(Original example:)

http://wareya.wordpress.com/?p=885

Extensions

I think bit-parallel NFAs on their own are incompatible with leftmost-anything semantics

wareya Mar 15, 2026

Prior skim: https://swtch.com/~rsc/regexp/regexp2.html Context: a bit-parallel NFA simulation is one that is similar to the “lock step” version of the above VM approach, but instead of a threadlist with information about captures or start positions, throws away all greediness and laziness distinctions and boils the state of the parallel NFA down to an array of […]

Show full content

Prior skim: https://swtch.com/~rsc/regexp/regexp2.html

Context: a bit-parallel NFA simulation is one that is similar to the “lock step” version of the above VM approach, but instead of a threadlist with information about captures or start positions, throws away all greediness and laziness distinctions and boils the state of the parallel NFA down to an array of “is each thread alive?”; this allows it to pack the entire state of the NFA into a series of registers and use SIMD operations to operate on every state simultaneously, giving a 64~512x speedup over naiively looping over such an array directly, which allows it to bypass all the dynamic overhead of maintaining even a well-architected thread list. This is faster than the threadlist approach for any NFA that is not both very large (thousands of states) and sparse (usually only ever in 2 or 3 states at once).

Further context: the “leftmost” in “leftmost-first” and “leftmost-longest” when describing regex engines means that the engine, in search mode, always gives you a result that *starts* as early as possible in the input text.

The simplest goal of a regex NFA implementation is a “find()” operation, which gives you the start and end position of one of the earliest matches in the input text, in O(n) time where n is specifically linked to the distance to the end of the match it gives you rather than the full length of the text. (This second point is important because it prevents repeated find() operations starting at the end of each next match from causing the total repeated process to be O(n^2).) This is ignoring cases where there is no match, which bail you out after O(n) work with most engines, and alternations that e.g. end in something like (.*z)? with no z in your input, where any find()-centric engine can cause O(n^2) iteration.

The specifics of this find() operation are somewhat important. It’s why, for example, the regex abax|ba|x matches the input abax once ([abax]) instead of twice (a[ba][x]).

Internal capture group positions are not a concern in this post, only the start and end positions of the entire match.

A single scan of a bitscan NFA doesn’t give you any start position, only an end position. You have to reconstruct the start position by doing multiple scans over the input. This is because unlike other NFA simulations, bitscan NFAs do not keep track of the relative start positions of different NFA states, which means that they’re limited to a smaller set of strategies for when to detect that a scan should be exited with a match. Here’s where this leads.

The following is the most robust multiscan I can think of:
1) unanchored “first match ever” pass.
2) reverse anchored longest-match scan
3) anchored leftmost-longest scan

This fails when matching abaa|ba against abaa. It gives you a[ba]a. This happens even if the regex is ba|abaa — it is not a problem of greed or laziness.

And you can’t just replace “first match ever” with “last match that’s part of the first overlapping match section”. You might think that the “that’s part of the first overlapping match section” restriction might save you and give you leftmost semantics, but it doesn’t. That would give you, for regex(aa) for example, the output aaa[aa] on the input aaaaa, instead of [aa]aaa. You might then try to unanchor the reversed scan, but now you’re back into O(n^2) territory for iterating find() with some non-pathological regexes.

I can’t think of any alternative termination condition other than “first match ever” or “there is a match and the current overlapping match section is ending”, that works for arbitrary variable-length regexes. You can’t kill existing threads that started later than a previously seen complete match began, because you don’t know when they started. Put differently: You must know what right-side end position to start the reverse scan from. But you don’t. You simply don’t know what end positions correspond to the leftmost match. You can bodge it in for string alternations, but not for quantifiers in general, which are the thing that makes NFAs stronger than Aho-Corasick.

There’s no fix.

You must either:
1) use a normal parallel NFA with startpos tracking (or a backtracking reference implementation)
2) use a DFA, or a “lazy DFA” (building out parts of the DFA only as needed to prevent exponential startup cost)
3) use an anchored NFA instead of a search NFA (which allows the last-match-of-section condition to give you leftmost-longest semantics, but makes searches potentially O(n^2))
4) give up on leftmost-anything semantics and just return the first-ending match of any kind

Note carefully that there are literally no quantifiers here at all, and the ordering of the alternations does not matter.

It might be possible to extend bit-parallel NFAs with the metadata needed to keep track of thread start and end positions. But such NFAs on their own aren’t enough for a proper find() operation.

Hyperscan (name of a library) seems to work around this by being designed around an API that gives you, iteratively, all the match-end positions in the string, and waiting on you to tell it when to stop looking. For a given end position, it can also give you a start position (which I assume is found by doing a multi-pass thing similar to the one I described). But for the abaa|ba example from earlier, it gives you the ba end position first, and then the abaa end position, and in an arbitrary quantified regex you have no way of knowing whether it’s finally given you the leftmost-possible match in the string until you finish exhausting the entire string. Once you *finally* get to the end of the string, you can go over all your past matches, but the set you end up with needs to be cleaned up or else it’s full of a bunch of overlapping matches, and it seems to be very much nontrivial to clean them up in such a way that you get a leftmost-anything iterated find() set of matches. To me, this makes hyperscan’s approach very much task-specific, since you can’t break up the scanning process with other checks or sub-processes like you can with the find() approach.

You might look at the earlier (.*z)? example and conclude “so hyperscan’s non-find() API is right, then”, but that’s not really how this shakes out in practice. Most tasks simply *need* an iterative find() approach because they’re themselves part of other, bigger iterative processes, like a user-directed one-by-one find-replace operation in a text editor, or context-sensitive chunk-scanning in a debugging tool. The most robust approach is to fire off warnings when someone designs a regex that has overly broad greedy quantifiers in it, and eat the O(n^2) if the user insists.

http://wareya.wordpress.com/?p=844

Extensions

Beta release: Predicated Recursive Descent for Rust

wareya Feb 1, 2026

TL;DR: I made a Rust library (https://crates.io/crates/pred-recdec) where you can write BNF (Backus–Naur form) grammars that are as strong as manually-written recursive descent parsers. This is not a normal parser generator. The philosophy is different. I’ve been working on parsing-related things for several months at this point, and I ended up with a design where […]

Show full content

I’ve been working on parsing-related things for several months at this point, and I ended up with a design where I have an actual working C parser that only took a few days to build and another few days to churn through test cases, once I managed to build the abstractions needed to do it. (C parsers usually take months to write and churn through test cases for, not days.)

It’s not only for parsing C. You can write arbitrary parsers in it, like JSON, S-expressions, etc.

Performance wise, it’s slower than native code, but still fast enough to be entirely usable. The C parser is only slightly slower (10~20%) than GCC and Clang in -fsyntax-only mode. The JSON parser is “only” about 5x slower than a pure native code highly-optimized JSON implementation like serde_json, despite running in an interpreter and having a tokenizer (which JSON parsers usually don’t do). (Yes, my library has a tokenizer. No, you don’t have to write a separate lexical grammar.)

I named the end result “pred recdec”, for “predicated recursive descent”, and you can look at it and use it here:

https://github.com/wareya/pred_recdec/

https://crates.io/crates/pred-recdec

Here’s a pred recdec grammar for S-expressions:

S ::= @peek(0, "(") parenexpr
parenexpr ::=
    @peek(1, ")") "(" ")" # avoid bloating the AST with empty itemlists
    | "(" $become itemlist
itemlist ::=
    @peek(0, ")") ")"
    | @peek(0, "(") parenexpr $become itemlist
    | @auto r`[a-zA-Z_][a-zA-Z_0-9]*|[0-9\.]*`r $become itemlist

How does this work? What does it do?

This is basically a BNF-shaped domain specific language for writing handwritten recursive descent parsers.

Normally, handwritten parsers end up being a mess of macros or boilerplate, because you need to duplicate or generate the code for managing AST state and performing lookahead operations. In pred_recdec, as much as possible of that boilerplate behavior is controlled by the shape of the BNF grammar and syntactically-small guards. It also has tail call optimization (useful for left-factoring and lists), which is often hard to do in handwritten parsers because of the AST manipulation they need to do.

This is different still from how parser generators normally work. Normal parser generators handle local nondeterminism on your behalf, generating lookahead sets or state machines or performing memoization and backtracking, or some combination of those. My library does none of that: you have a direct view of the control flow that the parser actually ends up running, and you need to predicate every alternation/production manually. You don’t have to debug shift-reduce conflicts or first-follow set conflicts or table update rollback failures, just your own control flow logic.

It’s a half way point. You get most of the benefits of a formal grammar (easier to read, reason about, and dump into programs that manipulate grammars) without throwing away the capabilities of handwritten parsers (control flow is linear, so you can throw in as much impure stateful code as you want, as user-defined hooks or guards).

For that last point, here’s a snippet of my C99 grammar:

primary_expression ::=
    @peekr(0, r`[L]?\x22.*\x22`r) string_literal
    | @auto "_Generic" "(" !hook(many_balanced) ")"
    | @guard(is_ident_not_known_enum) $become_as identifier
    | $become constant

The @guard and !hook items refer to user-defined code. If you really need to, you can write an entire parser or memo engine in them. You shouldn’t, but you can. That’s where this ended up, it really is just as strong as manually-written recursive descent parsers, but shorter, easier to reason about, and confining the non-context-free-grammar parts to only where they’re really needed.

“Isn’t this just ANTLR?” No, ANTLR tries to be ambiguity-tolerant. That causes a whole bunch of deep-reaching differences.

http://wareya.wordpress.com/?p=818

Extensions

Short bit: on scannerless parsing again

wareya Jan 4, 2026

Okay so my last post (https://wareya.wordpress.com/2026/01/03/short-bit-lexing-and-non-lexing-scanners-parsing/) was hard even for my smart friends to understand, so I’m going to take another crack at this. This is going to have a slightly prescriptive tone. However, please do not take it that way. It’s just clearer when worded prescriptively. Take this as, “BEWARE: there is an ambiguity […]

Show full content

This is going to have a slightly prescriptive tone. However, please do not take it that way. It’s just clearer when worded prescriptively.

Take this as, “BEWARE: there is an ambiguity landmine here,” and NOT as, “this is how you must talk”.

People often interpret “Scannerless” in parsers as meaning either:

Doesn’t have a separate “lexical grammar”, next to the “structural grammar”
Doesn’t perform “lexing” (turning tokens into IDENTIFIER, COMMA, LEFTBRACKET, etc)

Both of these are sometimes wrong. The two most conventional kinds of parsers are the “per-character” (“real” scannerless, like regexes) and “per-lexeme” parsers (LR/yacc/bison/etc), and for those parsers, this misunderstanding fits. However, there is a third category in the middle that this misunderstanding does not work for.

You can write a grammar on a per-letter/character basis. “Real” scannerless.
You can write a grammar that chunks the input with strings and regexes, but doesn’t give those chunks “names” or “identities”.
You can have a grammar that chunks, but DOES give the chunks names/identities. This is “lexical analysis”. You are determining “which lexeme” each chunk is.

It would be very, very silly to assert that the chunking done by mode 2 doesn’t count as “scanning”. Even the Dragon Book (2006) asserts that, sometimes, you can separate the chunking from the lexing and call it its own thing, and it even (correctly!) calls it scanning. (It just assumes, implicitly, that there’s no reason to scan without also lexing, so it assumes that the scanner is always going to be taped to a lexer.)

For a few decades, only options 1 and 3 were viable. 1 for the natural language and academic people, and 3 for all real-world uses. But it turns out that 2 is viable too, once you get enough memory, and cache gets fast enough. Also, parsing strategies just got better in general. We got Packrat and parser combinators, and Earley became viable.

Point being, remember: when you see/hear/say “scannerless”, be *aware* that it can mean “1” vs “2/3”, and not just the other two things it can mean.

This isn’t pedantic. Some real world Packrat parsers are secretly type 2 (with string interning)*, despite looking like type 1 (because you don’t have to write a lexical grammar for them). Some format-specific parsers do ad-hoc chunking, like simdjson, without lexing, and look like type 2 more than 1 or 3. It’s also why some languages are hard to parse with tools from the 80s: “soft keywords” are supported poorly by lexers (type 3), UNLESS you do lexing at the last possible moment, immediately before the parser asks for the next token, with an on-demand lexer that’s aware of context (LR parsers suffer the most from this).

It’s not just trivia, either. It’s likely that someone out there dismissed a fast Packrat parser because it looked scannerless, and they know that scannerless parsers are slow, without realizing it might be type 2, which is still fast. They could have avoided this mistake. It affects correctness, too. Chunking Earley and scannerless Earley (group 2 and group 1) parse slightly different sets of inputs. The chunking does some disambiguation up front. And lexing Earley (group 3) is different from group 2 yet again (because it does even more disambiguation up front). So again, it’s not just a matter of trivia or performance.

(* String interning doesn’t imply lexing. “if” is still “if”, and not a keyword or identifier.)

http://wareya.wordpress.com/?p=781

Extensions

Short bit: lexing and non-lexing scanners (parsing)

wareya Jan 3, 2026

I’ve been poking around the internet looking at people writing about computer language grammar parsing stuff for the past few months, and there’s a point of persistent confusion that I’ve been running into, and it’s made it harder to sort through the data and opinions. Here’s what the point of confusion is and what’s actually […]

Show full content

There’s a class of parsers between the scannerless kind and the lexing kind. This class of parser scans the input (it has a scanner; it chunks the input) without performing lexical analysis (assigning identities/types to those chunks).

Scannerless parser: Operates directly on characters, like regexes. No tokenization at all.
Mysterious middle class: ?????
Lexing: A phase in some parsers that turns the input text into a sequence of tokens with strict lexical identities, like “IDENTIFIER”, “COMMA”, “OPENBRACKET”, etc.

What happens if you try to turn the input text into a sequence of tokens, but just… don’t give them lexical identities? This isn’t theoretical. There are a lot of parsers that do this. Including some fast ones.

Some people assume that “scannerless” must mean “doesn’t have a separate lexical grammar”. Some other people assume that “scannerless” must mean “doesn’t perform lexical analysis”. (These two groups have some overlap). This is not always correct. A tokenizer that takes a String and outputs an Array<String> according to the same character spans as a tokenizer for an LR grammar, can still be called a scanner. A parser that uses the output of that tokenizer is still usually called scannerful. Even if the lexical grammar is derived from the syntax tree grammar, and even if those substrings have no associated lexical identity.

This isn’t a distinction without a difference: being scannerful is usually a necessary condition for a parser to be Fast. So if you’re reading something about “scannerless” parsers and their behavior or performance, you should be aware that it probably means “doesn’t chunk at all”, rather than “doesn’t give lexical identities” (though, depending on the writer, it can mean the latter, too). And also, “lexing” can make or break a parsing algorithm. LR requires lexing, because otherwise the transition tables would be impossible to build. Earley discourages lexing, because it supports highly ambiguous grammars and you’re throwing some of that power away if you pre-disambiguate lexical identities.

So:

Scannerless (like regex)
Scannerful, but non-lexing (like some HTML/JSON parsers) (parser sees strings)*
Scannerful, but lexing (like LR parsers) (parser sees lexeme identities)

(* Yes, some HTML/JSON parsers skip lexing, yes, really, yes, I’ve seen it happen with my own eyes, yes, I know that they’re “doing it wrong” in some sense. )

Each level does more disambiguation up front than the last. Scannerful non-lexing parsers can’t parse quite as many ambiguous grammars as scannerless parsers can. Lexing parsers can’t parse quite as many ambiguous grammars as scannerful non-lexing parsers can. Performance also increases with each level. But the difference in performance between scannerless and scannerful-nonlexing is Massive. The difference between scannerful-nonlexing and lexing is real, but relatively small. You can often ignore it (aside from LR parsing requiring lexing to work at all).

If you’re worried about scannerful-nonlexing parsers being slow because of operating on strings, don’t worry. The scanner can intern strings from a cache so that all identical strings have the same pointer or identity. Yes, this is cheap. It can also precompute which regexes are capable of applying to which tokens so that it doesn’t have to check again later during parsing.

Also, if you’re wondering “wait, then how am I supposed to make my LR parser handle soft keywords (ambiguous lexical grammar)”, the answer is “on-demand lexing”. If you lex on demand then you can instruct the lexer to ignore certain lexeme classes when you’re at certain states or rules. Yes, this is different from scannerless parsing.

If you had this point of confusion, you weren’t wrong to be confused. This section of parsing theory is taught very poorly because everyone focuses on maximum-performance maximum-reliability things like LR or fully generalized things meant for natural language processing like CYK. Even more confusing, PEG/Packrat parsing suites are randomly either scannerless or scannerful, and the only easy way to know is if you benchmark them against each other, because they almost always derive the lexical grammar (for scannerful ones) from the syntax tree grammar. So it’s hard to run into this distinction organically if you aren’t yourself personally working on parsers.

Per the Dragon Book (2006 edition):

    Sometimes, lexical analyzers are divided into a cascade of two processes:
        a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.
        b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

Read sections A and B carefully. For phase A, “scanning”, prior tokenization is not necessary. For phase B, “lexical analysis proper”, the “scanner” has already produced a sequence of tokens as output. Let’s bold that: it has “… where the scanner produces the sequence of tokens as output” as an earlier portion before lexical analysis. In other words, tokenization needs to have happened during (or, in even weirder niche architectures, even before) the “scanner” phase. This scanner phase is treated as, at least “sometimes”, being separate from before the “lexical analysis” phase.

The hidden assumption that the Dragon Book is making here, and why it sounds slightly confused (even though it’s not), is that it thinks that scanning without eventually doing lexical analysis is useless or rare, so A will always eventually feed into B, right? I want to note that this assumption was not wrong at the time, nor was it lacking forethought: this was the right way to explain these problems to learners at the time. But this is not true anymore: you end up with some parsing systems that do A without bothering to do B. But for a really long time, we existed in a world where almost all scanners were tightly bound to lexers, and usually implemented as the same chunk of code, so the distinction temporarily blurred together. Today, parsing architectures can be a lot weirder than you’d think!

http://wareya.wordpress.com/?p=721

Extensions

Short bit: Converting EBNF to BNF

wareya Dec 16, 2025

EBNF and BNF are ways of writing grammars. EBNF is slightly more complex than BNF, where EBNF supports some regex-like things like quantifiers/lists, but BNF supports only recursion, rule-naming (nonterminals), and tokens (terminals). Most explanations of how to convert EBNF to BNF are overly abstract or “mathematical”. Here’s the simple version, using “recursive regex”-style EBNF […]

Show full content

Most explanations of how to convert EBNF to BNF are overly abstract or “mathematical”. Here’s the simple version, using “recursive regex”-style EBNF syntax and left recursion:

Star: If you have an item X*, convert it to a reference to a new rule X_star that looks like: X_star -> X_star X | <empty>. Alternatively, you can lower it as “X_plus?”.

Plus: If you have an item X+, convert it to a reference to a new rule X_plus that looks like: X_plus -> X_plus X | X. Alternatively, you can lower it as “X_star X”.

Maybe: If you have an item X?, convert it to a reference to a new rule X_maybe that looks like: X_maybe -> X | <empty>.

Warning: If you’re using an LL or PEG or basic Packrat parser, you probably want to use right recursion instead of left recursion.

The above conversions imply greedy behavior in prefer-leftmost-match parsers. To imply laziness you can put the smaller alternation first instead of second.

Quantified groups like (A B C)* need to be pulled into their own new rules before applying the EBNF to BNF conversion, instead of trying to apply the quantifications to the individual parts.

In some situations, for some parsers, X_star might need to be lowered as, instead of the above, one of:

X_star -> X_star X | X | <empty> (…note: this lowering is “ambiguous”.)
X_plus_maybe -> X_plus | <empty> (…and also…) X_plus -> X_plus X | X

All three of these lowerings are valid and represent the same set of texts, but might have differing greediness/laziness behavior in obscure edge cases, depending on the parser being used. There are similar alternative conversions for the plus operator.

Pseudo-EBNF “exceptions” cannot be converted to BNF. So either don’t use them (you usually want to pick this), or extend your BNF implementation with them (you usually don’t want to pick this).

http://wareya.wordpress.com/?p=708

Extensions

Short bit: On Quaternions

wareya Dec 8, 2025

It is common for game engines to use quaternions, and it’s important for programmers to understand them, but this should not be taken as confirmation that quaternions are the default or main tool for handling rotations in game logic. 1) Camera angles are usually modified in euler space, even when the underlying data is quaternions […]

Show full content

1) Camera angles are usually modified in euler space, even when the underlying data is quaternions or matrices. There is a very good and obvious reason for this: gimbal lock is often the literal expected behavior for a physical camera in a physical space. If you look straight up and then try to turn 90 degrees left, you don’t want to end up looking at your west horizon with a 90 degree roll. You want to end up looking straight up but at a 90 degree roll.

2) Quaternions are NOT the main way of holding on to rotations in temporary calculations or cached calculations. They are primarily used for three things: One, storage; Two, orthonormalized animations; Three, spherical interpolation. Point three cannot be understated: easy SLERP (spherical linear interpolation) is the most important property of quaternions. But if you use quaternions instead of matrices in your shaders, or in your scene tree, or in your character controller, you’re making a mistake.

This is reflected in practice. Unreal primarily uses matrixes with quaternions only being in a few places. Unity uses a 50/50 mix of quaternions and matrixes, which is confusing but they at least try to use the right tool for the right job. Godot uses matrixes pervasively and only provides quaternions as an interface (and for some of its internal logic).

In short: do not railroad other developers into using quaternions. They are the right tool for some tasks, but not all tasks. In particular, do not be afraid of matrixes. They are not as inefficient as you probably think they are. (SIMD makes the extra scalar multiplications disappear, and you no longer have to do extra conversions to handle combined position+rotation. Also, multiplying quaternions by vectors sucks.)

http://wareya.wordpress.com/?p=686

Extensions

Short bit: measuring framerates, Time-Weighted FPS

wareya Dec 5, 2025

People who play video games frequently care about how well their games run, because it affects smoothness and input latency and a whole bunch of other random stuff like recording software etc. But when games have inconsistent performance, especially on a frame-by-frame basis, the obvious methods of measuring framerate over time start to have problems. […]

Show full content

For the sake of testing we’re going to look at a situation that “feels like” almost 50fps but is technically 100fps, where one frame lasts for 19ms and the second frame lasts 1ms; this is what gamers might call a “stuttery mess”. One of these frames only lasts for 1ms, so it should have way less of an effect on the measurement than the other, because we barely get to see it. We’ll also look at a pair of frames that are 10ms and 10ms as a sanity check: any measurement should give 100fps for this pair of frames.

TLDR: use sum(x_n)/sum(x_n^2), instead of n/sum(x_n).

The only method here here that Really Works is Method 4, and I don’t know if it has a name already or not. As an intuition aid, you can think of it as “time-weighted framerate”. You can skip straight to Method 4 if you want; Methods 1, 2, and 3 are only here to show what other ideas I explored that didn’t work, which is slightly useful but only slightly.

As a disclaimer, we’re imagining an ideal signal loop where, when a completed frame starts to “exist” and has its timestamp taken, it immediately replaces whatever came before it in its entirety. We’re intentionally ignoring any weird stuff like frame buffering and vsync. This is fine, because variable refresh rate displays exist and screen tearing exists too.

Method 1: Take a consecutive list of frametimes, sampled from consecutive frames, and divide each of them out of 1 (i.e. take their reciprocal), then average them. This gives you the “average framerate” where “framerate” is defined on a frame-by-frame basis and the average is taken by the frametime measurements at the end of each frame across the span of time that those frames cover. In our “stuttery mess” example, we have (1/0.001 + 1/0.019)/2 is (1000 + ~52)/2, which is around 526 frames per second. In the sanity check case, (1/0.01+1/0.01)/2 is 100. The sanity check passes, but the “stuttery mess” example gets adjusted in the WRONG direction, towards a higher framerate instead of a lower one. So this method is bogus.

Method 2: Take a consecutive list of frametimes, sampled from consecutive frames, average them, and then take the reciprocal of that average. This gives you the literal framerate for the course of time spanned by your frametimes. In our “stuttery mess” example, we have 1/((0.001+0.019)/2), which is exactly 100, while (1/0.01+1/0.01)/2 is 100. The sanity check passes, and the output of our “stuttery mess” measurement isn’t insane, but the “stuttery mess” number doesn’t include the influence of the fact that we’re stuttering. This method is valid, but not ideal.

Method 3: Take a consecutive list of frametimes, square them, take the average of those squares, take the square root of that, and then reciprocal that. (Doing the reciprocal early has the same problem as Method 1, so we won’t visit that method.) This gives us 1/(((0.001^2+0.019^2)/2)^(1/2)), which is about 74. The sanity check gives 100. This method is better than Method 2, but it doesn’t seem to fully express just how bad the stuttering is. We want to get a value that’s near 50, not about halfway between 50 and 100. Worse, if you feed it two frames where one is 0ms and the other is 20ms, it gives you about 70fps, even though the 0ms frame exists for zero time. But because we’re working with frametimes specifically, an input of [0ms, 20ms, 0ms, 20ms] should give the same output as an input of [20ms, 20ms]. Yes, I know, the math people reading this post are currently screaming “Stop using RMS wrong!” at their screens; don’t worry, I know that this isn’t that. We could try harder to calculate variance properly and make this work, but there’s a better way:

Method 4: Take the average frametime, define it as Q. Then, square every frametime. Then take the average of the squared frametimes. Then, divide by Q. Then, take the reciprocal. [This is how I originally derived this, but if you’re calculating it in code, do instead: sum(x_n)/sum(x_n^2).] For the stuttery mess case, this gives 1/((0.001^2+0.019^2)/2/0.01), which is about 55. The sanity check case gives 1/((0.01^2+0.01^2)/2/0.01), which is 100. Success! In fact, as the stuttery frametimes approach 0 and 0.020 respectively, the output approaches 50 exactly. For stuttering between 5ms and 15ms, which most people would consider “tolerable”, it gives 80, which is reasonable. The values being given here make intuitive sense and the idea of squaring the frametimes also makes sense; see the next paragraph. I want to emphasize that this number has real physical meaning and isn’t a statistics trick or bodge.

After poking around for a while, I learned that Method 4 is related to the Contraharmonic Mean, which, in physical terms for frametime data, relates to the area occupied by each frame’s latency if you try to chart the latency for each frame on a chart over time. You end up with a number that tells you the average latency you actually experience over time.

If that’s too complicated, the intuition remains: squaring the frametimes causes each frame to be weighted by the time that it lasts, and this is its “physical meaning” here. In other words, every moment in time gets “one vote”, not every frame, so frames that last longer and influence your perception have more of an effect. I don’t know if there’s a term for measuring framerates like this, but “framerate from normalized time-weighted frametimes”, or maybe just “time-weighted framerate”, probably makes more sense to normal people than “framerate from contraharmonic mean of frametimes”. It’s obviously closely related to framerate from RMS frametime, but the final step is different.

If you want to display Method 4 on a game HUD, something like “FPS (feels like):” would work. The most literal physical interpretation of this value is the reciprocal of the average lag since the frame has updated as a viewer experiences frame updates over time, give or take a multiple of 2. Or, put simpler, the reciprocal of the average latency of the things you see. If you had a second copy of the game with a perfectly stable framerate at this value, the average input latency over time would be the same (ignoring any hardware-side frame-buffering differences).

How does this relate to 1% lows? 1% lows tend to start out good but get worse the longer you run a benchmark, while “time-weighted framerate” reflects the total frametime variance faithfully across any measurement time, no matter how short or long. So with 1% lows you can only compare runs that are the same length, and you have to do more runs, but “time-weighted framerate” is a lot less picky, and you can compare runs with different lengths and you don’t have to do as many runs, as long as the general stress experienced by the game is the same.

What about “Animation Error”? (“Animation Error” is a concept introduced by Intel and popularized by journalists that run benchmarks, like Gamers Nexus.) Unfortunately, this method is just like Average FPS here: it can’t work with animation error. The idea of integrating a “badness over time” chart could be used to construct an FPS-like metric for animation error, though.

As a final aside: in some situations it may make sense to avoid measuring “time-weighted framerate” for spans longer than about a second or two. It may sometimes make more sense to measure it over one-second windows and then average those chunks together. It depends on what you’re trying to do.

http://wareya.wordpress.com/?p=600

Extensions

Earley Parsing Is Cheap in Principle and Practice: Motivation and Implementation

wareya Sep 27, 2025

The post describes how to write an Earley recognizer and parser. We will start by motivating the need for generalized parsing techniques and build a path to Earley’s algorithm. Then, we’ll describe the Earley recognizer, as a step-by-step algorithm. Finally, we’ll describe the modern extensions that allow you to build a real parser out of […]

Show full content

The final “Modern Earley” parser is cheap (O(n)) on sane grammars, including any grammar that LR or Recursive Descent parsers can handle, and remains cheap on real-world grammars that are only mildly ambiguous. Even on impractically insane grammars, the worst case complexity is O(n^3). Unlike GLR and some other mostly-efficient generalized parsing algorithms, the grammar-driven component of Earley’s complexity is linear, O(G * n….), instead of sometimes being a polynomial like O(G^2 * n….) (unless you go out of your way to add optimizations that require a higher upfront grammarwise cost). Industrial implementations of “Modern Earley” exist, like Marpa.

Historically motivating the need for Earley

Parsing is the process of taking a linear string of items and creating a data structure that describes its grammatical structure, according to a grammar. The string (4+5*2), for example, usually parses into a data structure that wraps parens around an addition, with a number to the left of the addition and a multiplication on the right of the addition.

For unambiguous grammars, the best techniques for parsing are recursive descent and shift-reduce/LR. A recursive descent parser starts at broad, full-text-level rules, and works its way inwards towards more granular rules by calling functions that correspond broadly to grammar rules. A recursive descent parser is likely to contain a function called parse_addition(), for example, which checks if there’s an addition wherever it’s being told to look, and returns the data structure for addition and how much of the input it maps to; and that parse_addition() function is likely to be called from within a parse_expression() function that does the same for generic expressions instead of just additions. A shift-reduce parser instead looks at the token stream and determines what rules it might be inside of, then builds up greater data structures as it goes, going from numbers to additions to expressions. In a way, a shift-reduce parser is like the “return path” from the bottom of a recursive descent parser, but because different fine-grained items may belong to multiple coarser items, it’s more complicated than that.

For unambiguous grammars, these are the only techniques you need. They’re close to optimal. But when your grammar is ambiguous, they kind of die. LR can’t handle true ambiguity at all. Recursive descent can only handle it efficiently by making arbitrary priority decisions and caching them, which turns your grammar from a “context-free grammar” into a “parse expression grammar”. This is usually fine, but there are cases where it’s not, because some locally-ambiguous rule nesting configurations might simply never get visited because of earlier priority decisions. And if recursive descent handles ambiguity inefficiently, with “backtracking” (which we’ll talk about later), you get exponential time-cost blowup.

Problems like these motivated the search for “generalized” parsing algorithms for context-free grammars.

Trivial generalized parsers for context-free-grammars are exponentially expensive and virtually unusable. Early attempts at fixing this exponential problem, like CYK (named after its creators), are better but still have catastrophically bad performance (CYK is cubic, O(n^3), on all inputs).

GLR (“generalized LR”) came later, but ended up having performance problems. Eventually the worst of its problems were fixed (“BRNGLR”, 2007), and the remaining problems only pop up in pathologically bad grammars, taking the form of the hidden quasiconstant grammar term of its complexity being quadratic with grammar size instead of linear. Or, at least, that’s what my poking around on the internet seems to imply is the case; and if this problem exists, it’s true that only pathologically bad grammars can produce it. (This is similar to, but not as bad as, how some regex implementations look fast but have exponential grammar length sensitivity. Also, people seem to think exponential grammar length sensitivity is acceptable for regexes, so quadratic grammar length sensitivity for CFGs is probably even more acceptable.)

GLL (“generalized LL”) came much later in 2010 on the back of the research that lead to BRNGLR, but has an unresolved performance problem with right recursion (reading ahead, the same problem Earley without Leo has; the GLL right recursion problem doesn’t seem to be addressed in academic literature, but it’s an O(n^2) complexity problem from unused preemptive “return stacks” piling up.)

Parsing With Derivatives was originally problematic, but got optimized; yet, even its optimized form still has problems in some cases that it should handle better (even in non-pathological situations). A successor to Parsing With Derivatives called Parsing With Zippers seems to fully fix PWD’s problems, but it’s not widely understood and hasn’t been widely applied in practice yet.

Earley parsing was first described in 1968 by someone named Earley and, in its modern form, lacks those problems. It originally had right recursive performance problems, but they were fixed by someone called Leo. Even with Leo’s optimizations, Earley parsing originally had the same pathological edge cases as GLR, but they were fixed in 2008 by one of the people who later went on to invent GLL. Think about that again: it started in 1968, and its modern form with both the right recursion and edge-case-performance fixes only came into the picture by 2008. It took 40 whole years for Earley parsing to finally get “finished”! At the same time, Earley’s algorithm didn’t come out of nowhere. It was partially motivated by the same tools as resulted in the internal structure and drawbacks of CYK. So, if you squint, it took even longer than 40 years.

Exploring the brute force approach first

The main reason that recursive descent doesn’t work for generalized parsing is the need for “backtracking”, i.e. running into a parse that doesn’t work, going back, and trying something else. For example, if you have a grammar rule like “statement ::= assignment | functioncall” and look for an assignment where there is none, you have to back out of the assignment parser, go back to where it started, and look for a function call instead. However, when these backtracking choices are nested inside of or abut each other, this will lead to insane (exponential) amounts of wasted work, leading to parsers that will simply not finish the job before the heat death of the universe.

What if you made the parser go down both choices at once, instead? Using CPU threads or parallelism or something? That works, actually. But there’s just So Much wasted work that it doesn’t solve the problem. Parallelism gives you a constant 8x or 64x reduction in cost, not a quadratic or exponential speedup. So in the grand scheme of things, you have to do as much work as backtracking makes you do. Trying to spawn 1000000 threads doesn’t work; the computer can only actually work on 8 or 64 or however many of them at a time.

A partial solution to this is to use memoization, or remembering which rules produce which outputs at which positions in the input. The common name for this is “Packrat” parsing, which is a relatively modern innovation, around 2004. Packrat parsing works on “parse expression grammars” (“PEG”).

PEG parsers unfortunately have the “problem” (intentionally) of some rule nesting configurations never being visited. As a trivial example that wouldn’t appear in any valid grammar, imagine “raw_function_call ::= expr call_args_list call_args”, where “call_args_list” is a rule that matches as many “call_args” as possible. By the time the “call_args_list” term finishes parsing, it will have consumed any “call_args” that could have been used to finish the rule. In a non-PEG parser, you would backtrack into call_args_list, undo one of the matches, then go back out and match the one you just undid against the literal call_args at the end of raw_function_call. But in a PEG parser, this is not possible and will not be done. It would be fine if this only happened in awful, stupid grammars like this, but unfortunately, it happens often in real-world grammars too. The “call_args_list” term could, for example, be something like “right_hand_extension_list”, being a list of any number of “call_args” and “array_indexes” and so on.

This particular example could be fixed by strengthening the PEG formulation with lookarounds or boolean predication, but there are examples that cannot be fixed in that way, and lookarounds/predication make the parser even more complicated and fragile, and the grammars harder to read.

I’ll add a side note here that Packrat parsers, once you add an extension to handle left recursion, are still useful. They allow recursive descent to parse highly left-recursive grammars, which is something you typically need to resort to shift-reduce parsers for. Shift-reduce requires a lot of precomputation, while Packrat does not. In my mind, for grammars where the PEG and CFG interpretations are the same, Packrat parsing is ideal. Unfortunately, that this is not the case for all grammars, and not even all real-world programming language grammars.

Let’s revisit parallelism. What if there was a way to make the different threads share work? This is what the GLL algorithm eventually figured how to do, emulating recursive descent on top of a “graph-structured stack” that caused the different threads to merge together and branch off in exactly the right places to save work. But let’s assume, thanks to our foreknowledge that GLL has unresolved problems, that we don’t know this is possible. How else could we do the work-sharing?

Vanilla Earley

All the way back in 1968, Earley stumbled onto the same intuition as Packrat parsing later stumbled onto, but went in a different direction with it. Instead of having a memo of each sub-parse for each rule at each position, Earley has what is similar to a memo of the current, local state of the parser at each position. When it runs into something like “statement ::= assignment | functioncall”, it makes “assignment” and “functioncall” parse threads, but does not give them the full context of all the items they are inside of, or the full context of everything that is inside of them. The parse threads are just the fact that “we are attempting to parse an assignment here at this position”. The former fact, not storing the full outer context, allows it to reuse the same work that pops up at the same position but in different contexts. The latter fact, not storing the entire parse at every position, allows it to avoid the priority problems of PEG. Earley needed to go further with this to be able to build a real parser, but fulfilling these constraints are what allowed the parsing algorithm to work and be efficient (in programmer terms, polynomial instead of exponential).

Now, we’ll describe every step that an Earley parser starts with during its recognition phase. This algorithm will have problems with empty rules, not handle right recursion efficiently, and lack some pointers that are needed for correctly building a parse tree from an ambiguous grammar, but we’ll fix those problems afterwards in the next section. First we’ll see the data structures being used, then we’ll see the process needed to fill them out.

The following describes the data structures being used.

Start with an empty data structure, a list of sets, where the list has one entry (set) for each input position. The positions in the outermost list are also called “columns”, and the outermost list as a whole is also called a “chart”. (Note that “list” here is the general sense and does not mean “linked list”.)
Each set is an insertion-ordered set that rejects adding items that already exist in it. In programmer terms, it can be implemented as a pair of a hashset and a vec, with the insertion operation being: only if the item we want to insert doesn’t exist in the hashset yet, add it to both the vec (at the end) and hashset (wherever).
In each set we will place “state items”, which have three values: one, a reference to a specific variant of a grammar rule; two, a “rule position” number pointing inside of that grammar rule (e.g. for “addition ::= num + num2”, the position can be before num, before +, before num2, or at the end); and three, a “start position” number saying at which input position the grammar rule originally started (i.e. which column we were at when we first produced the original version of this item that had the rule position at before the first item). Implicitly, a state item also has an “input position”, which is the column number of the state set that it is stored inside of, which corresponds to the location in the input that the “rule position” wants to match at.
For the sake of convenience, rules are limited to the expressiveness of BNF (“Backus–Naur form”), meaning that a rule can have one or more variants, but any variant can only be a list of zero or more unquantified and unqualified matches of other rules or literal text (“other rules” being “nonterminals”, and “literal text” being “terminals”). You can convert EBNF (BNF with quantifiers, but no lookaround or boolean qualifiers) into BNF in O(n) by converting quantified rules into new recursive rules.
State items may have “side pointers” that point at multiple other state items, and those other state items may be in other columns (and thus require an explicit input position instead of an implicit one). In vanilla Earley, efficient implementations only need “call” side-pointers, but later in this post we will need to add more kinds. From a programmer’s perspective, think of side pointers as being stored, per column, as a “HashMap<StateItem, Vec<Pair<StateItem, Integer>>>”. You do not need to implement them this way; your implementation just needs to be algorithmically equivalent (for example, you might decide to store them directly inside of StateItems instead of using a hashmap, or you might store their state set indexes instead of their values). (Note to people familiar with Earley already: the “origin set” is related to the idea of side pointers. They’re related abstractions.)

The following describes the process needed to fill out the Earley chart.

At the start, create a default “state item” from your root-level rule (e.g. “program” or “statement list”), and into the first set in the list. In fact, do this for every “variant” of that rule; or alternatively, wrap the user-facing root-level rule with your own hidden rootermost-rule that has only one “variant”, or require the user to provide a root-level rule with only one variant.
Assume that we are at column 0, state set position 0 (i.e. pointing at the first thing we added in step 1).
This is the start of a loop. As long as we have not gone past the end of the state set, handle the state item at our current position in the state set as follows:
If its rule position points at a raw string match (i.e. a “terminal”), do all of the following, else go to the next step (step 5). This phase is called “scan”. Check if our input position matches the “terminal”. If yes, create a copy of this state item, increase its rule position by 1, and add it to the state set of the next column (yes, its “start position” will remain at the value it used to have). Else, do nothing. Finally, in either case, increase our state set position by 1.
If its rule position points at another rule (i.e. a “nonterminal”), do all of the following, else go to the next step (step 6). This phase is called “prediction”, and corresponds to recursive descent function calls. Create a new state item for each variant of that rule, with rule position 0 and input position of current column index, and add them to the current column’s state set. Finally, after processing all the new items, increase our state set position by 1. Important performance note: during this phase, you can add a “call” side pointer from the child to the parent, to allow the next phase (“completion”) to be efficient. Note that the two sides of a “call” pointer are highly redundant, so you can find a compressed way to store them.
If its rule position points after the end, do all of the following, else skip to the next step (step 7). This phase is called “completion”, and corresponds to recursive descent function returns. For each item in the child’s start position column that matches child item (i.e. is at a rule position right before an instance of the child’s grammar rule), that matching item is a “parent” item; copy the parent item, increase its rule position by 1, and add that copy to the current state set. Finally, increase our state set position by 1. Important performance note: you can use a set of “call” side pointers built during prediction to avoid doing a linear scan for matching parent items.
If we have exhausted the current state set, set our state set position to 0 and increase our column index number by 1 (moving to the start of the next column). Either way, as long as we are not currently looking at an empty state set, go back to step 3. This is the end of a loop.
Termination check. If we ever run into an empty state set, we are done. If we don’t, either our input is infinite or we have made an implementation mistake, because the above algorithm should never produce infinitely-looping behavior. In particular, the use of insertion-ordered sets prevents us from revisiting state items, and the number of possible state items per column is the product of the grammar size and input size. Per chart, the number of possible state items is at worst O(G * n^2), where G is a linear function of the grammar size, and n is the input size. The maximum amount of work we can do is O(G * n^3), because we have at worst a linear amount of redundant, thrown-away insertions into our state sets per item. Don’t be scared! For sane grammars, even slightly ambiguous ones, this simplifies down to O(G * n) again!

If, at the end of this algorithm, we’re at the final column position (or depending on off-by-one behavior, one past the final column position), recognition has succeeded and our input could have been produced from our grammar. Otherwise, recognition has failed.

As a non-obvious note, the order in which we create and insert new state items causes it to be true that we never complete an item for whom all of its possible parents have not yet run prediction. If you were worried about this, you’re sharp, but you can relax; the academic work on Earley’s parsing algorithm was largely about proving this property true. This means that the order we’re doing work in is safe. Similarly, every state item at the start position of a left recursive “call stack” ends up being a single shared state item, and they each end up producing the same subsequent items, so they don’t have any complexity scaling issues.

For unambiguous grammars that contain no empty rules and no right recursion, this is all you need: walk the chart from right to left, probing for child items as needed. For ambiguous grammars, grammars that contain empty rules, or grammars that contain right recursion, we’re not done yet.

Modern extensions/fixes to Earley

For handling empty rules, see the “Empty Rules” section of Loup’s Earley tutorial, linked below. The technique described there is robust, minimal, and doesn’t require any additional abstractions, just some edge case checks. The simple version is “before going down into a rule variant, know whether it can match <empty>, and if it can, preemptively hallucinate the <empty> continuation version of it in the same column”. Producing a full parse tree for such hallucinated continuations requires being aware that they are hallucinated continuations and then also hallucinating their reduction pointers (see the below tree construction fix), but this is straightforward and safe; do it on demand during disambiguation. Loup also gives an algorithm for marking grammar rules as “potentially <empty>”. https://loup-vaillant.fr/tutorials/earley-parsing/empty-rules

For handling tree construction for ambiguous grammars, you need “reduction” and “predecessor” side pointers. These are described in Elizabeth Scott’s 2008 Earley SPPF (“Shared Packed Parse Forest”) construction paper. Whenever you create a rule-position-advanced version of a state item, create a “predecessor” side pointer from the advanced state item back to the one you’re advancing from. Whenever you perform a completion, create a “reduction” side pointer from completed parents to their children. Without these pointers, there are cases where you won’t know for sure which “child” state items in the Earley chart are actually associated with your parent rule state item. In the case of terminals, or unambiguous subsections of the grammar, you can skip this, but you should determine for yourself whether skipping it is worth it. Elizabeth Scott demonstrated that these additional side pointers do not change the complexity of the algorithm, and gave an algorithm for producing an SPPF using them. (As an aside, if your state sets are the right kind of data structure, you can avoid creating predecessor pointers, deriving them from the reduction pointers and state sets during disambiguation. But it’s a coin flip whether this is faster or slower than having predecessor pointers in the first place.)

For handling right recursion in linear (O(n)) instead of quadratic (O(n^2)) time, Leo came up with a solution that skips unnecessary stacks of completions in the general case. However, Leo intended to create a general overall Earley optimization, not just an optimization for right recursion, so it’s harder to understand than it needs to be. A simpler approach with the same optimization effect on right recursion is described below; once you understand this version, you can look up Leo’s original version and understand it more easily.

When completion happens at the end of a rule list, check if we’re completing into only one single rule (our “parent”), i.e. that there’s only a single call pointer instead of multiple. When this happens, and also at least when <empty> rules are not involved (this is an unresolved question to me, because <empty> at least doesn’t usually cause problems), it means that we have a deterministic return stack and can do the optimization. I have not convinced myself that this process is safe for non-end-of-rule-list predictions, so if our parent is not at the point of its own completion, skip the optimization in that case too.

If, as discovered in the previous paragraph, the optimization is safe, do the following. Check if our parent has a “tailret” pointer; if it does, create a “tailret” pointer from our current (child) state item pointing at item that the parent’s “tailret” pointer is pointing at. Else, create a “tailret” pointer from our child state item to its parent. Also form a completion of whatever we decided to point at. Do this instead of normal completion, not in conjunction. Also form an “taildown” side pointer from the top of the chain to the bottom (i.e. to the newest child state item), to allow us later to reconstruct the call stack correctly. This skips the stack of completions without preventing us from reconstructing it, turning all deterministic right-end return stacks, both recursive and non-recursive, from O(n) height into O(<constant>) height.

Warning: Be careful! Unlike the predecessor and reduction pointers, depending on how you set things up, there are likely to be edge case issues around start/item/rule position values when forming the “tailret”/”taildown” pointers (e.g. Do you “tailret” from the completed or pre-completed version of the child? It depends on the rest of your Earley implementation; in mine, I need to point away from the completed child’s predecessor! Do you “tailret” towards the leftwards or rightwards version of the parent? Left!). In some ways, this is a tailcall optimization for Earley items, hence the naming.

For this right-recursion optimization, when walking the chart from right to left, you will need to use the “taildown” pointers to hallucinate new state items to reconstruct the reduction pointers for the summarized part of the stack. Use them to go to the bottom of the call stack and hallucinate its parents (and their reduction pointers) from the bottom up (as though doing the vanilla Earley completion phase to them). Only do this for parts of the chart that you need to disambiguate; otherwise you may end up with O(n^2) in cases where you only need O(n). Generally speaking, most of the right-recursive return stacks we skipped earlier were going to result in state items that couldn’t be part of the final parse, and are not accessible at all when walking the chart from right to left, so as long as you defer fixing “taildown” reduction pointers until the right-to-left walk, you should be okay.

The one thing these fixes and extensions don’t do is fix all the edge cases in infinitely-recursive empty productions, like “expr ::= <empty> expr | integer”, but those kinds of rules are “useless” (genuinely) and don’t appear in real-world grammars (unless you’ve made a mistake when putting them into text). Even if you write them into a grammar accidentally, they result in non-degenerate output given the full version of Earley described here (and that output is still cheap to get).

Further work for real-world grammars

At this point, we have a full Earley implementation, but you might have noticed a problem. I talked a few times about walking the chart right to left. Most ambiguous grammars say that you have to disambiguate them from left to right. That’s a problem! But it’s solvable. You can work around this issue by creating reversed copies of the grammar and input stream and using them; now right-to-left on the chart means left-to-right on the original input, and the problem disappears. If recognition fails, run recognition again with unreversed copies to find the true location of the syntax error.

In fact, if you combine reversal with an additional (unrelated) two-phase context extraction trick for configuring an “invalidator” (again, unrelated to the reversal), you can accurately parse C, including typedefs (which is something general parsers usually fail at, or require extensive imperative “semantic action” machinery to address). I wrote about this in a post before this one, linked here: https://wareya.wordpress.com/2025/09/25/it-is-actually-surprising-that-earley-can-efficiently-parse-c-ambiguities-and-all/

If you’re curious, I tried the approach generating a full Elizabeth Scott SPPF for the Earley chart, where the SPPF itself is also right-to-left, and then reversing that SPPF. It seemed to work, but I was unable to prove its complexity bounds to myself, and it was slower and dramatically more complicated in code than “just” reversing the input and grammar. I consider it a dead end.

There’s room for further optimization

It’s possible to learn lessons from other parsing algorithms that give you additional tools that can be used to skip redundant information in the Earley chart. For example, learning from LR, you can use something similar to LR’s precomputed state transition tables to go all the way down the bottom-left edge of predictions. Part of the reason that such optimizations are safe is that the leftmost version of a state item, with the rule position set to zero, never needs to be visited by a parse tree generator (because of the start position and rule position in the state items); they only need to exist to produce new items and pointers for further processing, so in the case where that processing is going to be prediction, you can skip to the bottom of the chain! This doesn’t affect the complexity of the algorithm, but it does improve performance in practice by reducing the constant hidden “G” factor in “O(G * n)” or “O(G * n^3)”. If you do this, be careful with how you handle completions and “call” pointers, so that your handling is resilient against missing leftmost-rule-position state items (for example you may need additional precomputed tables). As another note, depending on how you do this, the precomputations can be O(G^2), giving you O(G * n^(1..3) + G^2), but you have more control and flexibility than you do with GLR, and the polynomial components don’t necessarily get multiplied together.

The applicability of these kinds of optimizations are another one of Earley’s strengths over its competitors, particularly GLL (which is the most similar competitor to Earley, both being ways of work-sharing on top of parallel recursive descent). It’s extremely non-obvious how a bottom-left edge optimization for GLL would work. This may, for example, be why Leo was able to find the right recursion optimizations for Earley but people haven’t found one for GLL yet. Distributing the work across both top-down-like and bottom-up-like tasks, dynamically, seems to decouple various hidden constraints in the structure of the algorithm and allow for some of the tasks to be optimized even further than is obvious at first glance.

As another thing that’s good for real-world performance, some (but not all) of the hashmaps and references or pointers can be replaced with simpler, smaller, more cache-friendly values, and string comparisons can be done upfront in the lexer. Or, you could skip over some of the state items for grammar rules that are marked as “nonproductive”/”invisible” (e.g. having been generated by EBNF->BNF conversion).

One final note is that despite the complexity-wise cheapness of the whole thing, the chart can get very big because of the G term, so if you have cheap and easy ways to preemptively invalidate state items ahead of time, it can help a lot to do so. For example, if you have a list of reserved names, it’s better to apply it in the lexer or recognizer than during disambiguation.

I haven’t discussed these tricks in detail because they’re not necessary. Production parsers will still perform fine without them, while still getting down to the ideal amount of computational complexity, which is what matters the most for Earley. In the real world, these further optimizations give a total performance improvement of something roughly like 5~10x, which can absolutely be worth it, but don’t cause the algorithm to be unusable without them; if they were more like 100x, then they’d be essential.

One final note about other work

Loup has an alternative approach to disambiguation given here: https://loup-vaillant.fr/tutorials/earley-parsing/parser

The technique described is fascinating. There are “edge” and “pred” pointers, which I think are similar in principle to Elizabeth Scott’s reduction and predecessor pointers. The first completed chart is visited over only items that contribute to the final completion and a second, reversed chart is created. The reversed chart can apparently be walked inputwise left-to-right, giving you the desired disambiguation order. For LR-Regular and almost-LR-Regular grammars, or for length-driven disambiguation rules, it is obvious that this works and gives you exactly what you want (controllable left-to-right disambiguation).

However, I’m not convinced that chart reversal is a safe way to prepare left-to-right ordered-preference disambiguation for all grammars, including psychotically ambiguous grammars with tons of mixed recursion and problematically-placed epsilons. It seems possible that some vital structural ambiguity nesting information might be lost or clobbered during the process, especially near recursion with epsilons. I’m not saying that it is definitely wrong; I am just not convinced that it is never wrong. Conversely, keeping things “kosher” but unsatisfying with the grammar and input reversal definitely gives you a safe approach even for these psychotic grammars.

So, I’m leaving Loup’s technique here as a very interesting addendum that should be given a proper academic treatment by someone more capable than myself. If it’s proven correct, maybe with a couple tweaks, you might be able to use it to replace full input-and-grammar reversal.

http://wareya.wordpress.com/?p=363

Extensions