The post describes how to write an Earley recognizer and parser. We will start by motivating the need for generalized parsing techniques and build a path to Earley’s algorithm. Then, we’ll describe the Earley recognizer, as a step-by-step algorithm. Finally, we’ll describe the modern extensions that allow you to build a real parser out of it. The modern extensions are described in enough detail to build a working parser on top of the recognizer, but not as step-by-step algorithms.
The final “Modern Earley” parser is cheap (O(n)) on sane grammars, including any grammar that LR or Recursive Descent parsers can handle, and remains cheap on real-world grammars that are only mildly ambiguous. Even on impractically insane grammars, the worst case complexity is O(n^3). Unlike GLR and some other mostly-efficient generalized parsing algorithms, the grammar-driven component of Earley’s complexity is linear, O(G * n….), instead of sometimes being a polynomial like O(G^2 * n….) (unless you go out of your way to add optimizations that require a higher upfront grammarwise cost). Industrial implementations of “Modern Earley” exist, like Marpa.
Historically motivating the need for Earley
Parsing is the process of taking a linear string of items and creating a data structure that describes its grammatical structure, according to a grammar. The string (4+5*2), for example, usually parses into a data structure that wraps parens around an addition, with a number to the left of the addition and a multiplication on the right of the addition.
For unambiguous grammars, the best techniques for parsing are recursive descent and shift-reduce/LR. A recursive descent parser starts at broad, full-text-level rules, and works its way inwards towards more granular rules by calling functions that correspond broadly to grammar rules. A recursive descent parser is likely to contain a function called parse_addition(), for example, which checks if there’s an addition wherever it’s being told to look, and returns the data structure for addition and how much of the input it maps to; and that parse_addition() function is likely to be called from within a parse_expression() function that does the same for generic expressions instead of just additions. A shift-reduce parser instead looks at the token stream and determines what rules it might be inside of, then builds up greater data structures as it goes, going from numbers to additions to expressions. In a way, a shift-reduce parser is like the “return path” from the bottom of a recursive descent parser, but because different fine-grained items may belong to multiple coarser items, it’s more complicated than that.
For unambiguous grammars, these are the only techniques you need. They’re close to optimal. But when your grammar is ambiguous, they kind of die. LR can’t handle true ambiguity at all. Recursive descent can only handle it efficiently by making arbitrary priority decisions and caching them, which turns your grammar from a “context-free grammar” into a “parse expression grammar”. This is usually fine, but there are cases where it’s not, because some locally-ambiguous rule nesting configurations might simply never get visited because of earlier priority decisions. And if recursive descent handles ambiguity inefficiently, with “backtracking” (which we’ll talk about later), you get exponential time-cost blowup.
Problems like these motivated the search for “generalized” parsing algorithms for context-free grammars.
Trivial generalized parsers for context-free-grammars are exponentially expensive and virtually unusable. Early attempts at fixing this exponential problem, like CYK (named after its creators), are better but still have catastrophically bad performance (CYK is cubic, O(n^3), on all inputs).
GLR (“generalized LR”) came later, but ended up having performance problems. Eventually the worst of its problems were fixed (“BRNGLR”, 2007), and the remaining problems only pop up in pathologically bad grammars, taking the form of the hidden quasiconstant grammar term of its complexity being quadratic with grammar size instead of linear. Or, at least, that’s what my poking around on the internet seems to imply is the case; and if this problem exists, it’s true that only pathologically bad grammars can produce it. (This is similar to, but not as bad as, how some regex implementations look fast but have exponential grammar length sensitivity. Also, people seem to think exponential grammar length sensitivity is acceptable for regexes, so quadratic grammar length sensitivity for CFGs is probably even more acceptable.)
GLL (“generalized LL”) came much later in 2010 on the back of the research that lead to BRNGLR, but has an unresolved performance problem with right recursion (reading ahead, the same problem Earley without Leo has; the GLL right recursion problem doesn’t seem to be addressed in academic literature, but it’s an O(n^2) complexity problem from unused preemptive “return stacks” piling up.)
Parsing With Derivatives was originally problematic, but got optimized; yet, even its optimized form still has problems in some cases that it should handle better (even in non-pathological situations). A successor to Parsing With Derivatives called Parsing With Zippers seems to fully fix PWD’s problems, but it’s not widely understood and hasn’t been widely applied in practice yet.
Earley parsing was first described in 1968 by someone named Earley and, in its modern form, lacks those problems. It originally had right recursive performance problems, but they were fixed by someone called Leo. Even with Leo’s optimizations, Earley parsing originally had the same pathological edge cases as GLR, but they were fixed in 2008 by one of the people who later went on to invent GLL. Think about that again: it started in 1968, and its modern form with both the right recursion and edge-case-performance fixes only came into the picture by 2008. It took 40 whole years for Earley parsing to finally get “finished”! At the same time, Earley’s algorithm didn’t come out of nowhere. It was partially motivated by the same tools as resulted in the internal structure and drawbacks of CYK. So, if you squint, it took even longer than 40 years.
Exploring the brute force approach first
The main reason that recursive descent doesn’t work for generalized parsing is the need for “backtracking”, i.e. running into a parse that doesn’t work, going back, and trying something else. For example, if you have a grammar rule like “statement ::= assignment | functioncall” and look for an assignment where there is none, you have to back out of the assignment parser, go back to where it started, and look for a function call instead. However, when these backtracking choices are nested inside of or abut each other, this will lead to insane (exponential) amounts of wasted work, leading to parsers that will simply not finish the job before the heat death of the universe.
What if you made the parser go down both choices at once, instead? Using CPU threads or parallelism or something? That works, actually. But there’s just So Much wasted work that it doesn’t solve the problem. Parallelism gives you a constant 8x or 64x reduction in cost, not a quadratic or exponential speedup. So in the grand scheme of things, you have to do as much work as backtracking makes you do. Trying to spawn 1000000 threads doesn’t work; the computer can only actually work on 8 or 64 or however many of them at a time.
A partial solution to this is to use memoization, or remembering which rules produce which outputs at which positions in the input. The common name for this is “Packrat” parsing, which is a relatively modern innovation, around 2004. Packrat parsing works on “parse expression grammars” (“PEG”).
PEG parsers unfortunately have the “problem” (intentionally) of some rule nesting configurations never being visited. As a trivial example that wouldn’t appear in any valid grammar, imagine “raw_function_call ::= expr call_args_list call_args”, where “call_args_list” is a rule that matches as many “call_args” as possible. By the time the “call_args_list” term finishes parsing, it will have consumed any “call_args” that could have been used to finish the rule. In a non-PEG parser, you would backtrack into call_args_list, undo one of the matches, then go back out and match the one you just undid against the literal call_args at the end of raw_function_call. But in a PEG parser, this is not possible and will not be done. It would be fine if this only happened in awful, stupid grammars like this, but unfortunately, it happens often in real-world grammars too. The “call_args_list” term could, for example, be something like “right_hand_extension_list”, being a list of any number of “call_args” and “array_indexes” and so on.
This particular example could be fixed by strengthening the PEG formulation with lookarounds or boolean predication, but there are examples that cannot be fixed in that way, and lookarounds/predication make the parser even more complicated and fragile, and the grammars harder to read.
I’ll add a side note here that Packrat parsers, once you add an extension to handle left recursion, are still useful. They allow recursive descent to parse highly left-recursive grammars, which is something you typically need to resort to shift-reduce parsers for. Shift-reduce requires a lot of precomputation, while Packrat does not. In my mind, for grammars where the PEG and CFG interpretations are the same, Packrat parsing is ideal. Unfortunately, that this is not the case for all grammars, and not even all real-world programming language grammars.
Let’s revisit parallelism. What if there was a way to make the different threads share work? This is what the GLL algorithm eventually figured how to do, emulating recursive descent on top of a “graph-structured stack” that caused the different threads to merge together and branch off in exactly the right places to save work. But let’s assume, thanks to our foreknowledge that GLL has unresolved problems, that we don’t know this is possible. How else could we do the work-sharing?
Vanilla Earley
All the way back in 1968, Earley stumbled onto the same intuition as Packrat parsing later stumbled onto, but went in a different direction with it. Instead of having a memo of each sub-parse for each rule at each position, Earley has what is similar to a memo of the current, local state of the parser at each position. When it runs into something like “statement ::= assignment | functioncall”, it makes “assignment” and “functioncall” parse threads, but does not give them the full context of all the items they are inside of, or the full context of everything that is inside of them. The parse threads are just the fact that “we are attempting to parse an assignment here at this position”. The former fact, not storing the full outer context, allows it to reuse the same work that pops up at the same position but in different contexts. The latter fact, not storing the entire parse at every position, allows it to avoid the priority problems of PEG. Earley needed to go further with this to be able to build a real parser, but fulfilling these constraints are what allowed the parsing algorithm to work and be efficient (in programmer terms, polynomial instead of exponential).
Now, we’ll describe every step that an Earley parser starts with during its recognition phase. This algorithm will have problems with empty rules, not handle right recursion efficiently, and lack some pointers that are needed for correctly building a parse tree from an ambiguous grammar, but we’ll fix those problems afterwards in the next section. First we’ll see the data structures being used, then we’ll see the process needed to fill them out.
The following describes the data structures being used.
- Start with an empty data structure, a list of sets, where the list has one entry (set) for each input position. The positions in the outermost list are also called “columns”, and the outermost list as a whole is also called a “chart”. (Note that “list” here is the general sense and does not mean “linked list”.)
- Each set is an insertion-ordered set that rejects adding items that already exist in it. In programmer terms, it can be implemented as a pair of a hashset and a vec, with the insertion operation being: only if the item we want to insert doesn’t exist in the hashset yet, add it to both the vec (at the end) and hashset (wherever).
- In each set we will place “state items”, which have three values: one, a reference to a specific variant of a grammar rule; two, a “rule position” number pointing inside of that grammar rule (e.g. for “addition ::= num + num2”, the position can be before num, before +, before num2, or at the end); and three, a “start position” number saying at which input position the grammar rule originally started (i.e. which column we were at when we first produced the original version of this item that had the rule position at before the first item). Implicitly, a state item also has an “input position”, which is the column number of the state set that it is stored inside of, which corresponds to the location in the input that the “rule position” wants to match at.
- For the sake of convenience, rules are limited to the expressiveness of BNF (“Backus–Naur form”), meaning that a rule can have one or more variants, but any variant can only be a list of zero or more unquantified and unqualified matches of other rules or literal text (“other rules” being “nonterminals”, and “literal text” being “terminals”). You can convert EBNF (BNF with quantifiers, but no lookaround or boolean qualifiers) into BNF in O(n) by converting quantified rules into new recursive rules.
- State items may have “side pointers” that point at multiple other state items, and those other state items may be in other columns (and thus require an explicit input position instead of an implicit one). In vanilla Earley, efficient implementations only need “call” side-pointers, but later in this post we will need to add more kinds. From a programmer’s perspective, think of side pointers as being stored, per column, as a “HashMap<StateItem, Vec<Pair<StateItem, Integer>>>”. You do not need to implement them this way; your implementation just needs to be algorithmically equivalent (for example, you might decide to store them directly inside of StateItems instead of using a hashmap, or you might store their state set indexes instead of their values). (Note to people familiar with Earley already: the “origin set” is related to the idea of side pointers. They’re related abstractions.)
The following describes the process needed to fill out the Earley chart.
- At the start, create a default “state item” from your root-level rule (e.g. “program” or “statement list”), and into the first set in the list. In fact, do this for every “variant” of that rule; or alternatively, wrap the user-facing root-level rule with your own hidden rootermost-rule that has only one “variant”, or require the user to provide a root-level rule with only one variant.
- Assume that we are at column 0, state set position 0 (i.e. pointing at the first thing we added in step 1).
- This is the start of a loop. As long as we have not gone past the end of the state set, handle the state item at our current position in the state set as follows:
- If its rule position points at a raw string match (i.e. a “terminal”), do all of the following, else go to the next step (step 5). This phase is called “scan”. Check if our input position matches the “terminal”. If yes, create a copy of this state item, increase its rule position by 1, and add it to the state set of the next column (yes, its “start position” will remain at the value it used to have). Else, do nothing. Finally, in either case, increase our state set position by 1.
- If its rule position points at another rule (i.e. a “nonterminal”), do all of the following, else go to the next step (step 6). This phase is called “prediction”, and corresponds to recursive descent function calls. Create a new state item for each variant of that rule, with rule position 0 and input position of current column index, and add them to the current column’s state set. Finally, after processing all the new items, increase our state set position by 1. Important performance note: during this phase, you can add a “call” side pointer from the child to the parent, to allow the next phase (“completion”) to be efficient. Note that the two sides of a “call” pointer are highly redundant, so you can find a compressed way to store them.
- If its rule position points after the end, do all of the following, else skip to the next step (step 7). This phase is called “completion”, and corresponds to recursive descent function returns. For each item in the child’s start position column that matches child item (i.e. is at a rule position right before an instance of the child’s grammar rule), that matching item is a “parent” item; copy the parent item, increase its rule position by 1, and add that copy to the current state set. Finally, increase our state set position by 1. Important performance note: you can use a set of “call” side pointers built during prediction to avoid doing a linear scan for matching parent items.
- If we have exhausted the current state set, set our state set position to 0 and increase our column index number by 1 (moving to the start of the next column). Either way, as long as we are not currently looking at an empty state set, go back to step 3. This is the end of a loop.
- Termination check. If we ever run into an empty state set, we are done. If we don’t, either our input is infinite or we have made an implementation mistake, because the above algorithm should never produce infinitely-looping behavior. In particular, the use of insertion-ordered sets prevents us from revisiting state items, and the number of possible state items per column is the product of the grammar size and input size. Per chart, the number of possible state items is at worst O(G * n^2), where G is a linear function of the grammar size, and n is the input size. The maximum amount of work we can do is O(G * n^3), because we have at worst a linear amount of redundant, thrown-away insertions into our state sets per item. Don’t be scared! For sane grammars, even slightly ambiguous ones, this simplifies down to O(G * n) again!
If, at the end of this algorithm, we’re at the final column position (or depending on off-by-one behavior, one past the final column position), recognition has succeeded and our input could have been produced from our grammar. Otherwise, recognition has failed.
As a non-obvious note, the order in which we create and insert new state items causes it to be true that we never complete an item for whom all of its possible parents have not yet run prediction. If you were worried about this, you’re sharp, but you can relax; the academic work on Earley’s parsing algorithm was largely about proving this property true. This means that the order we’re doing work in is safe. Similarly, every state item at the start position of a left recursive “call stack” ends up being a single shared state item, and they each end up producing the same subsequent items, so they don’t have any complexity scaling issues.
For unambiguous grammars that contain no empty rules and no right recursion, this is all you need: walk the chart from right to left, probing for child items as needed. For ambiguous grammars, grammars that contain empty rules, or grammars that contain right recursion, we’re not done yet.
Modern extensions/fixes to Earley
For handling empty rules, see the “Empty Rules” section of Loup’s Earley tutorial, linked below. The technique described there is robust, minimal, and doesn’t require any additional abstractions, just some edge case checks. The simple version is “before going down into a rule variant, know whether it can match <empty>, and if it can, preemptively hallucinate the <empty> continuation version of it in the same column”. Producing a full parse tree for such hallucinated continuations requires being aware that they are hallucinated continuations and then also hallucinating their reduction pointers (see the below tree construction fix), but this is straightforward and safe; do it on demand during disambiguation. Loup also gives an algorithm for marking grammar rules as “potentially <empty>”. https://loup-vaillant.fr/tutorials/earley-parsing/empty-rules
For handling tree construction for ambiguous grammars, you need “reduction” and “predecessor” side pointers. These are described in Elizabeth Scott’s 2008 Earley SPPF (“Shared Packed Parse Forest”) construction paper. Whenever you create a rule-position-advanced version of a state item, create a “predecessor” side pointer from the advanced state item back to the one you’re advancing from. Whenever you perform a completion, create a “reduction” side pointer from completed parents to their children. Without these pointers, there are cases where you won’t know for sure which “child” state items in the Earley chart are actually associated with your parent rule state item. In the case of terminals, or unambiguous subsections of the grammar, you can skip this, but you should determine for yourself whether skipping it is worth it. Elizabeth Scott demonstrated that these additional side pointers do not change the complexity of the algorithm, and gave an algorithm for producing an SPPF using them. (As an aside, if your state sets are the right kind of data structure, you can avoid creating predecessor pointers, deriving them from the reduction pointers and state sets during disambiguation. But it’s a coin flip whether this is faster or slower than having predecessor pointers in the first place.)
For handling right recursion in linear (O(n)) instead of quadratic (O(n^2)) time, Leo came up with a solution that skips unnecessary stacks of completions in the general case. However, Leo intended to create a general overall Earley optimization, not just an optimization for right recursion, so it’s harder to understand than it needs to be. A simpler approach with the same optimization effect on right recursion is described below; once you understand this version, you can look up Leo’s original version and understand it more easily.
When completion happens at the end of a rule list, check if we’re completing into only one single rule (our “parent”), i.e. that there’s only a single call pointer instead of multiple. When this happens, and also at least when <empty> rules are not involved (this is an unresolved question to me, because <empty> at least doesn’t usually cause problems), it means that we have a deterministic return stack and can do the optimization. I have not convinced myself that this process is safe for non-end-of-rule-list predictions, so if our parent is not at the point of its own completion, skip the optimization in that case too.
If, as discovered in the previous paragraph, the optimization is safe, do the following. Check if our parent has a “tailret” pointer; if it does, create a “tailret” pointer from our current (child) state item pointing at item that the parent’s “tailret” pointer is pointing at. Else, create a “tailret” pointer from our child state item to its parent. Also form a completion of whatever we decided to point at. Do this instead of normal completion, not in conjunction. Also form an “taildown” side pointer from the top of the chain to the bottom (i.e. to the newest child state item), to allow us later to reconstruct the call stack correctly. This skips the stack of completions without preventing us from reconstructing it, turning all deterministic right-end return stacks, both recursive and non-recursive, from O(n) height into O(<constant>) height.
Warning: Be careful! Unlike the predecessor and reduction pointers, depending on how you set things up, there are likely to be edge case issues around start/item/rule position values when forming the “tailret”/”taildown” pointers (e.g. Do you “tailret” from the completed or pre-completed version of the child? It depends on the rest of your Earley implementation; in mine, I need to point away from the completed child’s predecessor! Do you “tailret” towards the leftwards or rightwards version of the parent? Left!). In some ways, this is a tailcall optimization for Earley items, hence the naming.
For this right-recursion optimization, when walking the chart from right to left, you will need to use the “taildown” pointers to hallucinate new state items to reconstruct the reduction pointers for the summarized part of the stack. Use them to go to the bottom of the call stack and hallucinate its parents (and their reduction pointers) from the bottom up (as though doing the vanilla Earley completion phase to them). Only do this for parts of the chart that you need to disambiguate; otherwise you may end up with O(n^2) in cases where you only need O(n). Generally speaking, most of the right-recursive return stacks we skipped earlier were going to result in state items that couldn’t be part of the final parse, and are not accessible at all when walking the chart from right to left, so as long as you defer fixing “taildown” reduction pointers until the right-to-left walk, you should be okay.
The one thing these fixes and extensions don’t do is fix all the edge cases in infinitely-recursive empty productions, like “expr ::= <empty> expr | integer”, but those kinds of rules are “useless” (genuinely) and don’t appear in real-world grammars (unless you’ve made a mistake when putting them into text). Even if you write them into a grammar accidentally, they result in non-degenerate output given the full version of Earley described here (and that output is still cheap to get).
Further work for real-world grammars
At this point, we have a full Earley implementation, but you might have noticed a problem. I talked a few times about walking the chart right to left. Most ambiguous grammars say that you have to disambiguate them from left to right. That’s a problem! But it’s solvable. You can work around this issue by creating reversed copies of the grammar and input stream and using them; now right-to-left on the chart means left-to-right on the original input, and the problem disappears. If recognition fails, run recognition again with unreversed copies to find the true location of the syntax error.
In fact, if you combine reversal with an additional (unrelated) two-phase context extraction trick for configuring an “invalidator” (again, unrelated to the reversal), you can accurately parse C, including typedefs (which is something general parsers usually fail at, or require extensive imperative “semantic action” machinery to address). I wrote about this in a post before this one, linked here: https://wareya.wordpress.com/2025/09/25/it-is-actually-surprising-that-earley-can-efficiently-parse-c-ambiguities-and-all/
If you’re curious, I tried the approach generating a full Elizabeth Scott SPPF for the Earley chart, where the SPPF itself is also right-to-left, and then reversing that SPPF. It seemed to work, but I was unable to prove its complexity bounds to myself, and it was slower and dramatically more complicated in code than “just” reversing the input and grammar. I consider it a dead end.
There’s room for further optimization
It’s possible to learn lessons from other parsing algorithms that give you additional tools that can be used to skip redundant information in the Earley chart. For example, learning from LR, you can use something similar to LR’s precomputed state transition tables to go all the way down the bottom-left edge of predictions. Part of the reason that such optimizations are safe is that the leftmost version of a state item, with the rule position set to zero, never needs to be visited by a parse tree generator (because of the start position and rule position in the state items); they only need to exist to produce new items and pointers for further processing, so in the case where that processing is going to be prediction, you can skip to the bottom of the chain! This doesn’t affect the complexity of the algorithm, but it does improve performance in practice by reducing the constant hidden “G” factor in “O(G * n)” or “O(G * n^3)”. If you do this, be careful with how you handle completions and “call” pointers, so that your handling is resilient against missing leftmost-rule-position state items (for example you may need additional precomputed tables). As another note, depending on how you do this, the precomputations can be O(G^2), giving you O(G * n^(1..3) + G^2), but you have more control and flexibility than you do with GLR, and the polynomial components don’t necessarily get multiplied together.
The applicability of these kinds of optimizations are another one of Earley’s strengths over its competitors, particularly GLL (which is the most similar competitor to Earley, both being ways of work-sharing on top of parallel recursive descent). It’s extremely non-obvious how a bottom-left edge optimization for GLL would work. This may, for example, be why Leo was able to find the right recursion optimizations for Earley but people haven’t found one for GLL yet. Distributing the work across both top-down-like and bottom-up-like tasks, dynamically, seems to decouple various hidden constraints in the structure of the algorithm and allow for some of the tasks to be optimized even further than is obvious at first glance.
As another thing that’s good for real-world performance, some (but not all) of the hashmaps and references or pointers can be replaced with simpler, smaller, more cache-friendly values, and string comparisons can be done upfront in the lexer. Or, you could skip over some of the state items for grammar rules that are marked as “nonproductive”/”invisible” (e.g. having been generated by EBNF->BNF conversion).
One final note is that despite the complexity-wise cheapness of the whole thing, the chart can get very big because of the G term, so if you have cheap and easy ways to preemptively invalidate state items ahead of time, it can help a lot to do so. For example, if you have a list of reserved names, it’s better to apply it in the lexer or recognizer than during disambiguation.
I haven’t discussed these tricks in detail because they’re not necessary. Production parsers will still perform fine without them, while still getting down to the ideal amount of computational complexity, which is what matters the most for Earley. In the real world, these further optimizations give a total performance improvement of something roughly like 5~10x, which can absolutely be worth it, but don’t cause the algorithm to be unusable without them; if they were more like 100x, then they’d be essential.
One final note about other work
Loup has an alternative approach to disambiguation given here: https://loup-vaillant.fr/tutorials/earley-parsing/parser
The technique described is fascinating. There are “edge” and “pred” pointers, which I think are similar in principle to Elizabeth Scott’s reduction and predecessor pointers. The first completed chart is visited over only items that contribute to the final completion and a second, reversed chart is created. The reversed chart can apparently be walked inputwise left-to-right, giving you the desired disambiguation order. For LR-Regular and almost-LR-Regular grammars, or for length-driven disambiguation rules, it is obvious that this works and gives you exactly what you want (controllable left-to-right disambiguation).
However, I’m not convinced that chart reversal is a safe way to prepare left-to-right ordered-preference disambiguation for all grammars, including psychotically ambiguous grammars with tons of mixed recursion and problematically-placed epsilons. It seems possible that some vital structural ambiguity nesting information might be lost or clobbered during the process, especially near recursion with epsilons. I’m not saying that it is definitely wrong; I am just not convinced that it is never wrong. Conversely, keeping things “kosher” but unsatisfying with the grammar and input reversal definitely gives you a safe approach even for these psychotic grammars.
So, I’m leaving Loup’s technique here as a very interesting addendum that should be given a proper academic treatment by someone more capable than myself. If it’s proven correct, maybe with a couple tweaks, you might be able to use it to replace full input-and-grammar reversal.