PEG Roadmap

Monday, March 8, 2010

Nearly three months ago I released a parser generator which generates parsers in JavaScript from a grammar given as a PEG. There hasn't been a significant new release since that time, but there has been a lot of progress on features that are not quite ready for public consumption yet. Here I explain my plans for the project over the next few months, with a little parser theory along the way. If you just care about the roadmap, skip ahead to the end.

[Update 2010-03-30: see also the 0.0.5 release.]

Streaming

One of the goals of the project is to support streaming parsers, i.e. parsers which generate a stream of parse tree nodes or events from a stream of input chunks. This is useful when dealing with large input texts, and once streaming is supported it becomes possible to clone the state of a parser and resume it on changed input. This is useful for text editors and other real-time parser applications in which you want to re-parse changed text without starting over.

The initial release accepted input as a single string and produced a complete parse tree as output. The returned JavaScript value mirrors the structure of the parse tree, with each node represented by an array, containing an array of child nodes as the last element. The following raw parse tree fragment contains three nodes:

[10,1,[[132,1,[[133,1,[]]]]]]

This format is convenient to process, uses little memory, and is relatively fast to generate, but it is not suitable for streaming output. In streaming output, the start of a node may be returned to the caller before all of its child nodes are known. If the parse tree nodes that we return contain their own children as an array, then we can't emit any node until we know all its children, and since every node is a descendant of the root, that means we can't stream the output at all until the entire parse finishes.

Instead, streaming output can be accommodated by emitting a start and an end "event" for each node. In this format, we emit a "start event" for the start rule when we start parsing, emit any child nodes as we encounter them, and finally emit the "end event" for each rule match as we encounter it.

One of the changes here is that a streaming parser might emit many events, and then fail and emit an error. In contrast, in the all-at-once model, the parser either succeeds on the entire input or it emits only an error, never a partial, incorrect, or incomplete parse tree. While it is possible for the streaming parser to hit a failure point after emitting some parse nodes, we do not allow it to "partially fail". If a parser emits a node, it is committed to either parse the entire input consistent with that node or to fail the parse entirely.

So the first change to accommodate streaming was to change the format of the generated parsers' return values. The new format for the above parse tree fragment looks like this:

[10,132,133,-2,1,-2,1,-2,1]

In this format, a node is first opened by sending its identifier, then any children can follow until it is closed later by a -2 followed by the length of the match. Unlike the first format, this can easily be broken up into chunks, and returned piecewise as the parser consumes more and more of the input. Familiar streaming APIs like SAX operate on a similar principle.

Codegen changes

The other major change necessary for streaming is in the means of control flow in the generated parsers. The initial release used the "v5 codegen", in which each "interesting" rule in the grammar is compiled to a function which parses that rule, possibly by calling functions corresponding to other parse rules. (As an optimization, lots of uninteresting "lexical" rules like whitespace, etc, could optionally be folded into the other rules, which would speed up the parser and also keep the parse tree small.) Memoization of parse rule results is used to store the result of these calls at particular positions in the input. This memoized recursive descent is the traditional "packrat parser" approach to generating parsers from PEGs. As an example, here is the parser for a simple arithmetic expression language, as generated by the v5 codegen, with the actual parsing code removed to better show the structure:

function ParseExpr(str){
 var tbl=[],pos=0,l=str.length+1;while(l--)tbl.push([]);l=str.length;
 function Expr(a){/* ... */}
 function Add(a){/* ... */}
 function Mult(a){/* ... */}
 function Num(a){/* ... */}
 function S(a){/* ... */}
 var _Expr=Add
 var _Add=q(Mult,r(0,0,q(r(0,1,S),sl_0,r(0,1,S),Mult)))
 var _Mult=q(Num,r(0,0,q(r(0,1,S),sl_1,r(0,1,S),Num)))
 var _Num=r(1,0,cs_0)
 var _S=r(1,0,sl_2)
 function cs_0(){/* ... */}
 function sl_0(){/* ... */}
 function sl_1(){/* ... */}
 function sl_2(){/* ... */}
 function fin(c,p,x,r,a){/* ... */}
 function e(){/* ... */}
 function o(){/* ... */}
 function q(){/* ... */}
 function r(m,n,f){/* ... */}
 function n(f){/* ... */}
 function p(f){/* ... */}
 function t(a,n){/* ... */}
 function g(p){/* ... */}
 function b(p,n){/* ... */}
 return Expr([])&&pos==l?[true,b(0,0)]:[false,pos,tbl]}

Here is the corresponding grammar:

Expr ← Add
Add ← Mult ( S? "+" S? Mult )*
Mult ← Num ( S? "*" S? Num )*
Num ← [0-9]+
S ← " "+

The entire parser ParseExpr is called with a string as input, and includes functions for each rule within the grammar. It calls Expr which is the start rule in the grammar, and returns the result. PEG expressions like ordered choice ("o"), and repetitions ("r") are implemented as combinator functions that operate on other parser functions. These combinator functions are called only once, when the parser is set up, and they return functions which are then used to perform the actual parse. Lexical tokens (like "*" and "+" in this grammar, or "var" and "function" in JavaScript) are implemented as functions that recognize just that token and can be optimized, using native regexes or any other means to identify matches as quickly as possible. The functions corresponding to named rules can call each other recursively, and the results are memoized in the result table. This approach is fast, as the few extra function calls have negligible overhead in modern JavaScript engines, and it's easy to implement and yields a parser that mirrors the structure of the grammar itself. Unfortunately this approach is also incompatible with streaming without adding considerable complexity.

A streaming parser must be able to handle reaching the end of the available chunk of input at any time. When this happens, a non-streaming parser would signal success or failure of the entire parse the first and only time it reaches EOF, but a streaming parser needs to suspend itself when it runs out of input and wait for more. When further input is available, the streaming parser must continue parsing exactly as if the entire input had been present initially. If the parser is a recursive descent parser, then the only way to store and resume the state of the parser is to tear down the JavaScript call stack and restore its state when the parser resumes.

There are ways we could do this, e.g. by returning a distinguished kind of failure value that would bubble up through each function on the stack, which would then add itself and its local state, to an array. This array would then be returned to the caller as the parser's saved state and used to restore the stack (by calling each function again with a special argument to restore its internal state).

In fact, this process could be automatically performed by a compiler, which would give us a transformation that can turn any JavaScript program into one which can be suspended at any time and then resumed. This would be allow first-class continuations, or capturing the current continuation, as a transformation from a program that uses call/cc to one that only uses features JavaScript supports, at the cost of some generated boilerplate code in each function. However, writing a parser in a recursive style and a general-purpose transformation on recursive JavaScript programs is a highly indirect solution to the problem, and makes it quite difficult to do any further (parser-specific) optimizations on the transformed result.

(Though we do not take this approach here, the progression from a straightforward recursive-descent parser through to the most optimized state machine designs can all be seen as optimizations of the original recursive descent parser which could be automatically performed by a sufficiently advanced general-purpose optimizing compiler or specializer.)

A more direct approach to the problem is to generate a parser that uses no recursive function calls, but maintains its own call stack, pushing and popping states and local variables as the parser enters, tests, and exits the expressions appearing in the grammar. Whereas the current-continuation support by means of program transformation reifies the call stack only as necessary and uses it only to reconstruct the actual (JavaScript) call stack when resuming, this approach reifies the call stack continuously, and refers to it to determine what to do next, rather than letting the interpreter determine the control flow by making function calls and returning from them. We are effectively pulling the call stack out of the JavaScript interpreter and into the program, which means we have full control over it and can even perform radical optimizations on it such as tail-call optimization (which happens to correspond quite directly to some parser optimizations connected to streaming).

With the explicit call stack, then, we have transformed the grammar not into a recursive set of functions but into an explicit state machine which maintains its own stack. This is what most LL and LR parsers look like as well: a set of states, rules for moving between them, and a stack which controls the large-scale shape of the computation and provides the power to parse the recursive structures of non-regular languages. A packrat parser in this style shares some similarities with an LL parser (it contains a top-down set of rules similar to the call stack of a recursive descent parser) but has a somewhat more expressive set of state transitions and behaviors to support the PEG semantics.

The explicit call stack makes implementing a parser which consumes its input in chunks (which is half of the problem of streaming) trivial. Since there is no internal JavaScript call stack inside the parser function, to save the state of the parser we need only return the explicit stack to the caller, who can then provide it in the next call to resume parsing when additional input is available.

Similarly, to enable partial re-parsing of input which is being continuously changed, the caller can store the object containing the parser's saved state and provide it, with changed input, in subsequent calls to the parser to re-parse from that point as many times as desired without reparsing any earlier input.

The new code generator (the "v6 codegen") is the sixth complete rewrite of the code generator, and uses the explicit call stack approach outlined above.

The v6 codegen, on the grammar above (and without any optimizations enabled), produces the following parser:

p_arith_streaming_v6_Expr.names=[,'Expr','Add','Mult','Num','S'];
function p_arith_streaming_v6_Expr(out){var eof=false,s='',l=0,S=24576,T,M,F,R,tbl=[],x,pos=0,offset=0,buf=[],bufs=[],states=[],posns=[],chr;
T=[,28672,32768,77824,122880,139264,155648,8364,12460,40964,47106,49152,20661,43,44,65536,20661,43,12460,16556,86020,92162,94208,20661,43,45,110592,20661,43,16556,126976,46,135168,46,143360,47,151552,47,4268,163840,48]
M=[,41,41,41,41,41,41,41,36864,41,40964,60100,41,41,61440,73728,41,41,41,81920,41,86020,105156,41,41,106496,118784,41,41,41,41,131072,41,135168,41,147456,41,151552,159748,41,42]
F=[,42,42,42,42,42,42,42,42,42,41,42,53248,42,42,42,69632,42,42,42,42,41,42,98304,42,42,42,114688,42,42,42,42,42,41,42,42,42,41,42,42,41]
return function(m,x){switch(m){
case 'chunk':s+=x;l=s.length;while(tbl.length>>12)}
states.push(S)
S=T[S>>>12]}
if(S<49&&S>42){chr=s.charCodeAt(pos);switch(S){
case 43:R=true;pos--;break
case 44:R=chr<43?0:chr<44?1:0;break
case 45:R=chr<42?0:chr<43?1:0;break
case 46:R=chr<48?0:chr<58?1:0;break
case 47:R=chr<32?0:chr<33?1:0;break
case 48:R=chr<0?0:1;break}
if(R)pos++
else{
if(isNaN(chr)){if(eof)R=false;else{emit();R=undefined;S=states.pop();out('ready');return}}
else{if(chr>=0xD800&&chr<=0xDFFF)throw new Error('surrogate characters not yet handled here');else R=false}}
S=states.pop()}
while(R!=undefined){
if(S==24576){emit();return}if(R){
if(S&16/*m_emitstate*/)buf.push(S>>>12)
if(S&32/*m_emitclose*/)buf.push(-2)
if(S&64/*m_emitanon*/)buf.push(-1)
if(S&128/*m_emitlength*/)buf.push(pos-posns[posns.length-1])
if(S&1/*cache*/)tbl[posns[posns.length-1]][S]=[pos,buf]
if(S&256/*m_resetpos*/)pos=posns[posns.length-1]
if(S&4/*pushpos*/)posns.pop()
if(S&512/*m_emitbuf*/)buf=bufs.pop().concat(buf)
if(S&1024/*m_tossbuf*/)buf=bufs.pop()
if(!bufs.length&&buf.length>64)emit()
S=M[S>>>12]}
else{
if(S&1/*cache*/)tbl[posns[posns.length-1]][S]=false
if(S&4/*pushpos*/)pos=posns.pop()
if(S&2048/*f_tossbuf*/)buf=bufs.pop()
S=F[S>>>12]}
if(S==41){R=true;S=states.pop()}else if(S==42){R=false;S=states.pop()}else R=undefined;
}}}
function emit(){var x=bufs.length?bufs[0]:buf;if(x.length){out('tree segment',x);if(bufs.length)bufs[0]=[];else buf=[]}}}

This parser no longer reflects the shape of the grammar in any obvious or direct way, but looks much more like a typical generated table-based LL or LR parser.

On non-trivial grammars, the two approaches produce parsers of approximately the same size. For example, the ES5 parser is 71KB with the v5 codegen and 67KB with v6. Performance of the generated parsers is also about the same.

Analysis

PEGs allow unlimited lookahead, and can contain ordered choice expressions which consume all but the last character of the input, then backtrack all the way to the very beginning and reparse the entire text. Streaming the results of a parse as it proceeds is not as straightforward as it is with LR or LL parsers with bounded lookahead. Some PEGs, on some inputs, are impossible to stream (without a complex backtracking API). However, in most cases, much or all of a grammar can be streamed, as commit points are reached. A "commit point" here means a point in the parse at which one of two results is certain: either the entire parse will fail, or otherwise the parse will succeed by matching the rules that are currently on the stack. Once a commit point is reached, the parser can safely emit the nodes that are open on the stack, since either they will succeed, or the entire parse will fail anyway.

Since streaming is not always possible, a certain amount of analysis is required to determine commit points in a grammar. In the current streaming codegen, these commit points can be provided explicitly by the caller. Requiring the user to provide commit points can lead to difficult bugs and poorly performing parsers, and is almost entirely unnecessary.

One of the goals of this project is to derive as much as possible from the PEG itself without requiring annotations or extra work from the user. In accordance with this approach, I am developing automatic commit point analysis. In the general case this is undecidable, because even some basic questions about PEGs are undecidable (such as whether two arbitrary PEGs are disjoint, i.e. whether an input exists that satisfies both). However, a few rules seem sufficient to answer the interesting questions in many practical grammars.

Speed

The generated parsers are currently quite slow, compared to parsers in C or even handwritten parsers in JavaScript. Currently generated parsers parse at about 1 KB/s on a reasonably "dense" grammar and input (one which produces a lot of parse tree nodes per input character), as compared to an upper bound of about 1 MB/s, obtained by running a dummy "parser" on the same hardware and JavaScript engine. (The dummy parser simply compares each input character against a constant in a loop and constructs a dummy parse tree of about the same size as the correct output.) This means there is room for at most a 1000x improvement in the speed of generated parsers using this approach.

The v6 codegen is entirely unoptimized, and there are a lot of straightforward optimizations and some subtle ones that can be done, so I expect performance to improve considerably. The ultimate goal is for generated parsers to be as fast as hand-written JavaScript parsers, using all the tricks available. JavaScript parsers may never be as fast as C parsers, but there is a lot of room for improvement, and JavaScript engines are getting faster as well.

The commit point analysis also can help performance, by combining and optimizing rules and expressions in the grammar, combining or eliminating states in the state machine, and so on. The v6 codegen currently does none of these optimizations.

Even more radical parser simplifications are possible by eliminating rules from the parse tree output, which then frees the parser to combine rules that otherwise would need to be parsed separately. For example, if the user of the parser is not interested in the difference between whitespace and comments, then they can instruct the parser generator to suppress comment and whitespace nodes in the parse tree, and it can then fold these two rules into the places which use them, resulting in a faster (sometimes much faster) parser. The v5 codegen already has such options, but they need be ported to the v6 codegen.

Error reporting

There are three interesting kinds of errors here. The most commonly encountered errors are errors in the text given to the generated parser. Currently, if the parsed text does not conform to the grammar, the parser will simply give up and report the position at which the parse failed. A more user-friendly error would include a short description of the error, such as "missing expression after comma" or "unmatched opening parenthesis". Such messages could be provided with the grammar, by error annotations in the grammar itself, or can be derived from the grammar by the parser generator. As part of the commit point analysis, I hope to generate relatively useful error messages automatically, using the rule names of the grammar and the rule characteristics determined by the analyzer. For example, "Syntax error in function declaration: expected '{'", could be automatically derived from a FunctionDeclaration parse rule that would have succeeded had an opening curly brace occurred at that position.

Custom error messages will be supported too, but probably not by explicit annotations in the grammars, which I intend to keep clean. More likely, the parsers will return a custom format with some indication of the rules or expressions in which the parse failed, and the user can display the default or use a custom error message.

The other two kinds of errors are errors in the grammar itself and in the parser generator. Both kinds of errors are experienced in the same way, as a valid input which fails to parse, or which parses with an incorrect tree. There are also even JavaScript interpreter bugs which can lead to incorrect parser output on any one particular engine, several of which have turned up already. Errors in the grammar can be hard to see, and errors in the parser generator are extremely hard for a user to identify, especially with a generated state-machine parser. The best way to isolate all of these kinds of errors seems to be by breaking the parsing process into smaller parts and testing smaller productions against smaller inputs individually. A Web-based user interface customized for this task, in which you can enter a PEG and some sample input, see the generated parse tree, and see the rules which failed to match, and can experiment with changes to the grammar and see the effects in real time, would make writing and testing PEGs much easier. There is some infrastructure in place to facilitate this and I have some tools I use to do something like this but they are quite immature. I hope to see such a UI emerge as part of the project or a companion project eventually. The parser generator also has some options to generate a debugging parser, which produces volumes of debugging output documenting its internal state throughout the parse. This output tends to be about 100 to 1000 times larger than the parse tree itself, so this is only a practical solution for small inputs, once the problem is already isolated to a small part of the input or the grammar.

Roadmap

Three milestone releases are planned:

0.0.5

This will be the next release, probably out in a couple weeks, after some extensive testing. [Update: as 2010-03-30, this is now released] In 0.0.5, everything is moving to the new streamable output format, even non-streaming parsers. The existing utility API for displaying parse trees and so on will be updated to handle the new format, and probably extended slightly.

In 0.0.5 the output format will be locked down, hopefully for good, so anyone that is handling the raw output format directly can rely on it staying consistent across all future releases. There aren't going to be many new features in this release except for the change in the output format and the updated tools. A new function is provided to convert output in the new format to complete trees in the old format.

Currently the v5 and v6 "streaming" codegen both exist side by side, selected by an option to the parser generator. The v5 codegen will probably still be the default in the 0.0.5 release, as it is more stable, and it will be updated to the new streamable output format, even though it does not support streaming. The v6 codegen has been simplified in the interests of correctness, and needs to be absolutely correct and bug free before becoming the default. Having two complete and mostly distinct parser generators helps testing, as the two can be automatically tested in parallel against a suite of test grammars and inputs to detect any variations in output.

In the 0.0.5 release it will be possible to generate streaming parsers, by turning on the v6 codegen and providing commit points explicitly, but this will not be documented or recommended.

The 0.0.5 release will also port a few remaining features of the v4 codegen that never made it into the v5 codegen in the first release.

0.0.x (x>5) API Enhancements

The next major release after 0.0.5 will add to the public API for dealing with parser output, and for generating parsers.

There are many ways to consume the output of a parser. The codebase includes a DSL for processing parse trees, using pattern matching to identify and bind nodes, and JavaScript expressions to define attributes on the matched nodes. This is used internally for dealing with the parse tree of the PEG provided by the user. However, the language has some limitations which need to be resolved before it is exported in the public API.

There are a lot of internal tools I am using (in various stages of maturity) that are not ready for public consumption, and I plan to pick some winners from among these APIs, clean them up, remove the rest, and bring the internal code used for dealing with trees in line with the external API. This will reduce the size of the codebase and simplify the project. It will also certainly expose a much richer API for generating parsers and for manipulating parse trees. This will be the API release that would make it easy to build a rich UI for exploring parse trees and parsers in real time. This release will probably have some very basic automatic error reporting capability beyond the current "parse failed at position 12".

0.1.0 Streaming

While streaming is already possible, it requires a lot of extra work by the user to enable it. Once the analyzer is done, it will be possible to automatically generate streaming parsers from most PEGs where streaming is theoretically possible, and in other cases, to explain which parts of the grammar will not be streamed and why. Doing this analysis without the benefit of some of those internal tree-manipulation APIs is painful, but I don't want to invest more into those APIs until they are stable and publically released (which is why full streaming support, with analysis, is not going in 0.0.5).

This release will probably also have some speed improvements in the generated parsers. Probably by this time the v5 codegen will be removed entirely.

1.0

By the 0.1.0 release, most major features will be in place and the APIs should be stable. Hopefully after some wider testing, a 1.0 release will follow soon after. After 1.0, work will continue on API enhancements, speed improvements, better tools for dealing with trees, error reporting, and so on. Backend languages other than JavaScript are a possibility as well.

As always, with any questions, or if you want to help with the project, please comment here or email inimino@inimino.org or ask (and wait a while for an answer) in #inimino on Freenode IRC.

If you want to follow the progress in excruciating detail you can look at my TODO list which includes PEG (towards the middle) and most of my other open-source projects and is updated daily.