I was reading over Darius Bacon’s The Language of Choice yesterday, which is an introduction to binary decision diagrams; chatting with Bacon, a couple of ideas came out that I thought I ought to write down.
(I’m going to ignore tokenization and thus whitespace here, since it would add more heat than light.)
One is that the language generated by the CFG
<expr> ::= <var> | <expr> if <expr> else <expr>
is context-free, but left-recursive and ambiguous (a if b else c if d
else e
can be parsed as testing b
first or d
first). If you want
to parse it using a standard linear-time Packrat parser, you need to
factor out the left-recursion and resolve the ambiguity. Bacon
suggested two ways of doing this, one right-associative as it should
be:
<expr> ← <var> (if <expr> else <expr>)?
and the other left-associative:
<expr> ← <var> (if <expr> else <var>)*
Of course, once you have a parse tree, it’s a simple pattern-matching exercise to construct a parse tree with a different associativity.
(Of course Bacon is aware he didn’t invent left-recursion removal; I just had never understood it until he explained it.)
XXX verify these
In this form the language is missing a little bit of expressivity; it can only express an unbranching chain of single-variable choices, and it can’t express negation. You can enhance this with nesting to be able to handle arbitrary boolean functions:
<expr> ← ("(" <expr> ")" / <var> / <const>) (if <expr> else <expr>)?
<const> ← 0 / 1
(See also The sparsity of PEG memoization utility.)
I was thinking that this grammar might be a good example of when you benefit from Packrat’s memoization, but in fact it’s not, once the left recursion is factored out. I’d like to figure out how to work this out rigorously: how can we know that a given memo-table entry can never be used? Is there a single canonical answer that’s practical to compute, or is it more a question of a series of conservative approximations, progressively less simple?
Perhaps in this case we can observe that no retry-after-backtrack of
any callsite of <expr>
will ever invoke <expr>
a second time.
This is a whole-grammar property because it flows through invocations;
we could falsify it for both <expr>
and <var>
by, for example,
adding this rule to the above grammar:
<when> ← <expr> when <expr> / <expr>
But if we refactor that rule as follows, we regain the desirable memorylessness property:
<when> ← <expr> (when <expr>)?
Mizushima, Maeda, and Yamaguchi published a paper in 2010 where they
insert a Prolog-style “cut” operator into PEGs, spelled ↑; the
construction x ↑ y / z
is similar to x y / z
, but although
failures before the cut (in x
) can backtrack to z
, failures after
the cut (in y
) do not. Analogously in (x ↑ y)*
if there is a
failure in y
then the whole repetition fails. They first
implemented this in Yapp, but in the 2010 paper they explain how
they inserted cuts and negative lookahead assertions automatically in
semantics-preserving locations.
They say x ↑ y / z
is equivalent to x y / z
iff there is some
string S such that x
matches some prefix of S and z
matches some
(possibly different) prefix of S. This is an undecidable problem even
in some trivial cases, like where x
matches everything, so that the
problem reduces to whether z
can ever succeed on any input. So they
compute a conservative approximation that lets them insert enough cuts
to get “mostly sequential space” on “practical grammars”, including “a
Java PEG and a JSON PEG” but not “an XML PEG.”
They also report that their cut-insertion improves error reporting from PEGs.
Redziejowski wrote a 2016 paper “Cut points in PEG” where he uses a
second kind of “cut” ↓, which I’m guessing he got from an earlier
Mizushima et al. paper: a ↓ b ↑ c / d
backtracks and retries d
only if the failure happens within b
, so a failure after the
ordinary cut ↑ doesn’t backtrack, but neither does the failure before
the backwards cut ↓. I’m not yet sure what this is good for.
Cut insertion doesn’t avoid sticking things into a memo table that we are never going to need, but it does allow us to discard potential backtracking points off the stack as soon as we know we aren’t going to need them, which allows us to safely discard everything to the left of that point.
Given a candidate criterion for not memoizing a callsite at all, we could test it by memoizing it anyway, then testing the parser on some input to see which callsites the criterion erred on — those whose memo entries were expected to be useless, but got used anyway. It’s maybe also worthwhile to examine which callsites don’t need to probe the memo table because it’s impossible for a memo-table entry to exist, because no preceding attempt to parse the same text could have invoked that nonterminal at that position. I haven’t seen any research on this aspect of the problem, but it seems like it would help a lot to reduce both the time usage and space usage of the memo table.
There’s a trick due to Warth, Douglas, and Millstein, which lets Packrat parse some left-recursive grammars at the expense of their linear-time guarantee, which I think would enable Packrat to handle the original grammar in PEG form:
<expr> ← <expr> if <expr> else <expr> / <var>
When it detects a left-recursive call, it initially fails; if it
finishes parsing the rule successfully, for example with the <var>
alternative here, it enters the parse result into the memo table, then
restarts parsing, this time allowing the left-recursive call to
succeed. So, for example, parsing a if b else c if d else e
, it
would initially parse a
with the second alternative, and then on a
second go-round, use that a
for the left-recursive <expr>
invocation and succeed in parsing a if b else c
, which would replace
a
in the memo table; after restarting a second time, the initial
left-recursive <expr>
requirement would be fulfilled with the whole
a if b else c
from the previous attempt, and the whole expression
would be parsed as an <expr>
, though associating incorrectly.
After a third restart, an attempt to parse an <expr>
that begins
with a if b else c if d else e
fails, because that is followed by
end of input, not by if
. So we keep the former parse.
Essentially, this left-recursion trick converts a Packrat parser locally from a top-down parser into a bottom-up parser.
I think Medeiros, Mascarenhas, and Ierusalemschy have written a 2014 paper on this question.
An earlier draft of Bacon’s article, if I recall correctly, chose the
symbols ←
and →
rather than 0
and 1
for the Boolean values,
and used infix rather than prefix syntax, so (if we interpret ←
as
True and →
as False) a if b else c
would be written a {b} c
,
which is a
if b
evaluates to ←
and c
if b
evaluates to →
.
That is, a {←} c
evaluates to a
, and a {→} c
evaluates to c
.
So you got nice identities like {0 {x} 1} = x
, a {1 {x} 0} b = b
{x} a
, and a {b} a = a
. I really liked this idea, but he dropped
it for the final version of the article as making it less accessible.
The other clever Boolean notation idea is the one from Laws of Form, which is optimized for handwriting rather than strings of characters; in a string-of-characters form we can use nested parentheses, which indicate negation. One interpretation is that in LoF juxtaposition represents OR and the empty string represents False, so we can write the whole canonical evaluation ruleset as follows:
()() -> ()
(()) ->
That is, two empty sets of parentheses (“crosses” in LoF jargon) juxtaposed can be rewritten as a single empty set of parentheses, and a parenthesized empty parenthesis can be rewritten as an empty string.
This is essentially the same as the Sheffer stroke or NOR logic, but with a notation that exploits a homeomorphism between the empty string being the identity element in the free monad and falsehood being the identity element for Boolean OR. (LoF itself is silent on the correspondence between its crosses and logic, so you can equally well interpret the empty string as true and juxtaposition as AND.)
While the notation is pleasantly straightforward for explicit evaluation, my attempts to formulate normalization rules such as CNF normalization as tree rewriting rules for LoF notation were not very fruitful — they seemed to come out a great deal less clear than in traditional Boolean logic, and at the time I wasn’t able to get them working.