Reversible parsing

Kragen Javier Sitaker, 02020-05-11 (6 minutes)

In Prolog you can write definite clause grammars, which make it very straightforward to write grammars, which can then be used both for text generation and for parsing:

: user@debian:~/devel/dev3; swipl
% library(swi_hooks) compiled into pce_swi_hooks 0.00 sec, 3,856 bytes
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 5.10.4)
Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.

For help, use ?- help(Topic). or ?- apropos(Word).

?- [user].
det --> [the] | [a] | [that].
|: noun --> [buffalo] | [capacitor] | [philosophy].
|: vi --> [sucks] | [is, walking] | [glows].
|: vt --> [supersedes] | [clobbers] | [loves].
|: sentence --> det, noun, vi |
|:  det, noun, vt, det, noun.
|: 
% user://1 compiled 0.00 sec, 4,816 bytes
true.

?- phrase(sentence, S), append(Y, Z, S), append(X, [buffalo], Y).
S = [the, buffalo, sucks],
Y = [the, buffalo],
Z = [sucks],
X = [the] ;
S = [the, buffalo, is, walking],
Y = [the, buffalo],
Z = [is, walking],
X = [the] ;
S = [the, buffalo, glows],
Y = [the, buffalo],
Z = [glows],
X = [the] ;
S = [a, buffalo, sucks],
Y = [a, buffalo],
Z = [sucks],
X = [a] ;
S = [a, buffalo, is, walking],
Y = [a, buffalo],
Z = [is, walking],
X = [a] ;
...
S = [the, buffalo, loves, that, capacitor],
Y = [the, buffalo],
Z = [loves, that, capacitor],
X = [the] ;
S = [the, buffalo, loves, that, philosophy],
Y = [the, buffalo],
Z = [loves, that, philosophy],
X = [the] ;
S = Y, Y = [the, capacitor, supersedes, the, buffalo],
Z = [],
X = [the, capacitor, supersedes, the] ;
S = Y, Y = [the, capacitor, supersedes, a, buffalo],
Z = [],
X = [the, capacitor, supersedes, a] .

?-

A disadvantage of DCGs is that, in standard Prolog, they don’t terminate on left recursion and can take exponential time, although cuts can tame the exponential and I think tabled resolution can conquer both in some cases (“DCGs + Memoing = Packrat Parsing, But is it worth it?” by Ralph Becket and Zoltan Somogyi.)

Hmm, clearly I have a lot to learn about Prolog DCGs... Markus Triska’s tutorial, Anne Ogborn’s tutorial, the SWI-Prolog manual, and so on.

Anyway, what I was thinking was that for very straightforward kinds of “grammars”, even a perfectly ordinary imperative language suffices:

void employee_card(card *s, employee *e)
{
  int_columns(s, 0, 6, &e->empno);
  columns(s, 6, 16, e->firstname, sizeof e->firstname);
  columns(s, 16, 26, e->lastname, sizeof e->lastname);
  blank_columns(s, 26, 80);
}

This plain C function could be invoked either for input or for output, if card contains a flag that indicates the direction and the int_columns and columns functions consult that flag. And similar bidirectional serialization/deserialization functions can be built for a wider class of grammars. Field widths need not be fixed, and field concatenation can be implicit:

void employee_csv(stream *s, employee *e)
{
  int_field(s, &e->empno);
  text(s, ",");
  delim_s_field(s, e->firstname, sizeof e->firstname, ',');
  delim_s_field(s, e->lastname, sizeof e->lastname, '\n');
}

Again, this function can be used either for input or for output, and multiple such functions can be composed together. If we add a little bit of backtracking, we can get polymorphic records:

void foo(stream *s, thing *t)
{
  begin(s);
  {
    equal_int(s, &t->type, TYPE_BAR);
    byte(s, 'B');
    nbytes(s, &t->bar.contents, sizeof t->bar.contents);
  }
  or(s);
  {
    equal_int(s, &t->type, TYPE_QUUX);
    byte(s, 'Q');
    s16_le(s, &t->quux.len);
    nbytes(s, &t->quux.contents, t->quux.len);
  }
  end(s);
}

On input, the calls to equal_int function as assignments to an integer field, while the calls to byte function as assertions about which byte comes next in the input; if one of these assertions fails, its effect on the input stream is backtracked, so that a subsequent call to byte can test the same input bytes. The backtracking state is set up by begin, restored by or in case of failure, and torn down by end.

On output, the situation is precisely the other way around: the calls to equal_int function as assertions about what should be found in t->type for that branch to proceed successfully, while the calls to byte emit literal bytes on the output — bytes which are buffered so they can be retracted if the case must be backtracked due to a subsequent failed assertion.

But this is still a very simple case; in particular it does not handle allocation, which in a C-like language probably must be part of the state restored in case of backtracking.

You could consider something like

child_node(s, &t->child, sizeof struct fulano);
struct fulano *f = (struct fulano *)t->child;
equal_int(s, &f->type, TYPE_FULANO);
byte(s, 'f');
decimal_int(s, &f->x);
byte(s, ' ');
decimal_int(s, &f->y);

where child_node creates a new allocation on input (deallocated in case of backtracking) and does nothing on output. But consider the infix expression

3 + 1000/2/2/2/2/2

which in prefix notation is

(+ 3 (/ (/ (/ (/ (/ 1000 2) 2) 2) 2) 2))

so unfortunately we have to read the rest of the input before we know how deep down the tree 1000 goes. I’m not even sure Prolog DCGs handle this case in this form. I wrote a toy calculator program tonight to explore some of the above ideas, and I refactored the grammar to eliminate left recursion; here’s a simplified form of how it parses terms like 1000/2/2:

int term()
{
  int x = unary();

  begin();
  while (ok()) {
    /* save() ensures that our progress so far will not be backtracked */
    save();      /* zero or more multipliers, divisors, and modulos */
    begin();
    {
      op("*");
      int y = unary();
      if (ok()) x *= y;
    }
    or();
    {
      op("/");
      int y = unary();
      if (ok()) x /= y;
    }
    end();
  }
  end();

  return x;
}

x /= y in an AST would be something like x = new_division_node(x, y). But it’s deeply unclear to me how to make that work bidirectionally: the sequence of the input text is bottom-up, while normally the structure of an AST is top-down.

A somewhat related thing is operator precedence and associativity. If we take + to be associative, it might be reasonable to serialize both (+ 1 (+ 2 3)) and (+ (+ 1 2) 3) in infix as 1 + 2 + 3, but clearly (- 1 (- 2 3)) is 1 - (2 - 3) while (- (- 1 2) 3) is conventionally 1 - 2 - 3. Similarly, precedence dictates that (+ 3 (* 4 5)) can be 3 + 4 * 5, but (* (+ 3 4) 5) requires extra parentheses: (3 + 4) * 5.

Topics

Algorithms (17 notes)
Parsing (4 notes)
C (4 notes)
Prolog (3 notes)
Parsing expression grammars (3 notes)