Thurston claims that no previous grammar system supports his three requirements ...

alan-crowe · on May 18, 2011

I remember, back in 1985, trying to write a parser in Prolog on a VAX 11/750. I didn't know what I was doing and my code fell into the exponential case trap. Even minimalist examples appeared to crash as they took unreasonably long, so I wasn't getting clues to the problem, and just gave up.

Consequently I want to rephrase the first sentence of your third paragraph. The problem with DCGs is performance: they're exponential in the naive case. That might not sound too bad today, but back in 1985 the hype was that Prolog let you program declaratively. Just declare what a parse looks like. Reasonable performance in the naive case was the popular selling point for Prolog.

thurston · on May 18, 2011

Where is the grammar-dependent scanning?

Note that threading the context through the parse tree while maintaining fully generalized parsing requires keeping all versions of the parsing context in memory. Consider making a C++ parser in that way ... ie every time you modify the structures you build in memory you make a copy of them first.

Hemospectrum · on May 18, 2011

If you know that every subfield of your state object is immutable, and you're using references instead of "local" copies (inconvenient unless you have garbage collection) then your copying costs are limited to the size of the overall state object plus whichever field changed.

Obviously this makes no sense for C++, but Clojure and OCaml get away with it and in Haskell it's the standard way of implementing almost every stateful computation.

swannodette · on May 18, 2011

If your data structures are persistent data structures you don't incur the costs of copying.

thurston · on May 18, 2011

Indeed I did not follow what you meant. I missed the "persistent" and saw ... "don't incur the cost of copying." I took that to mean that you were referring to not maintaining any history and using a single, mutable global state. It is with that understanding that I made the previous comment.

To address what you meant ... it's more a matter of what's practical. For example, to parse C++ one needs to do quite a bit of semantic analysis during parsing. A full implementation of the type system is necessary. Then the data structures change with every statement/clause parsed. Achieving this with persistent data structures is far too costly.

thurston · on May 18, 2011

Then you forgo generalized parsing.

Edit: indeed I did not follow what you meant.

swannodette · on May 18, 2011

I don't follow.

thurston · on May 18, 2011

To have both generalized parsing and context dependent parsing you need to somehow revert changes to the global state when you backtrack. You can do this by restoring it to old versions, which requires copying the state and keeping the history. The Colm approach is to keep only one global state, but store instructions for programmatically reverting the state while you backtrack.

jules · on May 18, 2011

The point of functional data structures is that if you modify, you don't copy the whole structure, you only copy the path to the things that changed.

A simple example would be parsing a sequence of characters and storing that in a linked list. When you get an additional character you don't destructively modify the list, instead you create a new list node that points to the existing list.

    def process(char, state):
      return Node(char, state)

The process function takes a character and a state, and returns a new state. The old state is still usable, as nothing was mutated. On the other hand no large data structure had to be copied.

Your method is probably more efficient. Because you don't need access to all historical versions simultaneously, you only have to keep a stack of undo operations. So you only have O(1) slowdown per operation. With functional data structures you have in the worst case a O(log n) slowdown per operation.

For example if you implement a bit vector, then update and lookup can both be O(log n), e.g. a red-black tree. If you want update to be O(1) then lookup becomes O(n), e.g. a linked list. If you want lookup to be O(1) then update becomes O(n), e.g. copying the entire vector on update.

(please correct me if I'm wrong and if you can do better than O(n) for those cases)

thurston · on May 18, 2011

Yes I understand these techniques, in fact they are heavily in use in Colm, just not for maintaining the global state that is used for the parsing feedback loop.

I use the term "copy" in the general sense to make discussion easier. Conceptually, persistent data structures are a copy, even if they are optimized heavily to incur very little cost.

swannodette · on May 18, 2011

Then you misunderstood my point, backtracking and keeping history with persistent data structures is very cheap.