Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
You could have invented Parser Combinators (theorangeduck.com)
147 points by lettergram on Dec 1, 2014 | hide | past | favorite | 28 comments


This is an aside, but I really liked the presentation of this article. It starts from the right place, builds at the right speed, and relates to the reader in a fun way. The format of statement followed by the same statement but expressed in code was great. Thanks for writing/linking it.


I agree. You can only write so much object-to-data packing code followed by the redundant data-to-object packing code before you realize there's an exploitable pattern here.

(I had exactly this experience, in fact[0][1]. My name was way worse, though: "Jars".)

0: http://twistedoakstudios.com/blog/Post4708_optimizing-a-pars...

1: https://github.com/Strilanc/Tinker/blob/master/Warcraft3/Pro...


It happened to me while re-implementing LPeg [0] as a pure Lua library[1].

While targeting Lua, the original is actually written in C. It compiles patterns to bytecode interpreted by a custom VM.

I was wary to write an interpreter in a language that's already interpreted and chose to compile patterns to Lua functions instead. It turns out that the most straightforward way to do that is by using closures higher level functions, in a way that I later learned was known as parser combinators...

————

0. http://www.inf.puc-rio.br/~roberto/lpeg/

1. https://github.com/pygy/LuLPeg


I recently wrote a little intro to parser combinators in Haskell. It may be of interest to readers here: https://gist.github.com/tel/df3fa3df530f593646a0

In particular I exploit the structure of a parser combinator type as a monad transformer.


The title seemed to be based on Timothy Chow's article "You Could Have Invented Spectral Sequences", a subject of mathematics that is regarded as unintuitive and advanced. (http://www.ams.org/notices/200601/fea-chow.pdf)


I was going to suggest it was based on Dan Piponi's slightly more relevant "You Could Have Invented Monads (And Maybe You Already Have)" [1] but after a quick dig found a comment seemingly by Dan Piponi [2] saying his article was indeed inspired by "You Could Have Invented Spectral Sequences". So I guess either way that's the original spark.

1: http://blog.sigfpe.com/2006/08/you-could-have-invented-monad... 2: http://blog.ezyang.com/2012/02/anatomy-of-you-could-have-inv...



I like parser combinators in theory, but I never found a fast parser implementation. Recently in javascript, I resorted to parsing the easy common stuff with a fast regex, and the general stuff with the parser combinator.

I had the same problem in java.



I do the same in C++. I use Ragel for blasting through regular strings, where little extraction or semantic analysis is required, and use a full-blown recursive descent library for everything else.

Parser combinators are elegant but, at the end of the day, I find DSLs like PEGs, Ragels, and good old regex easier to read.


That approach is a search algorithm which spends a lot of time exploring dead ends, to solve a problem for which there are much better solutions. For some grammars, it looks like it will lead to combinatorial explosion. I did something like this in SNOBOL a long time ago. Someone commented "That's the slowest program I've ever seen that does something useful".


Well, in practice, it is a variant of recursive descent. It has pretty bad performance for certain grammars, but it is a perfectly good (and fast) alternative if you feel like spending time rewriting the grammar by hand to fit.

For example, nowadays gcc use recursive descent to parse C.

Also, not handling arbitrary CFG grammars well is extremely common, it is a feature that LL, LR, LALR, PEG and other all share. Is it obviously worse to have exponential time parsing something if compared to not parsing it at all? No, but if you need to handle ambiguity you probably should use a different algorithm.


Pretty neat.

One thing I've found that's nice about recursive descent parsers written by hand is that you provide useful and meaningful error messages. 'yacc' was always just saying "error" (and if you helped, the lexer might even supply a line number).

A message like "Expected a number for operator +" is way better than "syntax error line 109 around column 30".


That was bad design in "yacc". To get decent diagnostics out of a parser generator of that type, error outputs need to indicate the last point at which the parser was on a unique, successful path, and where the parser input stream is at the error. The problem in the input is between those two points.

Between those points, the parser was working on several alternate possible parsings, none of which could be completed. Give the names of those grammar constructs to the user. So an error message should look like:

    struct point { float x, float y };
                          ^^^^^^^^^ Expected one of:
                          <structure-variable-name>
                          ;
                          }


I don't have experience with yacc, but bison can generate errors with almost that exact form. It's even possible to do Clang style underlining, although it takes some hand holding.

That said, usually recursive descent gives better errors but in my experience it's mostly from better recovery.


I have played with parser combinators in Scala for a while. The post makes it interesting to read. I like it!


For those that don't know, Prolog allows of parsing in very nice ways and very easily using Domain Clause Grammar.

parse(Output)--> [Input], {member(Input, [ab, cd]), atom_codes(Input, Output)}.

test(In):- phrase(parse(L), [In]), writeln(L).

?- test(ab). [97,98] true .

?- test(cd). [99,100] true.

?- test(xy). false.


I currently gave writing a parser combinator in R for a laugh. It might be interesting to someone here. https://github.com/chappers/Ramble


From someone not at all involved in the field of computer science where this might be applied... where might this be applied in industry? Very compelling stuff, but I don't have a particular use case...


Any time you have to read a file format that doesn't have a library available already (or if you want to write that library). Plenty of industries have a de-facto standard file format that's basically text, but with some special command sequences or bracketing (e.g. recently I had someone ask me how to read "PGN" files, which turns out to be the standard way of recording chess games). Chess is a trivial example, but the "STL" format used by a CNC milling machine I've worked with is a pretty similar format.

So imagine I wanted to write a "middleware" piece for that milling machine - maybe it has a slight defect where its width measure isn't exactly the same as its depth measure (I expect modern machines automatically recalibrate themselves, but you can imagine weirder problems that this might not solve). Then I could write a program that would rescale models exported from a CAD application to match the weirdness of my particular machine, by reading the file, making the adjustments, and writing a new file.

For a less industrial example that I've actually done, I worked on a fan translation of a videogame. The scripts were, of course, in a weird format that the original makers came up with, with no publicly available libraries. So I wrote a parser that could separate the actual text lines from the timing information and so forth, allowing the translators to translate.


One example from Rdio's image server code base: https://github.com/rdio/thor/blob/master/src/main/scala/com/...

It's used to parse arbitrarily complex transformations to apply to an image in the URL.


It might be applied (but this isn't the only possible approach, and I rarely use parser combinators) every time you need to write a parser. And when you need to write a parser? Oh, I have to write parsers all the time, seriously. It's not like all data is passed using json or something like this nowadays, there're plenty domains where obscure, custom data formats (often intended primarily to be read by humans, not processed by machines) are used. So when you need to use such a system with a custom data format you ought to write your own parser.


I work in industry and I recently had to implement a special purpose templating language for our product. For various reasons none of the off the shelf ones quite did what we needed.

As a result I had to write a parser and interpreter. The thing about a lot of these computer science things is that you don't need them until you do. And then when you do it's nice to know about how they work.


Can anyone recommend an article for interpreters that is exactly at the level of the original link?



I guess http://norvig.com/lispy.html would be about right.


Thanks for this, I really enjoyed reading it :)


For some reason, this explanation makes me think about Behaviour Trees in AI more than it does about parsing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: