package pacomb

  1. Overview
  2. Docs

pacomb

  1. What is Pacomb?
  2. OCaml syntax extension
  3. Limitation
  4. Pacomb.Grammar The main module to build grammars
  5. Pacomb.Lex lexing and Pacomb.Lex.give_up
  6. Pacomb.Blank blanks and layout record
  7. Pacomb.Pos positions in file and handling of exceptions
  8. Pacomb.Input build input buffer from string, channel, ...
  9. Pacomb.Charset an efficient representation of character sets

Overview

PaComb implements a representation of grammars with semantic actions (values returned as a result of parsing). Parsing is performed by compiling grammars defined with the Pacomb.Grammar module (or indirectly though a PPX extension) to combinators. The library offers scanner less parsing, but the Pacomb.Lex module provide a notion of terminals and blanks that give a simple way to write grammars in two phases, as usual.

The main advantage of PaComb and similar solutions, contrary to ocamlyacc, is that grammars (compiled or not) are first class values. This allows using the full power of OCaml for manipulating grammars. For example, this is very useful when working with syntax extension mechanisms.

Importantly, performances of PaComb are very good: it is only two to three times slower than grammars generated by ocamlyacc.

PPX syntax extension

Defining languages using the Pacomb.Grammar module directly is cumbersome. For that reason, PaComb provides a BNF-like PPX syntax extension. An example with arithmetic expressions is given below. It can be compiled with command ocamlfind ocamlopt -package pacomb,pacomb.ppx -o calc -linkpkg calc.ml (if it is written to a file calc.ml).

type p = Atom | Prod | Sum
let%parser rec expr p =
  Atom < Prod < Sum
  ; (p=Atom) (x::FLOAT)                        => x
  ; (p=Atom) '(' (e::expr Sum) ')'             => e
  ; (p=Prod) (x::expr Prod) '*' (y::expr Atom) => x *. y
  ; (p=Prod) (x::expr Prod) '/' (y::expr Atom) => x /. y
  ; (p=Sum ) (x::expr Sum ) '+' (y::expr Prod) => x +. y
  ; (p=Sum ) (x::expr Sum ) '-' (y::expr Prod) => x -. y

This works using the extensions [%%parser ...] for structures and [%parser ...] for expression. These are also accessible by suffixing keywords with %parser as in the above example. These ppx extensions extends ocaml expressions with a syntax for grammars of type 'a Grammar.t and modifies the behaviour of let-bindings especially recursive ones to use declare_grammar, set_grammar and grammar_family. Recall that due to the limitation of ppx, we use a sub-syntax of OCaml expressions for grammars. It is therefore not a good idea to use "=>" as an infix inside [%parser ...].

We give below the BNF grammar for the extension, together with a sketch of its semantics.

grammar ::= rule                                                   itself
       | grammar ; rule                                       Grammar.alt
rule ::= qitems => expr                            A rule with its action
       | expr < ... < expr                       priority order see below
       | ERROR(m)                   report an error if parsing fails here
qitems ::= ()                                               Grammar.empty
       | non_empty_qitems                                          itself
       | cond non_empty_qitems                           conditional rule
cond ::= expr bool_op expr
       | not expr
non_empty_qitems ::= qitem                                         itself
       | non_empty_qitems qitems                              Grammar.seq
qitem ::= item                                                     itself
       | (epat :: item)                 give a name if used in the action
       | ((epat,epat) >: item)                     as above, but for dseq
item ::= '...'                                  Grammar.term(Lex.char ())
       | CHAR                                    Grammar.term(Lex.any ())
       | CHAR(c)                                 Grammar.term(Lex.char c)
       | "..."                                Grammar.term(Lex.string ())
       | STRING(s)                             Grammar.term(Lex.string s)
       | NAT                                     Grammar.term(Lex.nat ())
       | INT                                     Grammar.term(Lex.int ())
       | FLOAT                                 Grammar.term(Lex.float ())
       | UTF8                                   Grammar.term(Lex.utf8 ())
       | RE(expr)      Grammar.term(Lex.regexp (Regexp.from_string expr))
       | STRING_LIT                       Grammar.term(Lex.string_lit ())
       | CHAR_LIT                           Grammar.term(Lex.char_lit ())
       | RE(expr)      Grammar.term(Lex.regexp (Regexp.from_string expr))
       | expr                                                      itself
       | ~? expr                                      Grammar.option expr
       | ~? [expr] expr                  Grammar.option_default expr expr
       | ~* expr                                        Grammar.star expr
       | ~* [expr] expr                        Grammar.star_sep expr expr
       | ~+ expr                                        Grammar.plus expr
       | ~+ [expr] expr                        Grammar.plus_sep expr expr
epat ::= lid
       | (epat : coretype)
       | epat = lid                              encoding of [pat as lid]
       | (epat, ..., pat)
       | M.epat

epat correspond to an encoding of patterns in expressions. Beware that _ is invalid, use __ instead. cond is an expression using any test operator: "="|"<"|">"|"<="|">="|"<>"|"=="|"!="|"not"|"&&"|"||".

In action code (expression right of =>), a lid_lpos or lid_rpos will denote respectively the left and right position of the item named lid. a lid_pos will group both lid_lpos and lid_rpos in a record of type Pos.interval. If the item is matched by a tuple and you want to use its position you must use pat = lid syntax to give a name to the whole item.

Action code needs parenthesis or begin ... end if it uses if .. then, pattern matching or sequence.

Beware that inside the scope of the extension, you can use the syntax for grammars everywhere. This allows for some nesting as in:

type p = Atom | Prod | Sum
let%parser rec
     expr p = Atom < Prod < Sum
            ; (p=Atom) (x::FLOAT)                        => x
            ; (p=Atom) '(' (e::expr Sum) ')'             => e
            ; (p=Prod) (x::expr Prod) => ( '*' (y::expr Atom) => x*.y
                                         ; '/' (y::expr Atom) => x/.y)
            ; (p=Sum ) (x::expr Sum ) => ('+' (y::expr Prod) => x+.y
                                         ; '-' (y::expr Prod) => x-.y)

But when using this, beware that x is not available before the final action code, it can not be used for selecting grammar rule. (p::prio) => (x::g p) => x will report p as unbounded. To solve this, you can use dependent sequences, using (x,y)>:item will allow x (but not y) to be used both in the action and the rule. The separation with dependent and non dependent part is crucial as dependent grammar are memoised and you don't want "noise". Here is an example of grammar using this to implement an extensible calculator (see tests/calc_ext.ml):

let%parser rec
 expr pmax = ((pe,e1)>:expr pmax)((pop,b)>:op pe pmax)
               ((__,e2)::expr pop)                =>  (pop, b e1 e2)
            ; (x::FLOAT)                          => (0.0,x)
            ; '(' (e::expr_top) ')'               => (0.0,e)

Here are the meaning of let bindings for grammars through the ppx extension:

  • non recursive let bindings correspond to just a name for the grammar.
  • recursive let bindings correspond either to
  • Grammar.declare_grammar + Grammar.set_grammar (if no parameter)
  • Grammar.grammar_family + setting the grammar if parameters are given. multiple parameter and using label are supported through curryfication by the ppx extension.

For recursive grammar with exactly one parameter, a rule p_1 < p_2 < ... < p_n will automatically add rules to include the grammar parametrized by p_i in the grammar parametrized by p_(i+1). This was used by the calculator example above.

let%parser accepts three attribute:

  • [@cache] to cache the grammar (that is call Grammar.cache on the grammar)
  • [@merge f] to apply f if two parsetrees are possible for the same input. This corresponds to Grammar.cache ~merge:f.
  • [@layout blank] or [@layout blank ~config:expr] to apply Grammar.layout and change the blank characters for the grammar.

Anything which does not correspond to this grammar will be unchanged in the ocaml code (like the type definition in the example above). A mutually recursive definition can also mix the definition of grammars (parametric of not) with the definition of normal ocaml values. This means you could put the whole file inside %%parse ....

Limitations

Pacomb must eliminate left recursion in grammars in order to use combinators that would loop otherwise. However, left recursion is not supported if it traverses A Pacomb.Grammar.layout constructor to change blanks (probably possible to solve this, but probably not worth it).

Grammars are not left factorised automatically: (A B) | (A C) may parse A twice and this may result in very poor performance. Two solutions:

  • left factorise your grammar yourself,
  • Use Grammar.cache trading memory for speed.

Note: left recursion do not need and is not eliminated if the grammar uses a cache. However, this solution to use combinator is too slow for non ambiguous grammars so we do not impose a cache to all left recursive grammars.

The ppx extension is not too bad but still suffers from the fact that it uses a sub-language of OCaml to describe grammar. For instance let%parser g = ((_,x)::g) => x is not legal because _ cannot be used in an Ocaml expression. Though the following works: let%parser g = ((__,x)::g) => x. The syntax (((x,y) = z) :: g) => (x,y,z_pos) is not very nice as we use = to replace the as keyword and we also need a lot of parentheses.

OCaml

Innovation. Community. Security.