#parser-generator #parser #nom #context-free-grammar #macro

tree-builder

Parser Generator library which uses Nom and procedural macros to generate recursive-descent parsers

2 releases

0.0.3 Jul 25, 2023
0.0.2 Jul 25, 2023
0.0.1 Jul 14, 2023

#146 in Parser tooling

GPL-3.0 license

32KB
198 lines

TreeBuilder, a lightweight, nom-based parser generator library for Rust

TreeBuilder is a parser generator focused on the correct parsing of context-free grammars from ASCII input strings. It generates recursive-descent, backtracking parsers with as little code as possible and with convenient features. The parsers it generates aim to be very easy to use and to be compatable with Nom parsers if the user wishes to use Nom together with TreeBuilder. TreeBuilder will not be supporting left-recursive grammars in this version.

TreeBuilder's [Parser] trait shows the type signature of all TreeBuilder-generated parsers.

Example:

use tree_builder::{build_tree, Parser};

build_tree! {
    // Parses a Hex color
    HexColor #=> "#",
                 HexDigit,
                 HexDigit?, HexDigit?, HexDigit?, HexDigit?, HexDigit?;

    // Parses a Hex digit
    HexDigit #=> [0-9a-fA-F];
}

fn main() {
    // Will pass
    HexColor::parse("#abd325").unwrap();
    HexColor::parse("#asd000").unwrap();
    // Will fail
    HexColor::parse("#whatever").unwrap();
}

Kinds of rules

TreeBuilder can either generate parsers whose output data structure is just a String, or parsers whose output data structure is a more complicated tuple type. If you use the arrow #=> to separate the rule name from it's definition, you have created a lexical rule which outputs String, but if you use =>, you have the option to specify what parts of a rule definition you wish to keep.

You specify the parts you wish to keep using the @ operator (called Include). Data structures for parsers are based on the includes you have made, from left to right.

Let's take for example the rules Arraya and ArrayElems which parse JSON arrays below:

build_tree!{
    JValue => /* Definition of a JSON value */;

    Array => "[", #s*, @ArrayElems?, #s*, "]";
    ArrayElems => @JValue, #s*, @(",", #s*, @JValue, #s*)*;
}

The data structures generated for the rules above are:

struct Array(Optional<Box<ArrayElems>>);

struct ArrayElems(Box<JValue>, Vec<Box<JValue>>);

When you use alternations, the data structures generated are enums, whose variant names are either inferred when you only include a single non-terminal in an alternation, or need to be specified.

Alternation rule example:

build_tree!{
// Rule representing a JSON value
JValue => @JString
       |  @Number
       |  @Object
       |  @Array
       |  "true" <True>
       |  "false" <False>
       |  "null" <Null>;
}

And the data structure generated by this rule:

enum JValue {
    JString(Box<JString>),
    Number(Box<Number>),
    Object(Box<Object>),
    Array(Box<Array>),
    True(),
    False(),
    Null(),
}

Keep in mind that in the rules denoted by =>, you cannot use alternations inside of groupings at the moment. This limitation will probably be fixed in the next update.

TreeBuilder's language

TreeBuilder allows for the specification of grammars which use terminals, non-terminals, metacharacters and groupings/sub-parsers. Different operators can be applied to the aforementioned elements which define how many times one of them may be repeated.

Available terms:

  1. Terminals: Terminals are written just like strings in any C-like language. They are defined as a string of characters which can be either any character except ’\’ and ’”’, or a character escape sequence such as ’\n’, delimited by double quotes. The possible escape sequences are:

    • \n (newline)
    • \r (carriage return)
    • \t (tab)
    • \b (non-destructive backspace)
    • \f (form feed)
    • \ (backslash)
    • \” (double quote)

    Some examples of terminals:

    • ”for”
    • ”if”
    • ”a123\nb!?.\tc\””
  2. Metacharacters: A metacharacter is a special character that carries a specific meaning or functionality within a grammar specification. Unlike terminals, which parse their string equivalent, metacharacters have special interpretations and serve as building blocks for constructing powerful and elegant parser rules with as little code as possible. Here are the metacharacters and their meanings:

    • . (Dot): Matches any single character except a newline. It acts as a wildcard, repre- senting any character in the input string.
    • [ ] (Character Class): Matches any single character specified inside the square brack- ets. For example, [aeiou] matches any vowel character.
    • [^ ] (Negated Character Class): Matches any single character that is not specified inside the square brackets. For example, [^0-9] matches any non-digit character.
    • #d: Matches any digit character. It is equivalent to the character class [0-9]. For example, #d would match any digit from 0 to 9.
    • #D: Matches any non-digit character. It is equivalent to the character class [^0-9]. For example, #D would match any character which isn’t a digit.
    • #w: Matches any word character. It includes alphanumeric characters (a-z, A-Z, and 0-9) as well as the underscore () character. It is equivalent to the character class [a-zA-Z0-9].
    • #W: Matches any non-word character. It excludes alphanumeric characters (a-z, A- Z, and 0-9) as well as the underscore () character, matching anything else. It is equivalent to the character class [^a-zA-Z0-9].
    • #s: Matches any ASCII whitespace character, so matches any one of \f, \n, \r, \t.
  3. Nonterminals: Their syntax is exactly the same as ASCII Rust identifiers. A string be- ginning with either an alphabetic character or an underscore, followed by a string of either alphanumerics, an underscore, or both. Their use in the definition of grammars is to spec- ify that another rule is part of the rule you are currently specifying. So the parser which will be generated from a rule which uses a nonterminal will apply the parser related to the nonterminal at the point specified by the user.

  4. Groupings: grouping refers to the act of enclosing a sequence of elements or subexpres- sions within parentheses. It allows for establishing precedence and controlling the evalu- ation order of elements within a grammar specification. These are grouped as one of the smaller elements of the language because other operators can be applied groupings just like they can be applied to the other terms.

Available operators:

Operator Description
e | e Alternation
@e Include
e? Optional
e* Zero-or-more
e+ One-or-more

Alternations in TreeBuilder are ordered-choice ones, meaning that out of all the alternatives that you may define for a rule, only one of the alternatives can succeed. This occurs because TreeBuilder will try alternatives from the one defined first all the way to the last one in order, and if any of the alternatives successfully parses for an input string, the alternations after it are ignored completely.

Dependencies

~3.5–5MB
~89K SLoC