4 releases
0.2.0 | Nov 27, 2023 |
---|---|
0.1.3 | Nov 21, 2023 |
0.1.2 | Nov 14, 2023 |
0.1.1 | Aug 30, 2023 |
#164 in Parser tooling
33 downloads per month
67KB
2K
SLoC
tuck5: The Next Attempt
(at making a parser generator)
Tuck5 is a lexer/parser generation library written in Rust with a dual focus on
performance and ease of use. The meta
system is the main, and recommended,
method of interacting with tuck5. It provides a domain-specific language that
allows developers to easily define how to transform a set of tokens. Under the
hood, this simple syntax is executed as quick for-loops, with an unprecedented
level of developer control of execution.
Understanding the TUCK nature
tuck5 uses a system of tokens, as is to be expected by a tokenizer framework.
Tokens are either branches or leaves. If a token is a leaf, it contains a range
of indices into its corpus text. If it is a branch, it contains a vector of
tokens as its children. Tokens also contain some extra data: as the developer,
you are able to statically control this data to fit your needs. (Note: if you
are using the meta
system, this auxiliary data takes the form of a vector of
owned strings, which act as tags. A transition is planned to string pointers
or perhaps a bitflags-based system. This does not affect the experience of
using the meta
system.)
To parse a corpus string, the characters of the string are initially converted
into a series of leaf tokens where each token has two tags: a
where a
is
the content of the token, u97
where 97
is the decimal conversion of the
UTF-8 value of the character, and ws
ONLY if the character is classified as
Unicode whitespace.
Then, a series of rules are applied to the tokens. A rule consists of a
condition that may match a sequence of tokens, a transformation method
(whether a matching sequence is converted into a single leaf token, a single
branch token, or removed entirely), and a series of tags to apply to the token
that the matching tokens will be converted into. For examples of this, see the
section [[#Using the meta
system]].
Each rule is applied on each token, one-by-one. Many rules accept multiple tokens. They have specifically defined behavior for reaching the end of the corpus string. Each rule is repeated applied to the corpus string until it no longer can be applied. If specified, a group of rules can be applied in sequence, repeatedly, until none of them can be applied.
Using the meta
system
Define a meta-parser (a parser that is generated by the meta
system) with
this syntax.
Rules
'abc'
for any string abc
: matches if the next token's content exactly
equals the characters between the '
s.
a..z
for any characters a
and z
: matches if the next token is one
character long, and that character is between a
and z
inclusive.
{.abc.}
for any string abc
with x characters: matches if the next x
tokens exactly match each character in abc
.
tag
for any string tag
: matches if the next token has the tag tag
.
rule1 & rule2
: matches if rule1
applies from token x to token y, then
rule2
applies from token y + 1 to token z. Can be chained.
rule1 | rule2
: matches if either rule1
or rule2
applies. Can be chained.
rule?
: If rule
matches the next x tokens, this matches for x tokens.
Otherwise, matches 0 tokens. Not intended to be used alone (will probably lock
your parser up until warnings are implemented).
rule*
: Repeatedly matches rule
. If rule
matches tokens x to y,
tries again on tokens y to z and so on. If rule
doesn't match, matches
all the tokens that rule
did match on. Once again, not intended to be used
alone, as it will match even if rule
matched 0 times.
rule+
: Repeatedly matches rule
after matching it one time: see above.
Equivalent to rule & rule*
. Can be used alone.
Transformations
rule . tag0, tag1, etc;
: If rule
matches the next x tokens, transforms
those tokens into one leaf token (discarding the data of the "children") with
the specified tags. More performant than the alternative below.
rule : tag0, tag1, etc;
: If rule
matches the next x tokens, transforms
those tokens into one branch token (with the transformed tokens as the
children) with the specified tags. Useful for converting into an AST.
rule~;
: If rule
matches the next x tokens, removes those tokens. Most
often used for whitespace removal, but if AST generation is the goal, it can
also be used for comments.
Miscellaneous
# bla bla bla (line break)
: A one-line comment.
## bla bla bla ##
: A multi-line comment.
{ transformation transformation ... }
: Apply each of these transformations
in order until none of them match. Useful for preserving an order of
operations.
(rule)
: The same as a rule. Useful for the order of operations of the meta
system or for clarity.
Examples
See the calculator
folder in tests
for a full example of the meta
system.
Here is an explanation of a simpler version of it.
(0..9)+. positive, int, expr;
Here, we define our positive integers. 0..9
represents a range of characters,
which will match against the digits from 0 to 9. The +
signifies to match
against one or more digits.
int & '.' & int. positive, decimal, expr;
Here, int
matches against any token we recently defined with the tag int
.
If the sequence of an integer, followed by a decimal point, followed by an
integer (which, as we defined, can have leading zeroes) is found, these are
transformed into a decimal token.
'-' & positive. negative, expr;
Now we define the negative numbers. If a minus sign is matched, then a positive
number, we apply the negative
tag.
ws~;
Here, we remove whitespace, as specified by the pre-defined ws
tag.
{
'(' & expr & ')': parens, expr;
}
The braces around this statement means that there are more to follow it, and
they should be executed in sequence, repeatedly. We will see this in action
when we add the operations. For now, if parentheses surround a token with the
expr
tag (which you may have noticed was added to every number), all three
tokens will be transformed into one with the parens
and expr
tags.
{
'(' & expr & ')': parens, expr;
expr & '*' | '/' & expr: oper, expr;
}
Here, we add multiplication and division to our repeated transformations. With the braces, you'll find that any sequence of nested parentheses, multiplications, and divisions will follow the rules of PEMDAS.
{
'(' & expr & ')': parens, expr;
expr & '*' | '/' & expr: oper, expr;
expr & '+' | '-' & expr: oper, expr;
}
Finally, we complete our simple calculator by adding addition and subtraction.
To access our tagged tokens in Rust, use the tuck5::meta::eval_prog_from_text
function with text of your meta-parser and the text you'd like parsed. The
result will be a vector of tokens.