4 releases
0.2.2 | Mar 24, 2023 |
---|---|
0.2.1 | Mar 22, 2023 |
0.2.0 | Mar 22, 2023 |
0.1.0 | Mar 11, 2023 |
#147 in Parser tooling
Used in blarse
22KB
355 lines
Blex is a lightweight lexing framework built in Rust based on the T-Lex framework. Blex is built around the concept of building a set of rules that process a set of tokens. Each rule is run throughout the input string successively, gradually transforming it into an output string of tokens.
Tokens
The implementation of tokens itself is the idea most heavily borrowed from Rustuck. Tokens in Blex are represented as a contiguous range of characters from the original string along with a set of tags, themselves also represented as borrowed strings. For example, the string let x: int = 2;
might be represented as:
"let": var_dec, keyword
"x": ident
":": colon, keyword
"int": ident
"=": eq, keyword
"2": int
Notice the lack of whitespace: keywords don't have to represent the entire string.
Rules
Rules in Blex are any function that transforms a vector of tokens into an optional vector of tokens. In Rust, that refers to this trait bound: Fn(Vec<Token>) -> Option<Vec<Token>>
. Rules may modify the input tokens without worrying about mutating the original string: tokens are cloned before the rules are applied to them (luckily, cloning is a very cheap operation for tokens, which are just a range and a few pointers).
Rules are processed with the process_rule
and process_rules
function. The process_rule
function applies the rule starting on each token in a list. Rules are applied starting with a single token, which is wrapped in a Vec
, passed into the function, and processed. If the function returns None
, the starting token and the next token are combined into a Vec
and passed into the function. If the function returns Some
, the tokens passed in are replaced with the returned Vec
.
Rule Processing
Rules are processed on strings of tokens, but strings of characters are transformed into tokens using the str_to_tokens
function. Each character of a string is turned into a corresponding token, and one tag is added with the content of the character, a holdover from Rustuck.
An Example
Let's create a rule that takes two consecutive tokens that read ab
and converts them into one token with the tag c
. The rule would start out like this:
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
}
Let's say we input the string a b blex ab abab
. The string will be turned into these tokens:
"a": a;
" ": ;
"b": b;
" ": ;
"b": b;
"l": l;
"e": e;
"x": x;
" ": ;
"a": a;
"b": b;
" ": ;
"a": a;
"b": b;
"a": a;
"b": b;
"":
Notice the empty token at the end. We didn't type that: it was added automatically to give some rules, like those testing for words, a buffer. Our rule will start by scanning each token individually. Remember that we are scanning for the pattern "ab". Let's chart out each possible route our rule could take.
- It will start by finding one token containing one character.
- If the token has the tag
a
, we should continue the rule to include the next token. - Otherwise, we should stop the rule and return the token unchanged.
- If the token has the tag
- If multiple tokens are found, based on the last rule, it must be a pair of tokens with the first one being
a
. Knowing that, there are two paths we can take:- If the second token has the tag
b
, we should combine those tokens and give it the tagc
. - Otherwise, we should stop the rule and return the token unchanged.
- If the second token has the tag
Luckily, blex has idiomatic ways to express this. We'll start by match
ing on token_structure(&tokens)
, which has two options:
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
},
TokenStructure::Multiple => {
}
}
}
The token_structure
function takes in a borrowed Vec
of tokens. If it finds that the Vec
holds only one token, it returns that token wrapped in a Single
. Otherwise, it returns Multiple
. This is a safe way to guarantee the existence of one token.
There is also an idiomatic way to test a token for certain tags in the has_tag
method. We can use this to test both cases for the a
and b
tags respectively.
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
if tok.has_tag("a") {
// case 1
} else {
// case 2
}
},
TokenStructure::Multiple => {
if tokens[1].has_tag("b") {
// case 3
} else {
// case 4
}
}
}
}
The only thing left is the return values:
- In case 1, we want to continue the rule to the next token. We do this by returning
None
. - In cases 2 and 4, we want to end the rule and return the tokens unchanged. We simply do this by returning
Some(tokens)
. - In case 3, we want to combine both of our tokens and give them the tag
c
. Of course, there is an idiomatic way to do this as well:wrap
. Thewrap
function takes aVec
of tokens (which are assumed to be from the same string of tokens) and combines all of their contents into one token. The second argument ofwrap
is aVec<&str>
, which contains all of the tags to add to the new token.
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
if tok.has_tag("a") {
None
} else {
Some(tokens)
}
},
TokenStructure::Multiple => {
if tokens[1].has_tag("b") {
Some(vec![wrap(tokens, vec!["c"])])
} else {
Some(tokens)
}
}
}
}
That completes our rule! We can test it with a script like this:
#[test]
fn apply_ab() {
let text = "a b blex ab abab";
let mut body = str_to_tokens(text);
process_rule(ab_rule, &mut body);
print_tokens(body);
}
Which gives us:
"a": a;
" ": ;
"b": b;
" ": ;
"b": b;
"l": l;
"e": e;
"x": x;
" ": ;
"ab": c;
" ": ;
"ab": c;
"ab": c;
"":