blex

A lightweight lexing framework

4 releases

0.2.2 Mar 24, 2023
0.2.1 Mar 22, 2023
0.2.0 Mar 22, 2023
0.1.0 Mar 11, 2023

#147 in Parser tooling


Used in blarse

GPL-3.0 license

22KB
355 lines

Blex is a lightweight lexing framework built in Rust based on the T-Lex framework. Blex is built around the concept of building a set of rules that process a set of tokens. Each rule is run throughout the input string successively, gradually transforming it into an output string of tokens.

Tokens

The implementation of tokens itself is the idea most heavily borrowed from Rustuck. Tokens in Blex are represented as a contiguous range of characters from the original string along with a set of tags, themselves also represented as borrowed strings. For example, the string let x: int = 2; might be represented as:

"let": var_dec, keyword
"x": ident
":": colon, keyword
"int": ident
"=": eq, keyword
"2": int

Notice the lack of whitespace: keywords don't have to represent the entire string.

Rules

Rules in Blex are any function that transforms a vector of tokens into an optional vector of tokens. In Rust, that refers to this trait bound: Fn(Vec<Token>) -> Option<Vec<Token>>. Rules may modify the input tokens without worrying about mutating the original string: tokens are cloned before the rules are applied to them (luckily, cloning is a very cheap operation for tokens, which are just a range and a few pointers).

Rules are processed with the process_rule and process_rules function. The process_rule function applies the rule starting on each token in a list. Rules are applied starting with a single token, which is wrapped in a Vec, passed into the function, and processed. If the function returns None, the starting token and the next token are combined into a Vec and passed into the function. If the function returns Some, the tokens passed in are replaced with the returned Vec.

Rule Processing

Rules are processed on strings of tokens, but strings of characters are transformed into tokens using the str_to_tokens function. Each character of a string is turned into a corresponding token, and one tag is added with the content of the character, a holdover from Rustuck.

An Example

Let's create a rule that takes two consecutive tokens that read ab and converts them into one token with the tag c. The rule would start out like this:

fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {

}

Let's say we input the string a b blex ab abab. The string will be turned into these tokens:

"a": a; 
" ":  ; 
"b": b; 
" ":  ; 
"b": b; 
"l": l; 
"e": e; 
"x": x;
" ":  ;
"a": a;
"b": b;
" ":  ;
"a": a;
"b": b;
"a": a;
"b": b;
"":

Notice the empty token at the end. We didn't type that: it was added automatically to give some rules, like those testing for words, a buffer. Our rule will start by scanning each token individually. Remember that we are scanning for the pattern "ab". Let's chart out each possible route our rule could take.

  • It will start by finding one token containing one character.
    • If the token has the tag a, we should continue the rule to include the next token.
    • Otherwise, we should stop the rule and return the token unchanged.
  • If multiple tokens are found, based on the last rule, it must be a pair of tokens with the first one being a. Knowing that, there are two paths we can take:
    • If the second token has the tag b, we should combine those tokens and give it the tag c.
    • Otherwise, we should stop the rule and return the token unchanged.

Luckily, blex has idiomatic ways to express this. We'll start by matching on token_structure(&tokens), which has two options:

fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
	match tokens_structure(&tokens) {
            TokenStructure::Single(tok) => {
	            
            },
            TokenStructure::Multiple => {
				
            }
        }
}

The token_structure function takes in a borrowed Vec of tokens. If it finds that the Vec holds only one token, it returns that token wrapped in a Single. Otherwise, it returns Multiple. This is a safe way to guarantee the existence of one token. There is also an idiomatic way to test a token for certain tags in the has_tag method. We can use this to test both cases for the a and b tags respectively.

fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
	match tokens_structure(&tokens) {
            TokenStructure::Single(tok) => {
	            if tok.has_tag("a") {
		            // case 1
                } else {
					// case 2
                }
            },
            TokenStructure::Multiple => {
				if tokens[1].has_tag("b") {
					// case 3
                } else {
					// case 4
                }
            }
        }
}

The only thing left is the return values:

  • In case 1, we want to continue the rule to the next token. We do this by returning None.
  • In cases 2 and 4, we want to end the rule and return the tokens unchanged. We simply do this by returning Some(tokens).
  • In case 3, we want to combine both of our tokens and give them the tag c. Of course, there is an idiomatic way to do this as well: wrap. The wrap function takes a Vec of tokens (which are assumed to be from the same string of tokens) and combines all of their contents into one token. The second argument of wrap is a Vec<&str>, which contains all of the tags to add to the new token.
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
	match tokens_structure(&tokens) {
		TokenStructure::Single(tok) => {
			if tok.has_tag("a") {
				None
			} else {
				Some(tokens)
			}
		},
		TokenStructure::Multiple => {
			if tokens[1].has_tag("b") {
				Some(vec![wrap(tokens, vec!["c"])])
			} else {
				Some(tokens)
			}
		}
	}
}

That completes our rule! We can test it with a script like this:

#[test]
fn apply_ab() {
	let text = "a b blex ab abab";
	let mut body = str_to_tokens(text);
	process_rule(ab_rule, &mut body);
	print_tokens(body);
}

Which gives us:

"a": a; 
" ":  ; 
"b": b; 
" ":  ; 
"b": b; 
"l": l; 
"e": e;
"x": x;
" ":  ;
"ab": c;
" ":  ;
"ab": c;
"ab": c;
"":

No runtime deps