#lexing

lexr

Flexible, powerful and simple lexing in Rust

1 unstable release

0.1.0 Nov 19, 2023

#142 in Parser tooling

MIT license

29KB
263 lines

lexr

Lexr is a simple and flexible lexing library for Rust. It is designed to be used on its own, or in conjunction with parsr.

The syntax consists of a single macro, lex_rule! which is used to define a lexing rule. The macro generates a function that can be called to produce a lexer. The lexer is an iterator over the input string, producing tokens and locations as it goes.

If you encounter any issues or have suggestions, please report them here.

Here is a simple example of a lexer that recognizes the tokens A, B, and C:

use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
    A, B, C,
}
use Token::*;

lex_rule!{lex -> Token {
    "a" => |_| A,
    "b" => |_| B,
    "c" => |_| C,
}}

let tokens = lex("abc").into_token_vec();
assert_eq!(tokens, vec![A, B, C])

Macro Syntax

The lex_rule! macro is used to define a lexer. The lex rule has a name, a token type, and any number of patterns with associated actions. The syntax is as follows:

lex_rule!{NAME(ARGS) -> TOKEN {
    PATTERN => ACTION,
    ...
}}
  • NAME is the name of the function that is generated by the macro. This function can be called to produce a lexer.
  • ARGS is an optional list of arguments that are passed to the lexer.
  • TOKEN is the type of the tokens that the lexer produces. This can be any type, including void.
  • PATTERN is a pattern that the lexer matches against the input. If the pattern matches, the action is executed.
  • ACTION is an expression that is executed if the pattern matches. The expression must produce a token or continue or break.

The rules consist of a pattern and an action resulting in a token.
The order of the patterns is important, as the first that matches is chosen.

Patterns

Patterns are matched to the beginning of the input in the order they are defined.

Patterns can be the following:

  • One ore more string slice literals or constants. These strings are concatenated together, and used for regex matching.
  • A wildcard _ that matches any single character. This does not match eof.
  • eof, which matches the end of the input. This is optional, and if not provided, end of file is just ignored.
  • ws, which matches any whitespace character.

Here is an example showing the different legal patterns

use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
    A, B, C, D, Num, Eof
}
use Token::*;

const A_RULE: &str = "a";

lex_rule!{lex -> Token {
    ws => |_| continue, // Matches whitespace
    "a" => |_| A, // Matches "a"
    "b" "a" => |_| B, // Matches "bc"
    "c" A_RULE => |_| C, // Matches "ba"
    r"[0-9]+" => |_| Num, // Matches any number of digits
    _ => |_| D, // Matches any single character
    eof => |_| Eof, // Matches the end of the input
}}

let tokens = lex("a ba ca S 42").into_token_vec();
assert_eq!(tokens, vec![A, B, C, D, Num, Eof])

Actions

An action is a closure returning the token type provided in the macro definition.
It will run when the pattern matches, and can be used to produce a token, skip or stop lexing altogether.

Signature

There are 3 different signatures for the closure, which can be used to provide different parameters to the action:

  • |s| - The action is provided with the matched string
  • |s, buf| - The action is provided with the matched string and a buffer. The buffer can be used to lex a subrule.
  • |s, buf, loc| - The action is provided with the matched string, a buffer, and a location. The location is the location of the matched string in the input.

Only the first argument is required, the rest are optional. They can all be ignored with an underscore _.
This means that if no arguments are needed, the signature can be written as |_|.
For instance if only the location is of interest, the other arguments can be ignored with an underscore: |_, _, loc|.

Action

The actions themselves can be any expression that returns a token or continues or breaks.

Continue and break works as follows:

  • continue - This skips the current token and returns the next token instead.
  • break - This stops the lexer and thus the iterator will return None when this is encountered.

Notably it is possible to call [sub rules](# Sub Rules) from the action.

Here is an example showing the different legal actions

use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
    A, Num(i32), Eof
}
use Token::*;

lex_rule!{lex -> Token {
    // Returns A
    "a" => |_| A,
     // Matches any whitespace and skips it
    r"[ \n\t\r]" => |_| continue,
    // Stops the lexer
    "x" => |_| break,
    // Calls the sub rule and runs it until it it is done
    "#" => |_, buf| { comment(buf).deplete(); continue },
    // Parses the number and returns it
    r"[0-9]+" => |s| Num(s.parse().unwrap()),
    // Detects and returns Eof
    eof => |_| Eof, // Returns Eof
}}

// A simple rule that ignores all characters until a '#' is encountered
lex_rule!{comment -> () {
    "#" => |_| break,
    _ => |_| continue,
}}

let tokens = lex("a # comment # 42 a").into_token_vec();
assert_eq!(tokens, vec![A, Num(42), A, Eof]);

let tokens = lex("aa 12 x aa").into_token_vec();
assert_eq!(tokens, vec![A, A, Num(12)]);

Args

The arguments are passed to the lexer function, and can be used to pass arguments to a lexer. These can be used to for instance pass context information, or to pass arguments to sub rules.

Here is an example showing how to pass an argument to a lexer:

use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
    A, B(i32), Eof
}
use Token::*;

lex_rule!{lex(arg: i32) -> Token {
    "a" => |_| A,
    "b" => |_| B(arg),
    eof => |_| Eof,
}}

let tokens = lex("ab", 12).into_token_vec();
assert_eq!(tokens, vec![A, B(12), Eof]);

Sub Rules

Sub rules are lex rules that are called from the action of another lex rule.
This call will then operate on the samme buffer, and thus the sub rule will mutate the buffer.
This can be used to lex for instance comments, or even entire sub languages.

Be aware that calling sub rules is not tail recursive, so use it with caution, and not as the main way to lex.

Also make sure that the sub rule is run, otherwise nothing happens. This can be done by calling deplete to run to end, or next for a single token.

Here is an example showing how to call a sub rule:

use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
    A, Eof
}
use Token::*;

lex_rule!{lex -> Token {
    ws => |_| continue,
    "a" => |_| A,
    r"\(\*" => |_, buf| { comment(buf, 0).next(); continue },
    eof => |_| Eof,
}}

lex_rule!{comment(depth: u16) -> () {
    r"\(\*" => |_, buf| {comment(buf, depth + 1).next(); break},
    r"\*\)" => |_, buf|
        if depth == 0 {
            break
        } else {
            comment(buf, depth - 1).next();
            break
        },
    eof => |_| panic!("Unclosed comment!"),
    _ => |_| continue,
}}

let tokens = lex("a (* comment (* inner *) comment *) aa").into_token_vec();
assert_eq!(tokens, vec![A, A, A, Eof]);

License: MIT

Dependencies

~2.4–4MB
~72K SLoC