#tokenizer #lexer #token

alkale

A simple LL(1) tokenizer library for Rust

4 stable releases

1.0.3 Oct 14, 2024
1.0.2 Sep 20, 2024
1.0.1 Sep 13, 2024

#185 in Programming languages

Download history 270/week @ 2024-09-12 247/week @ 2024-09-19 24/week @ 2024-09-26 8/week @ 2024-10-03 156/week @ 2024-10-10 28/week @ 2024-10-17

222 downloads per month

MIT license

63KB
906 lines

Alkale

This is the repository for Alkale, a Rust library to assist in making hand-written LL(1) tokenizers.

Goals

Alkale has three specific goals in mind for its design.

Goal 1: Handle Sources

Alkale should natively handle common code sources, strings and files in particular.

General-purpose parsers usually need to either operate on files' bytes alone, or read entire files into memory, neither of which are ideal. Because Alkale doesn't need to support extensive lookahead, it can directly read characters from file buffers and treat them the same as if a regular string was being tokenized.

Goal 2: Provide Span Information

Span information is annoying to keep track of manually, so Alkale will automatically keeps track of spans for its tokens.

Due to the avoidance of in-memory source loading, Alkale's spans store index, line, and column information. This may lead to higher-than-average memory usage for non-iterator tokenizers. An iterator-like tokenizer that creates tokens as they're needed will avoid this problem.

Goal 3: Include Many Built-Ins

Many aspects of tokenizers are extremely common and repetitive. Think things such as string parsing, number tokenization, error recovery, etc. These common elements should come pre-packaged with Alkale by default. You may find a list of these in COMMON.md.

Because I have roots in esolangs, there may be some odd built-ins to assist with non-standard languages.

Structure

The core of Alkale operates on the TokenizerContext type. It is created using a BufReader<File> (for convience), or more generally, any type that implements IntoIterator<Item = char>.

The TokenizerContext provides LL(1) access into the underlying string with the next and peek methods, as well as tons of helper methods. Other methods range from peek_is, a general-purpose method to check if the next character is equal to some characters— all the way to try_parse_simple_string, which attempts to parse an entire rust-like string with character escaping and everything.

Dependencies

~240KB