#parser-generator #bad #parser-combinator #parser

bad_parsers

A parser combinator library written by myself, for myself

1 unstable release

Uses new Rust 2024

new 0.1.0-unstable Apr 18, 2025

#87 in Parser tooling

GPL-3.0 license

130KB
2K SLoC

bad_parsers

WARNING: CURRENTLY NOT STABLE

This library's API is far from finalized, and different components can change or vanish without warning. It is not currently recommended to use this library for any vaguely-serious projects.

This README is very empty

I'm just getting basic things in order at the moment, this is a very green library. If you need information on using the library, you can generate the documentation after cloning with cargo doc --open.


lib.rs:

bad_parsers

A parser combinator library written by myself for myself.

Don't say I didn't warn you.

bad_parsers is a parser combinator library written entirely from scratch in safe Rust with no external dependencies. The provided parsers are able to parse from string slices, arbitrary token type slices, as well as other token collections you may wish to implement yourself.

As an added bonus, using this library in your own project is guaranteed to make it worse! For free!

Main Features

  • A Parser trait that primarily interfaces with functions and closures that act like parsers, but could also be used to treat other types as parsers as well.
  • A Tokens trait that allows things to be parsed out of arbitrary data types.[citation needed]
  • Vaguely useful error handling: ParseErrors can provide useful information when parsing is unsuccessful.
  • Lazily evaluated parsers that allow for recursive definitions.
  • A collection of common parser combinator utility functions to build custom parsers with.

Overview of Parsers

The job of a parser is to extract meaningful data from a collection of tokens. To this end, a parser can be thought of as a function from some input tokens to some kind of data the parser is looking for. To achieve this, a parser will:

  • Examine the stream of tokens
  • Fail if it cannot find what it is looking for
  • If it does find it, return the parsed data and the remaining tokens

The importance of returning the remaining tokens cannot be overstated, as this is what allows multiple simple parsers to be composed and chained together to create much more sophisticated parsers.

Tokens and parser inputs

As was already stated, parsers in this library are able to parse from arbitrary data types, as long as they implement the Tokens trait.

Implementors of this trait are thought of as containers of individual tokens. In general, a type may implement Tokens<T> if:

  • it has a notion of being 'empty'
  • it has a notion of immutably taking a single token from the 'front'

More precise implementation guidelines for this trait are available in its documentation.

All parsers accept a token container as their input to parse from. Additionally, parsers being combined together must operate on the same token container type.

Currently, Tokens is only implemented OOTB by &str and &[T]. In practice, these should be the only input types you would need to deal with.

These two types having the same interface allows parsers to parse directly from strings, as well as collections of arbitrary token types. This means that for applications such as parsing a programming language, the combinator framework can be used for both lexing and parsing.

Creating parsers

All parsers implement the trait Parser<'a, Toks, T, A>. The type arguments are as follows:

  • 'a - named lifetime, nothing special
  • Toks - the type of token stream this parser operates on, must implement Tokens<T>
  • T - the type of the individual tokens contained within Toks
  • A - the type of the value this parser tries to parse

The typical way to define a custom parser is to use a function which returns an opaque type with the appropriate Parser implementation. This is normally done by returning a closure with the approriate signature:

use bad_parsers::{Parser, ParseError};

// parses a single i32 token if it's even
fn custom_parser<'a>() -> impl Parser<'a, &'a [i32], i32, i32> {
    |input: &'a [i32]| match input.iter().next() {
        Some(x) if x % 2 == 0 => Ok((&input[1..], *x)),
        _ => Err(ParseError::no_parse("no even int", input)),
    }
}

// quirk with lifetimes: define your inputs before creating the parser
let nums = [1, 2, 3, 4, 5, 6, 7, 8, 9];

let starts_with1 = &nums[0..];
let starts_with2 = &nums[1..];
let starts_with3 = &nums[2..];

let p = custom_parser();

assert!(p.parse(starts_with1).is_err());
assert_eq!((starts_with3, 2), p.parse(starts_with2).unwrap());

Note the above example alludes to a quirk to do with the lifetimes of inputs. That is explained more in-depth in the documentation of the Parser trait.

If differing opaque types become an issue for the compiler, you may wish to wrap the parser in a BoxedParser, which can easily be done with the boxed method.

Most of the provided combinator functions in this library will accept generic parser types. However, some will require a BoxedParser. The most notable examples are the operator overloads for BoxedParser.

Error Handling

When a parser fails, it returns an Err containing an error value of type [ParseError<Toks, T>]. When implementing your own parsers, it is important to construct error values that accurately reflect the cause of the error.

If you see a constructor function in ParseError that describes the reason that your parse has failed, such as ParseError::empty_input or ParseError::not_enough, you should probably use that one. If some kind of fault occurs with a process that is not strictly a parser failure, such as an I/O operation failure or an overflow, you should use the ParseError::other constructor, which wraps an arbitrary error type from another process and behaves slightly differently in order to report the error correctly. When in doubt, the ParseError::no_parse constructor is the catch-all generic error type. Be sure to provide useful details when constructing this error type.

More detailed information can be found in the ParseError documentation.

Worked Example: Parsing an Integer

Suppose we want to find an integer at the start of a string. At a high level, this means we first need to see if there are digits at the start of the string. If not, the parsing has failed and the parser will return an error. If it succeeds, we need to return the digits it found, as well as the rest of the string.

To get started, let's just try to parse a single digit. We can do this with the parser-building function token_satisfies, which builds a simple parser out of a predicate function:

use bad_parsers::{Parser, token_satisfies};

let front_number = token_satisfies(char::is_ascii_digit);

assert_eq!(("23", '1'), front_number.parse("123").unwrap());
assert_eq!(("rest", '3'), front_number.parse("3rest").unwrap());
assert_eq!(("0rest", '3'), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());

As we can see, the parser was able to extract the first digit from the front of the first three inputs, but it failed to parse anything when there is no digit at the start, or when the input is empty.

A good first step, but we need to parse multiple digits. For a task like this, bad_parsers provides the mult1 combinator. This takes a parser, and runs it multiple times over the input and collects the results in a [Vec]. This new parser will keep collecting parsed values until it is unable to parse any more. This parser will also fail if it is unable to parse at least one item from the input, hence the name:

use bad_parsers::{Parser, token_satisfies};

let front_number = token_satisfies(char::is_ascii_digit).mult1();

assert_eq!(("", vec!['1', '2', '3']), front_number.parse("123").unwrap());
assert_eq!(("rest", vec!['3']), front_number.parse("3rest").unwrap());
assert_eq!(("rest", vec!['3', '0']), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());

Now we're getting somewhere!

We are now able to get all the digits from the start of a string. Unfortunately, a Vec<char> is not an integer that Rust can do arithmetic with. There are a few ways we could turn it into an integer, so I'm going to choose one of the more convoluted solutions to show off more of the library. We'll be using the [map] combinator to turn the digit characters into integers. Note that the [map] is added before the mult1, as we only want to work with one digitat a time for the moment:

use bad_parsers::{Parser, token_satisfies};

let front_number = token_satisfies(char::is_ascii_digit)
    .map(|c| c.to_digit(10).unwrap())
    .mult1();

assert_eq!(("", vec![1, 2, 3]), front_number.parse("123").unwrap());
assert_eq!(("rest", vec![3]), front_number.parse("3rest").unwrap());
assert_eq!(("rest", vec![3, 0]), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());

It's safe to unwrap the result of to_digit here because it should only receive actual digit characters as inputs.

With that, we're almost there! Our digits are now actual numbers. All we have to do now is put them together.

We'll need a more involved [map] for this one:

use bad_parsers::{Parser, token_satisfies};

let front_number = token_satisfies(char::is_ascii_digit)
    .map(|c| c.to_digit(10).unwrap())
    .mult1()
    .map(|digs| {
        let mut n = 0;
        for d in digs {
            n *= 10;
            n += d;
        }
        n
    });

assert_eq!(("", 123), front_number.parse("123").unwrap());
assert_eq!(("rest", 3), front_number.parse("3rest").unwrap());
assert_eq!(("rest", 30), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());

Success! We now have a parser that can extract an integer from the start of a string.

There is a caveat with this parser, however: we haven't specified any custom error messages. If we leave this parser with just the default errors to report, we probably won't get a lot of useful information as to why the parser failed.

Since our parser has been put together from various combinators, the best way to change the error value is to use the map_error function. This function works in a way similar to Result::map_err and allows for modification of the error value that the parser was going to return.

For our parser, the token_satisfies can fail if it doesn't find a digit, but as long as it finds at least one, then the parser should still succeed. We only have that pesky zero-digit case to worry about, so we can add a helpful error message onto our mult1 for when it can't find any digits:

use bad_parsers::{Parser, token_satisfies};

let front_number = token_satisfies(char::is_ascii_digit)
    .map(|c| c.to_digit(10).unwrap())
    .mult1()
    .map_error(|mut e| {
        e.set_details("front_number couldn't find any digits");
        e
    })
    .map(|digs| {
        let mut n = 0;
        for d in digs {
            n *= 10;
            n += d;
        }
        n
    });

assert_eq!(("", 123), front_number.parse("123").unwrap());

let expected = "Parser needed to parse 1 elements, but only parsed 0: front_number couldn't find any digits, Failed at: \"foo\"";
assert_eq!(expected.to_string(), front_number.parse("foo").unwrap_err().to_string());

Now that we've modified the details of the original ParseError, the parser's error message clarifies that it was looking for digits, but couldn't find any. If this parser were part of a much larger chain, we could narrow down our search for the problem to parsers that look for digits. In this situation, we can see that the original error from mult1 already stated that it was looking for at least one of something, as well as the part of the input where it failed.

I must say, this is a pretty good-looking parser!

With that being said, having to deal with the complex type of the parser and its result can be a bit clunky for more casual developers who want to use this cutting-edge integer parsing technology. If the user is just trying to parse an integer and doesn't really care about the remaining input or any error messages, then using the parser in multiple places could very quickly make the code tedious and noisy with all the Result unwrapping. We can deal with all of that by placing this parser in an easier-to-use function:

use bad_parsers::{Parser, token_satisfies};

fn parse_int(input: &str) -> Option<u32> {
    let p = token_satisfies(char::is_ascii_digit)
        .map(|c| c.to_digit(10).unwrap())
        .mult1()
        .map_error(|mut e| {
            e.set_details("front_number couldn't find any digits");
            e
        })
        .map(|digs| {
            let mut n = 0;
            for d in digs {
                n *= 10;
                n += d;
            }
            n
        });
    p.parse(input).ok().map(|(_input, n)| n)
}

assert_eq!(Some(123), parse_int("123"));
assert_eq!(Some(3), parse_int("3rest"));
assert_eq!(Some(30), parse_int("30rest"));
assert!(parse_int("no front number").is_none());
assert!(parse_int("").is_none());

This works well enough, but giving it some more thought, is the parser really correct? If an input string starts with an integer, but contains some more text after it, should it still be considered valid?

Perhaps, perhaps not. But just for fun, let's modify this function to only return an integer when the string contains no extra text. To check that there is no text left after the integer, we can use [eof]. This special parser will succeed with a () only if the input is empty. But to make it work together with our integer parser, we'll need to combine them together with plus:

use bad_parsers::{Parser, token_satisfies, eof};

fn parse_int(input: &str) -> Option<u32> {
    let p = token_satisfies(char::is_ascii_digit)
        .map(|c| c.to_digit(10).unwrap())
        .mult1()
        .map_error(|mut e| {
            e.set_details("front_number couldn't find any digits");
            e
        })
        .map(|digs| {
            let mut n = 0;
            for d in digs {
                n *= 10;
                n += d;
            }
            n
        })
        .plus(eof());
    p.parse(input).ok().map(|(_input, (n, ()))| n)
}

assert_eq!(Some(123), parse_int("123"));
assert!(parse_int("3rest").is_none());
assert!(parse_int("30rest").is_none());
assert!(parse_int("no front number").is_none());
assert!(parse_int("").is_none());

Now only the first input in our tests is parsed successfully, as the two after it have trailing text.

Also note that the returned value has changed types from [u32] to (u32, ()). This is due to the use of plus, which runs two parsers in sequence, and return the values of both when they both succeed, but will fail completely unless both parsers succeed.

Since we only care about the first value being returned in this situation, we can replace plus with a variant called left, which works the same as plus but only returns the first value.

use bad_parsers::{Parser, token_satisfies, eof};

fn parse_int(input: &str) -> Option<u32> {
    let p = token_satisfies(char::is_ascii_digit)
        .map(|c| c.to_digit(10).unwrap())
        .mult1()
        .map_error(|mut e| {
            e.set_details("front_number couldn't find any digits");
            e
        })
        .map(|digs| {
            let mut n = 0;
            for d in digs {
                n *= 10;
                n += d;
            }
            n
        })
        .left(eof());
    p.parse(input).ok().map(|(_input, n)| n)
}

assert_eq!(Some(123), parse_int("123"));
assert!(parse_int("3rest").is_none());
assert!(parse_int("30rest").is_none());
assert!(parse_int("no front number").is_none());
assert!(parse_int("").is_none());

Now the code is a little bit cleaner. Thanks left!

As you can see, our humble parser combinators have been able to solve a long-standing computer science problem, one that stumped Turing, Church, and even von Neumann!

I really ought to submit this feature to the Rust development team. The standard library could really make use of this-

let x: u32 = parse_int("123").unwrap();

let y: u32 = str::parse("123").unwrap();

assert_eq!(x, y);
assert_ne!("time taken", "well-spent");

Oh.

No runtime deps