1 unstable release
Uses new Rust 2024
new 0.1.0-unstable | Apr 18, 2025 |
---|
#87 in Parser tooling
130KB
2K
SLoC
bad_parsers
WARNING: CURRENTLY NOT STABLE
This library's API is far from finalized, and different components can change or vanish without warning. It is not currently recommended to use this library for any vaguely-serious projects.
This README is very empty
I'm just getting basic things in order at the moment, this is a very green library.
If you need information on using the library, you can generate the documentation after cloning with cargo doc --open
.
lib.rs
:
bad_parsers
A parser combinator library written by myself for myself.
Don't say I didn't warn you.
bad_parsers
is a parser combinator library written entirely from scratch in safe Rust
with no external dependencies.
The provided parsers are able to parse from string slices, arbitrary token type slices,
as well as other token collections you may wish to implement yourself.
As an added bonus, using this library in your own project is guaranteed to make it worse! For free!
Main Features
- A Parser trait that primarily interfaces with functions and closures that act like parsers, but could also be used to treat other types as parsers as well.
- A Tokens trait that allows things to be parsed out of arbitrary data types.[citation needed]
- Vaguely useful error handling:
ParseError
s can provide useful information when parsing is unsuccessful. - Lazily evaluated parsers that allow for recursive definitions.
- A collection of common parser combinator utility functions to build custom parsers with.
Overview of Parsers
The job of a parser is to extract meaningful data from a collection of tokens. To this end, a parser can be thought of as a function from some input tokens to some kind of data the parser is looking for. To achieve this, a parser will:
- Examine the stream of tokens
- Fail if it cannot find what it is looking for
- If it does find it, return the parsed data and the remaining tokens
The importance of returning the remaining tokens cannot be overstated, as this is what allows multiple simple parsers to be composed and chained together to create much more sophisticated parsers.
Tokens
and parser inputs
As was already stated, parsers in this library are able to parse from arbitrary data
types, as long as they implement the Tokens
trait.
Implementors of this trait are thought of as containers of individual tokens.
In general, a type may implement Tokens<T>
if:
- it has a notion of being 'empty'
- it has a notion of immutably taking a single token from the 'front'
More precise implementation guidelines for this trait are available in its documentation.
All parsers accept a token container as their input to parse from. Additionally, parsers being combined together must operate on the same token container type.
Currently, Tokens
is only implemented OOTB by &str
and &[T]
.
In practice, these should be the only input types you would need to deal with.
These two types having the same interface allows parsers to parse directly from strings, as well as collections of arbitrary token types. This means that for applications such as parsing a programming language, the combinator framework can be used for both lexing and parsing.
Creating parsers
All parsers implement the trait Parser<'a, Toks, T, A>
.
The type arguments are as follows:
'a
- named lifetime, nothing specialToks
- the type of token stream this parser operates on, must implementTokens<T>
T
- the type of the individual tokens contained withinToks
A
- the type of the value this parser tries to parse
The typical way to define a custom parser is to use a function which returns an opaque type
with the appropriate Parser
implementation.
This is normally done by returning a closure with the approriate
signature:
use bad_parsers::{Parser, ParseError};
// parses a single i32 token if it's even
fn custom_parser<'a>() -> impl Parser<'a, &'a [i32], i32, i32> {
|input: &'a [i32]| match input.iter().next() {
Some(x) if x % 2 == 0 => Ok((&input[1..], *x)),
_ => Err(ParseError::no_parse("no even int", input)),
}
}
// quirk with lifetimes: define your inputs before creating the parser
let nums = [1, 2, 3, 4, 5, 6, 7, 8, 9];
let starts_with1 = &nums[0..];
let starts_with2 = &nums[1..];
let starts_with3 = &nums[2..];
let p = custom_parser();
assert!(p.parse(starts_with1).is_err());
assert_eq!((starts_with3, 2), p.parse(starts_with2).unwrap());
Note the above example alludes to a quirk to do with the lifetimes of inputs.
That is explained more in-depth in the documentation of the Parser
trait.
If differing opaque types become an issue for the compiler, you may wish to wrap the parser
in a BoxedParser
, which can easily be done with the boxed
method.
Most of the provided combinator functions in this library will accept generic parser types.
However, some will require a BoxedParser
.
The most notable examples are the operator overloads for BoxedParser
.
Error Handling
When a parser fails, it returns an Err
containing an error value of type
[ParseError<Toks, T>
].
When implementing your own parsers, it is important to construct error values that
accurately reflect the cause of the error.
If you see a constructor function in ParseError
that describes the reason that your
parse has failed, such as ParseError::empty_input
or ParseError::not_enough
,
you should probably use that one.
If some kind of fault occurs with a process that is not strictly a parser failure, such as
an I/O operation failure or an overflow, you should use the ParseError::other
constructor, which wraps an arbitrary error type from another process and behaves slightly
differently in order to report the error correctly.
When in doubt, the ParseError::no_parse
constructor is the catch-all generic error type.
Be sure to provide useful details when constructing this error type.
More detailed information can be found in the ParseError
documentation.
Worked Example: Parsing an Integer
Suppose we want to find an integer at the start of a string. At a high level, this means we first need to see if there are digits at the start of the string. If not, the parsing has failed and the parser will return an error. If it succeeds, we need to return the digits it found, as well as the rest of the string.
To get started, let's just try to parse a single digit.
We can do this with the parser-building function token_satisfies
, which builds a simple
parser out of a predicate function:
use bad_parsers::{Parser, token_satisfies};
let front_number = token_satisfies(char::is_ascii_digit);
assert_eq!(("23", '1'), front_number.parse("123").unwrap());
assert_eq!(("rest", '3'), front_number.parse("3rest").unwrap());
assert_eq!(("0rest", '3'), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());
As we can see, the parser was able to extract the first digit from the front of the first three inputs, but it failed to parse anything when there is no digit at the start, or when the input is empty.
A good first step, but we need to parse multiple digits.
For a task like this, bad_parsers
provides the mult1
combinator.
This takes a parser, and runs it multiple times over the input and collects the results in
a [Vec
].
This new parser will keep collecting parsed values until it is unable to parse any more.
This parser will also fail if it is unable to parse at least one item from the input,
hence the
name:
use bad_parsers::{Parser, token_satisfies};
let front_number = token_satisfies(char::is_ascii_digit).mult1();
assert_eq!(("", vec!['1', '2', '3']), front_number.parse("123").unwrap());
assert_eq!(("rest", vec!['3']), front_number.parse("3rest").unwrap());
assert_eq!(("rest", vec!['3', '0']), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());
Now we're getting somewhere!
We are now able to get all the digits from the start of a string.
Unfortunately, a Vec<char>
is not an integer that Rust can do arithmetic with.
There are a few ways we could turn it into an integer, so I'm going to choose one of the
more convoluted solutions to show off more of the library.
We'll be using the [map
] combinator to turn the digit characters into integers.
Note that the [map
] is added before the mult1
, as we only want to work with one
digitat a time for the moment:
use bad_parsers::{Parser, token_satisfies};
let front_number = token_satisfies(char::is_ascii_digit)
.map(|c| c.to_digit(10).unwrap())
.mult1();
assert_eq!(("", vec![1, 2, 3]), front_number.parse("123").unwrap());
assert_eq!(("rest", vec![3]), front_number.parse("3rest").unwrap());
assert_eq!(("rest", vec![3, 0]), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());
It's safe to unwrap the result of to_digit
here because it should
only receive actual digit characters as inputs.
With that, we're almost there! Our digits are now actual numbers. All we have to do now is put them together.
We'll need a more involved [map
] for this one:
use bad_parsers::{Parser, token_satisfies};
let front_number = token_satisfies(char::is_ascii_digit)
.map(|c| c.to_digit(10).unwrap())
.mult1()
.map(|digs| {
let mut n = 0;
for d in digs {
n *= 10;
n += d;
}
n
});
assert_eq!(("", 123), front_number.parse("123").unwrap());
assert_eq!(("rest", 3), front_number.parse("3rest").unwrap());
assert_eq!(("rest", 30), front_number.parse("30rest").unwrap());
assert!(front_number.parse("no front number").is_err());
assert!(front_number.parse("").is_err());
Success! We now have a parser that can extract an integer from the start of a string.
There is a caveat with this parser, however: we haven't specified any custom error messages. If we leave this parser with just the default errors to report, we probably won't get a lot of useful information as to why the parser failed.
Since our parser has been put together from various combinators, the best way to change
the error value is to use the map_error
function.
This function works in a way similar to Result::map_err
and allows for modification of
the error value that the parser was going to return.
For our parser, the token_satisfies
can fail if it doesn't find a digit, but as long
as it finds at least one, then the parser should still succeed.
We only have that pesky zero-digit case to worry about, so we can add a helpful error
message onto our mult1
for when it can't find any digits:
use bad_parsers::{Parser, token_satisfies};
let front_number = token_satisfies(char::is_ascii_digit)
.map(|c| c.to_digit(10).unwrap())
.mult1()
.map_error(|mut e| {
e.set_details("front_number couldn't find any digits");
e
})
.map(|digs| {
let mut n = 0;
for d in digs {
n *= 10;
n += d;
}
n
});
assert_eq!(("", 123), front_number.parse("123").unwrap());
let expected = "Parser needed to parse 1 elements, but only parsed 0: front_number couldn't find any digits, Failed at: \"foo\"";
assert_eq!(expected.to_string(), front_number.parse("foo").unwrap_err().to_string());
Now that we've modified the details of the original ParseError
, the parser's error
message clarifies that it was looking for digits, but couldn't find any.
If this parser were part of a much larger chain, we could narrow down our search for the
problem to parsers that look for digits.
In this situation, we can see that the original error from mult1
already stated that it
was looking for at least one of something, as well as the part of the input where it failed.
I must say, this is a pretty good-looking parser!
With that being said, having to deal with the complex type of the parser and its result
can be a bit clunky for more casual developers who want to use this cutting-edge integer
parsing technology.
If the user is just trying to parse an integer and doesn't really care about the
remaining input or any error messages, then using the parser in multiple places could very
quickly make the code tedious and noisy with all the Result
unwrapping.
We can deal with all of that by placing this parser in an easier-to-use function:
use bad_parsers::{Parser, token_satisfies};
fn parse_int(input: &str) -> Option<u32> {
let p = token_satisfies(char::is_ascii_digit)
.map(|c| c.to_digit(10).unwrap())
.mult1()
.map_error(|mut e| {
e.set_details("front_number couldn't find any digits");
e
})
.map(|digs| {
let mut n = 0;
for d in digs {
n *= 10;
n += d;
}
n
});
p.parse(input).ok().map(|(_input, n)| n)
}
assert_eq!(Some(123), parse_int("123"));
assert_eq!(Some(3), parse_int("3rest"));
assert_eq!(Some(30), parse_int("30rest"));
assert!(parse_int("no front number").is_none());
assert!(parse_int("").is_none());
This works well enough, but giving it some more thought, is the parser really correct? If an input string starts with an integer, but contains some more text after it, should it still be considered valid?
Perhaps, perhaps not.
But just for fun, let's modify this function to only return an integer when the string
contains no extra text.
To check that there is no text left after the integer, we can use [eof
].
This special parser will succeed with a ()
only if the input is empty.
But to make it work together with our integer parser, we'll need to combine them together
with plus
:
use bad_parsers::{Parser, token_satisfies, eof};
fn parse_int(input: &str) -> Option<u32> {
let p = token_satisfies(char::is_ascii_digit)
.map(|c| c.to_digit(10).unwrap())
.mult1()
.map_error(|mut e| {
e.set_details("front_number couldn't find any digits");
e
})
.map(|digs| {
let mut n = 0;
for d in digs {
n *= 10;
n += d;
}
n
})
.plus(eof());
p.parse(input).ok().map(|(_input, (n, ()))| n)
}
assert_eq!(Some(123), parse_int("123"));
assert!(parse_int("3rest").is_none());
assert!(parse_int("30rest").is_none());
assert!(parse_int("no front number").is_none());
assert!(parse_int("").is_none());
Now only the first input in our tests is parsed successfully, as the two after it have trailing text.
Also note that the returned value has changed types from [u32
] to (u32, ())
.
This is due to the use of plus
, which runs two parsers in sequence, and return the
values of both when they both succeed, but will fail completely unless both parsers succeed.
Since we only care about the first value being returned in this situation, we can replace
plus
with a variant called left
, which works the same as plus
but only returns
the first value.
use bad_parsers::{Parser, token_satisfies, eof};
fn parse_int(input: &str) -> Option<u32> {
let p = token_satisfies(char::is_ascii_digit)
.map(|c| c.to_digit(10).unwrap())
.mult1()
.map_error(|mut e| {
e.set_details("front_number couldn't find any digits");
e
})
.map(|digs| {
let mut n = 0;
for d in digs {
n *= 10;
n += d;
}
n
})
.left(eof());
p.parse(input).ok().map(|(_input, n)| n)
}
assert_eq!(Some(123), parse_int("123"));
assert!(parse_int("3rest").is_none());
assert!(parse_int("30rest").is_none());
assert!(parse_int("no front number").is_none());
assert!(parse_int("").is_none());
Now the code is a little bit cleaner. Thanks left
!
As you can see, our humble parser combinators have been able to solve a long-standing computer science problem, one that stumped Turing, Church, and even von Neumann!
I really ought to submit this feature to the Rust development team. The standard library could really make use of this-
let x: u32 = parse_int("123").unwrap();
let y: u32 = str::parse("123").unwrap();
assert_eq!(x, y);
assert_ne!("time taken", "well-spent");
Oh.