6 releases

0.0.6 Dec 6, 2024
0.0.5 Dec 6, 2024

#355 in Text processing

36 downloads per month

MIT license

455KB
1K SLoC

reggy

A friendly regular expression dialect for text analytics. Typical regex features are removed/adjusted to make natural language queries easier. Unicode-aware and able to search a stream with several patterns at once.

Should I Use reggy?

If you are working on a text processing problem with streaming datasets or hand-tuned regexes for natural language, you may find the feature set compelling.

Crate Match Streams? Case Insensitivity? Pattern Flexibility?
aho-corasick simple ASCII string set
regex Unicode best-effort full-featured regex
reggy Unicode best-effort regex subset

API Usage

Use the high-level Pattern struct for simple search.

let mut p = Pattern::new("dogs?")?;
assert_eq!(
    p.findall_spans("cat dog dogs cats"),
    vec![(4, 7), (8, 12)]
);

Use the Ast struct to transpile to normal regex syntax.

let ast = Ast::parse(r"dog(gy)?|dawg|(!CAT|KITTY CAT)")?;
assert_eq!(
    ast.to_regex(),
    r"\b(?mi:dog(?:gy)?|dawg|(?-i:CAT|KITTY\s+CAT))\b"
);

Stream a File

In this example, we will count the matches of a set of patterns within a file without loading it into memory. Use the Search struct to search a stream with several patterns at once.

Create a BufReader for the text.

use std::fs::File;
use std::io::{self, BufReader};

let f = File::open("tests/samples/republic_plato.txt")?;
let f = BufReader::new(f);

Compile the search object.

let patterns = [
    r"yes|(very )?true|certainly|quite so|I have no objection|I agree",
    r"\?",
];

let mut pattern_counts = [0; 2];

let mut search = Search::compile(&patterns)?;

Call Search::iter to create a StreamSearch. Any IO errors or malformed UTF-8 will be return a SearchStreamError.

for result in search.iter(f) {
    match result {
        Ok(m) => {
            pattern_counts[m.id] += 1;
        }
        Err(e) => {
            println!("Stream Error {e:?}");
            break;
        }
    }
}

println!("Assent Count:   {}", pattern_counts[0]);
println!("Question Count: {}", pattern_counts[1]);
// Assent Count:   1467
// Question Count: 1934

Walk a Stream Manually

let mut search = Search::compile(&[
    r"$#?#?#.##",
    r"(John|Jane) Doe"
])?;

Call Search::next to begin searching. It will yield any matches deemed definitely-complete immediately.

let jane_match = Match::new(1, (0, 8));
assert_eq!(
    search.next("Jane Doe paid John"),
    vec![jane_match]
);

Call Search::next again to continue with the same search state. Note that "John Doe" matched across the chunk boundary, and spans are relative to the start of the stream.

let john_match = Match::new(1, (14, 22));
let money_match_1 = Match::new(0, (23, 29));
let money_match_2 = Match::new(0, (41, 48));
assert_eq!(
    search.next(" Doe $45.66 instead of $499.00"),
    vec![john_match, money_match_1, money_match_2]
);

Call Search::finish to collect any not-definitely-complete matches once the stream is closed.

assert_eq!(search.finish(), vec![]);

See more in the API docs.

Pattern Language

reggy is case-insensitive by default. Spaces match any amount of whitespace (i.e. \s+). All the reserved characters mentioned below (\, (, ), {, }, ,, ?, |, #, and !) may be escaped with a backslash for a literal match. Patterns are surrounded by implicit unicode word boundaries (i.e. \b). Empty patterns or subpatterns are not permitted.

Examples

Make a character optional with ?

dogs? matches dog and dogs

Create two or more alternatives with |

dog|cat matches dog and cat

Create a sub-pattern with (...)

the qualit(y|ies) required matches the quality required and the qualities required

the only( one)? around matches the only around and the only one around

Create a case-sensitive sub-pattern with (!...)

United States of America|(!USA) matches USA, not usa

Match digits with #

#.## matches 3.14

Match exactly n times with {n}, or between n and m times with {n,m}

(very ){1,4}strange matches very very very strange

Definitely-Complete Matches

reggy follows "leftmost-longest", greedy matching semantics. A pattern may match after one step of a stream, yet may match a longer form depending on the next step. For example, abb? will match s.next("ab"), but a subsequent call to s.next("b") would create a longer match, "abb", which should supercede the match "ab".

Search only yields matches once they are definitely complete and cannot be superceded by future next calls. Each pattern has a maximum byte length L, counting contiguous whitespace as 1 byte.[^1] Once reggy has streamed at most L bytes past the start of a match without superceding it, that match will be yielded.

As a consequence, results of a given Search are the same regardless of how a given haystack stream is chunked. Search::next returns Matches as soon as it practically can while respecting this invariant.

Implementation

The pattern language is parsed with lalrpop (grammar).

The search routines use a regex_automata::dense::DFA. Compared to other regex engines, the dense DFA is memory-intensive and slow to construct, but searches are fast. Unicode word boundaries are handled by the unicode_segmentation crate.

[^1]: This is why unbounded quantifiers are absent from reggy. When a pattern requires * or +, users should choose an upper limit ({0,n}, {1,n}) instead.

Dependencies

~2.2–4.5MB
~75K SLoC