6 releases
0.0.6 | Dec 6, 2024 |
---|---|
0.0.5 | Dec 6, 2024 |
#355 in Text processing
36 downloads per month
455KB
1K
SLoC
reggy
A friendly regular expression dialect for text analytics. Typical regex features are removed/adjusted to make natural language queries easier. Unicode-aware and able to search a stream with several patterns at once.
Should I Use reggy
?
If you are working on a text processing problem with streaming datasets or hand-tuned regexes for natural language, you may find the feature set compelling.
Crate | Match Streams? | Case Insensitivity? | Pattern Flexibility? |
---|---|---|---|
aho-corasick |
✅ | simple ASCII | string set |
regex |
❌ | Unicode best-effort | full-featured regex |
reggy |
✅ | Unicode best-effort | regex subset |
API Usage
Use the high-level Pattern
struct for simple search.
let mut p = Pattern::new("dogs?")?;
assert_eq!(
p.findall_spans("cat dog dogs cats"),
vec![(4, 7), (8, 12)]
);
Use the Ast
struct to transpile to normal regex syntax.
let ast = Ast::parse(r"dog(gy)?|dawg|(!CAT|KITTY CAT)")?;
assert_eq!(
ast.to_regex(),
r"\b(?mi:dog(?:gy)?|dawg|(?-i:CAT|KITTY\s+CAT))\b"
);
Stream a File
In this example, we will count the matches of a set of patterns within a file without loading it into memory. Use the Search
struct to search a stream with several patterns at once.
Create a BufReader
for the text.
use std::fs::File;
use std::io::{self, BufReader};
let f = File::open("tests/samples/republic_plato.txt")?;
let f = BufReader::new(f);
Compile the search object.
let patterns = [
r"yes|(very )?true|certainly|quite so|I have no objection|I agree",
r"\?",
];
let mut pattern_counts = [0; 2];
let mut search = Search::compile(&patterns)?;
Call Search::iter
to create a StreamSearch
. Any IO errors or malformed UTF-8 will be return a SearchStreamError
.
for result in search.iter(f) {
match result {
Ok(m) => {
pattern_counts[m.id] += 1;
}
Err(e) => {
println!("Stream Error {e:?}");
break;
}
}
}
println!("Assent Count: {}", pattern_counts[0]);
println!("Question Count: {}", pattern_counts[1]);
// Assent Count: 1467
// Question Count: 1934
Walk a Stream Manually
let mut search = Search::compile(&[
r"$#?#?#.##",
r"(John|Jane) Doe"
])?;
Call Search::next
to begin searching. It will yield any matches deemed definitely-complete immediately.
let jane_match = Match::new(1, (0, 8));
assert_eq!(
search.next("Jane Doe paid John"),
vec![jane_match]
);
Call Search::next
again to continue with the same search state.
Note that "John Doe"
matched across the chunk boundary, and spans are relative to the start of the stream.
let john_match = Match::new(1, (14, 22));
let money_match_1 = Match::new(0, (23, 29));
let money_match_2 = Match::new(0, (41, 48));
assert_eq!(
search.next(" Doe $45.66 instead of $499.00"),
vec![john_match, money_match_1, money_match_2]
);
Call Search::finish
to collect any not-definitely-complete matches once the stream is closed.
assert_eq!(search.finish(), vec![]);
See more in the API docs.
Pattern Language
reggy
is case-insensitive by default. Spaces match any amount of whitespace (i.e. \s+
). All the reserved characters mentioned below (\
, (
, )
, {
, }
, ,
, ?
, |
, #
, and !
) may be escaped with a backslash for a literal match. Patterns are surrounded by implicit unicode word boundaries (i.e. \b
). Empty patterns or subpatterns are not permitted.
Examples
Make a character optional with ?
dogs?
matches dog
and dogs
Create two or more alternatives with |
dog|cat
matches dog
and cat
Create a sub-pattern with (...)
the qualit(y|ies) required
matches the quality required
and the qualities required
the only( one)? around
matches the only around
and the only one around
Create a case-sensitive sub-pattern with (!...)
United States of America|(!USA)
matches USA
, not usa
Match digits with #
#.##
matches 3.14
Match exactly n times with {n}
, or between n and m times with {n,m}
(very ){1,4}strange
matches very very very strange
Definitely-Complete Matches
reggy
follows "leftmost-longest", greedy matching semantics. A pattern may match after one step of a stream, yet may match a longer form depending on the next step. For example, abb?
will match s.next("ab")
, but a subsequent call to s.next("b")
would create a longer match, "abb"
, which should supercede the match "ab"
.
Search
only yields matches once they are definitely complete and cannot be superceded by future next
calls. Each pattern has a maximum byte length L
, counting contiguous whitespace as 1 byte.[^1] Once reggy
has streamed at most L
bytes past the start of a match without superceding it, that match will be yielded.
As a consequence, results of a given Search
are the same regardless of how a given haystack stream is chunked. Search::next
returns Match
es as soon as it practically can while respecting this invariant.
Implementation
The pattern language is parsed with lalrpop
(grammar).
The search routines use a regex_automata::dense::DFA
. Compared to other regex engines, the dense DFA is memory-intensive and slow to construct, but searches are fast. Unicode word boundaries are handled by the unicode_segmentation
crate.
[^1]: This is why unbounded quantifiers are absent from reggy
. When a pattern requires *
or +
, users should choose an upper limit ({0,n}
, {1,n}
) instead.
Dependencies
~2.2–4.5MB
~75K SLoC