72 releases
0.8.5 | Sep 29, 2024 |
---|---|
0.8.4 | Jun 9, 2024 |
0.8.3 | Mar 26, 2024 |
0.8.2 | Oct 14, 2023 |
0.2.1 | Jul 21, 2015 |
#43 in Text processing
16,949,881 downloads per month
Used in 48,129 crates
(158 directly)
1.5MB
50K
SLoC
regex-syntax
This crate provides a robust regular expression parser.
Documentation
Overview
There are two primary types exported by this crate: Ast
and Hir
. The former
is a faithful abstract syntax of a regular expression, and can convert regular
expressions back to their concrete syntax while mostly preserving its original
form. The latter type is a high level intermediate representation of a regular
expression that is amenable to analysis and compilation into byte codes or
automata. An Hir
achieves this by drastically simplifying the syntactic
structure of the regular expression. While an Hir
can be converted back to
its equivalent concrete syntax, the result is unlikely to resemble the original
concrete syntax that produced the Hir
.
Example
This example shows how to parse a pattern string into its HIR:
use regex_syntax::{hir::Hir, parse};
let hir = parse("a|b").unwrap();
assert_eq!(hir, Hir::alternation(vec![
Hir::literal("a".as_bytes()),
Hir::literal("b".as_bytes()),
]));
Safety
This crate has no unsafe
code and sets forbid(unsafe_code)
. While it's
possible this crate could use unsafe
code in the future, the standard
for doing so is extremely high. In general, most code in this crate is not
performance critical, since it tends to be dwarfed by the time it takes to
compile a regular expression into an automaton. Therefore, there is little need
for extreme optimization, and therefore, use of unsafe
.
The standard for using unsafe
in this crate is extremely high because this
crate is intended to be reasonably safe to use with user supplied regular
expressions. Therefore, while there may be bugs in the regex parser itself,
they should never result in memory unsafety unless there is either a bug
in the compiler or the standard library. (Since regex-syntax
has zero
dependencies.)
Crate features
By default, this crate bundles a fairly large amount of Unicode data tables
(a source size of ~750KB). Because of their large size, one can disable some
or all of these data tables. If a regular expression attempts to use Unicode
data that is not available, then an error will occur when translating the Ast
to the Hir
.
The full set of features one can disable are in the "Crate features" section of the documentation.
Testing
Simply running cargo test
will give you very good coverage. However, because
of the large number of features exposed by this crate, a test
script is
included in this directory which will test several feature combinations. This
is the same script that is run in CI.
Motivation
The primary purpose of this crate is to provide the parser used by regex
.
Specifically, this crate is treated as an implementation detail of the regex
,
and is primarily developed for the needs of regex
.
Since this crate is an implementation detail of regex
, it may experience
breaking change releases at a different cadence from regex
. This is only
possible because this crate is not a public dependency of regex
.
Another consequence of this de-coupling is that there is no direct way to
compile a regex::Regex
from a regex_syntax::hir::Hir
. Instead, one must
first convert the Hir
to a string (via its std::fmt::Display
) and then
compile that via Regex::new
. While this does repeat some work, compilation
typically takes much longer than parsing.
Stated differently, the coupling between regex
and regex-syntax
exists only
at the level of the concrete syntax.