4 releases
new 0.2.1 | Mar 19, 2025 |
---|---|
0.2.0 | Mar 19, 2025 |
0.1.1 | Mar 19, 2025 |
0.1.0 | Mar 19, 2025 |
#300 in Text processing
411 downloads per month
Used in 9 crates
(via xee-interpreter)
215KB
6K
SLoC
regexml
To start off, don't worry. We're not using regexes to parse XML in this crate. We're not in fact dealing with XML directly at all.
If you need a Regex engine in Rust, this crate isn't likely for you; use the regex crate instead. This crate instead implements a Regex engine compliant with varous XML-related standards, and focuses on standard-compliance rather than performance.
regexml
implements a regular expression engine that's compliant with regular
expressions as defined in appendix G of the XML Schema 1.1 standard, part 2:
https://www.w3.org/TR/xmlschema11-2/#regexs
This is the regex language that XML Schema uses so the user can define patterns as additional constraints on string data in an XML document:
https://www.w3.org/TR/xmlschema11-2/#dc-pattern
The XPath and XQuery Functions and Operators 3.1 specification defines an extension of these regular expressions for the purposes of use within the XPath and XQuery standard function library:
https://www.w3.org/TR/xpath-functions-31/#regex-syntax
regexml
also implements this extension.
Origins
The Rust source code is based on the Java implementation in Saxon HE
net.sf.saxon.regex
, which implements a spec-compliant Regex engine. In turn
this code is based on an engine implemented by Apache Jakarta:
https://blog.saxonica.com/mike/2012/01/a-new-regex-engine.html
The Java code has been translated by hand into Rust. There are some differences:
-
Operation
is an enum, instead of implemented using subclasses and dynamic dispatch as in the Java version. A traitOperationControl
provides dispatch to the enums. -
The
icu4x
project'sicu_
crates are used to provide various unicode features, including the implementation of character classes and casing rules. EspeciallyCodePointInversionList
and its associated builder proved very useful. Due to the way the regex compiler is organizedCharacterClass
does provide a special case for character class that matches with a single character. -
The original code had no internal tests, but a lot of integration tests were provided through the qt3tests project for testing XPath and XQuery. Most of those tests have been ported into simple Rust tests, which makes this package easier to maintain and debug.
Now that the port is complete we expect this package to evolve separately wherever it may go - no 1 to 1 mapping with the original Java code is going to be maintained.
Dependencies
~2.6–3.5MB
~60K SLoC