4 releases

new 0.2.1 Mar 19, 2025
0.2.0 Mar 19, 2025
0.1.1 Mar 19, 2025
0.1.0 Mar 19, 2025

#300 in Text processing

Download history 411/week @ 2025-03-15

411 downloads per month
Used in 9 crates (via xee-interpreter)

Apache-2.0 OR MPL-2.0

215KB
6K SLoC

regexml

To start off, don't worry. We're not using regexes to parse XML in this crate. We're not in fact dealing with XML directly at all.

If you need a Regex engine in Rust, this crate isn't likely for you; use the regex crate instead. This crate instead implements a Regex engine compliant with varous XML-related standards, and focuses on standard-compliance rather than performance.

regexml implements a regular expression engine that's compliant with regular expressions as defined in appendix G of the XML Schema 1.1 standard, part 2:

https://www.w3.org/TR/xmlschema11-2/#regexs

This is the regex language that XML Schema uses so the user can define patterns as additional constraints on string data in an XML document:

https://www.w3.org/TR/xmlschema11-2/#dc-pattern

The XPath and XQuery Functions and Operators 3.1 specification defines an extension of these regular expressions for the purposes of use within the XPath and XQuery standard function library:

https://www.w3.org/TR/xpath-functions-31/#regex-syntax

regexml also implements this extension.

Origins

The Rust source code is based on the Java implementation in Saxon HE net.sf.saxon.regex, which implements a spec-compliant Regex engine. In turn this code is based on an engine implemented by Apache Jakarta:

https://blog.saxonica.com/mike/2012/01/a-new-regex-engine.html

The Java code has been translated by hand into Rust. There are some differences:

  • Operation is an enum, instead of implemented using subclasses and dynamic dispatch as in the Java version. A trait OperationControl provides dispatch to the enums.

  • The icu4x project's icu_ crates are used to provide various unicode features, including the implementation of character classes and casing rules. Especially CodePointInversionListand its associated builder proved very useful. Due to the way the regex compiler is organizedCharacterClass does provide a special case for character class that matches with a single character.

  • The original code had no internal tests, but a lot of integration tests were provided through the qt3tests project for testing XPath and XQuery. Most of those tests have been ported into simple Rust tests, which makes this package easier to maintain and debug.

Now that the port is complete we expect this package to evolve separately wherever it may go - no 1 to 1 mapping with the original Java code is going to be maintained.

Dependencies

~2.6–3.5MB
~60K SLoC