4 releases
0.2.2 | Oct 5, 2024 |
---|---|
0.2.1 | Mar 7, 2024 |
0.2.0 | Feb 27, 2024 |
0.1.0 | Feb 20, 2024 |
#795 in Parser implementations
Used in dotnet-lens
440KB
7.5K
SLoC
Simple(ish) parser and extractor of XML.
This package provides an XmlReader
which can automatically determine the character encoding
of UTF-8 and UTF-16 (big endian and little endian byte order) XML byte streams, and parse the
XML into an immutable Element
tree held within an XmlDocument
. It's also possible to use a
custom byte stream decoder to read XML in other character encodings.
The aim of this package is to support as closely as possible the W3C specifications Extensible Markup Language (XML) 1.0 and Namespaces in XML 1.0 for well-formed XML. This package does not aim to support validation of XML, and consequently DTD (document type definition) is deliberately not supported.
Namespace support is always enabled, so the colon character is not permitted within the names of elements nor attributes.
XML concepts already supported
- Elements
- Attributes
- Default namespaces
xmlns="namespace.com"
- Prefixed namespaces
xmlns:prefix="namespace.com"
- Processing instructions
- Comments (skipped and thus not retrievable)
- CDATA sections
- Element language
xml:lang
and filtering by language - White space indication
xml:space
- Automatic detection and decoding of UTF-8 and UTF-16 XML streams.
- Support for custom encodings where the encoding is known before parsing, and where the client supplies a custom decoder to handle the byte-to-character conversion.
Examples
Reading an XML file
Suppose you want to read and extract XML from a file you know to be either UTF-8 or UTF-16
encoded. You can use XmlReader::parse_auto
to read, parse, and extract the XML from the file
and return either an XmlDocument
or an std::io::Error
.
let xml_file = File::open("test_resources/xml_utf8_BOM.xml")?;
let xml_doc = XmlReader::parse_auto(xml_file)?;
Traversing an XmlDocument
Once you have an XmlDocument
you can grab an immutable reference to the root Element
and
then traverse through the element tree using the req
(required child element) and opt
(optional child element) methods to target the first child element with the specified name.
And once we're pointing at the desired target, we can use element()
or text()
to attempt to
grab the element or text-only content of the target element.
For example, let's define a simple XML structure where required elements have a name starting with "r_" and optional elements have a name starting with "o_".
<root>
<r_Widget>
<r_Name>Helix</r_Name>
<o_AdditionalInfo>
<r_ReleaseDate>2021-05-12</r_ReleaseDate>
<r_CurrentVersion>23.10</r_CurrentVersion>
<o_TopContributors>
<r_Name>archseer</r_Name>
<r_Name>the-mikedavis</r_Name>
<r_Name>sudormrfbin</r_Name>
<r_Name>pascalkuthe</r_Name>
<r_Name>dsseng</r_Name>
<r_Name>pickfire</r_Name>
</o_TopContributors>
</o_AdditionalInfo>
</r_Widget>
</root>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
// Let's start by grabbing a reference to the widget element.
// Because we use req to indicate that it should be considered
// an error if this required element is missing, the element()
// method will return a Result<&Element, XmlError>. So we use
// the `?` operator to throw the XmlError if it occurs.
let widget = xml_doc.root().req("r_Widget").element()?;
// The name is required, so we just use req again. We also
// expect the name to contain only simple text content (not
// mixed with other elements or processing instructions) so we
// call text() followed by the `?` operator to throw the
// XmlError that will be generated if either the name element
// is not found, or if it contains non-simple content.
let widget_name = widget.req("r_Name").text()?;
// The info and top contributor elements are optional (may or
// may not appear in this type of XML document) so we can use
// the opt method to indicate that it is not an error if
// either element is not found. Instead of a
// Result<&Element, XmlError> this entirely optional chain
// will cause element() to give us an Option<&Element>
// instead, so we use `if let` to take action only if the
// given optional chain elements all exist.
if let Some(top_contrib_list) = widget
.opt("o_AdditionalInfo")
.opt("o_TopContributors")
.element() {
println!("Found top {} contributors!",
top_contrib_list.elements()
.filter(|e| e.is_named("r_Name")).count());
}
// If we want the release date, that's a required element
// within an optional element. In other words, it's not an
// error if "o_AdditionalInfo" is missing, but if it *is*
// found then we consider it an error if it does not contain
// "r_ReleaseDate". This is a mixed chain, involving both
// required and optional, which means that element() will
// return a Result<Option<&Element>, XmlError>, an Option
// wrapped in a Result. So we use `if let` and the `?`
// operator together.
if let Some(release_date) = widget
.opt("o_AdditionalInfo")
.req("r_ReleaseDate")
.element()? {
println!("Release date: {}", release_date.text()?);
}
Ok(())
}
Note that the return type of the element()
and text()
methods varies depending on whether
the method chain involves req
or opt
or both. This table summarizes the scenarios.
Chain involves | element() returns |
text() returns |
---|---|---|
only req |
Result<&Element, XmlError> |
Result<&str, XmlError> |
only opt |
Option<&Element> |
Result<Option<&str>, XmlError> |
both req and opt |
Result<Option<&Element>, XmlError> |
Result<Option<&str>, XmlError> |
Similarly, the return types of att_req
and att_opt
methods also vary depending on the method
chain.
Chain involves | att_req(name) returns |
att_opt(name) returns |
---|---|---|
only req |
Result<&str, XmlError> |
Result<Option<&str>, XmlError> |
only opt |
Result<Option<&str>, XmlError> |
Option<&str> |
both req and opt |
Result<Option<&str>, XmlError> |
Result<Option<&str>, XmlError> |
It's easier to remember this as the following: req
/att_req
will generate an error if the
element or attribute does not exist, so their use means that the return type must involve a
Result<_, XmlError>
of some sort. And opt
/att_opt
may or may not return a value, so their
use means that the return type must involve an Option<_>
of some sort. And mixing the two
(required and optional) means that the return type must involve a Result<Option<_>, XmlError>
of some sort. And text()
generates an error if the target element does not have simple content
(no child elements and no processing instructions) so its use also means that the return type
must involve a Result
of some sort.
More complex traversal using XmlPath
The methods req
and opt
always turn their attention to the first child element with the
given name. It's not possible to use them to target a sibling, say the second "Widget" within a
list of "Widget" elements. To target siblings, and/or to iterate multiple elements, you instead
use XmlPath
. (Don't confuse this with XPath which has a
similar purpose but very different implementation.)
For example, if you have XML which contains a list of employees, and you want to iterate the
employees' tasks' deadlines, you could use XmlPath
like this:
<roster>
<employee>
<name>Angelica</name>
<department>Finance</department>
<task-list>
<task>
<name>Payroll</name>
<deadline>tomorrow</deadline>
</task>
<task>
<name>Reconciliation</name>
<deadline>Friday</deadline>
</task>
</task-list>
</employee>
<employee>
<name>Byron</name>
<department>Sales</department>
<task-list>
<task>
<name>Close the big deal</name>
<deadline>Saturday night</deadline>
</task>
</task-list>
</employee>
<employee>
<name>Cat</name>
<department>Software</department>
<task-list>
<task>
<name>Fix that bug</name>
<deadline>Maybe later this month</deadline>
</task>
<task>
<name>Add that new feature</name>
<deadline>Possibly this year</deadline>
</task>
<task>
<name>Make that customer happy</name>
<deadline>Good luck with that</deadline>
</task>
</task-list>
</employee>
</roster>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
for deadline in xml_doc.root()
.all("employee")
.first("task-list")
.all("task")
.first("deadline")
.iter() {
println!("Found task deadline: {}", deadline.text()?);
}
Ok(())
}
This creates and iterates an XmlPath
which represents "the first deadline element within
every task within the first task-list within every employee". Based on the example XML above,
this will print out all the text content of all six "deadline" elements.
Note that we could use first("employee")
if we only wanted the first employee. Or we could
use nth("employee", 1)
if we only want the second employee (zero would point to the first).
Or we could use last("employee")
if we only want the last employee. Similarly, we could use
first("task")
if we only wanted to consider the first task in each employee's list.
Filtering elements within an XmlPath
An XmlPath
not only lets you specify which child element names are of interest, but also lets
you specify which xml:lang patterns are of interest, and lets you specify a required attribute
name-value pair which must be found within a child element in order to include it in the
iterator.
<inventory>
<box type='games'>
<item>
<name xml:lang='en'>C&C: Tiberian Dawn</name>
<name xml:lang='en-US'>Command & Conquer</name>
<name xml:lang='de'>C&C: Teil 1</name>
</item>
<item>
<name xml:lang='en'>Doom</name>
<name xml:lang='sr'>Zla kob</name>
<name xml:lang='ja'>ドゥーム</name>
</item>
<item>
<name xml:lang='en'>Half-Life</name>
<name xml:lang='sr'>Polu-život</name>
</item>
</box>
<box type='movies'>
<item>
<name xml:lang='en'>Aliens</name>
<name xml:lang='sv-SE'>Aliens - Återkomsten</name>
<name xml:lang='vi'>Quái Vật Không Gian 2</name>
</item>
<item>
<name xml:lang='en'>The Cabin In The Woods</name>
<name xml:lang='bg'>Хижа в гората</name>
<name xml:lang='fr'>La cabane dans les bois</name>
</item>
</box>
</inventory>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
let english = ExtendedLanguageRange::new("en")?;
for movie in xml_doc.root()
.all("box")
.with_attribute("type", "games")
.all("item")
.all("name")
.filter_lang_range(&english)
.iter() {
println!("Found movie title in English: {}",
movie.text()?);
}
Ok(())
}
This will print out the names of all four English-language titles for the three games. It will
skip all of the movies, and all names which are rejected by the "en" language filter. Note
that this "en" filter will match both xml:lang="en"
and xml:lang="en-US"
so you'll get two
matching name elements for the first game.
Attribute extraction
Getting the value of an attribute is done with the methods att_req
(generate an error if the
attribute is missing) and att_opt
(no error if the attribute is missing).
For example, given this simple XML document, we can grab the attribute values easily.
<root generationDate='2023-02-09T18:10:00Z'>
<record id='35517'>
<temp locationId='23'>40.5</temp>
</record>
<record id='35518'>
<temp locationId='36'>38.9</temp>
</record>
</root>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
// Iterate the records using an XmlPath.
for record in xml_doc.root().all("record").iter() {
// The record@id attribute is required (we consider it an
// error if it is missing). So use att_req and then the
// `?` syntax to throw any XmlError generated.
let record_id = record.att_req("id")?;
let temp = record.req("temp").element()?;
let temp_value = temp.text()?;
// The temp@locationId attribute is optional (we don't
// consider it an error if it's not found within this
// element). So use att_opt and then `if let` to check for
// it.
if let Some(loc_id) = temp.att_opt("locationId") {
println!("Found temperature {} at {}",
temp_value, loc_id);
} else {
println!("Found temperature {} at ??? location.",
temp_value);
}
}
Ok(())
}
Note: the xml:lang
and xml:space
values cannot be read from as attribute values from an
Element
, because these are "special attributes" whose values are inherited by child elements
(and the language is inherited by an element's attributes too). To get the effective value of
these language and space properties, see the methods language_tag
and white_space_handling
instead.
Namespace handling
All of the examples so far have used XML without any namespace declarations, which means that
the element and attribute names are not within any namespace (or put another way, they have a
namespace which has no value). Specifying the target name of an element or attribute can be
done with a string slice &str
when the namespace has no value. But when the target name has
a namespace value, you must specify the namespace in order to target the desired element.
The most direct way of doing this is to use a (&str, &str)
tuple which contains the local
part and then namespace (not the prefix) of the element name. But you can also call the
pre_ns
(preset or predefined namespace) method to let a cursor or XmlPath know that it should
assume the given namespace value if you don't use a tuple to directly specify the namespace for
each element and attribute within the method chain. An example is probably be the easiest way to
explain this.
<!-- The root element declares that the default namespace for it
and its descendants should be the given URI. It also declares that
any element/attribute using prefix 'pfx' belongs to a namespace
with a different URI. -->
<root xmlns='example.com/DefaultNamespace'
xmlns:pfx='example.com/OtherNamespace'>
<one>This child element has no prefix, so it inherits
the default namespace.</one>
<pfx:two>This child element has prefix pfx, so inherits the
other namespace.</pfx:two>
<pfx:three pfx:key='value'>Attribute names can be prefixed
too.</pfx:three>
<four key2='value2'>Unprefixed attribute names do *not*
inherit namespaces.</four>
<five xmlns='' key3='value3'>The default namespace can be
cleared too.</five>
</root>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
let root = xml_doc.root();
// You can use a tuple to specify the local part and namespace
// of the targeted element.
let one = root.req(("one", "example.com/DefaultNamespace"))
.element()?;
// Or you can call pre_ns before a chain of
// req/opt/first/all/nth/last method calls.
let two = root.pre_ns("example.com/OtherNamespace")
.req("two").element()?;
// The effect of pre_ns continues until you call element() or
// text(), so you can keep assuming the same namespace for
// child elements or attributes.
let three_key = root.pre_ns("example.com/OtherNamespace")
.req("three").att_req("key")?;
// Be careful if the namespace changes (or is cleared) when
// moving down through child elements and attributes. If that
// happens, you can call pre_ns again, or you can use a tuple
// to explicitly state the different namespace.
let four_key = root
.pre_ns("example.com/DefaultNamespace")
.req("four")
.pre_ns("")
.att_req("key2")?;
// When no namespace applies to a method or attribute name,
// you don't need to specify any namespace to target it, so
// you don't need to use pre_ns nor a tuple. But you can
// anyway if you want to make it more explicit that there is
// no namespace.
let five_key = root.req(("five", "")).att_req(("key3", ""))?;
Ok(())
}
It's important to note that once you call element()
the effect of pre_ns vanishes. So don't
forget that you if you do call element()
in the middle of a method chain, you need to call
pre_ns
again in order to specify the preset namespace from that point forward.
<root xmlns='example.com/DefaultNamespace'>
<topLevel>
<innerLevel>
<list>
<item>something</item>
<item>whatever</item>
<item>more</item>
<item>and so on</item>
</list>
</innerLevel>
</topLevel>
</root>
// Defining a static constant makes it quicker to type namespaces,
// and easier to read the code.
const NS_DEF: &str = "example.com/DefaultNamespace";
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
// Use a chain of req calls to get to the required list, then
// use an XmlPath to iterate however many items are found
// within the list and count them.
// This first attempt will actually give us the wrong number,
// because once we call element()? we receive an `&Element`
// reference, and the preset namespace effect is lost. So the
// XmlPath we chain on straight after that will be searching
// the empty namespace and won't find any matching elements
// and will report a count of zero.
let mistake = xml_doc
.root()
.pre_ns(NS_DEF)
.req("topLevel")
.req("innerLevel")
.req("list")
.element()?
.all("item")
.iter()
.count();
// You can fix the problem by either using an explicit name
// tuple `("item", NS_DEF)` or by calling pre_ns again after
// element() so that the XmlPath knows which namespace should
// be used when searching for items.
let correct = xml_doc
.root()
.pre_ns(NS_DEF)
.req("topLevel")
.req("innerLevel")
.req("list")
.element()?
.pre_ns(NS_DEF)
.all("item")
.iter()
.count();
// However, to avoid confusion, it's recommended to avoid
// including `element()` between two different method chains,
// and to instead assign it to a variable name for clarity.
let list = xml_doc
.root()
.pre_ns(NS_DEF)
.req("topLevel")
.req("innerLevel")
.req("list")
.element()?;
let cleanest = list.all(("item", NS_DEF)).iter().count();
Ok(())
}
Error handling
The examples above have simplified the code snippets for brevity, but in a real application you will need to handle the different error types returned by the different steps of reading/parsing and extracting from XML. Here is a compact example which shows the error handling needed for each step.
fn main() {
// Decide what to do if either step returns an error.
// For simplicity, we'll simply panic in this example, but in
// a real application you may want to remap the error to the
// type used by your application, or trigger some recovery
// logic instead.
let xml_doc = match read_xml() {
Ok(d) => d,
Err(e) => panic!("XML reading or parsing failed!"),
};
match extract_xml(xml_doc) {
Ok(()) => println!("Finished without errors!"),
Err(e) => panic!("XML extraction failed!"),
}
}
// The XML parsing methods might throw an std::io::Error, so they
// go into their own method.
fn read_xml() -> Result<XmlDocument, std::io::Error> {
let xml = "<root><child/></root>";
let xml_doc = XmlReader::parse_auto(xml.as_bytes());
xml_doc
}
// The extraction methods might throw an XmlError, so they go into
// their own method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
let child = xml_doc.root().req("child").element()?;
Ok(())
}