19 releases
0.7.0-beta.2 | Jul 6, 2024 |
---|---|
0.6.4 | Jun 15, 2024 |
0.6.3 | Mar 16, 2024 |
0.5.1 | Dec 16, 2023 |
0.1.2 | Nov 20, 2021 |
#142 in Web programming
55 downloads per month
455KB
11K
SLoC
Skyscraper - HTML scraping with XPath
Rust library to scrape HTML documents with XPath expressions.
This library is major-version 0 because there are still
todo!
calls for many xpath features. If you encounter one that you feel should be prioritized, open an issue on GitHub.See the Supported XPath Features section for details.
HTML Parsing
Skyscraper has its own HTML parser implementation. The parser outputs a tree structure that can be traversed manually with parent/child relationships.
Example: Simple HTML Parsing
use skyscraper::html::{self, parse::ParseError};
let html_text = r##"
<html>
<body>
<div>Hello world</div>
</body>
</html>"##;
let document = html::parse(html_text)?;
Example: Traversing Parent/Child Relationships
// Parse the HTML text into a document
let text = r#"<parent><child/><child/></parent>"#;
let document = html::parse(text)?;
// Get the children of the root node
let parent_node: DocumentNode = document.root_node;
let children: Vec<DocumentNode> = parent_node.children(&document).collect();
assert_eq!(2, children.len());
// Get the parent of both child nodes
let parent_of_child0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing");
let parent_of_child1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing");
assert_eq!(parent_node, parent_of_child0);
assert_eq!(parent_node, parent_of_child1);
XPath Expressions
Skyscraper is capable of parsing XPath strings and applying them to HTML documents.
Below is a basic xpath example. Please see the docs for more examples.
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree, grammar::{XpathItemTreeNodeData, data_model::{Node, XpathItem}}};
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
let html_text = r##"
<html>
<body>
<div>Hello world</div>
</body>
</html>"##;
let document = html::parse(html_text)?;
let xpath_item_tree = XpathItemTree::from(&document);
let xpath = xpath::parse("//div")?;
let item_set = xpath.apply(&xpath_item_tree)?;
assert_eq!(item_set.len(), 1);
let mut items = item_set.into_iter();
let item = items
.next()
.unwrap();
let element = item
.as_node()?
.as_tree_node()?
.data
.as_element_node()?;
assert_eq!(element.name, "div");
Ok(())
}
Supported XPath Features
Below is a non-exhaustive list of all the features that are currently supported.
- Basic xpath steps:
/html/body/div
,//div/table//span
- Attribute selection:
//div/@class
- Text selection:
//div/text()
- Wildcard node selection:
//body/*
- Predicates:
- Attributes:
//div[@class='hi']
- Indexing:
//div[1]
- Attributes:
- Functions:
fn:root()
contains(haystack, needle)
- Forward axes:
- Child:
child::*
- Descendant:
descendant::*
- Attribute:
attribute::*
- DescendentOrSelf:
descendant-or-self::*
- (more coming soon)
- Child:
- Reverse axes:
- Parent:
parent::*
- (more coming soon)
- Parent:
- Treat expressions:
/html treat as node()
This should cover most XPath use-cases. If your use case requires an unimplemented feature, please open an issue on GitHub.
Dependencies
~2.2–3MB
~58K SLoC