3 unstable releases
0.2.1 | Dec 4, 2023 |
---|---|
0.2.0 |
|
0.1.55 | Nov 27, 2021 |
#211 in HTTP client
34KB
667 lines
Scraping Websites!
Examples:
Get InnerHTML:
let html = "<html><body><div>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter("body");
println!("{}", filtered_dom.get_inner_html());
//Output: <div>Hello World!</div>
Get Text:
let html = "<html><body><div>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter("body");
println!("{}", filtered_dom.get_text());
//Output: Hello World!
Get Text from single Tags:
use sitescraper;
let html = "<html><body><div>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter("div");
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!
Works also with
get_inner_html()
Filter by tag-name, attribute-name and attribute-value using a tuple:
use sitescraper;
let html = "<html><body><div id='hello'>Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter(("div", "id", "hello"));
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!
Works also with a tuple consisting of two string literals
let filtered_dom = dom.filter(("div", "id"));
You can also leave arguments out by passing "*" or "":
use sitescraper;
let html = "<html><body><div id="hello">Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter(("*", "id", "hello"));
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!
or
use sitescraper;
let html = "<html><body><div id="hello">Hello World!</div></body></html>";
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = dom.filter(("", "", "hello"));
println!("{}", filtered_dom.tag[0].get_text());
//Output: Hello World!
Get Website-Content:
use sitescraper;
let html = sitescraper::http::get("http://example.com/).await.unwrap();
let dom = sitescraper::parse_html(html).unwrap();
let filtered_dom = sitescraper::filter!(dom, "div");
println!("{}", filtered_dom.get_inner_html());
Dependencies
~4–15MB
~208K SLoC