1 unstable release
0.1.0 | Dec 14, 2023 |
---|
#6 in #tld
263 downloads per month
260KB
14K
SLoC
Summary
tldextract-rs is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from Domain.
Hostname
- Cargo.toml:
tld_extract = { git = "https://github.com/emo-cat/tldextract-rs" }
- example code
use tld_extract::TLDExtract;
fn main() {
let source = tld_extract::Source::Hardcode;
let suffix = tld_extract::SuffixList::new(source, false, None);
let mut extract = TLDExtract::new(suffix, true).unwrap();
let e = extract.extract(" mirrors.tuna.tsinghua.edu.cn").unwrap();
let s = serde_json::to_string_pretty(&e).unwrap();
println!("{:}", s);
}
- ExtractResult
{
"subdomain": "mirrors.tuna",
"domain": "tsinghua",
"suffix": "edu.cn",
"registered_domain": "tsinghua.edu.cn"
}
Implementation details
Why not split on "." and take the last element instead?
Splitting on "." and taking the last element only works for simple eTLDs like com
, but not more complex ones like oseto.nagasaki.jp
.
eTLD tries
tldextract-rs stores eTLDs in compressed tries.
Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.
Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac
and the example URL host `example.nsw.edu.au`
The compressed trie will be structured as follows:
START
╠═ au 🚩 ✅
║ ╚═ edu ✅
║ ╚═ nsw 🚩 ✅
╚═ ac
╠═ com 🚩
╠═ edu 🚩
╚═ gov 🚩
=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`
The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw
. Reversing the nodes gives the extracted eTLD nsw.edu.au
.
Acknowledgements
Dependencies
~1–14MB
~224K SLoC