1 unstable release
0.1.0 | Feb 4, 2024 |
---|
#8 in #sitemap
15KB
209 lines
wls
wls (web ls) makes it easy to crawl multiple sitemaps and list URLs. It can even automatically find sitemaps for a domain using robots.txt.
Usage
wls accepts multiple domains/sitemaps as arguments, and will print all found URLs to stdout:
$ wls docs.rs > urls.txt
$ head -n 6 urls.txt
https://docs.rs/A-1/latest/A_1/
https://docs.rs/A-1/latest/A_1/all.html
https://docs.rs/A5/latest/A5/
https://docs.rs/A5/latest/A5/all.html
https://docs.rs/AAAA/latest/AAAA/
https://docs.rs/AAAA/latest/AAAA/all.html
$ grep /all.html urls.txt | wc -l
113191
# that's a lot of crates!
If an argument does not contain a slash, it is treated as a domain, and wls will automatically attempt to find sitemaps using robots.txt. For example, docs.rs uses the Sitemap:
directive in its robots.txt file, so the following commands are equivalent:
$ wls docs.rs
$ wls https://docs.rs/robots.txt
$ wls https://docs.rs/sitemap.xml
wls will print logs to stderr when -v/--verbose
is enabled:
$ wls -v docs.rs
Found 1 sitemaps
in robotstxt with url: https://docs.rs/robots.txt
Found 26 sitemaps
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txt
Found 15934 URLs
in sitemap with url: https://docs.rs/-/sitemap/a/sitemap.xml
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txt
Found 11170 URLs
in sitemap with url: https://docs.rs/-/sitemap/b/sitemap.xml
in sitemap with url: https://docs.rs/sitemap.xml
in robotstxt with url: https://docs.rs/robots.txt
...
More options are available too:
$ wls --help
Usage: wls [OPTIONS] <URLS>...
Arguments:
<URLS>... Domains/sitemaps to crawl
Options:
-T, --timeout <SECONDS> Maximum response time [default: 30]
-w, --wait <SECONDS> Delay between requests [default: 0]
-v, --verbose Enable logs
-h, --help Print help
-V, --version Print version
Dependencies
~7–19MB
~273K SLoC