2 stable releases
new 1.0.2 | Mar 5, 2025 |
---|
#5 in #web-crawler
91 downloads per month
18KB
189 lines
spider
A command line interface for crawling websites and storing their content.
usage
USAGE:
ss [FLAGS] [OPTIONS] --domain <DOMAIN>
FLAGS:
-h, --help Prints help information
-r, --respect-robots Respect robots.txt file and not scrape not allowed files
-V, --version Prints version information
-v, --verbose Turn verbose logging on
OPTIONS:
-c, --concurrency <NUM> How many request can be run simultaneously
-d, --domain <DOMAIN> Domain to crawl
-p, --polite-delay <DELAY_IN_MILLIS> Polite crawling delay in milli seconds
-m, --max-depth <DEPTH> Maximum crawl depth from the starting URL
-t, --timeout <SECONDS> Timeout for HTTP requests in seconds
-u, --user-agent <USER_AGENT> Custom User-Agent string for HTTP requests
-o, --output-dir <OUTPUT_DIR> Directory to store output (default: ./spider-output)
Dependencies
~9–21MB
~292K SLoC