2 stable releases

new 1.0.2 Mar 5, 2025

#5 in #web-crawler

Download history 91/week @ 2025-02-28

91 downloads per month

Custom license

18KB
189 lines

spider

A command line interface for crawling websites and storing their content.

usage

USAGE:
    ss [FLAGS] [OPTIONS] --domain <DOMAIN>

FLAGS:
    -h, --help              Prints help information
    -r, --respect-robots    Respect robots.txt file and not scrape not allowed files
    -V, --version           Prints version information
    -v, --verbose           Turn verbose logging on

OPTIONS:
    -c, --concurrency <NUM>                 How many request can be run simultaneously
    -d, --domain <DOMAIN>                   Domain to crawl
    -p, --polite-delay <DELAY_IN_MILLIS>    Polite crawling delay in milli seconds
    -m, --max-depth <DEPTH>                 Maximum crawl depth from the starting URL
    -t, --timeout <SECONDS>                 Timeout for HTTP requests in seconds
    -u, --user-agent <USER_AGENT>           Custom User-Agent string for HTTP requests
    -o, --output-dir <OUTPUT_DIR>           Directory to store output (default: ./spider-output)

Dependencies

~9–21MB
~292K SLoC