#csv #url #provider #js #pattern #json #txt #sub #otx

app urx

Extracts URLs from OSINT Archives for Security Insights

3 releases (breaking)

new 0.3.0 Apr 1, 2025
0.2.0 Mar 31, 2025
0.1.0 Mar 28, 2025

#1 in #sub

Download history 98/week @ 2025-03-24

98 downloads per month

MIT license

1.5MB
3K SLoC

Extracts URLs from OSINT Archives for Security Insights.

Urx is a command-line tool designed for collecting URLs from OSINT archives, such as the Wayback Machine and Common Crawl. Built with Rust for efficiency, it leverages asynchronous processing to rapidly query multiple data sources. This tool simplifies the process of gathering URL information for a specified domain, providing a comprehensive dataset that can be used for various purposes, including security testing and analysis.

Features

  • Fetch URLs from multiple sources in parallel (Wayback Machine, Common Crawl, OTX)
  • Filter results by file extensions, patterns, or predefined presets (e.g., "no-image" to exclude images)
  • Support for multiple output formats: plain text, JSON, CSV
  • Output results to the console or a file, or stream via stdin for pipeline integration
  • URL Testing:
    • Filter and validate URLs based on HTTP status codes and patterns.
    • Extract additional links from collected URLs

Installation

cargo install urx

From Homebrew (Tap)

brew tap hahwul/urx
brew install urx

From Source

git clone https://github.com/hahwul/urx.git
cd urx
cargo build --release

The compiled binary will be available at target/release/urx.

From Docker

ghcr.io/hahwul/urx

Usage

Basic Usage

# Scan a single domain
urx example.com

# Scan multiple domains
urx example.com example.org

# Scan domains from a file
cat domains.txt | urx

Options

Usage: urx [OPTIONS] [DOMAINS]...

Arguments:
  [DOMAINS]...  Domains to fetch URLs for

Options:
  -h, --help     Print help
  -V, --version  Print version

Output Options:
  -o, --output <OUTPUT>  Output file to write results
  -f, --format <FORMAT>  Output format (e.g., "plain", "json", "csv") [default: plain]
      --merge-endpoint   Merge endpoints with the same path and merge URL parameters

Provider Options:
      --providers <PROVIDERS>  Providers to use (comma-separated, e.g., "wayback,cc,otx") [default: wayback,cc,otx]
      --subs                   Include subdomains when searching
      --cc-index <CC_INDEX>    Common Crawl index to use (e.g., CC-MAIN-2025-08) [default: CC-MAIN-2025-08]

Display Options:
  -v, --verbose      Show verbose output
      --silent       Silent mode (no output)
      --no-progress  No progress bar

Filter Options:
  -p, --preset <PRESET>
          Filter Presets (e.g., "no-resources,no-images,only-js,only-style")
  -e, --extensions <EXTENSIONS>
          Filter URLs to only include those with specific extensions (comma-separated, e.g., "js,php,aspx")
      --exclude-extensions <EXCLUDE_EXTENSIONS>
          Filter URLs to exclude those with specific extensions (comma-separated, e.g., "html,txt")
      --patterns <PATTERNS>
          Filter URLs to only include those containing specific patterns (comma-separated)
      --exclude-patterns <EXCLUDE_PATTERNS>
          Filter URLs to exclude those containing specific patterns (comma-separated)
      --show-only-host
          Only show the host part of the URLs
      --show-only-path
          Only show the path part of the URLs
      --show-only-param
          Only show the parameters part of the URLs
      --min-length <MIN_LENGTH>
          Minimum URL length to include
      --max-length <MAX_LENGTH>
          Maximum URL length to include

Network Options:
      --network-scope <NETWORK_SCOPE>  Control which components network settings apply to (all, providers, testers, or providers,testers) [default: all]
      --proxy <PROXY>                  Use proxy for HTTP requests (format: http://proxy.example.com:8080)
      --proxy-auth <PROXY_AUTH>        Proxy authentication credentials (format: username:password)
      --insecure                       Skip SSL certificate verification (accept self-signed certs)
      --random-agent                   Use a random User-Agent for HTTP requests
      --timeout <TIMEOUT>              Request timeout in seconds [default: 30]
      --retries <RETRIES>              Number of retries for failed requests [default: 3]
      --parallel <PARALLEL>            Maximum number of parallel requests [default: 5]
      --rate-limit <RATE_LIMIT>        Rate limit (requests per second)

Testing Options:
      --check-status
          Check HTTP status code of collected URLs [aliases: --cs]
      --include-status <INCLUDE_STATUS>
          Include URLs with specific HTTP status codes or patterns (e.g., --is=200,30x) [aliases: --is]
      --exclude-status <EXCLUDE_STATUS>
          Exclude URLs with specific HTTP status codes or patterns (e.g., --es=404,50x,5xx) [aliases: --es]
      --extract-links
          Extract additional links from collected URLs (requires HTTP requests)

Examples

# Save results to a file
urx example.com -o results.txt

# Output in JSON format
urx example.com -f json -o results.json

# Filter for JavaScript files only
urx example.com -e js

# Exclude HTML and text files
urx example.com --exclude-extensions html,txt

# Filter for API endpoints
urx example.com --patterns api,v1,graphql

# Exclude specific patterns
urx example.com --exclude-patterns static,images

# Use Fileter Preset (similar to --exclude-extensions=png,jpg,.....)
urx example.com -p no-images

# Use specific providers
urx example.com --providers wayback,otx

# Include subdomains
urx example.com --subs

# Check status of collected URLs
urx example.com --check-status

# Extract additional links from collected URLs
urx example.com --extract-links

# Network configuration
urx example.com --proxy http://localhost:8080 --timeout 60 --parallel 10 --insecure

# Advanced filtering
urx example.com -e js,php --patterns admin,login --exclude-patterns logout,static --min-length 20

# HTTP Status code based filtering
urx example.com --include-status 200,30x,405 --exclude-status 20x

Integration with Other Tools

Urx works well in pipelines with other security and reconnaissance tools:

# Find domains, then discover URLs
echo "example.com" | urx | grep "login" > potential_targets.txt

# Combine with other tools
cat domains.txt | urx --patterns api | other-tool

Inspiration

Urx was inspired by gau (GetAllUrls), a tool that fetches known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. While sharing similar core functionality, Urx was built from the ground up in Rust with a focus on performance, concurrency, and expanded filtering capabilities.

Contribute

Urx is open-source project and made it with ❤️ if you want contribute this project, please see CONTRIBUTING.md and Pull-Request with cool your contents.

Dependencies

~14–27MB
~402K SLoC