#word-count #tally #cli #word #count #word-tally

bin+lib word-tally

Output a tally of the number of times unique words appear in source input

27 releases (breaking)

Uses new Rust 2024

new 0.19.0 Apr 13, 2025
0.17.0 Apr 10, 2025
0.16.0 Feb 13, 2025
0.15.0 Nov 21, 2024
0.8.2 Jul 25, 2024

#321 in Text processing

Download history 139/week @ 2025-02-12 135/week @ 2025-04-02 350/week @ 2025-04-09

485 downloads per month

MIT license

42KB
819 lines

word-tally

Crates.io docs.rs GitHub Actions Workflow Status

Output a tally of the number of times unique words appear in source input.

Usage

Usage: word-tally [OPTIONS] [PATH]

Arguments:
  [PATH]  File path to use as input rather than stdin ("-") [default: -]

Options:
  -s, --sort <ORDER>       Sort order [default: desc] [possible values: desc, asc, unsorted]
  -c, --case <FORMAT>      Case normalization [default: lower] [possible values: original, upper, lower]
  -m, --min-chars <COUNT>  Exclude words containing fewer than min chars
  -M, --min-count <COUNT>  Exclude words appearing fewer than min times
  -e, --exclude <WORDS>    Exclude words from a comma-delimited list
  -d, --delimiter <VALUE>  Delimiter between keys and values [default: " "]
  -o, --output <PATH>      Write output to file rather than stdout
  -f, --format <FORMAT>    Output format [default: text] [possible values: text, json, csv]
  -v, --verbose            Print verbose details
  -p, --parallel           Use parallel processing for word counting
  -h, --help               Print help
  -V, --version            Print version

Examples

word-tally README.md | head -n3
#>> tally 22
#>> word 20
#>> https 11

CSV output:

# Using delimiter (manual CSV)
word-tally --delimiter="," --output="tally.csv" README.md

# Using CSV format (with header)
word-tally --format=csv --output="tally.csv" README.md

JSON output:

word-tally --format=json --output="tally.json" README.md

Parallel processing can be much faster for large files:

word-tally --parallel README.md

# Tune with environment variables
WORD_TALLY_THREADS=4 WORD_TALLY_CHUNK_SIZE=32768 word-tally --parallel huge-file.txt

Environment Variables

  • WORD_TALLY_UNIQUENESS_RATIO - Divisor for estimating unique words from input size (default: 10)
  • WORD_TALLY_DEFAULT_CAPACITY - Default initial capacity when there is no size hint (default: 1024)

These variables only affect the program when using the --parallel flag:

  • WORD_TALLY_THREADS - Number of threads for parallel processing (default: number of cores)
  • WORD_TALLY_CHUNK_SIZE - Size of chunks for parallel processing in bytes (default: 16384)
  • WORD_TALLY_WORD_DENSITY - Multiplier for estimating unique words per chunk (default: 15)

Installation

cargo install word-tally

Cargo.toml

Add word-tally as a dependency.

[dependencies]
word-tally = "0.19.0"

Documentation

https://docs.rs/word-tally

Tests & benchmarks

Clone the repository.

git clone https://github.com/havenwood/word-tally
cd word-tally

Run the tests.

cargo test

And run the benchmarks.

cargo bench

Dependencies

~6MB
~102K SLoC