27 releases (breaking)
Uses new Rust 2024
new 0.19.0 | Apr 13, 2025 |
---|---|
0.17.0 | Apr 10, 2025 |
0.16.0 | Feb 13, 2025 |
0.15.0 | Nov 21, 2024 |
0.8.2 | Jul 25, 2024 |
#321 in Text processing
485 downloads per month
42KB
819 lines
word-tally
Output a tally of the number of times unique words appear in source input.
Usage
Usage: word-tally [OPTIONS] [PATH]
Arguments:
[PATH] File path to use as input rather than stdin ("-") [default: -]
Options:
-s, --sort <ORDER> Sort order [default: desc] [possible values: desc, asc, unsorted]
-c, --case <FORMAT> Case normalization [default: lower] [possible values: original, upper, lower]
-m, --min-chars <COUNT> Exclude words containing fewer than min chars
-M, --min-count <COUNT> Exclude words appearing fewer than min times
-e, --exclude <WORDS> Exclude words from a comma-delimited list
-d, --delimiter <VALUE> Delimiter between keys and values [default: " "]
-o, --output <PATH> Write output to file rather than stdout
-f, --format <FORMAT> Output format [default: text] [possible values: text, json, csv]
-v, --verbose Print verbose details
-p, --parallel Use parallel processing for word counting
-h, --help Print help
-V, --version Print version
Examples
word-tally README.md | head -n3
#>> tally 22
#>> word 20
#>> https 11
CSV output:
# Using delimiter (manual CSV)
word-tally --delimiter="," --output="tally.csv" README.md
# Using CSV format (with header)
word-tally --format=csv --output="tally.csv" README.md
JSON output:
word-tally --format=json --output="tally.json" README.md
Parallel processing can be much faster for large files:
word-tally --parallel README.md
# Tune with environment variables
WORD_TALLY_THREADS=4 WORD_TALLY_CHUNK_SIZE=32768 word-tally --parallel huge-file.txt
Environment Variables
WORD_TALLY_UNIQUENESS_RATIO
- Divisor for estimating unique words from input size (default: 10)WORD_TALLY_DEFAULT_CAPACITY
- Default initial capacity when there is no size hint (default: 1024)
These variables only affect the program when using the --parallel
flag:
WORD_TALLY_THREADS
- Number of threads for parallel processing (default: number of cores)WORD_TALLY_CHUNK_SIZE
- Size of chunks for parallel processing in bytes (default: 16384)WORD_TALLY_WORD_DENSITY
- Multiplier for estimating unique words per chunk (default: 15)
Installation
cargo install word-tally
Cargo.toml
Add word-tally
as a dependency.
[dependencies]
word-tally = "0.19.0"
Documentation
Tests & benchmarks
Clone the repository.
git clone https://github.com/havenwood/word-tally
cd word-tally
Run the tests.
cargo test
And run the benchmarks.
cargo bench
Dependencies
~6MB
~102K SLoC