1 unstable release
0.1.0 | Jul 4, 2024 |
---|
#1258 in Text processing
24 downloads per month
12KB
74 lines
tc - Token Count
tc
is a CLI tool for counting tokens in text files, as a lightweight wrapper around the HuggingFace Tokenizers crate. It's like the Unix wc
command, but for tokens instead of words.
Features
- Count tokens in files or from stdin
- Support for multiple files and glob patterns
- Uses any tokenizer in HuggingFace Tokenizers
Installation
cargo install token-counter
Usage
Using default tokenizer (cl100k, the tokenizer for GPT-3.5 and GPT-4):
tc file1.md file2.md
Using globs:
tc *.md
Arguments:
-m
,--model
: HuggingFace ID of the model for tokenizer (ex.google-bert/bert-base-uncased
)
Dependencies
~15–27MB
~401K SLoC