#nlp #tokenizer #command-line-tool #glob-pattern

app token-counter

wc for tokens: count tokens in files with HF Tokenizers

1 unstable release

0.1.0 Jul 4, 2024

#1258 in Text processing

24 downloads per month

MIT license

12KB
74 lines

tc - Token Count

tc is a CLI tool for counting tokens in text files, as a lightweight wrapper around the HuggingFace Tokenizers crate. It's like the Unix wc command, but for tokens instead of words.

Features

  • Count tokens in files or from stdin
  • Support for multiple files and glob patterns
  • Uses any tokenizer in HuggingFace Tokenizers

Installation

cargo install token-counter

Usage

Using default tokenizer (cl100k, the tokenizer for GPT-3.5 and GPT-4):

tc file1.md file2.md

Using globs:

tc *.md

Arguments:

  • -m, --model: HuggingFace ID of the model for tokenizer (ex. google-bert/bert-base-uncased)

Dependencies

~15–27MB
~401K SLoC