4 releases

0.1.3 Nov 26, 2024
0.1.2 Nov 26, 2024
0.1.1 Nov 26, 2024
0.1.0 Nov 26, 2024

#294 in Compression

30 downloads per month

MIT license

8KB
132 lines

Memory-efficient English language tokenizer

Applying Dearborn orthography to make English easier for machines to understand.

Dearborn orthography allows for lossless compression of English. This reduces the number of tokens required to encode meaning, and removes tokens that are informationally "distracting". It also removes confusing inconsistencies of standard English, while retaining it's structure and being convertible at any stage back to it's standard English equivalent. This compression and standardization of language down to meaning carrying tokens is ideal for the training of large language models.

Dependencies

~3.5–5MB
~89K SLoC