#double-array #sentence-piece #darts #precompiled-charsmap

spm_precompiled

This crate aims to emulate https://github.com/google/sentencepiece Dart::DoubleArray struct and it's Normalizer. This crate is highly specialized and not intended for general use.

5 releases

0.1.4 May 30, 2022
0.1.3 Jul 19, 2021
0.1.2 Sep 17, 2020
0.1.1 Sep 15, 2020
0.1.0 Sep 15, 2020

#993 in Development tools

Download history 36411/week @ 2024-07-26 33786/week @ 2024-08-02 35727/week @ 2024-08-09 35477/week @ 2024-08-16 37010/week @ 2024-08-23 36499/week @ 2024-08-30 41500/week @ 2024-09-06 37529/week @ 2024-09-13 47018/week @ 2024-09-20 43435/week @ 2024-09-27 45732/week @ 2024-10-04 50505/week @ 2024-10-11 54068/week @ 2024-10-18 52200/week @ 2024-10-25 50425/week @ 2024-11-01 44730/week @ 2024-11-08

211,430 downloads per month
Used in 103 crates (2 directly)

Apache-2.0

2MB
16K SLoC

Crate API

spm_precompiled

This crate aims to emulate https://github.com/google/sentencepiece Dart::DoubleArray struct and it's Normalizer. It's main intent is to be used with tokenizers that is a Rust library that aims to provide facilities to tokenize string for use with HuggingFace's transformers library

This crate is highly specialized and not intended for general use.

The core of the algorithm is to read spm's binary precompiled_charsmap.


lib.rs:

This crate aims to emulate https://github.com/google/sentencepiece Dart::DoubleArray struct and it's Normalizer. It's main intent is to be used with tokenizers that is a Rust library that aims to provide facilities to tokenize string for use with HuggingFace's transformers library

This crate is highly specialized and not intended for general use.

The core of the algorithm is to read spm's binary precompiled_charsmap.

Dependencies

~1.7–2.5MB
~49K SLoC