22 releases
0.11.2 | Jul 22, 2023 |
---|---|
0.11.1 | Mar 19, 2023 |
0.10.0 | Oct 11, 2022 |
0.8.2 | Jul 30, 2022 |
0.1.3 | Feb 7, 2020 |
#275 in Machine learning
5,048 downloads per month
Used in 11 crates
(7 directly)
2MB
26K
SLoC
This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.
The main data structure of this crate is SentencePieceProcessor
,
which is used to tokenize sentences:
use sentencepiece::SentencePieceProcessor;
let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
.into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
"▁a", "▁t", "el", "es", "c", "o", "pe", "."]);