5 unstable releases
new 0.3.1 | Apr 17, 2025 |
---|---|
0.3.0 | Apr 9, 2025 |
0.2.1 | Mar 2, 2025 |
0.2.0 | Feb 14, 2025 |
0.1.0 | Jan 31, 2025 |
#628 in Text processing
139 downloads per month
Used in 2 crates
280KB
6K
SLoC
Alphabet Detector
Detects 400 alphabets of 323 languages in 170 scripts
One language can be written in multiple scripts, so it will be detected as a different
ScriptLanguage
(language + script)
Does not have any models, just matches the alphabet. Not recommended to use as a standalone detector, it's more like a word separator + language prefilter for an actual language detector.
Splits text (iterator CharIndices
) to words, and detects ScriptLanguage
s (language + script) of words by used letters (chars).
Examples
To split text
to the iterator of WordLang
:
let word_iter = words::from_ch_ind::<Vec<char>>(text.char_indices());
If you don't need individual words, but just want to analyze a full text:
let (all_words, all_langs) = fulltext_filter_with_margin_sorted::<Vec<char>, 95>(text.char_indices());
It will give you all Word
s (Vec<Word<Vec<char>>>
) of text
and Vec<(ScriptLanguage, u32)>
filtered with a less then 5% margin for an error.
Instead of Vec<char>
you can use other types of words.
Extras
Look at the alphabets.rs to understand what languages have already defined alphabets. Some of them need validation.
Warning: can return words with chars from the Unicode private area (for example Lingala
, Nuer
or Yoruba
languages), because of char normalization (composition with Inherited
), and there are no such chars defined in Unicode.
Dependencies
~3–5MB
~84K SLoC