6 releases
0.2.3 | Jun 24, 2023 |
---|---|
0.2.2 | Jun 24, 2023 |
0.1.1 | Jun 5, 2023 |
#1579 in Text processing
81 downloads per month
390KB
14K
SLoC
wordfreq-model
This crate provides a loader for pre-compiled wordfreq models, allowing you to easily create wordfreq instances for various languages.
Documentation
https://docs.rs/wordfreq-model/
Licensing
The source code is licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
The model files are distributed here together with the credits.
lib.rs
:
wordfreq-model
This crate provides a loader for pre-compiled wordfreq models,
allowing you to easily create WordFreq
instances for various languages.
Instructions
The provided models are the same as those distributed in the original Python package. See the original documentation for the supported languages and their sources.
You need to specify models you want to use with features
.
The feature names are in the form of large-xx
or small-xx
, where xx
is the language code.
For example, if you want to use the large-English and small-Japanese models,
specify large-en
and small-ja
as follows:
# Cargo.toml
[dependencies.wordfreq-model]
version = "0.2"
features = ["large-en", "small-ja"]
There is no default feature. Be sure to specify features you want to use.
Examples
load_wordfreq
can create a WordFreq
instance from a ModelKind
enum value.
ModelKind
will have the specified feature names in CamelCase, such as LargeEn
or SmallJa
.
By default, only ModelKind::ExampleEn
appears for tests.
use approx::assert_relative_eq;
use wordfreq_model::load_wordfreq;
use wordfreq_model::ModelKind;
let wf = load_wordfreq(ModelKind::ExampleEn).unwrap();
assert_relative_eq!(wf.word_frequency("las"), 0.25);
assert_relative_eq!(wf.word_frequency("vegas"), 0.75);
assert_relative_eq!(wf.word_frequency("Las"), 0.25); // Standardized
Standardization
As the above example shows, the model automatically standardizes words before looking them up (i.e., Las
is handled as las
).
This is done by an instance Standardizer
set up in the WordFreq
instance.
load_wordfreq
automatically sets up an appropriate Standardizer
instance for each language.
Notes
This crate downloads specified model files and embeds the models directly into the source code. Specify as many models as you need to avoid extra downloads and bloating the resulting binary.
The actual model files to be used are placed here together with the credits. If you do not desire automatic model downloads and binary embedding, you can create instances from these files directly. See the instructions in [wordfreq].
Dependencies
~5.5–10MB
~192K SLoC