1 unstable release
Uses old Rust 2015
0.0.1 | Mar 23, 2016 |
---|
#37 in #classification
24KB
356 lines
langid-rs
NGram-based text classifier written in Rust.
This is not fully ready for use because it lacks pre-trained models and proper documentation.
Usage
Classifying using pre-trained models
Use the glob crate to get a list of files. Filenames will be used as names for models.
extern crate langid;
extern crate glob;
use langid::Classifier;
use glob::glob;
fn main() {
let paths = glob("./language_profiles/*.json").unwrap().filter_map(Result::ok);
let classifier = Classifier::from_files(paths);
let language = classifier.classify("Sample text that you want classified.");
println!("Sample language: {}", language);
}
Training and classifying on the fly
extern crate langid;
use langid::Classifier;
fn main() {
let first_language_training_text = "...";
let second_language_training_text = "...";
let mut classifier = Classifier::new();
classifier.train(first_language_training_text, "first");
classifier.train(second_language_training_text, "second");
let language = classifier.classify("Sample in the first language.");
println!("Sample language: {}", language);
}
Training
Run cargo install langid
to get the langid
CLI utility.
langid train [-o FILE] <FILE FILE...>
Create a model based on input text files. Write to stdout or to the file specified by -o
or --output
.
Credits
Implements algorithm described by William B. Cavnar and John M. Trenkle, “N-Gram-Based Text Categorization”, 1994.
Dependencies
~2MB
~34K SLoC