4 releases
0.1.3 | Jul 6, 2023 |
---|---|
0.1.2 | Oct 19, 2020 |
0.1.1 | Aug 20, 2020 |
0.1.0 | Aug 20, 2020 |
#481 in Text processing
70KB
1.5K
SLoC
rust-pragmatic-segmenter
Rust port of pySBD v3.1.0 and Ruby pragmatic_segmenter. Documentations
rust-pragmatic-segmenter is rule-based SBD. It uses a lot of regular expressions to separate sentences.
use pragmatic_segmenter::Segmenter;
let segmenter = Segmenter::new()?;
let result: Vec<_> = segmenter.segment("Hi Mr. Kim. Let's meet at 3 P.M.").collect();
//=> vec!["Hi Mr. Kim. ", "Let's meet at 3 P.M."]
How to build
sudo apt install -y libclang-dev
cargo build
TODOs
- Perfectly match the behavior with pySBD (current: 99%)
- Support languages other than English
- Remove regexes with look around and back references
- Try Intel Hyperscan
- Fix mistakes of pySBD, possibly send patches to the upstream
- Optimize copies and allocations
- Use proper error types instead of Boxed error
- Import test cases from pySBD and ruby pragmatic_segmenter
rust-pragmatic-segmenter is primarily distributed under the terms of both the Apache License (Version 2.0) and the MIT license. See COPYRIGHT for details.
Dependencies
~7MB
~164K SLoC