#machine-learning #neural-network #deep-learning #language-model #tokenizer #pytorch #sentencizer

nnsplit

A tool to split text using a neural network. For sentence boundary detection, compound splitting and more.

29 releases

0.5.9 Mar 5, 2023
0.5.8 Jul 23, 2021
0.5.7 Mar 16, 2021
0.5.2 Nov 1, 2020
0.2.2 Feb 26, 2020

#256 in Machine learning

MIT license

33KB
665 lines

NNSplit

PyPI Crates.io npm CI License

A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is also supported.

Features

  • Robust: Not reliant on proper punctuation, spelling and case. See the metrics.
  • Small: NNSplit uses a byte-level LSTM, so weights are small (< 4MB) and models can be trained for every unicode encodable language.
  • Portable: NNSplit is written in Rust with bindings for Rust, Python, and Javascript (Browser and Node.js). See how to get started in the usage section.
  • Fast: Up to 2x faster than Spacy sentencization, see the benchmark.
  • Multilingual: NNSplit currently has models for 9 different languages (German, English, French, Norwegian, Swedish, Simplified Chinese, Turkish, Russian and Ukrainian). Try them in the demo.

Documentation has moved to the NNSplit website: https://bminixhofer.github.io/nnsplit.

License

NNSplit is licensed under the MIT license.

Dependencies

~2–13MB
~171K SLoC