#korean #tf-idf #cjk #convert #issue #dictionary #document

bin+lib ragit-korean

korean tokenizer for ragit

7 releases

Uses new Rust 2024

0.3.4 Mar 11, 2025
0.3.3 Mar 9, 2025
0.3.1 Feb 26, 2025
0.2.1 Feb 1, 2025
0.2.0 Dec 30, 2024

#1042 in Text processing

Download history 101/week @ 2024-12-25 25/week @ 2025-01-01 9/week @ 2025-01-08 122/week @ 2025-01-29 15/week @ 2025-02-05 6/week @ 2025-02-12 126/week @ 2025-02-19 353/week @ 2025-02-26 211/week @ 2025-03-05 94/week @ 2025-03-12

785 downloads per month
Used in 2 crates (via ragit)

MIT license

66KB
1K SLoC

ragit-korean

Ragit-korean is a very simple korean tokenizer.

Ragit used to use charabia to tokenize cjk documents, but it has too many issues.

  1. Charabia bundles cjk dictionaries in the binary, which makes the file 70MiB bigger.
  2. It silently converts 완성형 korean to 조합형 korean. That silently messes up tfidf searches.

Dependencies

~10KB