15 releases (breaking)

0.32.2 Jun 29, 2024
0.30.0 Apr 13, 2024
0.29.0 Mar 18, 2024
0.27.2 Dec 30, 2023
0.23.0 Feb 23, 2023

#1797 in Text processing

Download history 1236/week @ 2024-07-28 973/week @ 2024-08-04 1229/week @ 2024-08-11 936/week @ 2024-08-18 1362/week @ 2024-08-25 1464/week @ 2024-09-01 1109/week @ 2024-09-08 1073/week @ 2024-09-15 1207/week @ 2024-09-22 1296/week @ 2024-09-29 1413/week @ 2024-10-06 1487/week @ 2024-10-13 1291/week @ 2024-10-20 1457/week @ 2024-10-27 1786/week @ 2024-11-03 1670/week @ 2024-11-10

6,384 downloads per month
Used in 2 crates (via lindera-analyzer)

MIT license

555KB
13K SLoC

Lindera Filter

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera Crates.io

Character and token filters for Lindera.

Character filters

Japanese iteration mark character filter

Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana.

Mapping character filter

Replace characters with the specified character mappings, and correcting the resulting changes to the offsets. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string.

Regex character filter

Character filter that uses a regular expression for the target of replace string.

Unicode normalize character filter

Unicode normalization to normalize the input text, that using the specified normalization form, one of NFC, NFD, NFKC, or NFKD.

Token filters

Japanese base form token filter

Replace the term text with the base form registered in the morphological dictionary. This acts as a lemmatizer for verbs and adjectives.

Japanese compound word token filter

Compound consecutive tokens that have specified part-of-speech tags into a single token. This is useful for handling compound words that are not registered in the morphological dictionary.

Japanese katakana stem token filter

Normalizes common katakana spelling variations ending with a long sound (U+30FC) by removing that character. Only katakana words longer than the minimum length are stemmed.

Japanese keep tags token filter

Keep only tokens with the specified part-of-speech tag.

Japanese number token filter

Convert tokens representing Japanese numerals, including Kanji numerals, to Arabic numerals.

Japanese reading form token filter

Replace the text of a token with the reading of the text as registered in the morphological dictionary. The reading is in katakana.

Japanese stop tags token filter

Remove tokens with the specified part-of-speech tag.

Keep words token filter

Keep only the tokens of the specified text.

Korean keep tags token filter

Keep only tokens with the specified part-of-speech tag.

Korean reading form token filter

Replace the text of a token with the reading of the text as registered in the morphological dictionary.

Korean stop tags token filter

Remove tokens with the specified part-of-speech tag.

Length token filter

Keep only tokens with the specified number of characters of text.

Lowercase token filter

Normalizes token text to lowercase.

Mapping token filter

Replace characters with the specified character mappings.

Stop words token filter

Remove the tokens of the specified text.

Uppercase token filter

Normalizes token text to uppercase.

API reference

The API reference is available. Please see following URL:

Dependencies

~15MB
~323K SLoC