#morphological-analysis #sqlite #library

lindera-sqlite

Lindera tokenizer for SQLite FTS5 extention

4 releases (2 breaking)

new 0.41.0 Apr 13, 2025
0.40.2 Apr 2, 2025
0.40.1 Mar 27, 2025
0.38.1 Dec 6, 2024

#2265 in Text processing

Download history 129/week @ 2025-03-26 144/week @ 2025-04-02

273 downloads per month

AGPL-3.0-only

38KB
515 lines

Overview

lindera-sqlite is a C ABI library which exposes a FTS5 tokenizer function.

When used as a custom FTS5 tokenizer this enables application to support Chinese, Japanese and Korean in full-text search.

Build extension

% cargo build --features=ipadic,ko-dic,cc-cedict,compress,extension

Set enviromment variable for Lindera configuration

% export LINDERA_CONFIG_PATH=./resources/lindera.yml

Then start SQLite

% sqlite3 example.db

Load extension

sqlite> .load ./target/debug/liblindera_sqlite lindera_fts5_tokenizer_init

Create table using FTS5 with Lindera tokenizer

sqlite> CREATE VIRTUAL TABLE example USING fts5(content, tokenize='lindera_tokenizer');

Insert data

sqlite> INSERT INTO example(content) VALUES ("Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。");

Search data

sqlite> SELECT * FROM example WHERE content MATCH "Lindera" ORDER BY bm25(example) LIMIT 10;

Dependencies

~18–31MB
~536K SLoC