58 releases (30 breaking)
0.33.0 | Sep 21, 2024 |
---|---|
0.32.2 | Jun 29, 2024 |
0.31.0 | May 28, 2024 |
0.29.0 | Mar 18, 2024 |
0.3.2 | Feb 20, 2020 |
#1969 in Text processing
13,299 downloads per month
Used in 35 crates
(22 directly)
110KB
2.5K
SLoC
Lindera Core
A morphological analysis core library for Lindera. This project fork from kuromoji-rs.
This package contains dictionary structures and the viterbi algorithm.
Dictionary format
IPADIC
This repository uses mecab-ipadic.
IPADIC dictionary format
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | Surface | |
1 | 左文脈ID | Left context ID | |
2 | 右文脈ID | Right context ID | |
3 | コスト | Cost | |
4 | 品詞 | Major POS classification | |
5 | 品詞細分類1 | Middle POS classification | |
6 | 品詞細分類2 | Small POS classification | |
7 | 品詞細分類3 | Fine POS classification | |
8 | 活用形 | Conjugation type | |
9 | 活用型 | Conjugation form | |
10 | 原形 | Base form | |
11 | 読み | Reading | |
12 | 発音 | Pronunciation |
IPADIC user dictionary format (CSV)
IPADIC user dictionary simple version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | surface | |
1 | 品詞 | Major POS classification | |
2 | 読み | Reading |
IPADIC user dictionary detailed version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | Surface | |
1 | 左文脈ID | Left context ID | |
2 | 右文脈ID | Right context ID | |
3 | コスト | Cost | |
4 | 品詞 | POS | |
5 | 品詞細分類1 | POS subcategory 1 | |
6 | 品詞細分類2 | POS subcategory 2 | |
7 | 品詞細分類3 | POS subcategory 3 | |
8 | 活用形 | Conjugation type | |
9 | 活用型 | Conjugation form | |
10 | 原形 | Base form | |
11 | 読み | Reading | |
12 | 発音 | Pronunciation | |
13 | - | - | After 13, it can be freely expanded. |
IPADIC NEologd
This repository uses mecab-ipadic-neologd.
IPADIC NEologd dictionary format
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | Surface | |
1 | 左文脈ID | Left context ID | |
2 | 右文脈ID | Right context ID | |
3 | コスト | Cost | |
4 | 品詞 | Major POS classification | |
5 | 品詞細分類1 | Middle POS classification | |
6 | 品詞細分類2 | Small POS classification | |
7 | 品詞細分類3 | Fine POS classification | |
8 | 活用形 | Conjugation type | |
9 | 活用型 | Conjugation form | |
10 | 原形 | Base form | |
11 | 読み | Reading | |
12 | 発音 | Pronunciation |
IPADIC NEologd user dictionary format (CSV)
IPADIC NEologd user dictionary simple version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | surface | |
1 | 品詞 | Major POS classification | |
2 | 読み | Reading |
IPADIC NEologd user dictionary detailed version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | Surface | |
1 | 左文脈ID | Left context ID | |
2 | 右文脈ID | Right context ID | |
3 | コスト | Cost | |
4 | 品詞 | POS | |
5 | 品詞細分類1 | POS subcategory 1 | |
6 | 品詞細分類2 | POS subcategory 2 | |
7 | 品詞細分類3 | POS subcategory 3 | |
8 | 活用形 | Conjugation type | |
9 | 活用型 | Conjugation form | |
10 | 原形 | Base form | |
11 | 読み | Reading | |
12 | 発音 | Pronunciation | |
13 | - | - | After 13, it can be freely expanded. |
UniDic
This repository uses unidic-mecab.
UniDic dictionary format
Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | Surface | |
1 | 左文脈ID | Left context ID | |
2 | 右文脈ID | Right context ID | |
3 | コスト | Cost | |
4 | 品詞大分類 | Major POS classification | |
5 | 品詞中分類 | Middle POS classification | |
6 | 品詞小分類 | Small POS classification | |
7 | 品詞細分類 | Fine POS classification | |
8 | 活用型 | Conjugation form | |
9 | 活用形 | Conjugation type | |
10 | 語彙素読み | Lexeme reading | |
11 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | |
12 | 書字形出現形 | Orthography appearance type | |
13 | 発音形出現形 | Pronunciation appearance type | |
14 | 書字形基本形 | Orthography basic type | |
15 | 発音形基本形 | Pronunciation basic type | |
16 | 語種 | Word type | |
17 | 語頭変化型 | Prefix of a word form | |
18 | 語頭変化形 | Prefix of a word type | |
19 | 語末変化型 | Suffix of a word form | |
20 | 語末変化形 | Suffix of a word type |
UniDic user dictionary format (CSV)
UniDic user dictionary simple version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | Surface | |
1 | 品詞大分類 | Major POS classification | |
2 | 語彙素読み | Lexeme reading |
UniDic user dictionary detailed version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | Surface | |
1 | 左文脈ID | Left context ID | |
2 | 右文脈ID | Right context ID | |
3 | コスト | Cost | |
4 | 品詞大分類 | Major POS classification | |
5 | 品詞中分類 | Middle POS classification | |
6 | 品詞小分類 | Small POS classification | |
7 | 品詞細分類 | Fine POS classification | |
8 | 活用型 | Conjugation form | |
9 | 活用形 | Conjugation type | |
10 | 語彙素読み | Lexeme reading | |
11 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | |
12 | 書字形出現形 | Orthography appearance type | |
13 | 発音形出現形 | Pronunciation appearance type | |
14 | 書字形基本形 | Orthography basic type | |
15 | 発音形基本形 | Pronunciation basic type | |
16 | 語種 | Word type | |
17 | 語頭変化型 | Prefix of a word form | |
18 | 語頭変化形 | Prefix of a word type | |
19 | 語末変化型 | Suffix of a word form | |
20 | 語末変化形 | Suffix of a word type | |
21 | - | - | After 21, it can be freely expanded. |
ko-dic
This repository uses mecab-ko-dic.
ko-dic dictionary format
Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.
Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).
The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0
on the above-linked spreadsheet.
The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0
of the spreadsheet. Any blank values default to *
.
Index | Name (Korean) | Name (English) | Notes |
---|---|---|---|
0 | 표면 | Surface | |
1 | 왼쪽 문맥 ID | Left context ID | |
2 | 오른쪽 문맥 ID | Right context ID | |
3 | 비용 | Cost | |
4 | 품사 태그 | part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
5 | 의미 부류 | meaning | (too few examples for me to be sure) |
6 | 종성 유무 | presence or absence | T for true; F for false; else * |
7 | 읽기 | reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
8 | 타입 | type | One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석) |
9 | 첫번째 품사 | first part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV |
10 | 마지막 품사 | last part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP |
11 | 표현 | expression | 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized |
ko-dic user dictionary format (CSV)
ko-dic user dictionary simple version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 표면 | Surface | |
1 | 품사 태그 | part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
2 | 읽기 | reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
ko-dic user dictionary detailed version
Index | Name (Korean) | Name (English) | Notes |
---|---|---|---|
0 | 표면 | Surface | |
1 | 왼쪽 문맥 ID | Left context ID | |
2 | 오른쪽 문맥 ID | Right context ID | |
3 | 비용 | Cost | |
4 | 품사 태그 | part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
5 | 의미 부류 | meaning | (too few examples for me to be sure) |
6 | 종성 유무 | presence or absence | T for true; F for false; else * |
7 | 읽기 | reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
8 | 타입 | type | One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석) |
9 | 첫번째 품사 | first part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV |
10 | 마지막 품사 | last part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP |
11 | 표현 | expression | 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized |
12 | - | - | After 12, it can be freely expanded. |
CC-CEDICT
This repository uses CC-CEDICT-MeCab.
CC-CEDICT dictionary format
Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.
Index | Name (Chinese) | Name (English) | Notes |
---|---|---|---|
0 | 表面形式 | Surface | |
1 | 左语境ID | Left context ID | |
2 | 右语境ID | Right context ID | |
3 | 成本 | Cost | |
4 | 词类 | Major POS classification | |
5 | 词类1 | Middle POS classification | |
6 | 词类2 | Small POS classification | |
7 | 词类3 | Fine POS classification | |
8 | 併音 | pinyin | |
9 | 繁体字 | traditional | |
10 | 簡体字 | simplified | |
11 | 定义 | definition |
CC-CEDICT user dictionary format (CSV)
CC-CEDICT user dictionary simple version
Index | Name (Chinese) | Name (English) | Notes |
---|---|---|---|
0 | 表面形式 | Surface | |
1 | 词类 | Major POS classification | |
2 | 併音 | pinyin |
CC-CEDICT user dictionary detailed version
Index | Name (Chinese) | Name (English) | Notes |
---|---|---|---|
0 | 表面形式 | Surface | |
1 | 左语境ID | Left context ID | |
2 | 右语境ID | Right context ID | |
3 | 成本 | Cost | |
4 | 词类 | POS | |
5 | 词类1 | POS subcategory 1 | |
6 | 词类2 | POS subcategory 2 | |
7 | 词类3 | POS subcategory 3 | |
8 | 併音 | pinyin | |
9 | 繁体字 | traditional | |
10 | 簡体字 | simplified | |
11 | 定义 | definition | |
12 | - | - | After 12, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Dependencies
~19–29MB
~608K SLoC