1 unstable release
0.1.0 | Jun 1, 2024 |
---|
#1308 in Text processing
70KB
1.5K
SLoC
vidyut-kosha
A compact Sanskrit lexicon
vidyut-kosha
defines a key-value store that can compactly map tens of
millions of Sanskrit words to their inflectional data. Depending on the
application, storage costs can be as low as 1 byte per word. This storage
efficiency comes at the cost of increased lookup time, but in practice, we have
found that this increase is negligible and well worth the efficiency gains
elsewhere.
For your convenience, vidyut-kosha
contains helper scripts that will
automatically build an interesting and comprehensive dictionary. For details,
see the Usage section below.
This crate is under active development as part of the Ambuda project. If you enjoy our work and wish to contribute to it, we encourage you to join our Discord server, where you can meet other Sanskrit programmers and enthusiasts.
Overview
Sanskrit is a highly inflected language, and any reasonable Sanskrit word list will have tens of millions of words at minimum. In practice, Sanskrit programs make various compromises on this word list so that performance doesn't suffer.
One common compromise is to store irregular words as-is and store regular words just in their stem form. Then at query time, we guess at the query's underlying stem with the help of prefix and suffix tables that we curate manually.
This approach is workable, but has two main weaknesses:
-
Speed. To illustrate, the query
ubhe
could yield the candidatesubha
,ubhA
,ubhi
,ubh
, andubhe
, all of which we must check against the store. -
Overgeneration. By using prefix and suffix tables, we will potentially accept invalid words like *
sundarA
(forsundarI
) or *putrakA
(forputrikA
). This can be useful for some applications, but often it's a decision made by need, not by choice.
vidyut-kosha
avoids these weaknesses. It stores words exactly, which avoids
overgeneration. And although single-key lookup is slightly slower, it avoids the
multiple lookup problem, which makes it faster for many applications.
In exchange, we must pay the following price:
-
We must construct the dictionary ahead of time. If we wish to change the dictionary, we must construct it again. Construction takes several minutes to complete.
-
We must be able to convert our values to 64-bit integers. This sounds daunting but is straightforward in practice. (If you plan to use
vidyut-kosha
as an inflectional dictionary, our library seamlessly manages this conversion for you already.)
Usage
Detailed notes are coming soon. For now, try running the following command:
$ make create_kosha
If you have a specific word list you want to use instead, you can use the builder API directly like so:
use vidyut_kosha::{Kosha, Builder};
use vidyut_kosha::morph::*;
let mut builder = Builder::new("output-dir").unwrap();
builder.insert("Bavati", &Pada::Tinanta(Tinanta {
dhatu: Dhatu("BU".to_string()),
purusha: Purusha::Prathama,
vacana: Vacana::Eka,
lakara: Lakara::Lat,
pada: PadaPrayoga::Parasmaipada,
}));
builder.finish().unwrap();
let kosha = Kosha::new("output-dir");
Design
Although vidyut-kosha
can support any encoding type for keys, we strongly
recommend using SLP1 or WX for better space efficiency and query
performance.
Our underlying data structure is a finite-state transducer (FST) as
implemented in the fst
crate. For details on how FSTs work, we
recommend this blog post by the creator of the fst
crate,
particularly the section titled "Finite state machines as data
structures."
Briefly, an FST generalizes the trie data structure by storing both shared prefixes and shared suffixes. The construction algorithm for FSTs will find these prefixes and suffixes automatically, and it will try reusing saved prefixes and suffixes wherever possible.
FSTs have several important constraints, two of which are most revelant here:
-
The values in the FST must satisfy a specific algebra. Rust's
fst
crate satisfies this algebra by requiring that values are 64-bit integers. So to put it simply, we must be able to convert whatever information we need to store into a 64-bit integer. -
The keys in the FST must all be unique. But Sanskrit words are not necessarily unique, and we must to be able to store duplicate keys.
Solving the encoding constraint
For the use case of an inflectional dictionary, we want to store two kinds of information: enumerated data and text data.
Enumerated data includes categories like person, number, gender, case, and
so on. Each category can take one of several possible values that we know ahead
of time. We can trivially convert such data to an integer value. For example,
we represent the vacana (number) of a word with one of four possible values:
eka
, dvi
, bahu
, or none
if the vacana is unknown for some reason.
Thus we can map eka
to 1, dvi
to 2, and so on.
Text data includes the stem or root of the underlying form, and we cannot feasibly enumerate it ahead of time. We encode this data through a lookup table approach: if we append all strings to a list, then the index of that string in the list is its integer representation. By using this approach, we pay the price of storing these strings the old-fashioned way. But a list of lemmas is much smaller than a list of words, so the space requirements are trivial.
Once we have mapped all of our information to integer values, we can treat our 64-bit integer as a bitfield and add values together with the appropriate shifts. For example, here's an early version of our scheme for tinantas, which uses only 32 bits:
OOLLLLppVVaadddddddddddddddddddd
O: part of speech (2 bits = 4 possible values)
L: lakara (4 bits = 16 possible values)
p: purusha (2 bits = 4 possible values)
V: vacana (2 bits = 4 possible values)
a: pada + prayoga (2 bits = 4 possible values)
d: dhatu ID (20 bits = ~1 million possible values)
One real consequence of the encoding constraint is that we can't associate words with completely arbitrary data. But if our data is structured carefully, we have plenty of room to associate each word with interesting information.
Solving the uniqueness constraint
A Sanskrit dictionary is practical only if it support duplicate keys. To solve the uniqueness constraint, we use the following workaround, which (surprisingly) works well in practice.
The fst
crate treats keys as a vector of unsigned 8-bit integers, which means
that each "letter" in the key can take a value in the range [0, 255]. Since the
ASCII range starts at 65, we can encode extra information in the range [0, 64],
which lets us mark duplicates.
For example, we might store the various forms of gacchati
as follows;
gacchati
,gacchati + 0
gacchati + 1
And so on up to gacchati + 64
if necessary. This simple approach lets us
store up to 65 duplicates of a given key.
Since even 65 duplicates is not enough in practice, we simply extended the range of bytes we append to the key:
gacchati
,gacchati + 0 0
gacchati + 0 1
,gacchati + 0 2
And so on up to gacchati + 64 64
if necessary. This approach lets us store
more than 4000 duplicates per key, which is more than enough for Sanskrit.
At lookup time, we would first check for the query word -- gacchati
, let's
say -- then sequentially chehck gacchati + 0 0
and so on until we no longer
find any results. This lookup might seem extremely expensive, but due to how
FSTs are structured, it is surprisingly cheap.
Dependencies
~6–14MB
~149K SLoC