9 breaking releases
0.10.0 | Aug 15, 2022 |
---|---|
0.8.0 | Aug 15, 2022 |
#1503 in Text processing
120KB
2K
SLoC
A quasi-lossless Balkanoidal meta-lingual compressor.
Background
It has long been accepted that Serbian is a compact variant of Russian, with less liberal use of vowels. Since the forfeiting of the Riviera in 1991, the loss of tourism revenue has led to further austerity in vowel use. Serbs increasingly needed economically viable ways of communicating, since vowels aren't exactly cheap!
serb.zip
is a lightweight framework for transcoding text from one lexical form to another. It presently comes with one codec — Balkanoid — a universal transcoder between Serbo-Croatian and Russian languages that is almost entirely isomorphic — it maps from one lingual domain to another with no loss of meaning and some loss of whitespace and capitalisation. Balkanoid works with both English and East-Slavic texts.
Getting Started
Installation
Install serb.zip
via cargo
:
cargo install serbzip
Unless a custom dictionary file is provided, serb.zip
requires a compiled dictionary file in ~/.serbzip/dict.blk
. A default dictionary supporting English and Russian will be automatically downloaded and installed on first use.
Compressing a file
When working with input/output, serb.zip
can be given files to read and write from, or it can be run interactively. Let's try compressing an input stream interactively first.
serbzip -c
This brings up serb.zip
, running the balkanoid
codec, in interactive mode.
██████╗ █████╗ ██╗ ██╗ ██╗ █████╗ ███╗ ██╗ ██████╗ ██╗██████╗
██╔══██╗██╔══██╗██║ ██║ ██╔╝██╔══██╗████╗ ██║██╔═══██╗██║██╔══██╗
██████╔╝███████║██║ █████╔╝ ███████║██╔██╗ ██║██║ ██║██║██║ ██║
██╔══██╗██╔══██║██║ ██╔═██╗ ██╔══██║██║╚██╗██║██║ ██║██║██║ ██║
██████╔╝██║ ██║███████╗██║ ██╗██║ ██║██║ ╚████║╚██████╔╝██║██████╔╝
╚═════╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═════╝ ╚═╝╚═════╝
Enter text; CTRL+D when done.
Let's key in the first verse of Mearns' Antigonish. Enter the following text, line by line, and hit CTRL+D when done.
Yesterday, upon the stair,
I met a man who wasn't there!
He wasn't there again today,
Oh how I wish he'd go away!
serb.zip
echoes the compressed version:
Ystrdy, pn th str,
I mt a mn wh wasn't thr!
H wasn't thr gn tdy,
H hw I wsh h'd g wy!
Most of the time, though, you'll want to compress a file and have the output written to another file:
serbzip -c -i antigonish.txt -o antigonish.sz
This reads from antigonish.txt
in the current directory and writes to antigonish.sz
. By convention, we use .sz
extensions on serb.zip
files, but you may use any extension you like.
Expanding/decompressing a file
The expansion operation is the logical reverse of compression, and uses much the same arguments. The only difference: the -x
flag instead of -c
.
serbzip -x -i antigonish.sz -o antigonish.txt
Advanced Usage
Custom dictionary
serb.zip
can be run with a custom dictionary file. The dictionary can be specified with the -d
flag – either as a newline-separated word list (.txt
) or a binary image (.blk
).
The word list is the simplest approach, but it takes time to parse the list, produce fingerprints and sort the mapping vectors. It is, nonetheless, useful for testing.
The preferred approach is to compile a binary image from a word list, and use the binary image thereafter.
To compile custom.txt
into custom.blk
:
serbzip -p -d custom.txt -m custom.blk
To later use custom.blk
during compression and expansion:
serbzip -d custom.blk ...
CAUTION: The congruency of dictionaries is essential for the reversibility of the algorithm. In other words, if you use one dictionary to compress a stream, then another dictionary to expand it, some words may not match. The default dictionary has ~50K English and ~50K Russian words, and should suffice for most people.
Switching codecs
serb.zip
encompasses several codecs, of which balkanoid
is the default. Other codecs are still in development. Use the --codec
flag to specify an alternate codec.
Getting help
If you are using this application, you should probably seek professional help.
If you get stuck, run serbzip --help
.
Using serb.zip
as a library
To embed serb.zip
in your application, follow these instructions.
How serb.zip
Works
Each codec works differently. They share some common concepts but the underlying algorithms may be entirely different.
Balkanoid
Balkanoid is the famed codec that started it all. It maps between compressed and expanded forms with no loss in meaning. It does this through a combination of dictionary mapping and a unary encoding.
Balkanoid's dictionary is a mapping from a fingerprint to a sorted vector of all dictionary words that satisfy that fingerprint. A fingerprint is taken by stripping all vowels and converting the remaining characters to lowercase. E.g., the fingerprint of "Apple"
is "ppl"
.
The dictionary is essentially a HashMap<String, Vec<String>>
. The vector is sorted by comparing words by length (shortest first), then by lexicographical order.
For the sample word list
at
came
cent
come
count
in
inn
it
no
on
the resulting dictionary is
"cm" -> ["came", "come"]
"cnt" -> ["cent", "count"]
"n" -> ["in", "no", "on"]
"t" -> ["at", "it"]
"nn" -> ["inn"]
Once the dictionary is loaded, the algorithm works by iterating over the input, line by line. Each line is subsequently tokenised by whitespace, disregarding contiguous whitespace (shown here with ␣
characters for readability) sequences. E.g., the line "To␣be...␣␣␣␣␣or␣␣␣␣␣not␣to␣be!"
is captured as six words: ["To", "be...", "or", "not", "to", "be!"]
. An empty line or a line comprising only whitespace characters is tokenised to []
.
Each resulting word is subject to three rulesets — punctuation, compaction and capitalisation — applied in that order.
The punctuation rules are used to efficiently deal with punctuated words such as "banana!"
. Normally, "banana"
would be an easy dictionary lookup, but the exclamation mark is a fly in the ointment. Punctuation splits every word into a prefix and a suffix. The prefix encompasses a contiguous string of alphabetical characters from the start of the word. The suffix contains all other characters.
- If the first character of the word is a backslash (
'\'
), then move the backslash into the head of the prefix. For the 2nd and subsequent characters, move the longest contiguous string of alphabetical characters from the start of the string into the prefix, while assigning the rest of the string to the suffix. - If the first character is not a backslash, split the word such that the prefix encompasses the longest contiguous string of alphabetical characters from the start of the word. The suffix contains all other characters.
For example, for "bananas!!!"
, the punctuation tuple is ("bananas", "!!!")
, by rule 2. For an unpunctuated word, the prefix encompasses the entire word (again, by rule 2), while the suffix is an empty string; e.g., "pear"
splits to ("pear", "")
. For a word such as "\foo?"
, the punctuation tuple is ("\foo", "?")
, by rule 1 — the first backslash is admitted to the prefix. For a single backslash "\"
, the result is ("\", "")
, as per rule 1. For a series of backslashes "\\\"
, the result is ("\", "\\")
.
When a word comprises multiple fragments separated by non-alphabetical characters, only the first fragment is assigned to the prefix; e.g., "dog@pound.com"
becomes ("dog", "@pound.com")
. This is a rare example of a relatively long suffix; ordinarily, suffixes comprise trailing punctuation symbols, which are prevalent in English and East-Slavic languages. Although one might think that multi-fragment words could be expressed as N-tuples and encoded accordingly, this would render the algorithm irreversible for some words.
The compaction rules are used to de-vowel the word, being the essence of the algorithm. It starts by taking a lowercase representation of the prefix element of the punctuation tuple and removing all vowels (excluding 'y'
in the English variant). The compaction rule does not apply to the suffix element of the tuple — suffixes are encoded as-is. In practice, the suffix is much shorter than the prefix. The rule comprises four parts:
- If the word-prefix begins with a backslash (
'\'
), then treat it as an escape sequence. Prepend another backslash and return the word-prefix. E.g.,"\sum"
encodes to"\\sum"
. This rule is mainly used for encoding papers containing typesetting commands, such as TeX and Markdown. - Convert the word-prefix to lowercase and generate its fingerprint. Resolve the (0-based) position of the lowercase prefix in the vector of strings mapped from the fingerprint. If the word-prefix is in the dictionary, then encode it by padding the output with a string of whitespace characters — the length of the string equal to the position of the prefix in the vector — then output the fingerprint. E.g., assume the word-prefix
"no"
is positioned second in the dictionary mapping for its fingerprint:"n" -> ["in", "no", "on"]
. It would then be encoded as"␣n"
— one space followed by its fingerprint. The word"on"
would have required two spaces —"␣␣n"
. - Otherwise, if the word-prefix is not in the dictionary and contains one or more vowels, it is encoded as-is. E.g., the word-prefix
"tea"
, which is not in our sample dictionary, but contains a vowel, is encoded as"tea"
. - Otherwise, if the word-prefix comprises only consonants and its fingerprint does not appear in the dictionary, it is encoded as-is. E.g., the word-prefix
"psst"
is not mapped, so it is encoded with no change —"psst"
. This case is much more frequent in East-Slavic than it is in English; in the latter, words comprising only consonants are either abbreviations, acronyms or representations of sounds. - Otherwise, prepend a backslash (
'\'
) to the word-prefix. E.g., the word-prefix"cnt"
, comprising all-consonants, having no resolvable position in"cnt" -> ["cent", "count"]
for an existing fingerprint"cnt"
, is encoded as"\cnt"
. Without this rule,"cnt"
with no leading spaces might be reversed into"cent"
... or something entirely different and less appropriate.
The capitalisation rules encode capitalisation hints into the output so that the word-prefix may have its capitalisation restored. These rules is not perfectly reversible, but cope well in the overwhelming majority of cases.
- If the input contains a capital letter anywhere after the second letter, capitalise the entire input.
- Otherwise, if the input starts with the capital letter, capitalise only the first letter and append the remainder of the input.
- Otherwise, if the input has no capital letters, feed it through to the output unchanged.
The encoded output of each prefix is subsequently combined with the suffix to form the complete, encoded word.
Once the encoded outputs of each word have been derived, the resulting line is obtained by concatenating all outputs, using a single whitespace character to join successive pairs. The outputs from our earlier examples — ["␣n", "tea", "psst", "\cnt"]
are combined into the string "␣n␣tea␣psst␣\cnt"
.
Like compression, expansion begins by tokenising each line. Only in the case of expansion, contiguous whitespace sequences are not discarded — their count is needed to decode the output of compaction rule 1. This is the unary encoding part of the algorithm. One might think of it as the reverse of run-length encoding.
For each tokenised word, we apply the punctuation rules first, followed by reverse-compaction, then by capitalisation.
Punctuation here is the same as before. The word is split into a punctuation tuple. The prefix is decoded, while the suffix is carried as-is.
The reverse-compaction rule acts as follows:
- If the word-prefix begins with a backslash, then it is removed, and the remaining substring is returned. E.g., the encoded word-prefix
"\trouble"
is decoded to"trouble"
. A single backslash"\"
decodes to an empty string. This rule reverses the output of both rule 1 and rule 5 of the compaction ruleset. - Otherwise, the lowercase version of the word-prefix is checked for vowels. If at least one vowel is found, the lowercase string is returned. E.g., the word-prefix
"german"
is passed through. - Otherwise, if the lowercase word-prefix comprises only consonants, it is considered to be a fingerprint. It is resolved in the dictionary by looking up the word at the position specified by the number of leading spaces. If the fingerprint is present in the dictionary, it should always be possible to locate the original word. (The exception is when two different dictionaries were used, which is treated as an error.) E.g., given the mapping
"n" -> ["in", "no", "on"]
, the encoded word-prefix"␣␣n"
decodes to"on"
. - Otherwise, if the fingerprint is not in the dictionary, return the word as-is. E.g.,
"kgb"
is passed through if there is no mapping for its fingerprint in the dictionary.
After reverse-compaction, the capitalisation rule is applied as per the compression path. Capitalisation is mostly reversible — it works well for words that begin with capitals or contain only capitals, such as acronyms. However, it cannot always reverse mixed-case words and words that reduce to a single consonant. Consider some examples.
"Apple"
encodes to "Ppl
, given a dictionary containing "ppl" -> ["apple", ...]
. It correctly decodes back to "Apple"
.
"KGB"
encodes to "KGB"
, given no dictionary mapping for the fingerprint "kgb"
, and decodes correctly.
"LaTeX"
encodes to "LTX"
, assuming it exists in the dictionary. It decodes to "LATEX"
, incorrectly capitalising some letters. This is an example of the mixed-case problem. However, if the word is absent from the dictionary, it will be encoded as-is, and will have its capitalisation restored correctly.
The problematic mixed-case scenario seldom occurs in practice because acronyms are generally absent from dictionaries; such words are rarely subjected to compaction — they're encoded verbatim.
"Ra"
(the sun god of ancient Egypt) is encoded to "Ra"
, given a dictionary mapping "r" -> ["ra", ...]
. Capitalisation is correctly reversed. However, if the input is the acronym "RA"
, it will also be encoded to "Ra"
— capitalisation will not be reversible in this case. Again, if "ra"
is not in the dictionary, the word will be encoded verbatim and capitalisation will be reversible.
Compression Efficacy
serb.zip
is interesting from a linguistic standpoint, but does it actually compress?
The table below shows the result of compressing a series of literary works using the Balkanoid codec. The works ranged in size from a few hundred kilobytes to several megabytes. For each text, the size of the serb.zip
output is displayed, along with a percent reduction relative to the original size. (Negative values indicate that the size increased relative to the original.)
Both the original text and its "serb-zipped" variant were subsequently subjected to gzip
and bzip2
binary compression on the --best
setting. It was instructive to observe whether binary post-compression altered the result. The sizes were recorded, along with the reduction factor.
The complete test set was split into two: English texts and Russian texts. The English test set was much larger — 21 works versus 5.
English texts
filename | size | words | gzip size | bzip2 size | sz size | sz reduction % | sz.gz size | sz+gz reduction % | sz.bz2 size | sz+bz2 reduction % |
---|---|---|---|---|---|---|---|---|---|---|
no_man_is_an_island.txt | 396 | 81 | 289 | 283 | 366 | 7.57 | 258 | 10.72 | 261 | 7.77 |
antigonish.txt | 478 | 97 | 282 | 293 | 552 | -15.48 | 263 | 6.73 | 269 | 8.19 |
a_dream_within_a_dream.txt | 652 | 141 | 414 | 404 | 648 | .61 | 381 | 7.97 | 374 | 7.42 |
the_raven.txt | 6587 | 1068 | 2753 | 2610 | 6250 | 5.11 | 2513 | 8.71 | 2440 | 6.51 |
metamorphosis.txt | 142017 | 25094 | 51125 | 41494 | 135843 | 4.34 | 47295 | 7.49 | 40236 | 3.03 |
alice_in_wonderland.txt | 174313 | 29594 | 61021 | 49027 | 167360 | 3.98 | 57207 | 6.25 | 48216 | 1.65 |
the_prince.txt | 307808 | 52982 | 112051 | 86353 | 284168 | 7.68 | 103035 | 8.04 | 84642 | 1.98 |
calculus_made_easy.txt | 404533 | 56128 | 128753 | 103437 | 376411 | 6.95 | 122490 | 4.86 | 102025 | 1.36 |
frankenstein.txt | 448821 | 78122 | 168673 | 126241 | 408328 | 9.02 | 154384 | 8.47 | 123996 | 1.77 |
sherlock_holmes.txt | 612668 | 98533 | 212198 | 153006 | 532337 | 13.11 | 192828 | 9.12 | 151134 | 1.22 |
pride_and_prejudice.txt | 798774 | 124753 | 267241 | 182367 | 671468 | 15.93 | 241914 | 9.47 | 182294 | .04 |
effective_kafka.txt | 832006 | 121588 | 260631 | 186466 | 740534 | 10.99 | 244580 | 6.15 | 185031 | .76 |
dracula.txt | 881217 | 164382 | 335208 | 245421 | 855606 | 2.90 | 315240 | 5.95 | 244016 | .57 |
new_testament.txt | 1020010 | 182644 | 344546 | 248323 | 978194 | 4.09 | 326802 | 5.14 | 247227 | .44 |
jane_eyre.txt | 1084733 | 188452 | 428029 | 317669 | 1028088 | 5.22 | 397902 | 7.03 | 311810 | 1.84 |
crime_and_punishment_eng.txt | 1201520 | 206553 | 439388 | 319534 | 1161086 | 3.36 | 412300 | 6.16 | 315694 | 1.20 |
moby_dick.txt | 1276235 | 215864 | 511491 | 389164 | 1208689 | 5.29 | 476706 | 6.80 | 383487 | 1.45 |
mormon.txt | 1588965 | 293164 | 453627 | 313667 | 1528402 | 3.81 | 425963 | 6.09 | 313170 | .15 |
anna_karenina_eng.txt | 2068079 | 352857 | 743362 | 535118 | 1958820 | 5.28 | 694156 | 6.61 | 529153 | 1.11 |
count_of_monte_cristo.txt | 2786940 | 464031 | 1012102 | 724410 | 2600665 | 6.68 | 939973 | 7.12 | 714182 | 1.41 |
war_and_peace_eng.txt | 3359372 | 566334 | 1221693 | 888312 | 3120825 | 7.10 | 1134553 | 7.13 | 880502 | .87 |
Within the English test set, the size reduction using serb.zip
alone was substantial in most cases. The greatest reduction was seen in Pride and Prejudice — 15.93%, followed by 13.11% in The Memoirs of Sherlock Holmes. In most cases, the size reduction was in the single-digit percentages. A Dream Within a Dream showed almost no difference in size, with a reduction of just .61%.
However, in one case — Antigonish — the output size increased by 15.48%.
Applying gzip
and bzip2
was where things got really interesting. In every single case, the output size decreased substantially, compared to the equivalent binary compression run without serb.zip
. For gzip
, improvements ranged from 4.86% (Calculus Made Easy) to 10.72% (No Man is an Island). For bzip2
, the smallest improvement was recorded for Pride and Prejudice, at .04%; the highest improvement was for Antigonish, at 8.19%. This is interesting because Antigonish showed a poor result for serb.zip
alone. However, it appears that although the file size increased, the information entropy decreased at a disproportionately greater rate. This helped binary compressors achieve a better overall result.
Russian texts
filename | size | words | gzip size | bzip2 size | sz size | sz reduction % | sz.gz size | sz+gz reduction % | sz.bz2 size | sz+bz2 reduction % |
---|---|---|---|---|---|---|---|---|---|---|
u_lukomorya.txt | 1599 | 169 | 754 | 609 | 1528 | 4.44 | 744 | 1.32 | 620 | -1.80 |
lyublyu_tebya.txt | 2290 | 205 | 1032 | 830 | 2189 | 4.41 | 1022 | .96 | 832 | -.24 |
odnazhdy.txt | 2359 | 237 | 1031 | 885 | 2333 | 1.10 | 1047 | -1.55 | 905 | -2.25 |
anna_karenina_rus.txt | 3072666 | 288389 | 802824 | 520413 | 3319009 | -8.01 | 841072 | -4.76 | 528303 | -1.51 |
war_and_peace_rus.txt | 5368089 | 494556 | 1436738 | 935792 | 5618619 | -4.66 | 1490941 | -3.77 | 955191 | -2.07 |
For the Russian test set, the result was very different. The size difference for the serb.zip
run varied between 4.44% and -8.01%. The result failed to improve with the use of binary post-compression. Notably, Анна Каренина and Война и Мир showed the worst results across the board. They were also the largest works in the test set.
It appears that Balkanoid is not particularly effective at compressing Russian texts. The following postulates a theory as to why.
In (modern) English, a noun doesn’t have any other forms but singular and plural. In Russian, nouns are declined — a process called «склонение» (declension), which gives the words different endings, in singular and in plural in six cases: именительный падеж (nominative), родительный падеж (genitive), дательный падеж (dative), винительный падеж (accusative), творительный падеж (instrumental), and предложный падеж (prepositional). In addition, Russian has genders: masculine, feminine, and neuter.
When encoding words in English, provided both the singular and plural forms of a word are in the dictionary, that word may be effectively compacted in every case. In Russian, words have numerous variations and the likelihood of locating a given word in the dictionary exactly as it appears in the input text is substantially lower. Dictionary-based compaction alone, in the absence of grammatical awareness, appears to be insufficient to achieve consistent and measurable compression gains — comparable to that observed with English texts.
FAQ
Does serb.zip
work?
serb.zip
(namely, the Balkanoid codec) was tested on numerous texts, including such literary masterpieces as War and Peace and Effective Kafka, and was found to correctly compress and expand each word in every case. A total of 27 texts spanning over 4 million words were assessed. Some even included typesetting commands. serb.zip
works.
What are its limitations?
The caveats and limitations are explained in the how serb.zip
works section. In summary:
- Contiguous spaces between words, as well as leading and trailing spaces will not be restored.
- Some mixed case words will not have their capitalisation restored correctly.
- Some acronyms containing one consonant will not be restored correctly.
- The algorithm requires that the same dictionary is used for compression and expansion. It cannot detect if a dictionary has changed between runs. In many cases, this will produce a legible expansion, albeit the output may not correspond to the original.
In every other respect, Balkanoid is a fully restorable codec. In other words, the compressed output can be accurately restored to match the original input upon expansion. The limitations above are, by and large, immaterial — they do not affect comprehension.
What is the difference between serb.zip
and Balkanoid?
serb.zip
is a framework for developing, testing and executing codecs that map from one lexical form to another. Balkanoid is an implementation of a codec. When running the serbzip
command without specifying a codec, Balkanoid will be used by default.
I love this project. How can I contribute?
If you're interested in this sort of piss-take, please, do join in.
It would be nice to have more reversible, human-readable transformations. Pig Latin might be a good example. The current algorithm is simple but not reversible. Perhaps, with little alteration, one might make a fully reversible/lossless Pig Latin codec.
serb.zip
is highly modular; you can plug in your own serbzip::codecs::Codec
implementation, and away you go.
Dependencies
~8–18MB
~264K SLoC