1 unstable release

0.1.0 Dec 28, 2022

#1186 in Text processing

MIT license

51KB
779 lines

bocu1

This crate serves two purposes:

  1. To provide a usable implementation of BOCU-1 to people who are looking for one, as well as some utilities related to working directly with BOCU-1 strings and "small strings" packed into 64 or 128-bit integers.

  2. To serve as a well-documented elucidation / reference for people trying to piece together how the encoding even works, from the multiple, partial bits of documentation that exist elsewhere online.

It thus has more redundancies and explanations than necessary, and has been "split apart" into as many layers as possible, in order to maximize clarity, motives, design choices, etc. It is probably too slow as a result. But it's not a fast encoding in any case: the speed win comes from encoding text once and working with it in its encoded (smaller!) form.

I've also included a bunch of test vectors and a randomizing validator for it connected to IBM's previous reference implementation.

What even is BOCU-1

BOCU-1 is a relatively nice Unicode encoding, like UTF-8 or the like, with somewhat different tradeoffs. It is not useful for all (or even many) cases and in general most applications should lean on UTF-8, but BOCU-1 gets you a few things that you might want in certain contexts:

  1. Generally small. The 'C' in 'BOCU-1' stands for compression, and the principal way to look at this encoding is as a very simple text-focused compression format. BOCU-1 strings are smaller than those in most other common Unicode encodings. In particular it is small enough to be quite useful for storing small strings "packed" in multibyte machine words (64 or 128 bits).

  2. Minimizes linguistic penalties. Latin-script languages are not encoded significantly more efficiently than non-Latin. Indeed, most scripts get an encoding nearly as good as they got in the pre-Unicode "code page" days.

  3. Minimal fixed compression overhead. The compression kicks in on the second character encoded, without any dictionaries or encoded metadata.

  4. Codepoint-order preserving. You can memcmp() BOCU-1 strings, or integer-compare them if small strings are packed into integers, and the comparison will obey the lexicographic Unicode codepoint order of the strings. This is probably the main use of the form: you can build a really fast ordered dictionary type or database key range using BOCU-1 small strings packed into integers (if you're ok with codepoint order).

  5. Reasonably (though not perfectly) compatible with existing contexts: self-synchronizing at line boundaries, avoids ASCII control characters that would mess up a terminal, permits some crude forms of byte-level word and line breaking, etc.

  6. Small, simple, deterministic code. Single representation for each input.

It has several downsides of course, which should be admitted up front:

  1. Stateful encoding. Period. The encoder state resets quite often, but you can't decode a single Unicode scalar inside a run without decoding from the most recent reset-point. This means memchr-like byte-based string-search is somewhat compromised (though you can byte-search for candidates sharing the second-and-beyond characters in a query, and then fully decode and filter those candidates, so it's not completely hopeless).

  2. Less CPU-efficient than UTF-8. It uses integer divide and modulus operations rather than just bitwise operations.

  3. Less compatible. It reuses many printable ASCII codes for other things, and does not code most values in the ASCII space as themselves (aside from the C0 control characters <= 0x20).

  4. Formerly patented in the US (US patent 6737994, filed 2002, expired November 2022). I actually wrote this crate 4 years before publishing it, tried to contact IBM to get the promised "royalty-free license" they were going to offer conforming implementations, found out that it was sold to Google in 2011, contacted them and heard back they would be willing to include it in their "open patent non-assertion pledge", but then they stopped responding to emails, the trail went cold and nothing ever happened so I just waited. The patent is now expired, so I think I'm in the clear, but that is why this package is MIT licensed rather than MIT+ASL2: I have no idea what the ASL2 patent clauses mean with respect to a formerly-patented thing.

Encoding overview

BOCU-1 consists of 3 parts:

  1. A stateful delta-encoder with a previous-char value that is normalized by some script- and block-specific rules.

  2. A variable-length code (1..4 code values) for the deltas generated in part 1, that is both lexical-order and delta-magnitude preserving. This code has separately-treated leading and trailing values.

  3. A final mapping from each trailing code value in the variable-length code of part 2 to output bytes, using a small set of range-offsets that thereby avoids bytes used for common ASCII control codes, provides synchronization bytes, and allows codepoints below 0x20 to be self-encoded.

Part of what makes existing documentation and code for BOCU-1 look complex is that these three parts appear quite intertwined (for performance), when in fact they can be understood almost completely in isolation. This crate is therefore organized into 3 main sub-modules (plus some helpers), one for each such part, exposing only the bits that need to be shared between them.

License: MIT

Dependencies

~250–360KB