#surrogate #unicode #encoding #wtf8

omgwtf8

Optimized-Matching-Generalized Wobbly Transformation Format — 8-bit

1 unstable release

Uses old Rust 2015

0.1.0 Jan 15, 2018

#1402 in Text processing

MIT license

75KB
1K SLoC

Optimized-Matching-Generalized Wobbly Transformation Format — 8-bit

This package implements the “OMG-WTF-8” encoding, making it suitable for use as a pattern via the “Pattern 1.5” API, so it doesn’t need two separate Pattern APIs from “Pattern 2.0”.

Background

Shortcomings of WTF-8

WTF-8 extends UTF-8 by allowing a lone surrogate code point using the usual UTF-8 encoding algorithm similar to CESU-8, e.g. U+D800 is encoded as the 3-byte sequence ed a0 80. Unlike CESU-8, WTF-8 encodes non-BMP characters as a 4-byte sequence, so conversion from valid UTF-8 sequence is zero-cost.

These two decisions together means we cannot split a well-formed WTF-8 string in the middle of a surrogate pair, which makes it impossible to implement some common string APIs like split().

let haystack = OsString::from_wide(&[0xd800, 0xdc00, 0xd800, 0xdc01, 0xd800, 0xdc02]);
let needle = OsString::from_wide(&[0xd800]);
for part in haystack.split(&needle) {
    println!("{:?}", part); // <-- how could we represent `part` as an `&OsStr`?
}
// should print "", "\u{dc00}", "\u{dc01}", "\u{dc02}".

Splitting a Surrogate Pair

The modified WTF-8 encoding addresses this by allowing a 4-byte sequence be split into halves, to represent a lone surrogate code point at the beginning or end. To see why this is possible, let’s check the encoding of non-BMP characters in UTF-8, -16 and -32.

(Image: Bit distribution in UTF-8, -16 and -32 for a non-BMP character)

In UTF-8, the first byte encode 3 bits, and the next three bytes each encode 6 bits. In UTF-16, each code unit encodes 10 bits. Therefore, the first 3 bytes of a UTF-8 4-byte sequence are enough to determine the high surrogate in the corresponding UTF-16 encoding, and the last 2 bytes are enough to determine the low surrogate.

Therefore, we can split a 4-byte sequence f0 9f 98 82 into two overlapping sub-strings f0 9f 98 and 98 82, each of which is sufficient to determine corresponding surrogate code point, and thus able to compare with the proper WTF-8 encoding.

(Image: Bit distribution of high surrogate)

For consistency, we will encode the low surrogate using a 3-byte substring instead of 2-.

(Image: Bit distribution of low surrogate)

Definitions

Terminology

The terminology used in OMG-WTF-8 follows those of WTF-8.

A Unicode code point is any value in the Unicode code space. It is written as “U+XXXX”, where XXXX is a hexadecimal number. A valid code point is in the range U+0000 to U+10FFFF, inclusive.

A high surrogate is a code point in the range U+D800 to U+DBFF.

A low surrogate is a code point in the range U+DC00 to U+DFFF.

A surrogate is either a high surrogate or a low surrogate, i.e. a code point in the range U+D800 to U+DFFF.

A surrogate pair is two code points, where the first code point is a high surrogate, and the second code point is a low surrogate.

An unpaired high surrogate in a string is a high surrogate not followed by a low surrogate.

An unpaired low surrogate in a string is a low surrogate not preceded by a high surrogate.

An unpaired surrogate is either an unpaired high surrogate or an unpaired low surrogate.

A 4-byte sequence is the UTF-8 encoding of a character in the range U+10000 to U+10FFFF, which occupies 4 bytes.

A 3-byte sequence is a sequence of 3 consecutive bytes.

The canonical representation of a surrogate is a 3-byte sequence, generated by encoding the said code point via generalized UTF-8.

Canonical representation Byte 0 Byte 1 Byte 2
High surrogate ed a0af 80bf
Low surrogate ed b0bf 80bf

Every character in the code point range U+10000 to U+10FFFF can be encoded as a surrogate pair in UTF-16, and as a 4-byte sequence in UTF-8. For this character, the first 3 bytes of the 4-byte sequence is the high surrogate’s split representation. Similarly, the last 3 bytes of the 4-byte sequence is the low surrogate’s split representation.

Split representation Byte 0 Byte 1 Byte 2
High surrogate f0 90bf 80bf
High surrogate f1f3 80bf 80bf
High surrogate f4 808f 80bf
Low surrogate 80bf 80bf 80bf

OMG-WTF-8

The well-formed OMG-WTF-8 sequence consists of, in order:

  1. Zero or one 3-byte sequence at the beginning of the string, which is the split representation of a low surrogate,
  2. A well-formed WTF-8 sequence, and
  3. Zero or one 3-byte sequence at the end of the string, which is the split representation of a high surrogate.

Thus, any well-formed WTF-8 string is a well-formed OMG-WTF-8 sequence.

A surrogate in split representation can never appear in the middle of an OMG-WTF-8 sequence. A high surrogate in split representation can only appear at the end of the string, and a low surrogate in split representation can only appear at the beginning of the string.

Canonicalization is a process which converts a 3-byte sequence from split representation to canonical representation. This process can be done in-place.

Any well-formed OMG-WTF-8 sequence can be converted to a well-formed WTF-8 sequence by canonicalization. Since split-representation surrogates can only appear at either ends of the string, canonicalization is O(1) and is very cheap.

OMG-WTF-8 strings are hashed and compared-with-equality through its canonicalization. The ordering of OMG-WTF-8 strings should be compatible with well-formed UTF-8. When unpaired surrogates appear, the actual ordering is unspecified. Nevertheless, OMG-WTF-8 should be totally ordered.

OMG-WTF-8 must be converted to WTF-8 when stored.

Slicing

It is possible to slice a well-formed OMG-WTF-8 string s[i..j]. In order to maintain well-formed-ness, the resulting string may be longer than j - i.

The index can point at the middle of a 4-byte sequence, e.g.

let s = OmgWtf8::from_str("\u{10000}");
let first_half = &s[..2];   // valid, returns the 3-byte sequence "\xf0\x90\x80".
let second_half = &s[2..];  // valid, returns the 3-byte sequence "\x90\x80\x80".
let empty_str = &s[2..2];   // valid, returns an empty string "".

Pointing to the middle of a 3-byte sequence is always an error.

Because of the overlapping, it is unsafe to split a mutable OMG-WTF-8 string. We recommend only use OMG-WTF-8 in immutable case, and convert to WTF-8 if mutable access is needed.

Matching

An OMG-WTF-8 sequence can be used as a needle for string matching. When matching, transform any unpaired low surrogate at the beginning (no matter in what representation) into a regular expression like this:

(\xed\xb0\x80|[\x80-\xbf][\x80\x90\xa0\xb0]\x80)

and transform any unpaired high surrogate at the end (no matter in what representation) into a regular expression like this:

(\xed\xa0\x80|\xf0\x90[\x80-\x8f])

The WTF-8 sequence in the middle is matched byte-for-byte.

After a match is found, the cursor may potentially point to byte 1 or byte 3 of a 4-byte sequence. In that case, the index needs to be readjusted to point to byte 2. The cursor itself may also needs to move backward by 2 bytes for the next match, e.g. searching for \u{dc00}\u{d800} inside \u{10000}\u{10000}\u{10000} should yield 2 matches at 2..6 and 6..10.

This package

The omgwtf8 package is a demo showing the concept does work. This is not implemented for performance nor feature completeness.

Dependencies

~3.5MB
~72K SLoC