4 releases (breaking)

0.4.0 Jul 5, 2022
0.3.0 Apr 15, 2021
0.2.0 Mar 29, 2021
0.1.0 Mar 29, 2021

#515 in Encoding

MIT license

305KB
6K SLoC

Neodyn Exchange Format Specification

Neodyn Exchange on crates.io Neodyn Exchange on docs.rs Minimum Supported Rust Version (MSRV) Downloads License Lines of Code

Neodyn Exchange is a serialization format originally derived from the Serde data model. It is designed to be an easy-to-use, portable, and efficient means of data exchange between systems. Although the reference implementation is in Rust, implementations in other languages are expected and encouraged.

Usage Quick Start

Like many Serde-compatible serialization formats, Neodyn Exchange is used primarily by calling top-level functions for different formats. The crate root re-exports the following functions:

// For serialization:
fn to_value()           // serializes to an in-memory, loosely-typed `Value`
fn to_string()          // serializes to a string, in the text representation
fn to_bytes()           // serializes to a Vec<u8>, in the binary format
fn to_writer()          // serializes to an io::Write, in the binary format
fn to_writer_buffered() // same as to_writer(), but uses buffering

// For deserialization
fn from_value()           // deserializes from an in-memory, loosely-typed Value
fn from_value_ref()       // same as above, but takes a borrowed &Value instead
fn from_str()             // deserializes from the text representation
fn from_bytes()           // deserializes from a byte slice in the binary format
fn from_reader()          // deserializes from an io::Read in the binary format
fn from_reader_buffered() // same as above, but uses buffering for efficiency

There are more functions in the conventionally-named ser and de modules, which, however, are less common (e.g. writing the text format as raw bytes).

The Value enum is also exported at the top level; it represents all possible values the Neodyn Exchange format can work with. Additionally, Value and &Value implement the Deserializer trait, as well as Serialize and Deserialize, so you can use them as a last-resort escape hatch if some types or other formats aren't flexible enough in their implementation of (de)serialization.

Furthermore, the types implementing Serializer and Deserializer are also public (however, they are not re-exported). However, some caveats apply to them (e.g. some of them must be .finalize()d manually), so please refer to their respective documentation for more information.

Representations

The Neodyn Exchange data model has three equivalent representations:

  1. An abstract, structured, in-memory value tree, realized by the Value type in the Rust reference implementation;
  2. A human-readable, pretty-printable, text-based format;
  3. And a machine-readable, compact binary format.

Each of these representations are capable of storing the same set of values. How exactly one interprets a serialized value may differ between applications. For example, due to the varying nature of the means of parsing each representation, the Serde reference implementation may produce different kinds of errors at different points in time when deserializing the structured, the human-readable, and the binary format, respectively.

The Type System

Neodyn Exchange is a self-describing format, and as such, every serialized value carries its type. This also means that as long as one is satisfied with a loosely-typed value tree as the output of the parsers, no schema is needed for decoding a serialized value.

The possible types are the following:

  1. null, denoting the absence of an optional value.
  2. opt, denoting the presence of a (wrapped) optional value.
  3. bool, just a regular Boolean, either true or false.
  4. int, 64-bit signed integer.
  5. uint, 64-bit unsigned integer.
  6. float, 64-bit IEEE-754 floating-point number, which may not be NaN.
  7. string, UTF-8 encoded, human readable text.
  8. blob, an arbitrary, raw sequence of bytes.
  9. array, an ordered collection of arbitrary nested values.
  10. map, an unordered collection of key-value pairs, where both the keys and the values are arbitrary nested Neodyn Exchange values.

The abstract value tree representation pretty much reflects this system one-to-one.

Remark: The signed integer, unsigned integer, and floating-point types are collectively called numbers. (Note in particular that bool is not considered a number.)

The Text Representation

The text representation is defined to be a UTF-8 encoded string. The following is an informal (read: not rigorous), BNF-style grammar for this format.

Whitespace (as defined by Unicode) before the first token, after the last token, and between consecutive tokens is allowed and ignored. The exception is strings, which are treated as a single token, and whitespace between characters in a string must not be ignored. (This is just the usual semantics, because of course we want to preserve whitespace within a string.)

Details of the notation below:

  • Non-terminal productions are written in all caps, without quotes, e.g. VALUE
  • Production definitions are denoted with :=, e.g. BOOL := 'true' | 'false'
  • Literal text is encapsulated between single quotes, e.g. 'null'
  • Unordered alternation is denoted by |, e.g. foo | bar
  • Round parentheses are used for grouping, e.g. (VALUE)
  • Kleene operators ?, * and + denote optional (0 or 1 repetitions), any (0 or more repetitions), and many (1 or more repetitions) items, respectively, e.g. (KEY ':' VALUE ',')*
  • Ranges of Unicode codepoints are spelled using PCRE notation, e.g. [0-9]
  • Fixed-number repetitions of consecutive characters in a range is also written in PCRE notation, e.g. [0-9a-fA-F]{2}
  • Unicode concepts and other human-readable, simplifying descriptions are given in angle brackets, e.g. <XID_continue>
VALUE := NULL | OPT | BOOL | INT | UINT | FLOAT | STRING | BLOB | ARRAY | MAP

NULL := 'null' WB
OPT := '?' VALUE

BOOL := ('true' | 'false') WB

INT := ('+' | '-') [0-9]+ WB
UINT := [0-9]+ WB
FLOAT := ('+' | '-')? (([0-9]* '.' [0-9]+) | ([0-9]+ '.' [0-9]*) | 'inf') WB

STRING := '"' ( UNESCAPED | ESCAPED )* '"'
UNESCAPED := <any Unicode codepoint except '"' and '\'>
ESCAPED := '\n' | '\r' | '\t' | '\\' | '\'' | '\"' | '\u{[0-9a-fA-F]+}'

BLOB := '#' ([0-9a-fA-F]{2})* '#'

ARRAY := '[' ( VALUE ( ',' VALUE )* ','? )? ']'
MAP := '{' ( PAIR ( ',' PAIR )* ','? )? '}'
PAIR := VALUE ':' VALUE

WB := <Unicode word boundary or ASCII punctuation>

Notes:

  • The human-readable format is case sensitive. All built-in literal words are written in lowercase. (The case of strings is of course preserved.)

  • All numbers are written in decimal. The human-readable format does not allow bases other than 10.

  • Numbers and words (e.g. literal null, true, false, and inf) must be separated from their surroundings by Unicode word boundaries or ASCII punctuation. The purpose of this constraint is to disallow strings that would be hard to parse for a human reader, e.g. 123null shouldn't be accepted as two valid tokens 123 and null, because they are not visually separate.

  • There is a distinction between signed and unsigned integers, i.e. +42 is not the same as 42. This means that the textual format can preserve the type of a positive signed integer, which is distinct from an unsigned integer.

  • Floating-point numbers must always have digits on at least one side of the decimal point. Canonically, they are printed with at least one digit on both sides on the decimal point, even if that digit happens to be a zero, e.g. -0.3 or +12.0. The sign of a floating-point number is optional, but it is always printed in the canonical representation.

  • Floating-point numbers are capable of representing positive and negative infinity, and they preserve the sign of zero, but NaN is disallowed. NaN values are converted to null in all three representations of the Neodyn Exchange format. This means, however, that NaN does not round-trip if serialized then deserialized, because attempting to deserialize null into a float is treated as an error.

  • Leading zeroes in all three numeric types are allowed and ignored, except for floating-point inf. Trailing zeroes in the fractional part of floating-point numbers are also allowed and ignored.

    The canonical representation is not to print any leading or trailing zeroes, except when the number (for integers) or its integer or fractional part (for floating-point numbers) is equal to zero. In these cases, exactly one zero digit is printed in each place where it is needed.

  • Trailing commas in arrays and maps are allowed and ignored. The canonical format is to always print trailing commas.

  • Whitespace is allowed between pairs of hex digits in a blob literal, but hex digits within a pair (i.e. those denoting the high and low nibble of the same byte) must not be separated by whitespace.

  • Hex digits in a blob literal and in Unicode escape sequences are case insensitive and the lowercase and uppercase characters a...f and A...F are pairwise equivalent. The canonical representation is lowercase.

  • String literals are allowed to contain "raw" newlines and tabs. These are interpreted as-is. The canonical representation is to always escape newlines and tabs.

  • Escape sequences are interpreted according to Rust's rules. This means that mostly the usual interpretation applies (e.g. \n denotes a newline, \t is a tab, etc.), with the addition of \u{...}, where ... denotes a sequence of consecutive hex digits (case insensitive) that is parsed as a Unicode code point. Leading zeroes are allowed, and the canonical representation is not to print any leading zeroes.

Example for the text representation:

[
  {
    +39: -.354,
    -1.: true,
    +3.142: -6.283,
    0: null,
    1: ?"an optional string",
    2: ??"two levels of optionals; even an optional null is allowed, e.g.:",
    null: ?null,
    "as you can see": "null is allowed to be a key as well",
    "escaped\nnewline": "unescaped
newline",
    ["arrays","and","maps"]:{"can":"be","keys":"too"},
    "this is a map": "with a trailing comma",
  },
  {
    "optional array": ?[
      "first",
      "second",
    ],
    "empty map": {},
    "array without a trailing comma": [1, 2, 3],
    "this is a map": "also without a trailing comma"
  },
]

The Binary Representation

The binary format tries to pack values into as little space as possible. It is, however, a strictly byte-oriented format, i.e. it operates on units of at least 8 bits and assumes 8-bit bytes. It employs techniques such as variable-width integer encoding and string interning in order to achieve small serialized object sizes. Consequently, although the conceptual type system describes 64-bit integers and floating-point numbers, values of sufficiently low magnitude and/or sufficiently few significant digits may actually use less space when encoded.

Type tags and Variable-Width Encoding

Every value starts with a one-byte tag containing at least type information, but often other data such as length or an inline small value.

The types are organized hierarchically. There are three levels of types:

  1. Every value has a major type.
  2. For some of the major types, there are one or more minor types.
  3. For certain combinations of major and minor types, there is a value tag.

The type annotations and these additional pieces of information occupy well-defined places within a byte:

  • The major type is stored in the 3 most significant bits (#5-#7).
  • The minor type, if any, is stored in the preceding 3 bits, i.e. bits #2-#4.
  • The value tag, if present, occupies the 2 least significant bits, #0-#1.
  • Certain major-minor combinations store a 2-bit logarithmic length information in the lower two bits instead of a value tag. This denotes that either 20=1, 21=2, 22=4, or 23=8 bytes of additional data follow. This is usually an integer or a floating-point number, always in little-endian byte order.
  • Yet some other major types omit the minor type as well as the value tag and store a small integer payload in the lower 5 bits (#0-#4).

To sum up, the possible formats for type/value tags are:

  • Major, Minor, Value Tag: XXX YYY ZZ, (X = major bits, Y = minor bits, Z = value tag bits)
  • Major, Minor, Log-Length: XXX YYY NN (X = major bits, Y = minor bits, N = bits of base-2 logarithm of the length of the data following)
  • Major, Small Payload: XXX VVV VV (X = major bits, V = payload data bits)

Anatomy of a Serialized Value. The Symbol Table.

A serialized Neodyn Exchange value in the binary format consists of two parts:

  1. The header or "symbol table"
  2. The body

Strings and binary blobs are stored in a compressed manner: only one copy of each identical byte array is ever written. These uniqued byte arrays are arranged into a "symbol table" at the beginning of the serialized data. If there are no interned strings or blobs in a value, this header may be omitted (and it is omitted by the reference implementation).

Note that if a string and a blob have the exact same data payload, they are uniqued to the same slot in the symbol table. Also note that empty strings and blobs are never interned and they are denoted by special inline "empty string" and "empty blob" markers, respectively, in the value body.

Furthermore, Each symbol table entry declares whether the corresponding data payload can (and will) be used as a string too, or only as a blob. This helps the deserializer avoid trying to parse arbitrary binary data as UTF-8.

If a symbol table entry is referenced more than once, the number of references, the so-called "use count" is also stored along with the payload. This makes it possible for the deserializer to hand over the owned buffer to the consumer upon the last reference to the symbol, sparing a copy and an allocation. (This optimization only matters when the deserializer works with owned buffers.)

After the symbol table, all other data are stored in the value body. It contains atomic values (e.g. booleans, numbers, etc.) as well as arrays and maps, inline. Strings and blobs are referenced by their index into the symbol table.

The only case when more than one value is present in the body is when there is a complex type (an array or a map), recursively containing other values. The values in an array follow each other sequentially. The keys and values of a map are interleaved: the first entry is the first key followed by the first value, then the second key and the second value, and so on.

Type Tags in the Header and the Body

Numerically equal type tags can have different interpretations depending on whether they are found in the header / symbol table or in the actual value body.

The following lists all of the possible major and minor types and value tags. As previously, NN denotes log-length bits, and VVV VV stands for inline small integer bits.

First, type tags shared between the header and the body:

Bits Meaning
000 000 NN Symbol Table Start

Second, type tags for the header / symbol table:

Bits Meaning
010 VVV VV Short Blob, Single-use
011 VVV VV Short Blob, Multiple uses
100 VVV VV Short String, Single-use
101 VVV VV Short String, Multiple uses
111 010 NN Long Blob, Single-use
111 011 NN Long Blob, Multiple uses
111 100 NN Long String, Single-use
111 101 NN Long String, Multiple uses

Third, type tags for the value body:

Bits Meaning
000 001 00 null
000 001 01 Optional Present
000 001 10 false
000 001 11 true
000 010 00 Empty string (inline)
000 010 01 Empty blob (inline)
001 VVV VV Small signed integer
010 VVV VV Small unsigned integer
011 VVV VV Low-index string
100 VVV VV Low-index blob
101 VVV VV Small array
110 VVV VV Small map
111 001 NN Signed integer
111 010 NN Unsigned integer
111 011 NN String
111 100 NN Blob
111 101 NN Array
111 110 NN Map
111 111 NN Floating-point number

Bit-level Details of the Binary Format

First some general notes about the variable-width number encoding:

  • Sizes, counts, payload lengths, and indices are always encoded using unsigned integers in little-endian byte order.
  • Some type tags use the same variable-width encoding structure for associated length/index data as proper unsigned integers themselves, except that they might use a different major/minor type tag.
  • This unified variable-width integer encoding is as follows.
    • If the major type of the type tag denotes a "small" value, then the 5 least significant bits (#0-#4) contain the integer value inline, either as an unsigned or as a 2's complement signed bit sequence.

      Therefore, the range of inline small integers is:

      • -16...+15 for signed
      • 0...31 for unsigned
    • If the major type of the type tag denotes a "big" value, then the minor type tag determines the actual (sub)type of the value, and the 2 least significant bits use the aforementioned log-length encoding to refer the size of the following integer or floating-point number. Thus, the lower two bits might be either of 00, 01, 10, 11, which means a subsequent number size of 1, 2, 4, or 8 bytes, respectively. 2, 4, and 8-byte numbers are stored as little-endian.

    • Thus, an encoded number may occupy a total of either 1, 2, 3, 5, or 9 bytes, depending on its magnitude.

Next, the format of the symbol table is elaborated.

Symbol Table Header Format

If the symbol table is present, the value must start with a marker byte with the major type "simple" and the minor type "symtab", i.e. 000 000 NN. If the value does not start with such a byte, then no symbol table is read.

The last two bits, NN, define the base-2 logarithm of the number of bytes needed for encoding the number of entries in the symbol table. I.e.,

Symtab start byte Width of integer encoding symbol count
000 000 00 1
000 000 01 2
000 000 10 4
000 000 11 8

Note that this means that the symbol table count doesn't have a compact, inline (1-byte) format for storing the count; the total size is always 2, 3, 5, or 9 bytes, as the "start byte" is present unconditionally.

So, the symbol count follows, as a little-endian unsigned integer. For example, a symbol table with 512 entries might begin like this:

+---------------+-------------+-------------+
|  000 000 01   |  0000 0000  |  0000 0010  |
+---------------+-------------+-------------+
| Symtab start, |  count,     |  count,     |
| 2^1 = 2 bytes |  low byte   |  high byte  |
| of length     |             |             |
| follow        |             |             |
+---------------+-------------+-------------+

Then, each entry follows in sequence, without any additional padding in between.

Symbol Table Entry Format

Symbol table entries consist of, in this order:

  1. Type tag and length (1...9 bytes), in the usual variable-width encoding;
  2. An optional use count (1...9 bytes), as an unsigned integer;
  3. The raw payload data.

The first 1...9 bytes thus contain information about the following:

  • The actual symbol type, i.e.:
    • string or blob
    • small (length encoded inline in the type tag) or big
    • single-use or multiple-use (whether a use count follows the length)
  • The length of the payload, in bytes, as an unsigned integer.

The bit pattern for each of the type tags can be found in the table of all type tags in the previous section.

  • If the major type of the type tag denotes a multi-use symbol, then the use count follows as an unsigned integer in the usual variable-width encoding.
  • If, however, the major type corresponds to a single-use symbol, then no use count is encoded, and it should be assumed to be 1 by the decoder.

So, for example, a 64-byte-long string that is only referred to once in the value body (and thus has an implicit use count) might be encoded as follows:

+--------------+-----------+------------------------+
|  111 100 00  | 0100 0000 | 64 bytes of UTF-8 data |
+--------------+-----------+------------------------+
| long string, | length,   | actual string payload  |
| single-use,  | 64 bytes  |                        |
| 2^0 = 1 byte |           |                        |
| of length    |           |                        |
| follow       |           |                        |
+--------------+-----------+------------------------+

Meanwhile a 17-byte-long blob that is used 130 times in the body can be encoded in the following manner:

+--------------+------------+-----------+---------------------+
|  011 100 01  | 111 010 00 | 1000 0010 | 17 arbitrary bytes  |
+--------------+------------+-----------+---------------------+
| short blob,  | use count: |    130    | actual blob payload |
| multi-use,   | unsigned   |           |                     |
| 17 bytes     | integer,   |           |                     |
| long         | 2^0 = 1    |           |                     |
|              | byte wide  |           |                     |
+--------------+------------+-----------+---------------------+

The Value Body

Non-interned data, i.e. values of every type other than strings and blobs, are stored inline in the body, which directly follows the header.

The exact format of each type of value is reproduced below.

Type Tag Value Following additional data
000 001 00 null None (tag-only)
000 001 01 Optional Wrapped value
000 001 10 false None (tag-only)
000 001 11 true None (tag-only)
000 010 00 Empty String None (tag-only)
000 010 01 Empty Blob None (tag-only)
001 VVV VV Small Int None (5-bit value inline)
010 VVV VV Small Uint None (5-bit value inline)
011 NNN NN Small String None (5-bit symtab index inline)
100 NNN NN Small Blob None (5-bit symtab index inline)
101 NNN NN Small Array Array items (5-bit count inline)
110 NNN NN Small Map Key-value pairs (5-bit count inline)
111 001 NN Big Int 2NN bytes of 2's complement
111 010 NN Big Uint 2NN bytes of unsigned int
111 011 NN Big String 2NN bytes of symtab index
111 100 NN Big Blob 2NN bytes of symtab index
111 101 NN Big Array 2NN bytes of count, then items
111 110 NN Big Map 2NN bytes of count, then pairs
111 111 NN Floating-point 2NN (4 or 8) bytes of IEEE-754

Example

Consider the MsgPack front page example in the human-readable format of Neodyn Exchange:

{"compact": true, "schema": 0}

In the binary format this is represented as:

Address Byte Explanation
00 000 000 00 Symtab start, 200 = 1 byte of length
01 000 000 10 2 symbols in symtab
02 100 001 11 Short string, single-use, 7 bytes
03 011 000 11 'c'
04 011 011 11 'o'
05 011 011 01 'm'
06 011 100 00 'p'
07 011 000 01 'a'
08 011 000 11 'c'
09 011 101 00 't'
10 100 001 10 Short string, single-use, 6 bytes
11 011 100 11 's'
12 011 000 11 'c'
13 011 010 00 'h'
14 011 001 01 'e'
15 011 011 01 'm'
16 011 000 01 'a'
17 110 000 10 2-element map
18 011 000 00 String #0
19 000 001 11 true
20 011 000 01 String #1
21 010 000 00 Unsigned integer 0

Dependencies

~3–10MB
~112K SLoC