#columnar #compression #serde

serde_columnar

Ergonomic columnar storage encoding crate with forward and backward compatible

17 releases

0.3.10 Sep 11, 2024
0.3.7 Aug 9, 2024
0.3.6 Jun 14, 2024
0.3.3 Nov 21, 2023
0.1.0 Nov 17, 2022

#259 in Encoding

Download history 65/week @ 2024-07-15 19/week @ 2024-07-22 72/week @ 2024-07-29 196/week @ 2024-08-05 179/week @ 2024-08-12 225/week @ 2024-08-19 40/week @ 2024-08-26 298/week @ 2024-09-02 812/week @ 2024-09-09 283/week @ 2024-09-16 123/week @ 2024-09-23 184/week @ 2024-09-30 95/week @ 2024-10-07 137/week @ 2024-10-14 30/week @ 2024-10-21 110/week @ 2024-10-28

374 downloads per month
Used in 6 crates (4 directly)

MIT/Apache

82KB
2K SLoC

serde_columnar

serde_columnar is an ergonomic columnar storage encoding crate that offers forward and backward compatibility.

It allows the contents that need to be serialized and deserialized to be encoded into binary using columnar storage, all by just employing simple macro annotations.

For more detailed introduction, please refer to this Notion link: Serde-Columnar.

🚧 This crate is in progress and not stable, should not be used in production environments

Features 🚀

serde_columnar comes with several remarkable features:

  • 🗜️ Utilizes columnar storage in conjunction with various compression strategies to significantly reduce the size of the encoded content.
  • 🔄 Built-in forward and backward compatibility solutions, eliminating the need for maintaining additional version codes.
  • 🌳 Supports nested columnar storage.
  • 📦 Supports list and map containers
  • 🔄 Supports deserialization using iterator format.

How to use

Install

cargo add serde_columnar

Or edit your Cargo.toml and add serde_columnar as dependency:

[dependencies]
serde_columnar = "0.3.10"

Container Attribute

  • vec:
    • Declare this struct will be rows of a vec-like container
    • Automatically derive RowSer trait if set ser at the same time
    • Automatically derive RowDe trait if set de at the same time
  • map:
    • Declare this struct will be rows of a map-like container
    • Automatically derive KeyRowSer trait if set ser at the same time
    • Automatically derive KeyRowDe trait if set de at the same time
  • ser:
    • Automatically derive Serialize trait for this struct
  • de:
    • Automatically derive Deserialize trait for this struct
  • iterable:
    • Declare this struct will be iterable
    • Only available for row struct
    • Iterable for more details

Field Attribute

  • strategy:
    • The columnar compression strategy applied to this field.
    • Optional value: Rle/DeltaRle/BoolRle/DeltaOfDelta.
    • Only available for row struct.
  • class:
    • Declare this field is a container for rows. The field's type is usually Vec or HashMap and their variants.
    • Optional value: vec or map.
    • Only available for table struct.
  • skip:
  • borrow:
    • Same as #[serde(borrow)], borrow data for this field from the deserializer by using zero-copy deserialization.
    • use #[columnar(borrow="'a + 'b")] to specify explicitly which lifetimes should be borrowed.
    • Only available for table struct for now.
  • iter:
    • Declare the iterable row type when deserializing using iter mode.
    • Only available for field marked class.
    • Only available for class="vec".
  • optional & index:
    • In order to achieve forward and backward compatibility, some fields that may change can be marked as optional.
    • And in order to avoid the possibility of errors in the future, such as change the order of optional fields, it is necessary to mark the index.
    • All optional fields must be after other fields.
    • The index is the unique identifier of the optional field, which will be encoded into the result. If the corresponding identifier cannot be found during deserialization, Default will be used.
    • optional fields can be added or removed in future versions. The compatibility premise is that the field type of the same index does not change or the encoding format is compatible (such as changing u32 to u64).

Examples

use serde_columnar::{columnar, from_bytes, to_vec};

#[columnar(vec, ser, de)]                // this struct can be a row of vec-like container
struct RowStruct {
    name: String,
    #[columnar(strategy = "DeltaRle")]   // this field will be encoded by `DeltaRle`
    id: u64,
    #[columnar(strategy = "Rle")]        // this field will be encoded by `Rle`
    gender: String,
    #[columnar(strategy = "BoolRle")]    // this field will be encoded by `BoolRle`
    married: bool
    #[columnar(optional, index = 0)]     // This field is optional, which means that this field can be added in this version or deleted in a future version
    future: String
    #[columnar(strategy = "DeltaOfDelta")] // this field will be encoded by `DeltaOfDelta`
    time: i64
}

#[columnar(ser, de)]                    // derive `Serialize` and `Deserialize`
struct TableStruct<'a> {
    #[columnar(class = "vec")]          // this field is a vec-like table container
    pub data: Vec<RowStruct>,
    #[columnar(borrow)]                 // the same as `#[serde(borrow)]`
    pub text: Cow<'a, str>
    #[columnar(skip)]                   // the same as `#[serde(skip)]`
    pub ignore: u8
    #[columnar(optional, index = 0)]    // table container also supports optional field
    pub other_data: u64

}

let table = TableStruct::new(...);
let bytes = serde_columnar::to_vec(&table).unwrap();
let table_from_bytes = serde_columnar::from_bytes::<TableStruct>(&bytes).unwrap();

You can find more examples of serde_columnar in examples and tests.

Iterable

When we use columnar for compression encoding, there is a premise that the field is iterable. So we can completely borrow the encoded bytes to obtain all the data in the form of iterator during deserialization without directly allocating the memory of all the data. This implementation can also be achieved completely through macros.

To use iter mode when deserializing, you only need to do 3 things:

  1. mark all row struct with iterable
  2. mark the field of row container with iter="..."
  3. use serde_columnar::iter_from_bytes to deserialize
#[columnar(vec, ser, de, iterable)]
struct Row{
  #[columnar(strategy="Rle")]
  rle: String
  #[columnar(strategy="DeltaRle")]
  delta_rle: u64
  other: u8
}

#[columnar(ser, de)]
struct Table{
  #[columnar(class="vec", iter="Row")]
  vec: Vec<Row>,
  other: u8
}

let table = Table::new(...);
let bytes = serde_columnar::to_vec(&table).unwrap();
let table_iter = serde_columnar::iter_from_bytes::<Table>(&bytes).unwrap();

Acknowledgements

  • serde: Serialization framework for Rust.
  • postcard: Postcard is a #![no_std] focused serializer and deserializer for Serde. We use it as serializer and deserializer in order to provide VLE and ZigZag encoding.
  • Automerge: Automerge is an excellent crdt framework, we reused the code related to RLE Encoding in it.

Dependencies

~1.3–2.1MB
~45K SLoC