6 releases (breaking)

0.24.0 Feb 3, 2025
0.23.0 Jan 30, 2025
0.22.1 Jan 21, 2025
0.21.1 Dec 16, 2024
0.19.0 Nov 15, 2024

#426 in Database implementations

Download history 92/week @ 2024-11-10 121/week @ 2024-11-17 186/week @ 2024-11-24 103/week @ 2024-12-01 67/week @ 2024-12-08 165/week @ 2024-12-15 6/week @ 2024-12-22 66/week @ 2024-12-29 133/week @ 2025-01-05 109/week @ 2025-01-12 187/week @ 2025-01-19 240/week @ 2025-01-26 248/week @ 2025-02-02 138/week @ 2025-02-09

813 downloads per month
Used in 3 crates (2 directly)

Apache-2.0

2MB
39K SLoC

Vortex IPC Format

Messages:

  • Context - provides configuration context, e.g. which encodings are referenced in the stream.
  • Array - indicates the start of an array. Contains the schema.
  • Chunk - indices the start of an array chunk. Contains the offsets for each column message.
  • ChunkColumn - contains the encoding metadata for a single column of a chunk, including offsets for each buffer.

lib.rs:

Read and write Vortex layouts, a serialization of Vortex arrays.

A layout is a serialized array which is stored in some linear and contiguous block of memory. Layouts are recursively defined in terms of one of three kinds:

  1. The FlatLayout. A contiguously serialized array of buffers, with a specific in-memory Alignment.

  2. The StructLayout. Each column of a StructArray is sequentially laid out at known offsets. This permits reading a subset of columns in time linear in the number of kept columns.

  3. The ChunkedLayout. Each chunk of a ChunkedArray is sequentially laid out at known offsets. This permits reading a subset of rows in time linear in the number of kept rows.

A layout, alone, is not a standalone Vortex file because layouts are not self-describing. They neither contain a description of the kind of layout (e.g. flat, column of flat, chunked of column of flat) nor a data type (DType).

Reading

Reading is implemented by VortexFile. It's "opened" by VortexOpenOptions, which can be provided with information about's the file's structure to save on IO before the actual data read. Once the file is open and has done the initial IO work to understand its own structure, it can be turned into a stream by calling VortexFile::scan with a Scan, which defines filtering and projection on the file.

The file manages IO-oriented work and CPU-oriented work on two different underlying runtimes, which are configurable and pluggable with multiple provided implementations (Tokio, Rayon etc.). It also caches buffers between stages of the scan, saving on duplicate IO. The cache can also be reused between scans of the same file (See SegmentCache).

File Format

Succinctly, the file format specification is as follows:

  1. Data is written first, in a form that is describable by a Layout (typically Array IPC Messages). a. To allow for more efficient IO & pruning, our writer implementation first writes the "data" arrays, and then writes the "metadata" arrays (i.e., per-column statistics)
  2. We write what is collectively referred to as the "Footer", which contains: a. An optional Schema, which if present is a valid flatbuffer representing a message::Schema b. The Layout, which is a valid footer::Layout flatbuffer, and describes the physical byte ranges & relationships amongst the those byte ranges that we wrote in part 1. c. The Postscript, which is a valid footer::Postscript flatbuffer, containing the absolute start offsets of the Schema & Layout flatbuffers within the file. d. The End-of-File marker, which is 8 bytes, and contains the u16 version, u16 postscript length, and 4 magic bytes.

Reified File Format

┌────────────────────────────┐
│                            │
│            Data            │
│    (Array IPC Messages)    │
│                            │
├────────────────────────────┤
│                            │
│   Per-Column Statistics    │
│                            │
├────────────────────────────┤
│                            │
│     Schema Flatbuffer      │
│                            │
├────────────────────────────┤
│                            │
│     Layout Flatbuffer      │
│                            │
├────────────────────────────┤
│                            │
│    Postscript Flatbuffer   │
│  (Schema & Layout Offsets) │
│                            │
├────────────────────────────┤
│     8-byte End of File     │
│(Version, Postscript Length,│
│       Magic Bytes)         │
└────────────────────────────┘

A Parquet-style file format is realized by using a chunked layout containing column layouts containing chunked layouts containing flat layouts. The outer chunked layout represents row groups. The inner chunked layout represents pages.

All the chunks of a chunked layout and all the columns of a column layout need not use the same layout.

Anything implementing VortexReadAt, for example local files, byte buffers, and cloud storage, can be used as the "linear and contiguous memory".

Apache Arrow

If you ultimately seek Arrow arrays, VortexRecordBatchReader converts an open Vortex file into a RecordBatchReader.

Dependencies

~29–61MB
~1M SLoC