1 unstable release

Uses old Rust 2015

0.0.1 Oct 27, 2017

#36 in #columnar


Used in column

MIT/Apache

23KB
462 lines

Column

column is a Rust library to represent vectors of a certain type in a columnar format. The columnar representation is beneficial when iterating a large number of elements but only looking at a subset of a type's fields.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

An example

To use column, add the following dependency to your project's Cargo.toml:

[dependencies]
column = { git = "https://github.com/antiguru/column-rs.git" }

This will bring in the column crate from Github (this will hopefully change!), which should allow you to use regular structs in a column-based memory layout.

extern crate column;
#[macro_use] extern crate column_derive;

#[derive(Column, Debug)]
struct Data {
    id: usize,
    val: f64,
}

fn main() {
    let mut u = <Data as Column>::new();

    let ds = vec![Data { id: 0, val: 3.141 }, Data { id: 1, val: 42.}];
    u.extend(ds);

    for e in u.iter() {
        println!("Element: {:?}", e);
    }
    for mut e in u.iter_mut() {
        *e.val *= 2.;
    }
    for e in u.iter() {
        println!("Element: {:?}", e);
    }

}

This example is contained in the column crate, you can run it from the crate's root directory by typing

% cargo run --example columnar
Running `target/debug/examples/columnar`
Element: DataRef { id: 0, val: 3.141 }
Element: DataRef { id: 1, val: 42 }
Element: DataRef { id: 0, val: 6.282 }
Element: DataRef { id: 1, val: 84 }

Generic types

The generated code dereferences values to materialize elements for to_owned. That means that the value has to implement the Copy trait for this to work. Otherwise, Rust will complain about it. For example, this will produce a working generic columnar type:

#[derive(Column)]
struct DataGen<A: Copy> {
    id: A,
}

Filtered collections

When using columnar types, they might be passed to different downstream functionality without exposing all elements in the collection. To avoid intermediate copies, this carte contains a filtered collection (FilteredCollection.) It is a wrapper around an &IntoIterator combined with a Vec<bool> that stores which items are available in the target collection.

The following example instantiates a FilteredCollection and uses its retain method to only retain a subset of elements in the collection. Note that this does not change the underlying data.

use column::bitmap::FilteredCollection;
let mut bitmap_container = FilteredCollection::new(&container, container.len());
bitmap_container.retain(|u| p(u));

Debugging

Column creates the required implementations during the compilation process. Sometimes things go wrong and debugging the generated code is rather tedious. For this reason, a flag verbose exists. It forces column to write the intermediate code to the target directory. For a type Data, it will generate the file target/derive_column_Data.rs. (If there's a more elegant way, please open an issue or send a pull request!) Insert the following snippet in Cargo.toml to enable verbose output:

[dependencies]
column = { git = "https://github.com/antiguru/column-rs.git", features = [ "verbose" ] }

The feature can also be activated on the command line when working on this project. The following cargo invocation tests and dumps the intermediate files:

cargo test --features verbose

Performance

There's a small benchmark. It shows that the columnar format can be substantially faster for certain operations. Run it using Rust nightly using something like this:

% rustup run nightly cargo bench
[...]
running 6 tests
test data_bitmap_column_add_assign   ... bench:  13,178,990 ns/iter (+/- 405,043) = 1909 MB/s
test data_bitmap_vec_add_assign      ... bench:  39,891,687 ns/iter (+/- 714,537) = 630 MB/s
test data_column                     ... bench:   1,949,104 ns/iter (+/- 79,882) = 4303 MB/s
test data_column_add_assign          ... bench:   5,495,641 ns/iter (+/- 203,841) = 4579 MB/s
test data_row                        ... bench:  16,817,910 ns/iter (+/- 264,722) = 498 MB/s
test data_row_add_assign             ... bench:  32,581,004 ns/iter (+/- 759,734) = 772 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 0 filtered out

Take the performance numbers with a grain of salt. The speedup from using a columnar representation originates from loading less and more dense data into memory. It will only show any benefit of not all elements of a struct are accessed in a tight loop, because only then we can reduce the number of bytes transferred into the CPU. However, when all data is touched, i.e. all columns have to be loaded, the speedup may be negligible or even negative. Also, there is a cost associated with transforming a row-based representation into a column representation.


lib.rs:

Derive a Column representation for a struct.

Examples

#[macro_use] extern crate column_derive;
extern crate column;
#[derive(Column)]
struct Data {x: usize, y: u64}

Dependencies

~1.8–3.5MB
~72K SLoC