#compile-time #filter #staged #programming #simd #rayon #savitzky-golay

bin+lib staged-sg-filter

A staged programming implementation for Savitzky-Golay filters. Loops go brrr.

5 releases

0.2.3 Jul 22, 2024
0.2.2 Jun 6, 2024
0.2.1 Jun 6, 2024
0.2.0 Jun 6, 2024
0.1.0 Jun 6, 2024

#206 in Hardware support

Download history 74/week @ 2024-07-23 11/week @ 2024-07-30 4/week @ 2024-09-10 1/week @ 2024-09-17 1/week @ 2024-09-24 8/week @ 2024-10-01

357 downloads per month

MIT license

320KB
942 lines

Staged Filters

  • A Savitzky-Golar filter that is fast, baby.

  • All (N,M) parameters are precomputed and pulled in at compile time.

  • rayon support is available via a rayon feature flag

  • Still some SIMD perf left on the table - newer versions will focus on perf

  • Remember to compile this with RUSTFLAGS="-C target-cpu=native".

This code is based on another code I adapted in Julia with much help from others, see StagedFilters.jl.

Example

Benchmarks

The other savgol-rs implementation offers this speed:

use savgol_rs::*;
fn main() {
    let input = SavGolInput {
        data: &vec![10.0; 500_000],
        window_length: 3,
        poly_order: 1,
        derivative: 0,
    };
    let result = savgol_filter(&input);
    let data = result.unwrap();
    println!("{:?}", &data[0..10]);
}

takes about 52s, whereas this crate

use staged_sg_filter::sav_gol;

fn main() {
    let n = 100_000_000;
    let v = vec![10.0; n];
    let mut buf = vec![0.0; n];
    sav_gol::<1, 1>(&mut buf, &v);

    println!("{:?}", &buf[0..10]);
}

runs in about 200ms in 20x the data size. We're roughly churning through about 100_000_000/0.2 ≈ 5e8 elements per second or 5e8 * 10^-90.5 elements per nanosecond.

This can still be improved by about a 4x factor, which is the current speed of the Julia code.

Notes

It's called "staged" because the computation is done in "stages", which allows the compiler to optimize the code a lot more - namely, the use of const generics in Rust provide more opportunities for profitable loop unrolling and proper SIMD lane-width usage.

You are expected to have FMA and AVX2 compatible hardware (at least). Compile with RUSTFLAGS="-C target-cpu=native" cargo run --release for best performance.

Decent efforts have been made to ensure

  • minimal dependencies and fast builds
  • auto-vectorization fires off with the help of cargo-remark
  • as much computation is pushed to compile time with the use of precomputed coefficients and const generics
  • the hot path is allocation and panic-free

Algorithm

  1. Calculate the coefficients of interest in Julia, copy/paste them into coeffs/_f32.rs appropriately and declare them as const.
  2. Do a fixed-size rolling window dot_product with half the elements of the dot product as the coeffs obtained previously.
  3. Update each element of a buffer
  4. Parallelize with Rayon

TODO

  • rayon support
  • Just calculate NxM up to 12x12 and cache that
  • fma support
  • f32/f64 float support
  • SIMD support
  • GPU support / ping Manuel Drehwald
  • no_std support see(Effective Rust link)
  • support derivatives (stretch goal - sponsor me???)

Dependencies

~0–265KB