4 releases (breaking)

0.4.0	Jan 14, 2025
0.3.0	Oct 28, 2024
0.2.0	Aug 27, 2024
0.1.1	Jul 19, 2024

#124 in Algorithms

10,443 downloads per month
Used in 32 crates (4 directly)

MIT/Apache

1MB
28K SLoC

CubeCL Linear Algebra Library.

The crate contains common linear algebra algorithms.

Algorithms

Tiling 2D Matrix Multiplication.

The kernel is very flexible and can be used on pretty much any hardware.
Cooperative Matrix Multiplication.

The kernel is using Automatic Mixed Precision (AMP) to leverage cooperative matrix-multiply and accumulate instructions. For f32 tensors, the inputs are casted into f16, but the accumulation is still performed in f32. This may cause a small lost in precision, but with way faster execution.

Benchmarks

You can run the benchmarks from the workspace with the following:

cargo bench --bench matmul --features wgpu # for wgpu
cargo bench --bench matmul --features cuda # for cuda

On an RTX 3070 we get the following results:

matmul-wgpu-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        13.289ms
  Variance    28.000ns
  Median      13.271ms
  Min         12.582ms
  Max         13.768ms
―――――――――――――――――――――――――
matmul-cuda-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        12.754ms
  Variance    93.000ns
  Median      12.647ms
  Min         12.393ms
  Max         14.501ms
―――――――――――――――――――――――――
matmul-cuda-f32-cmma

―――――――― Result ―――――――――
  Samples     100
  Mean        4.996ms
  Variance    35.000ns
  Median      5.084ms
  Min         4.304ms
  Max         5.155ms
―――――――――――――――――――――――――

Dependencies

~5–19MB
~199K SLoC