4 releases

0.1.3	Mar 13, 2025
0.1.2	Mar 13, 2025
0.1.1	Mar 13, 2025
0.1.0	Mar 13, 2025

#126 in Machine learning

1,751 downloads per month

MIT license

63KB
1K SLoC

Principal component analysis (PCA)

Forked from https://github.com/ekg/pca. Modified from Erik Garrison's original implementation.

Consider using this library if you have more features than samples.

A Rust library providing Principal Component Analysis (PCA) functionality using either:

A covariance-based eigen-decomposition (classical PCA). (Faster, less memory-efficient.)
A randomized SVD approach (for large-scale or high-dimensional data). (Slower, more memory-efficient.)

This library supports:

Mean-centering and scaling of input data.
Automatic selection of PCA components via a user-defined tolerance or a fixed count.
Both "standard" PCA for moderate dimensions and a "randomized SVD" routine for very large matrices.

The PCA is obtained via SVD.

Installation

Add this to your Cargo.toml:

[dependencies]
efficient_pca = "0.1.3"

Or just cargo add efficient_pca to get the latest version.

Features

Standard PCA via eigen-decomposition of the covariance matrix:
- Suitable for data where the number of features is not prohibitively large compared to samples.
- Automatically handles scaling and centering of your data.
- Allows a tolerance-based cutoff for the number of principal components.
Randomized PCA (rfit) via randomized SVD:
- Efficient for high-dimensional or very large datasets.
- Randomized SVD can approximate the principal components much faster if the dimensionality is very large.
- Allows specifying the number of components and an oversampling parameter to improve accuracy.
Flexible Tolerances:
- You can specify a fraction of the largest eigenvalue/singular value as a threshold to keep or reject components.
Easy Transformation:
- Once fitted, the same PCA instance can be used to transform new data into the principal-component space.

Usage

Classical PCA (`fit`)

use ndarray::array;
use efficient_pca::PCA;

fn main() {
    // Suppose we have some 2D dataset (n_samples x n_features)
    let data = array![
        [1.0, 2.0],
        [3.0, 4.0],
        [5.0, 6.0]
    ];

    // Create a new PCA instance
    let mut pca = PCA::new();

    // Fit the model to the data using the classical covariance-based approach.
    // (Optionally provide a tolerance, e.g., Some(0.01) to drop small components.)
    pca.fit(data.clone(), None).unwrap();

    // Transform the data into the new PCA space
    let transformed = pca.transform(data).unwrap();
    println!("PCA-transformed data:\n{:?}", transformed);
}

Randomized SVD PCA (`rfit`)

use ndarray::array;
use efficient_pca::PCA;

fn main() {
    // Larger data matrix example (here just 2x2 for illustration)
    let data = array![
        [1.0, 2.0],
        [3.0, 4.0]
    ];

    // Set up PCA
    let mut pca = PCA::new();

    // Use 'rfit' to perform randomized SVD
    // - n_components: number of principal components you want
    // - n_oversamples: oversample dimension
    // - seed: optional for reproducibility
    // - tol: optional variance-based cutoff
    pca.rfit(data.clone(), 2, 10, Some(42), None).unwrap();

    // Transform into PCA space
    let transformed = pca.transform(data).unwrap();
    println!("Randomized PCA result:\n{:?}", transformed);
}

Transforming Data

Once the PCA is fitted (by either fit or rfit), you can transform new incoming data.
The PCA object internally stores the mean, scaling, and rotation matrix.

use ndarray::array;
use efficient_pca::PCA;

fn main() {
    let data_train = array![
        [1.0, 2.0],
        [3.0, 4.0],
    ];
    let data_test = array![
        [2.0, 3.0],
        [4.0, 5.0],
    ];
    
    let mut pca = PCA::new();
    pca.fit(data_train.clone(), Some(1e-3)).unwrap();
    
    // Transform both training data and new data
    let train_pcs = pca.transform(data_train).unwrap();
    let test_pcs = pca.transform(data_test).unwrap();
    
    println!("Train set in PCA space:\n{:?}", train_pcs);
    println!("Test set in PCA space:\n{:?}", test_pcs);
}

API Overview

`PCA::new()`

Creates a new, empty PCA struct. Before use, you must call either fit or rfit.

use efficient_pca::PCA;
let pca = PCA::new();

`PCA::fit(...)`

Fits PCA using the covariance eigen-decomposition approach.

Parameters:
- data_matrix: Your data, shape (n_samples, n_features).
- tolerance: If Some(tol), discard all components whose eigenvalue is below tol * max_eigenvalue. Otherwise, keep all.
Returns: Result<(), Box<dyn Error>>, but on success, stores internal rotation, mean, and scale.

use ndarray::array;
use efficient_pca::PCA;
let data = array![[1.0, 2.0],
                  [3.0, 4.0]];
let mut pca = PCA::new();
pca.fit(data, Some(0.01)).unwrap();

`PCA::rfit(...)`

Fits PCA using randomized SVD.

Parameters:
- x: The input data, shape (n_samples, n_features).
- n_components: Number of components to keep (upper bound).
- n_oversamples: Oversampling dimension for randomized SVD.
- seed: Optional RNG seed for reproducibility.
- tol: Optional fraction of the largest singular value used to drop components.
Returns: Same as fit, but uses a different internal approach optimized for large dimensions.

use ndarray::array;
use efficient_pca::PCA;
let data = array![[1.0, 2.0],
                  [3.0, 4.0]];
let mut pca = PCA::new();
pca.rfit(data, 10, 5, Some(42_u64), Some(0.01)).unwrap();

`PCA::transform(...)`

Transforms data using the previously fitted PCA’s rotation, mean, and scale.

Parameters:
- x: A data matrix with the same number of features as the training data.
Returns: The matrix of shape (n_samples, n_components) in principal-component space.

use ndarray::array;
use efficient_pca::PCA;
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = array![[1.0, 2.0],
                      [3.0, 4.0]];
    let mut pca = PCA::new();
    pca.fit(data.clone(), None).unwrap();
    let projected = pca.transform(data)?;
    println!("{:?}", projected);
    Ok(())
}

Performance Considerations

Standard PCA (fit):

Computes the covariance matrix (p x p) if p <= n, else a Gram matrix (n x n).
Faster for smaller feature counts but can be expensive for extremely large p.

Randomized SVD (rfit):

Best for very high dimensional datasets.
You can tune n_oversamples to reduce approximation errors (at the cost of more computation).
Internally, it uses a smaller SVD on a projected matrix, offering a big speed-up for large or tall/skinny/wide matrices.

Memory Usage:

The library copies data in some places to center and scale, and creates temporary matrices for covariance or Gram decompositions.
For very large datasets, consider using rfit.
Use rfit for high-dimensional, wide datasets where memory efficiency is crucial, even if that sometimes means a trade-off in runtime.

Authors

Erik Garrison erik.garrison@gmail.com. See original repository: https://github.com/ekg/pca
SauersML

License

This project is licensed under the MIT License - see the LICENSE file for details.

Dependencies

~78MB
~1M SLoC