4 releases
new 0.1.3 | Mar 13, 2025 |
---|---|
0.1.2 | Mar 13, 2025 |
0.1.1 | Mar 13, 2025 |
0.1.0 | Mar 13, 2025 |
#554 in Science
65 downloads per month
63KB
1K
SLoC
Principal component analysis (PCA)
Forked from https://github.com/ekg/pca. Modified from Erik Garrison's original implementation.
Consider using this library if you have more features than samples.
A Rust library providing Principal Component Analysis (PCA) functionality using either:
- A covariance-based eigen-decomposition (classical PCA). (Faster, less memory-efficient.)
- A randomized SVD approach (for large-scale or high-dimensional data). (Slower, more memory-efficient.)
This library supports:
- Mean-centering and scaling of input data.
- Automatic selection of PCA components via a user-defined tolerance or a fixed count.
- Both "standard" PCA for moderate dimensions and a "randomized SVD" routine for very large matrices.
Installation
Add this to your Cargo.toml
:
[dependencies]
efficient_pca = "0.1.3"
Or just cargo add efficient_pca
to get the latest version.
Features
-
Standard PCA via eigen-decomposition of the covariance matrix:
- Suitable for data where the number of features is not prohibitively large compared to samples.
- Automatically handles scaling and centering of your data.
- Allows a tolerance-based cutoff for the number of principal components.
-
Randomized PCA (
rfit
) via randomized SVD:- Efficient for high-dimensional or very large datasets.
- Randomized SVD can approximate the principal components much faster if the dimensionality is very large.
- Allows specifying the number of components and an oversampling parameter to improve accuracy.
-
Flexible Tolerances:
- You can specify a fraction of the largest eigenvalue/singular value as a threshold to keep or reject components.
-
Easy Transformation:
- Once fitted, the same PCA instance can be used to transform new data into the principal-component space.
Usage
Classical PCA (fit
)
use ndarray::array;
use efficient_pca::PCA;
fn main() {
// Suppose we have some 2D dataset (n_samples x n_features)
let data = array![
[1.0, 2.0],
[3.0, 4.0],
[5.0, 6.0]
];
// Create a new PCA instance
let mut pca = PCA::new();
// Fit the model to the data using the classical covariance-based approach.
// (Optionally provide a tolerance, e.g., Some(0.01) to drop small components.)
pca.fit(data.clone(), None).unwrap();
// Transform the data into the new PCA space
let transformed = pca.transform(data).unwrap();
println!("PCA-transformed data:\n{:?}", transformed);
}
Randomized SVD PCA (rfit
)
use ndarray::array;
use efficient_pca::PCA;
fn main() {
// Larger data matrix example (here just 2x2 for illustration)
let data = array![
[1.0, 2.0],
[3.0, 4.0]
];
// Set up PCA
let mut pca = PCA::new();
// Use 'rfit' to perform randomized SVD
// - n_components: number of principal components you want
// - n_oversamples: oversample dimension
// - seed: optional for reproducibility
// - tol: optional variance-based cutoff
pca.rfit(data.clone(), 2, 10, Some(42), None).unwrap();
// Transform into PCA space
let transformed = pca.transform(data).unwrap();
println!("Randomized PCA result:\n{:?}", transformed);
}
Transforming Data
Once the PCA is fitted (by either fit
or rfit
), you can transform new incoming data.
The PCA object internally stores the mean, scaling, and rotation matrix.
use ndarray::array;
use efficient_pca::PCA;
fn main() {
let data_train = array![
[1.0, 2.0],
[3.0, 4.0],
];
let data_test = array![
[2.0, 3.0],
[4.0, 5.0],
];
let mut pca = PCA::new();
pca.fit(data_train.clone(), Some(1e-3)).unwrap();
// Transform both training data and new data
let train_pcs = pca.transform(data_train).unwrap();
let test_pcs = pca.transform(data_test).unwrap();
println!("Train set in PCA space:\n{:?}", train_pcs);
println!("Test set in PCA space:\n{:?}", test_pcs);
}
API Overview
PCA::new()
Creates a new, empty PCA
struct. Before use, you must call either fit
or rfit
.
use efficient_pca::PCA;
let pca = PCA::new();
PCA::fit(...)
Fits PCA using the covariance eigen-decomposition approach.
- Parameters:
data_matrix
: Your data, shape (n_samples, n_features).tolerance
: IfSome(tol)
, discard all components whose eigenvalue is belowtol * max_eigenvalue
. Otherwise, keep all.
- Returns:
Result<(), Box<dyn Error>>
, but on success, stores internal rotation, mean, and scale.
use ndarray::array;
use efficient_pca::PCA;
let data = array![[1.0, 2.0],
[3.0, 4.0]];
let mut pca = PCA::new();
pca.fit(data, Some(0.01)).unwrap();
PCA::rfit(...)
Fits PCA using randomized SVD.
- Parameters:
x
: The input data, shape (n_samples, n_features).n_components
: Number of components to keep (upper bound).n_oversamples
: Oversampling dimension for randomized SVD.seed
: Optional RNG seed for reproducibility.tol
: Optional fraction of the largest singular value used to drop components.
- Returns: Same as
fit
, but uses a different internal approach optimized for large dimensions.
use ndarray::array;
use efficient_pca::PCA;
let data = array![[1.0, 2.0],
[3.0, 4.0]];
let mut pca = PCA::new();
pca.rfit(data, 10, 5, Some(42_u64), Some(0.01)).unwrap();
PCA::transform(...)
Transforms data using the previously fitted PCA’s rotation, mean, and scale.
- Parameters:
x
: A data matrix with the same number of features as the training data.
- Returns: The matrix of shape (n_samples, n_components) in principal-component space.
use ndarray::array;
use efficient_pca::PCA;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = array![[1.0, 2.0],
[3.0, 4.0]];
let mut pca = PCA::new();
pca.fit(data.clone(), None).unwrap();
let projected = pca.transform(data)?;
println!("{:?}", projected);
Ok(())
}
Performance Considerations
Standard PCA (fit
):
- Computes the covariance matrix
(p x p)
ifp <= n
, else a Gram matrix(n x n)
. - Faster for smaller feature counts but can be expensive for extremely large
p
.
Randomized SVD (rfit
):
- Best for very high dimensional datasets.
- You can tune
n_oversamples
to reduce approximation errors (at the cost of more computation). - Internally, it uses a smaller SVD on a projected matrix, offering a big speed-up for large or tall/skinny/wide matrices.
Memory Usage:
-
The library copies data in some places to center and scale, and creates temporary matrices for covariance or Gram decompositions.
-
For very large datasets, consider using
rfit
. -
Use
rfit
for high-dimensional, wide datasets where memory efficiency is crucial, even if that sometimes means a trade-off in runtime.
Authors
- Erik Garrison erik.garrison@gmail.com. See original repository: https://github.com/ekg/pca
- SauersML
License
This project is licensed under the MIT License - see the LICENSE file for details.
Dependencies
~77MB
~1M SLoC