3 releases
0.1.2 | Jan 26, 2024 |
---|---|
0.1.1 | Jan 25, 2024 |
0.1.0 | Jan 25, 2024 |
#1838 in Algorithms
21 downloads per month
16KB
227 lines
Description
This projects provides a simple implementation to apply Principal Coordinate Analysis
on distance (dissimilarity) matrix.
The logic inside is based on Python's scikit-bio
library.
How to use
Documentation can be found here.
Status
Experimental.
lib.rs
:
This crate provides functionality to apply Principal Coordinate Analysis.
Quick Start
A very brief introduction into Principal Coordinates Analysis.
What is Principal Coordinates Analysis (PCoA)
PCoA is a statistical method that turns distance data between items into a map-like visualization. It helps to reveal which items are close to each other and which ones are far away. This visualization can also assist in identifying groups or clusters among the items.
Principal Coordinate Analysis and Multidimensional Scaling
Multidimensional Scaling is a family of statistical methods that focus on creating mappings of items based on distance. Principal Coordinate Analysis is one type of Multidimensional Scaling which specifically works with numerical distances, in which there is no measurement error — you have one precise distance measure for each pair of items.
Principal Coordinate Analysis nad Principal Component Analysis
PCoA and Principal Component Analysis (PCA) are often confused due to their shared initials and both involving dimensionality reduction. However, they differ in their primary objectives:
- PCA concentrates on shared variance, aiming to summarize multiple variables into the fewest components while maximizing the explanation of variance in each component.
- PCoA emphasizes distances and strives to identify dimensions that capture the greatest distances between items.
For more information, please read an excellent article.
Implementation
This crate's implementation is based on scikit-bio implementation, however, most of performance optimizations are not applied here.
Usage Example
This section explains how to use the library.
Brief
Let's assume we have three points A, B, C with the respective distance matrix of 3x3 shape:
0 | AB | AC |
BA | 0 | BC |
CA | CB | 0 |
To call the PCoA
function, you need to construct DMatrix
instance from the distance matrix.
The following pseudo code demonstrates expected layout of one dimensional array which will be used later to the function:
DMatrix::from_column_slice(3, 3, [0,AB,AC,BA,0,BC,CA,CB,0])
DMatrix
is a type from nalgebra
crate, other types from this crate can be imported with pcoa::nalgebra
.
Please note that current implementation assumes a symmetric matrix, so the following equalities are expected to be
hold: AB=BA
, AC=CA
and BC=CB
.
As a second argument, you need to pass dimensionality of principal coordinates. Typically, it equals to 2.
As result, another DMatrix
instance will be returned and it represents principal coordinates which can be used to
plot original items on a map.
Code example
A minimalistic example:
use pcoa::apply_pcoa;
use pcoa::nalgebra::DMatrix;
// here, we have interest in only two coordinates (e.g. x and y)
let number_of_dimensions = 2;
// create a distance matrix from raw data. Matrix is expected to be symmetric with 3x3 shape
let distance_matrix = DMatrix::from_column_slice(3, 3, &[0_f64, 250., 450., 250., 0., 300., 450., 300., 0.]);
// apply pcoa
let coords_matrix = apply_pcoa(distance_matrix, number_of_dimensions).expect("cannot apply PCoA");
// NOTE: transpose matrix to get first column for x coordinates and the second - for y coordinates.
let coords_matrix = coords_matrix.transpose();
let xs: Vec<_> = coords_matrix.column(0).iter().copied().collect();
let ys: Vec<_> = coords_matrix.column(1).iter().copied().collect();
// these are our coordinates
assert_eq!((xs[0].round(), ys[0].round()), (213., -60.));
assert_eq!((xs[1].round(), ys[1].round()), (24., 104.));
assert_eq!((xs[2].round(), ys[2].round()), (-237., -44.));
Here, we have the following distance matrix as input:
0 | 250 | 450 |
250 | 0 | 300 |
450 | 300 | 0 |
As result, we get the following coordinates (rounded in test and here):
A: (213, -60)
B: ( 24, 104)
C: (-237, -44)
These coordinates retain original distances between points (with some precision loss) and can be used to visualize original data in 2D space.
Dependencies
~3MB
~57K SLoC