1 unstable release
new 0.1.0-alpha.1 | Apr 12, 2025 |
---|
#1397 in Machine learning
1MB
16K
SLoC
SciRS2 Datasets
A collection of dataset utilities for the SciRS2 scientific computing library. This module provides functionality for loading, generating, and working with common datasets used in scientific computing, machine learning, and statistical analysis.
Features
- Data Loaders: Functions for loading datasets from various sources
- Dataset Generators: Utilities to generate synthetic datasets
- Toy Datasets: Pre-defined small datasets for testing and examples
- Caching: Efficient caching mechanism for dataset loading
- Data Sampling: Tools for sampling from datasets
Usage
Add the following to your Cargo.toml
:
[dependencies]
scirs2-datasets = { workspace = true }
Basic usage examples:
use scirs2_datasets::{loaders, generators, toy};
use scirs2_core::error::CoreResult;
// Load CSV data
fn example_csv_loading() -> CoreResult<()> {
let csv_path = "data/example.csv";
let data = loaders::load_csv(csv_path, true)?;
println!("Loaded {} rows from CSV", data.nrows());
Ok(())
}
// Generate synthetic data
fn example_data_generation() -> CoreResult<()> {
// Generate a random classification dataset
let (features, labels) = generators::make_classification(
100, // n_samples
2, // n_features
2, // n_classes
1, // n_clusters_per_class
0.8, // class_sep
)?;
println!("Generated dataset with {} samples", features.nrows());
Ok(())
}
// Use toy datasets
fn example_toy_dataset() -> CoreResult<()> {
// Load the iris dataset
let iris = toy::load_iris()?;
println!("Iris dataset: {} samples, {} features",
iris.data.nrows(), iris.data.ncols());
println!("Feature names: {:?}", iris.feature_names);
println!("Target names: {:?}", iris.target_names);
Ok(())
}
Components
Loaders
Functions for loading data from various file formats:
use scirs2_datasets::loaders::{
load_csv, // Load data from CSV files
load_json, // Load data from JSON files
load_arff, // Load data from ARFF files
load_libsvm, // Load data from LIBSVM/SVMLight format
};
Generators
Functions for generating synthetic datasets:
use scirs2_datasets::generators::{
make_classification, // Generate a random n-class classification problem
make_regression, // Generate a random regression problem
make_blobs, // Generate isotropic Gaussian blobs
make_moons, // Generate two interleaving half circles
make_circles, // Generate a large circle containing a smaller circle
make_s_curve, // Generate an S curve dataset
make_swiss_roll, // Generate a swiss roll dataset
};
Toy Datasets
Pre-defined datasets for testing and examples:
use scirs2_datasets::toy::{
load_iris, // The classic Iris dataset
load_digits, // Handwritten digits dataset
load_wine, // Wine recognition dataset
load_boston, // Boston house prices dataset
load_diabetes, // Diabetes dataset
load_breast_cancer, // Breast cancer wisconsin dataset
};
Sampling
Utilities for data sampling:
use scirs2_datasets::sample::{
train_test_split, // Split arrays into random train and test subsets
stratified_split, // Split preserving the percentage of samples for each class
bootstrap_sample, // Generate a bootstrap sample
resample, // Resample arrays or matrices
};
Contributing
See the CONTRIBUTING.md file for contribution guidelines.
License
This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.
Dependencies
~14–27MB
~411K SLoC