#data-science #machine-learning #feature-selection #deduplicate #feature-engineering #feature-extraction

feature-factory

A high-performance feature engineering library for Rust powered by Apache DataFusion

1 unstable release

new 0.1.1-alpha Mar 9, 2025
0.1.0 Mar 6, 2025

#112 in Machine learning

Download history 100/week @ 2025-03-02

100 downloads per month

MIT/Apache

185KB
3.5K SLoC

Weldon the Penguin

Feature Factory

Tests Lints Code Coverage CodeFactor Crates.io Docs.rs Downloads MSRV License Status: Alpha

Feature Factory is a feature engineering library for Rust built on top of Apache DataFusion. It uses DataFusion internally for fast, in-memory data processing. It is inspired by the Feature-engine Python library and provides a wide range of components (referred to as transformers) for common feature engineering tasks like imputation, encoding, discretization, and feature selection.

Feature Factory aims to be feature-rich and provide an API similar to Scikit-learn, with the performance benefits of Rust and Apache DataFusion. Feature Factory transformers follow a fit-transform paradigm, where each transformer provides a constructor, a fit method, and a transform method. Given an input dataframe, a transformer applies a transformation to the data and returns a new dataframe. The library also provides a pipeline API that allows users to chain multiple transformers together to create data transformation pipelines for feature engineering.

[!IMPORTANT] Feature Factory is currently in the early stage of development. APIs are unstable and may change without notice. Inconsistencies in documentation are expected, and not all features have been implemented yet. It has not yet been thoroughly tested, benchmarked, or optimized for performance. Bug reports, feature requests, and contributions are welcome!

Features

  • High Performance: Feature Factory uses Apache DataFusion as the backend data processing engine.
  • Scikit-learn API: It provides a Scikit-learn-like API which is familiar to most data scientists.
  • Pipeline API: Users can chain multiple transformers together to build a feature engineering pipeline.
  • Large Set of Transformers: Currently, Feature Factory includes the following transformers:
Task Transformers Status
Imputation - MeanMedianImputer: Replace missing values with the mean (or median).
- ArbitraryNumberImputer: Replace missing values with an arbitrary number.
- EndTailImputer: Replace missing values with values at distribution tails.
- CategoricalImputer: Replace missing values with an arbitrary string or most frequent category.
- AddMissingIndicator: Add a binary indicator for missing values.
- DropMissingData: Remove rows with missing values.
Tested
Categorical Encoding - OneHotEncoder: Perform one-hot encoding.
- CountFrequencyEncoder: Replace categories with their frequencies.
- OrdinalEncoder: Replace categories with ordered numbers.
- MeanEncoder: Replace categories with target mean.
- WoEEncoder: Replace categories with the weight of evidence.
- RareLabelEncoder: Group infrequent categories.
Tested
Variable Discretization - ArbitraryDiscretizer: Discretize based on user-defined intervals.
- EqualFrequencyDiscretizer: Discretize into equal-frequency bins.
- EqualWidthDiscretizer: Discretize into equal-width bins.
- GeometricWidthDiscretizer: Discretize into geometric intervals.
Tested
Outlier Handling - ArbitraryOutlierCapper: Cap outliers at user-defined bounds.
- Winsorizer: Cap outliers using percentile thresholds.
- OutlierTrimmer: Remove outliers from the dataset.
Tested
Numerical Transformations - LogTransformer: Apply logarithmic transformation.
- LogCpTransformer: Apply log transformation with a constant.
- ReciprocalTransformer: Apply reciprocal transformation.
- PowerTransformer: Apply power transformation.
- BoxCoxTransformer: Apply Box-Cox transformation.
- YeoJohnsonTransformer: Apply Yeo-Johnson transformation.
- ArcsinTransformer: Apply arcsin transformation.
Tested
Feature Creation - MathFeatures: Create new features with mathematical operations.
- RelativeFeatures: Combine features with reference features.
- CyclicalFeatures: Encode cyclical features using sine or cosine.
Tested
Datetime Features - DatetimeFeatures: Extract features from datetime values.
- DatetimeSubtraction: Compute time differences between datetime values.
Tested
Feature Selection - DropFeatures: Drop specific features.
- DropConstantFeatures: Remove constant and quasi-constant features.
- DropDuplicateFeatures: Remove duplicate features.
- DropCorrelatedFeatures: Remove highly correlated features.
- SmartCorrelatedSelection: Select the best features from correlated groups.
-DropHighPSIFeatures: Drop features based on Population Stability Index (PSI).
- SelectByInformationValue: Select features based on information value.
- SelectBySingleFeaturePerformance: Select features based on univariate estimators.
- SelectByTargetMeanPerformance: Select features based on target mean encoding.
- MRMR: Select features using Maximum Relevance Minimum Redundancy.
Tested

[!NOTE] Status shows whether the module is Tested (unit, integration, and documentation tests) and Benchmarked. Empty status means the module has not yet been tested and benchmarked.

Installation

cargo add feature-factory

Or add this to your Cargo.toml:

[dependencies]
feature-factory = "0.1"

Feature Factory requires Rust 1.83 or later.

Documentation

You can find the latest API documentation at docs.rs/feature-factory.

Architecture

The main building blocks of Feature Factory are transformers and pipelines.

Transformers

A transformer takes one or more columns from an input DataFrame and creates new columns based on a transformation. Transformers can be stateful or stateless:

  • A stateful transformer needs to learn one or more parameters from the data during training (via calling fit) before it can transform the data. A stateful transformer with learned parameters is referred to as a fitted transformer.
  • A Stateless transformer can directly transform the data without needing to learn any parameters.

All transformers implement the Transformer trait, which includes:

Method Description
new Creates a new transformer instance. Can accept hyperparameters and column names as input arguments.
fit Learns parameters from data. For stateless transformers this is a no-op.
transform Applies the transformation to data. Stateful transformers require calling fit first.
is_stateful Returns true if the transformer is stateful, otherwise false.

The figure below shows a high-level overview of how a single Feature Factory transformer works:

Feature Factory Transformer

[!IMPORTANT] In most cases, to avoid data leakage, the data used for training a transformer must not be the same as the data that is going to be transformed.

Pipelines

A pipeline chains multiple transformers together. Pipelines are created using the make_pipeline macro, which accepts a list of (name, transformer) tuples. Stateful transformers must be fitted before they're used in a pipeline.

The figure below shows a high-level overview of how a Feature Factory pipeline works:

Feature Factory Pipeline

[!IMPORTANT] Currently, to use a stateful transformer in a pipeline, it must be already fitted.

Examples

Check out the examples and tests directories for examples of how to use Feature Factory.

Contributing

See CONTRIBUTING.md for details on how to make a contribution.

The mascot of this project is named "Weldon the Penguin". He is a Rustacean penguin who loves to swim in the sea and play video games—and is always ready to help you with your data.

The logo was created using Gimp, ComfyUI, and a Flux Schnell v2 model.

Licensing

Feature Factory is available under the terms of either of these licenses:

Dependencies

~71MB
~1.5M SLoC