1 unstable release
new 0.1.1-alpha | Mar 9, 2025 |
---|---|
0.1.0 |
|
#112 in Machine learning
100 downloads per month
185KB
3.5K
SLoC

Feature Factory
Feature Factory is a feature engineering library for Rust built on top of Apache DataFusion. It uses DataFusion internally for fast, in-memory data processing. It is inspired by the Feature-engine Python library and provides a wide range of components (referred to as transformers) for common feature engineering tasks like imputation, encoding, discretization, and feature selection.
Feature Factory aims to be feature-rich and provide an API similar to Scikit-learn,
with the performance benefits of Rust and Apache DataFusion. Feature Factory transformers follow
a fit-transform paradigm, where each transformer provides a
constructor, a fit
method, and a transform
method. Given an input dataframe, a transformer applies a
transformation to the data and returns a new dataframe.
The library also provides a pipeline API that allows users to chain multiple transformers together to create data
transformation pipelines for feature engineering.
[!IMPORTANT] Feature Factory is currently in the early stage of development. APIs are unstable and may change without notice. Inconsistencies in documentation are expected, and not all features have been implemented yet. It has not yet been thoroughly tested, benchmarked, or optimized for performance. Bug reports, feature requests, and contributions are welcome!
Features
- High Performance: Feature Factory uses Apache DataFusion as the backend data processing engine.
- Scikit-learn API: It provides a Scikit-learn-like API which is familiar to most data scientists.
- Pipeline API: Users can chain multiple transformers together to build a feature engineering pipeline.
- Large Set of Transformers: Currently, Feature Factory includes the following transformers:
Task | Transformers | Status |
---|---|---|
Imputation | - MeanMedianImputer : Replace missing values with the mean (or median). - ArbitraryNumberImputer : Replace missing values with an arbitrary number. - EndTailImputer : Replace missing values with values at distribution tails. - CategoricalImputer : Replace missing values with an arbitrary string or most frequent category. - AddMissingIndicator : Add a binary indicator for missing values. - DropMissingData : Remove rows with missing values. |
Tested |
Categorical Encoding | - OneHotEncoder : Perform one-hot encoding. - CountFrequencyEncoder : Replace categories with their frequencies. - OrdinalEncoder : Replace categories with ordered numbers. - MeanEncoder : Replace categories with target mean. - WoEEncoder : Replace categories with the weight of evidence. - RareLabelEncoder : Group infrequent categories. |
Tested |
Variable Discretization | - ArbitraryDiscretizer : Discretize based on user-defined intervals. - EqualFrequencyDiscretizer : Discretize into equal-frequency bins. - EqualWidthDiscretizer : Discretize into equal-width bins. - GeometricWidthDiscretizer : Discretize into geometric intervals. |
Tested |
Outlier Handling | - ArbitraryOutlierCapper : Cap outliers at user-defined bounds. - Winsorizer : Cap outliers using percentile thresholds. - OutlierTrimmer : Remove outliers from the dataset. |
Tested |
Numerical Transformations | - LogTransformer : Apply logarithmic transformation. - LogCpTransformer : Apply log transformation with a constant. - ReciprocalTransformer : Apply reciprocal transformation. - PowerTransformer : Apply power transformation. - BoxCoxTransformer : Apply Box-Cox transformation. - YeoJohnsonTransformer : Apply Yeo-Johnson transformation. - ArcsinTransformer : Apply arcsin transformation. |
Tested |
Feature Creation | - MathFeatures : Create new features with mathematical operations. - RelativeFeatures : Combine features with reference features. - CyclicalFeatures : Encode cyclical features using sine or cosine. |
Tested |
Datetime Features | - DatetimeFeatures : Extract features from datetime values. - DatetimeSubtraction : Compute time differences between datetime values. |
Tested |
Feature Selection | - DropFeatures : Drop specific features.- DropConstantFeatures : Remove constant and quasi-constant features.- DropDuplicateFeatures : Remove duplicate features.- DropCorrelatedFeatures : Remove highly correlated features.- SmartCorrelatedSelection : Select the best features from correlated groups.- DropHighPSIFeatures : Drop features based on Population Stability Index (PSI).- SelectByInformationValue : Select features based on information value.- SelectBySingleFeaturePerformance : Select features based on univariate estimators.- SelectByTargetMeanPerformance : Select features based on target mean encoding.- MRMR : Select features using Maximum Relevance Minimum Redundancy. |
Tested |
[!NOTE] Status shows whether the module is
Tested
(unit, integration, and documentation tests) andBenchmarked
. Empty status means the module has not yet been tested and benchmarked.
Installation
cargo add feature-factory
Or add this to your Cargo.toml
:
[dependencies]
feature-factory = "0.1"
Feature Factory requires Rust 1.83 or later.
Documentation
You can find the latest API documentation at docs.rs/feature-factory.
Architecture
The main building blocks of Feature Factory are transformers and pipelines.
Transformers
A transformer takes one or more columns from an input DataFrame and creates new columns based on a transformation. Transformers can be stateful or stateless:
- A stateful transformer needs to learn one or more parameters from the data during training (via calling
fit
) before it can transform the data. A stateful transformer with learned parameters is referred to as a fitted transformer. - A Stateless transformer can directly transform the data without needing to learn any parameters.
All transformers implement the Transformer
trait, which includes:
Method | Description |
---|---|
new |
Creates a new transformer instance. Can accept hyperparameters and column names as input arguments. |
fit |
Learns parameters from data. For stateless transformers this is a no-op. |
transform |
Applies the transformation to data. Stateful transformers require calling fit first. |
is_stateful |
Returns true if the transformer is stateful, otherwise false . |
The figure below shows a high-level overview of how a single Feature Factory transformer works:
[!IMPORTANT] In most cases, to avoid data leakage, the data used for training a transformer must not be the same as the data that is going to be transformed.
Pipelines
A pipeline chains multiple transformers together. Pipelines are created using the make_pipeline
macro, which accepts a list of (name, transformer)
tuples.
Stateful transformers must be fitted before they're used in a pipeline.
The figure below shows a high-level overview of how a Feature Factory pipeline works:
[!IMPORTANT] Currently, to use a stateful transformer in a pipeline, it must be already fitted.
Examples
Check out the examples and tests directories for examples of how to use Feature Factory.
Contributing
See CONTRIBUTING.md for details on how to make a contribution.
Logo
The mascot of this project is named "Weldon the Penguin". He is a Rustacean penguin who loves to swim in the sea and play video games—and is always ready to help you with your data.
The logo was created using Gimp, ComfyUI, and a Flux Schnell v2 model.
Licensing
Feature Factory is available under the terms of either of these licenses:
- MIT License (LICENSE-MIT)
- Apache License, Version 2.0 (LICENSE-APACHE)
Dependencies
~71MB
~1.5M SLoC