Composite Estimators using Pipeline & FeatureUnions

In machine learning workflows, data often requires various preprocessing steps before it can be fed into a model. Composite estimators, such as Pipelines and FeatureUnions, provide a way to combine these preprocessing steps with the model training process. This blog post will explore the concepts of composite estimators and demonstrate their usage in scikit-learn (version 0.20).

Agenda

Introduction to Composite Estimators
Pipelines
TransformedTargetRegressor
FeatureUnions
ColumnTransformer
GridSearch on Pipeline

1. Introduction to Composite Estimators

Composite estimators combine one or more transformers with an estimator, creating a single unified model. This approach offers several benefits, such as code reusability and modularity. In scikit-learn, composite transformers are implemented using Pipelines, while FeatureUnions are used to concatenate the outputs of transformers to create derived features.

2. Pipelines

Pipelines automate the workflow of applying a series of transformations to the data before fitting the estimator. They ensure that intermediate steps, i.e., transformers, implement both the fit() and transform() methods. Once trained, the same pipeline can be used for making predictions.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

pipeline = make_pipeline(
    CountVectorizer(stop_words='english'),
    TfidfTransformer(),
    LogisticRegression()
)

pipeline.fit(trainX, trainY)
pipeline.predict(testX)

3. TransformedTargetRegressor

In regression tasks, it’s sometimes beneficial to transform the target variable to make it more normally distributed. The TransformedTargetRegressor automates this process, allowing for better error calculation and prediction remapping.

from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression

regr = TransformedTargetRegressor(
    regressor=LinearRegression(),
    transformer=QuantileTransformer(output_distribution='normal')
)

regr.fit(X_train, y_train)

4. FeatureUnions

FeatureUnions combine several transformer objects into a single transformer. During fitting, each transformer is fit independently, and during transform, the outputs are concatenated.

from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

num_cols = ['Age', 'Fare']
cat_cols = ['Embarked', 'Sex', 'Pclass']

pipeline_num = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaling', StandardScaler())
])

pipeline_cat = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoding', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', pipeline_num, num_cols),
        ('cat', pipeline_cat, cat_cols)
    ]
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=10))
])

5. ColumnTransformer (Beta stage)

The ColumnTransformer allows for mapping different preprocessing steps to different columns in the dataset, providing an easy way to apply specific transformations to specific columns.

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(), cat_cols)
    ]
)

6. GridSearch for Pipelines

GridSearchCV can be used to search over specified parameter values for both transformers and estimators in a pipeline. This allows for fine-tuning the entire pipeline, including preprocessing steps and the model itself.

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__n_estimators': [10, 15, 20],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, iid=False)
grid_search.fit(trainX, trainY)

By using composite estimators like Pipelines and FeatureUnions, along with tools like GridSearchCV, you can create powerful and flexible machine learning workflows that automate and streamline the process of building and evaluating models.