In machine learning workflows, data often requires various preprocessing steps before it can be fed into a model. Composite estimators, such as Pipelines and FeatureUnions, provide a way to combine these preprocessing steps with the model training process. This blog post will explore the concepts of composite estimators and demonstrate their usage in scikit-learn (version 0.20).
Agenda
- Introduction to Composite Estimators
- Pipelines
- TransformedTargetRegressor
- FeatureUnions
- ColumnTransformer
- GridSearch on Pipeline
1. Introduction to Composite Estimators
Composite estimators combine one or more transformers with an estimator, creating a single unified model. This approach offers several benefits, such as code reusability and modularity. In scikit-learn, composite transformers are implemented using Pipelines, while FeatureUnions are used to concatenate the outputs of transformers to create derived features.
2. Pipelines
Pipelines automate the workflow of applying a series of transformations to the data before fitting the estimator. They ensure that intermediate steps, i.e., transformers, implement both the fit()
and transform()
methods. Once trained, the same pipeline can be used for making predictions.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
pipeline = make_pipeline(
CountVectorizer(stop_words='english'),
TfidfTransformer(),
LogisticRegression()
)
pipeline.fit(trainX, trainY)
pipeline.predict(testX)
3. TransformedTargetRegressor
In regression tasks, it’s sometimes beneficial to transform the target variable to make it more normally distributed. The TransformedTargetRegressor
automates this process, allowing for better error calculation and prediction remapping.
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
regr = TransformedTargetRegressor(
regressor=LinearRegression(),
transformer=QuantileTransformer(output_distribution='normal')
)
regr.fit(X_train, y_train)
4. FeatureUnions
FeatureUnions combine several transformer objects into a single transformer. During fitting, each transformer is fit independently, and during transform, the outputs are concatenated.
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
num_cols = ['Age', 'Fare']
cat_cols = ['Embarked', 'Sex', 'Pclass']
pipeline_num = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaling', StandardScaler())
])
pipeline_cat = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoding', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', pipeline_num, num_cols),
('cat', pipeline_cat, cat_cols)
]
)
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=10))
])
5. ColumnTransformer (Beta stage)
The ColumnTransformer
allows for mapping different preprocessing steps to different columns in the dataset, providing an easy way to apply specific transformations to specific columns.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(), cat_cols)
]
)
6. GridSearch for Pipelines
GridSearchCV can be used to search over specified parameter values for both transformers and estimators in a pipeline. This allows for fine-tuning the entire pipeline, including preprocessing steps and the model itself.
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__n_estimators': [10, 15, 20],
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, iid=False)
grid_search.fit(trainX, trainY)
By using composite estimators like Pipelines and FeatureUnions, along with tools like GridSearchCV, you can create powerful and flexible machine learning workflows that automate and streamline the process of building and evaluating models.
Leave a Reply