With increasing demand in machine learning and data science in businesses, for upgraded data strategizing there’s a need for a better workflow to ensure robustness in data modelling. Machine learning has certain steps to be followed namely – data collection, data preprocessing(cleaning and feature engineering), model training, validation and prediction on the test data(which is previously unseen by model). 

Here testing data needs to go through the same preprocessing as training data. For this iterative process, pipelines are used which can automate the entire process for both training and testing data. It ensures reusability of the model by reducing the redundant part, thereby speeding up the process. This could prove to be very effective during the production workflow.

Pipelines

(Source: YouTube – Pydata )

In this article, I’ll be discussing how to implement a machine learning pipeline using scikit-learn.

Advantages of using Pipeline:

  • Automating the workflow being iterative.
  • Easier to fix bugs 
  • Production Ready
  • Clean code writing standards
  • Helpful in iterative hyperparameter tuning and cross-validation evaluation

Challenges in using Pipeline:

  • Proper data cleaning
  • Data Exploration and Analysis
  • Efficient feature engineering

Scikit-Learn Pipeline

The sklearn.pipeline module implements utilities to build a composite estimator, as a chain of transforms and estimators.

I’ve used the Iris dataset which is readily available in scikit-learn’s datasets library. The 6 columns in this dataset are: Id, SepalLength(in cm), SepalWidth(in cm), PetalLength(in cm), PetalWidth(in cm), Species(Target). 50samples containing 3 classes-Iris setosa, Iris Virginica, Iris versicolor.

Pipelines

After loading the data, split it into training and testing then build pipeline object wherein standardization is done using StandardScalar() and dimensionality reduction using PCA(principal component analysis) both of these with be fit and transformed(these are transformers), lastly the model to use is declared here it is LogisticRegression, this is the estimator. The pipeline is fitted and the model performance score is determined.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
iris_df=load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris_df.data,iris_df.target,test_size=0.3,random_state=0)
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
                     ('pca1',PCA(n_components=2)),                     ('lr_classifier',LogisticRegression(random_state=0))])
model = pipeline_lr.fit(X_train, y_train)
model.score(X_test,y_test)

OUTPUT - 0.8666666666666667

With the pipeline, we preprocess the training data and fit the model in a single line of code. In contrast, without a pipeline, we have to do normalization, dimensionality reduction, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables.

Use the following two lines of code inside the Pipeline object for filling missing values and change categorical values to numeric. (Since iris dataset doesn’t contain these we are not using)

('imputer', SimpleImputer(strategy='most_frequent')) #filling missing values

(‘onehot', OneHotEncoder(handle_unknown='ignore'))    #convert categorical 

Make sure to import OneHotEncoder and SimpleImputer modules from sklearn!

Stacking Multiple Pipelines to Find the Model with the Best Accuracy

We build different pipelines for each algorithm and the fit to see which performs better.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
                     ('pca1',PCA(n_components=2)), 
                     ('lr_classifier',LogisticRegression())])
pipeline_dt=Pipeline([('scalar2',StandardScaler()),
                     ('pca2',PCA(n_components=2)),
                     ('dt_classifier',DecisionTreeClassifier())])
pipeline_svm = Pipeline([('scalar3', StandardScaler()),
                      ('pca3', PCA(n_components=2)),
                      ('clf', svm.SVC())])
pipeline_knn=Pipeline([('scalar4',StandardScaler()),
                     ('pca4',PCA(n_components=2)),
                     ('knn_classifier',KNeighborsClassifier())])
pipelines = [pipeline_lr, pipeline_dt, pipeline_randomforest, pipeline_knn]
pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'Support Vector Machine',3:'K Nearest Neighbor'}
for pipe in pipelines:
  pipe.fit(X_train, y_train)
for i,model in enumerate(pipelines):
    print("{} Test Accuracy:{}".format(pipe_dict[i],model.score(X_test,y_test)))
OUTPUT:
Logistic Regression Test Accuracy: 0.8666666666666667
Decision Tree Test Accuracy: 0.9111111111111111
Support Vector Machine Test Accuracy: 0.9333333333333333
K Nearest Neighbor Test Accuracy: 0.9111111111111111

From the results, it’s clear that Support Vector Machines(SVM) perform better than other models.

Hyperparameter Tuning in Pipeline

With pipelines, you can easily perform a grid-search over a set of parameters for each step of this meta-estimator to find the best performing parameters. To do this you first need to create a parameter grid for your chosen model. One important thing to note is that you need to append the name that you have given the classifier part of your pipeline to each parameter name. In my code above I have called this ‘randomforestclassifier’ so I have added randomforestclassifier__ to each parameter. Next, I created a grid search object which includes the original pipeline. When I then call fit, the transformations are applied to the data, before a cross-validated grid-search is performed over the parameter grid.

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
pipe = make_pipeline((RandomForestClassifier()))
grid_param = [
{"randomforestclassifier": [RandomForestClassifier()],
"randomforestclassifier__n_estimators":[10,100,1000],                    "randomforestclassifier__max_depth":[5,8,15,25,30,None],                 "randomforestclassifier__min_samples_leaf":[1,2,5,10,15,100],
"randomforestclassifier__max_leaf_nodes": [2, 5,10]}]
gridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0,n_jobs=-1) 
best_model = gridsearch.fit(X_train,y_train)
best_model.score(X_test,y_test)

OUTPUT - 0.9777777777777777

Conclusion

This is a basic pipeline implementation. In real-life data science, scenario data would need to be prepared first then applied pipeline for rest processes. Building quick and efficient machine learning models is what pipelines are for. Pipelines are high in demand as it helps in coding better and extensible in implementing big data projects. Automating the applied machine learning workflow and saving time invested in redundant preprocessing work.
The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook with codes.

The post Hands-On Tutorial On Machine Learning Pipelines With Scikit-Learn appeared first on Analytics India Magazine.