Operationalizing SKLearn with Azure Machine Learning

Hello World!

So I just completed an incredible project with Brain Thermal Tunnel Genix, where I learned so much about pattern recognition, machine learning and taking research and algorithms and pushing those into a production environment where it can be integrated into a real product.  Today’s article takes those lessons and provides a sample on how to perform complex modelling and operationalize it in the cloud.  The accompanying Gallery Example can be found here.

Why Model Outside Azure ML?

Sometimes you run into things like various limitations, speed, data size or perhaps you just iterate better on your own workstation.  I find myself significantly faster on my workstation or in a jupyter notebook that lives on a big ol’ server doing my experiments.  Modelling outside Azure ML allows me to use the full capabilities of whatever infrastructure and framework I want for training.

So Why Operationalize with Azure ML?

AzureML has several benefits such as auto-scale, token generation, high speed python execution modules, api versioning, sharing, tight PaaS integration with things like Stream Analytics among many other things.  This really does make life easier for me.  Sure I can deploy a flask app via docker somewhere, but then, I need to worry about things like load balancing, and then security and I really just don’t want to do that.  I want to build a model, deploy it, and move to the next one.  My value is A.I. not web management, so the more time I spend delivering my value, the more impactful I can be.

Enough Why, Just Show me How!

Where did the data come from?

You can download this gallery example to get the data.  Alternatively the data is simply sample data that is provided with Azure Machine Learning Studio.  It is the Energy Efficiency Regression data.  I didn’t do anything special with it.  Just download that as your starting point.

Setup for Experiment Repeatability

One of the first challenges is working with a team who develops models on different infrastructure.  In our scenario, some used AML, others used workstations.  We cannot simply randomly split the data for every experiment.  This would mean that everybody is using completely different training and testing data sets.  So we first download the data and split into training and testing data sets and share via Azure’s blob storage capabilities.  The below python code shows reading the data in and then randomly splitting that data and finally saving it to separate .csv’s which can be shared.

import pandas as pd
import numpy as np

raw_data = pd.read_csv('/home/drcrook/Downloads/EE_Regression_Data.csv')

###############################
# Split into Test/Train Sets #
##############################
indices = np.random.permutation(raw_data)
valid_cnt = int(len(raw_data) * 0.25)
test, train = raw_data[:valid_cnt], raw_data[valid_cnt:]

Data Pre-Processing Functions & Parameter Distribution

For any real modelling you do, there will be some series of pre-processing or feature extraction.  There will likely be some set of parameters associated with this as well.  These parameters are best to be saved and distributed as well.  For the sample, I simple showcase doing a Min/Max normalization across the data set, which is a fairly common way to normalize data.  There will usually be other steps, but this at least showcases a single basic one.  The code below showcases extracting the parameters and then saving those to .csv for distribution and consumption in Azure ML.

#########################
# Preprocess Functions #
########################
#Get Min/Max Params
def MinMaxParams(df):
    return df.min(), df.max()

#Min/Max Normalization
def MM_Normalize(df, min_vals, max_vals):
    return (df - min_vals) / (max_vals - min_vals)

#Inverse Min/Max Normalization
def Inverse_MM_Normalize(df, min_vals, max_vals):
    return ((max_vals - min_vals) * df) + min_vals

###############################################
# Get Params and save them, will need in AML #
##############################################
params = MinMaxParams(train)
params[0].to_csv('/home/drcrook/data/EE_Regression_Min.csv')
params[1].to_csv('/home/drcrook/data/EE_Regression_Max.csv')

Data Pre-processing and Model Training

Since we already have train and test due to earlier processing, we can then simply use those for pre-processing for training the model.  I add an _n to the end to showcase they have been normalized.  We save our y values off and drop those columns from the training data, convert to numpy matrices and finally train with a very simple linear regression model.  See the below code.

###############################
# Pre-Process Data for Model #
##############################
#Normalize Data
train_n = MM_Normalize(train, params[0], params[1])
test_n = MM_Normalize(test, params[0], params[1])

#Extract y's as own set & drop from inputs
train_y = train_n['Heating Load']
train_n.drop('Heating Load', axis = 1, inplace=True)

#we want true test_y's, we don't want to judge against inversed normalization.
#we want to test against the original answers, so snag that, and drop labels
#from normalized data set.
test_y = test['Heating Load']
test_n.drop('Heating Load', axis = 1, inplace=True)

#Convert to numpy matrices, as this is what SKLearn wants.
train_n = train_n.as_matrix()
test_n = test_n.as_matrix()

############################
# Now for Machine Learning #
############################
model = linear_model.LinearRegression()
model.fit(train_n, train_y)

Model Inference, Testing & Performance

With all workstation development, you should generate your metrics on your workstation such that you can view your results.  First we perform our predictions, which outputs a normalized result, which we then inverse the normalization specifically for that column.  Finally we use SKLearn’s metrics package to calculate our root mean squared error (common metric for linear regression).  See the below code.

#############
# Inference #
############
predictions_n = model.predict(test_n)
predictions = Inverse_MM_Normalize(predictions_n, 
                                   params[0]['Heating Load'], 
                                   params[1]['Heating Load'])

######################
# Calculate Metrics #
#####################
from sklearn.metrics import mean_squared_error
RMSE = mean_squared_error(predictions, test_y)**0.5
RMSE

Persist Model & Test Persisted Model

We now need to export the model to a file that we will eventually upload to Azure ML.  But as any good scientist, we should test the model to ensure the same results as the model we just trained.  The below code showcases how to do this.

#############################################
# Persist Model for AML Operationalization #
############################################
from sklearn.externals import joblib
joblib.dump(model, '/home/drcrook/data/EE_Model/model.pkl')

#test model
model2 = joblib.load('/home/drcrook/data/EE_Model/model.pkl')
p = Inverse_MM_Normalize(model2.predict(test_n), 
                         params[0]['Heating Load'],
                         params[1]['Heating Load'])
#ensure same RMSE
RMSE = mean_squared_error(p, test_y)**0.5
RMSE

Complete Code for Workstation Training & Model Persistence

Below is the complete code for training and persistence on a workstation.

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn import preprocessing, linear_model

raw_data = pd.read_csv('/home/drcrook/Downloads/EE_Regression_Data.csv')

###############################
# Split into Test/Train Sets #
##############################
indices = np.random.permutation(raw_data)
valid_cnt = int(len(raw_data) * 0.25)
test, train = raw_data[:valid_cnt], raw_data[valid_cnt:]

#############################
# Save Test/Train Data Set #
############################
test.to_csv('/home/drcrook/data/EE_Regression_Test.csv')
train.to_csv('/home/drcrook/data/EE_Regression_Train.csv')

#########################
# Preprocess Functions #
########################
#Get Min/Max Params
def MinMaxParams(df):
    return df.min(), df.max()

#Min/Max Normalization
def MM_Normalize(df, min_vals, max_vals):
    return (df - min_vals) / (max_vals - min_vals)

#Inverse Min/Max Normalization
def Inverse_MM_Normalize(df, min_vals, max_vals):
    return ((max_vals - min_vals) * df) + min_vals

###############################################
# Get Params and save them, will need in AML #
##############################################
params = MinMaxParams(train)

###############################
# Pre-Process Data for Model #
##############################
#Normalize Data
train_n = MM_Normalize(train, params[0], params[1])
test_n = MM_Normalize(test, params[0], params[1])

#Extract y's as own set & drop from inputs
train_y = train_n['Heating Load']
train_n.drop('Heating Load', axis = 1, inplace=True)

#we want true test_y's, we don't want to judge against inversed normalization.
#we want to test against the original answers, so snag that, and drop labels
#from normalized data set.
test_y = test['Heating Load']
test_n.drop('Heating Load', axis = 1, inplace=True)

#Convert to numpy matrices, as this is what SKLearn wants.
train_n = train_n.as_matrix()
test_n = test_n.as_matrix()

############################
# Now for Machine Learning #
############################
model = linear_model.LinearRegression()
model.fit(train_n, train_y)

#############
# Inference #
############
predictions_n = model.predict(test_n)
predictions = Inverse_MM_Normalize(predictions_n, 
                                   params[0]['Heating Load'], 
                                   params[1]['Heating Load'])

######################
# Calculate Metrics #
#####################
from sklearn.metrics import mean_squared_error
RMSE = mean_squared_error(predictions, test_y)**0.5
RMSE


#############################################
# Persist Model for AML Operationalization #
############################################
from sklearn.externals import joblib
joblib.dump(model, '/home/drcrook/data/EE_Model/model.pkl')

#test model
model2 = joblib.load('/home/drcrook/data/EE_Model/model.pkl')
p = Inverse_MM_Normalize(model2.predict(test_n), 
                         params[0]['Heating Load'],
                         params[1]['Heating Load'])
#ensure same RMSE
RMSE = mean_squared_error(p, test_y)**0.5
RMSE      


Initial Understanding Before Operationalizing with Azure ML

We now have a full experiment and inference model that has been persisted on our workstation.  We now should understand how this maps to Azure ML.  As we have already built our training and validation model, we simply need to integrate as a production model in to Azure ML.  If we think about how AML works, it simply is read data in, process in a module, infer in a module, return a result.  Should be pretty easy.  Also there is a published gallery example here.

operationalized model

Preparing Prepped Materials for Upload

So all persisted data will need to be uploaded as a data set into Azure ML.  I know its weird that it is uploaded as a “data set”, but it is not necessarily treated as a data set once its up there.  Think of it just as uploading zip files.  So we need to zip everything up.  Some models produce significantly many more .npy files than others when you persist them.  The easiest way to upload sklearn models is to dump them to a folder, just like we did in the EE_Model folder, and zip up the whole folder.

Seperately we need to zip up the 2 .csv’s for our parameters for min/max normalization.

Finally, we need to upload our EE_Regression_Test.csv as well as our EE_Regression_Train.csv.  We only upload these so we can pipe data through and verify it completes but also in case we wish to train other models on the same grounds as what we trained on our workstation.  These two files do not need to be zipped.

See screen shots below for results of zipping.

EE_Params_Zipped

EE_Model_zipped

Uploading Assets

In the bottom left hand corner is a giant plus sign, select that and you will see a set of options.  Select Data Sets and from local file.  Select your local files.  Do this for the zipped up parameters, the zipped up model as well as the test and train data sets.

upload_data

Pre-process data in AML

Preprocessing the data in AML is a little different than on our workstation, but it is more or less the same.  Below is the code.  Notice in the image above that we have brought the zipped file into our experiment space and pass it as a parameter in to the module.  These files are unzipped directly into a folder ‘./Script Bundle’.  We can then simply read those in using pandas.  Notice that we use pd.Series and not pd.read_csv().  The reason is that the min/max is only a single row of a data frame and we want the normalization function to treat it as a series so it is applied row-wise using the defined operators.  We then set dataframe1 to the normalized data, which is passed out of the module.

import pandas as pd

#Min/Max Normalization
def MM_Normalize(df, min_vals, max_vals):
    return (df - min_vals) / (max_vals - min_vals)


def azureml_main(dataframe1 = None, dataframe2 = None):

    #Read Data from the zip file that was passed into the module
    min_vals = pd.Series.from_csv('./Script Bundle/EE_Regression_Min.csv')
    max_vals = pd.Series.from_csv('./Script Bundle/EE_Regression_Max.csv')
    dataframe1 = MM_Normalize(dataframe1, min_vals, max_vals)
    
    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1,

Model Hydration & Inference in AML

We now simply bring the zip file for the model into our workspace, create a new python execute module and pass the zip file into this.  Notice that between the normalization code and the inference code there is a column select module that removes columns 0 and Heating Load.  Those are an index column and our y value.  We don’t want to perform a prediction with those columns, otherwise we will experience improper shape errors (reference linear algebra matrix multiplication and vectorization of algorithms)  See hydration and inference code below for this module.

# The script MUST contain a function named azureml_main
# which is the entry point for this module.

# imports up here can be used to 
import pandas as pd
from sklearn.externals import joblib
from sklearn import linear_model
# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
    # Execution logic goes here
    model = joblib.load('./Script Bundle/model.pkl')
    predictions = model.predict(dataframe1)
    dataframe1 = pd.DataFrame(predictions)    
    return dataframe1,

Again notice, we read the .pkl, just like we did from our workstation code, however we read it from ./Script Bundle.  We then then use model.predict on the data frame and then convert the predictions, which is outputted as a numpy array into a Data Frame and return the data frame.  All data passed between Azure ML modules must be data frames.

Inverse Normalization on Predictions

The last step is to inverse the normalization.  As our model is built with and for normalized data, the outputted predictions are normalized and we must then create one final module to inverse that process.  These values will be the ones returned by our web endpoint.  See code below.

import pandas as pd

def Inverse_MM_Normalize(df, min_vals, max_vals):
    return ((max_vals - min_vals) * df) + min_vals

def azureml_main(dataframe1 = None, dataframe2 = None):

    min_vals = pd.Series.from_csv('./Script Bundle/EE_Regression_Min.csv')
    max_vals = pd.Series.from_csv('./Script Bundle/EE_Regression_Max.csv')
    print(min_vals)
    min_vals = min_vals['Heating Load']
    max_vals = max_vals['Heating Load']
    dataframe1 = Inverse_MM_Normalize(dataframe1, min_vals, max_vals)

    return dataframe1,

This looks almost exactly like our normalization process.

Why Separate out all Processes?

This is a good question.  This was a lesson learned while working on the BTT project.  Basically it can be nice to use the out of the box AML models as well to get an idea for which models to put more time and effort into.  Is it a model problem or is it a feature extraction and pre-processing problem?  By seperating out extraction, normalization, inference and inverse of those processes for predictions, you can very easily swap out any step to isolate issues and were to spend your time.  I find it pretty nice to do this type of work in the AML ecosystem.

Where to put my input and outputs?

This is easy.  Put your input before your processing and your outputs after your inverse of normalization and column filtration steps.  I always keep my feature extraction/pre-processing behind the endpoint.  Most of the algorithm’s magic sauce is in how you pose the questions, which is analagous to feature extraction and normalization.  I want to own that part of the processing and not hand the reigns over to another owner for that.  See below image with inputs/outputs defined.

web_servic-ified

Publishing

Super easy, there is a button on the bottom bar “Deploy Web Service”.  If its greyed out, run the experiment and then push that button.  Viola, you are done!  You also get sample code, docs page, tokens, token management etc.  My next step is usually wrap that sucker with Azure Api Manager, put a payment gateway in there and start selling :D.

Summary

So we covered not only how to operationalize a simple SKLearn model with Azure Machine Learning, but also the whys.  SKLearn plays a large role on my customer engagements and it is really nice to be able to operationalize in this way and also hand the reigns over in this fashion.  It reduces the skills needed for deployment, maintenance, management etc.

 

5 thoughts on “Operationalizing SKLearn with Azure Machine Learning

  1. Pingback: SKLearn To Azure ML – Curated SQL

  2. Hello David, great reviews, a question so far: why did you normalize the target variable? Is it necessary? If so, why? If not, did you do it for convenience reasons? Thanks, and please keep up with the good work!

    • Always normalize your input data. It helps keep your gradients smooth and even. It becomes more obvious and apparent as your variation between input feature magnitudes gets larger and larger. In this example it probably wasn’t that necessary; but it is good to show how to add the extra step.

  3. Pingback: Optimizing ML Algorithms | DaCrook

Leave a Reply

Your email address will not be published. Required fields are marked *