Optimizing ML Classification Algorithms

Hello World!

Today we are going to do a little exercise around optimizing an algorithm.  I was working with a customer who was using open data (and we know how that can be) to perform an initial set of predictions to show some value while adding in some collection capabilities so they can roll one with more reliable data later.

The data can be collected from here: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2015&prodType=document

The sample notebook can be located here: https://github.com/drcrook1/MLPythonNotebooks

The Problem

We want to be able to correctly place folks into a bucket of wealth based on a simple survey which on the surface does not appear terribly revealing.

Operationalizing the Results

I already have an article on this: http://dacrook.com/operationalizing-sklearn-with-azure-machine-learning/ This is probably the fastest and easiest way to operationalize an sklearn model unless you are a savvy flask developer.

The Starting Point

So there already existed a survey for Age, General Working Category, Marital Status, Gender and School attainment.  So lets just simply start here.  The customer just took all of that data and stuck it into a decision tree with some other modifications and yeilded a total f-1 score of .47; I’m not sure what those mods where, but I just took it and did the same thing in sklearn and got a .58; so thats the number to beat.

So lets take a look at what we start with…

             precision    recall  f1-score   support

       high       0.18      0.27      0.21        11
        mid       0.64      0.60      0.62       138
        low       0.56      0.57      0.57       112

avg / total       0.59      0.57      0.58       261

initial_1
So What to do first?
I’ve never seen the data before in my life; so the first thing I need to do is to pop open my jupyter server and a notebook and bring in a variety of libraries.  I am a big fan of jupyter notebooks.  I have a notebook in Azure on an N-Series VM so I can play with the GPUs (that might be overkill for this project; but I just leave it always on)  You can always shift between A series, D series and N series depending on your project (pay for only what you need).

import pandas as pd
import numpy as np
import scipy.stats.stats as stats
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from tqdm import tqdm
import math
init_notebook_mode(connected=True)

Alright, so this is a ton of imports that are very critical.  Pandas is for data manipulation (table/dataframe) type stuff.  Numpy is for numerical analysis; vector math, matrix math; that sorta thing.  scipy.stats has a variety of useful libraries we will dive into later; plotly for graphing (PLOTLY KICKS BUTT) tqdm so I know how long things will take; we definately need math, and we initialize the jupyter notebook so we can display graphics inline instead of saving to files and displaying in a separate browser tab.

Load some data

# Data Definition and Loading
data_file = 'usr/data/PUMS/csv_pmo/ss15pmo.csv'

data = pd.read_csv(data_file)
data = data.drop_duplicates()
print(data[0:5])

Notice here that I dropped duplicate rows immediately; I forgot I always do this; this may be a partial contributor to my initial performance gains with a decision tree over the customer (will reveal more later as this was a problem).

I’ve got a linux box and I just drop stuff to that data directory.  I’ve never seen this in my life; so lets just print the top 5 rows to see what we are working with here…

  RT  SERIALNO  SPORDER  PUMA  ST   ADJINC  PWGTP  AGEP  CIT  CITWP   ...     0  P        76        1  1100  29  1001264     79    77    1    NaN   ...      
1  P       179        1  2800  29  1001264     70    30    1    NaN   ...      
2  P       326        1  1200  29  1001264     11    65    1    NaN   ...      
3  P       326        2  1200  29  1001264     12    69    1    NaN   ...      
4  P       360        1   600  29  1001264    177    24    1    NaN   ...      

   pwgtp71  pwgtp72  pwgtp73  pwgtp74  pwgtp75  pwgtp76  pwgtp77  pwgtp78  0      163       79       79      145        7       81        6        6   
1      129       69       69       73       16       71       16       70   
2        4       13       21        4        2       21        9        9   
3        3       13       21        5        2       20       11       11   
4      260       67      186       44      168      152      196       70   

   pwgtp79  pwgtp80  
0       80       79  
1       74      125  
2       19       22  
3       16       23  
4       53       47  

[5 rows x 284 columns]

Hrm; yeah; its open data…great…no idea what any of that means and I have 284 columns like that… But I was working with a customer; so I knew the column we were predicting on; PINCP; so lets just start by visualizing that thing…I know it is financial; so lets just do a histogram and set 50 buckets; that sounds good…

income_histogram_initial

Alright; so we already know one major problem; and looking at the confusion matrix we can see a lot of low/mid mix up happening.  The low mid/mix was determined to be <$50,000 was low and between that and $250,000 was mid.  So that parameter alone could be a huge contributor…But more importantly; we have a data skew problem.

Re Balance the data

So there are way more sophisticated ways you can do this; but here is a simple one.  We will simply section off the low income area; and then use pandas built in sampling algorithm to sample 15% of it.  We then want to visualize the before and after in the same sampling as a probability to ensure the curvature looks about the same.

ml_df_low_pincp = ml_df[data.apply(lambda x: x['PINCP'] < 75000, axis = 1)]
ml_df_low_pincp_sample = ml_df_low_pincp.sample(n = math.floor(len(ml_df_low_pincp) * .15))
trace = go.Histogram(x=ml_df_low_pincp['PINCP'].as_matrix(), histnorm='probability', name = 'all', nbinsx=25)
trace1 = go.Histogram(x=ml_df_low_pincp_sample['PINCP'].as_matrix(), histnorm='probability', name = 'sample', nbinsx=25)
iplot([trace, trace1])
print(len(ml_df_low_pincp))
print(len(ml_df_low_pincp_sample))

So we saw the distribution dropped off quite a bit after 75,000; so lets just begin by sampling that section.  Like I said; way more sophisticated ways to do it; but we are just seeing if this will be fairly effective to begin with.  Here is what the visual looks like…

income_sampled_down

Alright; the curves look close enough and we dropped from 3755 examples to 563 examples.  We lost some data in that; but the point is to drop some data out of this.  Also note that you could totally separate this data out into a test set as well.

Lets go ahead and append all of the segments together and re visualize the overall wealth distribution…

#We have a more normalized and sampled data set...
df = ml_df_richfolks.append(ml_df_high_pincp).append(ml_df_low_pincp_sample)
ml_df = df[data.apply(lambda x: x['PINCP'] < 500000, axis = 1)]
trace1 = go.Histogram(x=ml_df['PINCP'].as_matrix(), histnorm='probability', name = 'sample', nbinsx=50)
iplot([trace1])
print(len(ml_df))

Notice I also filter out anybody over $500,000 in income; they were just too far out there and there was only 1 or 2 of them.  From a business impact perspective; we can miss those folks.

income_hist_final

Ok; so we know about where our main buckets will sit; and it looks like we will have about even between low income and medium income with still very little representation in high income.  You can also see some possible flaws; we would re-sample this depending on how many buckets we want; but for now lets just move on and see how it does…

Checking out our performance

So we have all the data back together lets write up some code for the decision tree and see how it does…

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import tree

def map_pincp_to_bucket(pincp):
    if(pincp < 50000):
        return 2
    elif(pincp >= 50000 and pincp < 250000):
        return 1
    else:
        return 0

df['PINCP'] = df['PINCP'].apply(lambda x: map_pincp_to_bucket(x))

X = df[['AGEP', 'COW', 'MAR', 'SEX', 'SCHL']].as_matrix()
Y = df['PINCP'].as_matrix()

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

model = tree.DecisionTreeClassifier()
model = model.fit(x_train, y_train)

y_pred = model.predict(x_test)
conf = confusion_matrix(y_test, y_pred)
trace = go.Heatmap(z=conf,
                  x = ['high', 'mid', 'low'],
                  y = ['high', 'mid', 'low'])
print(classification_report(y_test, y_pred, target_names = ['high', 'mid', 'low']))
iplot([trace])

And the results are printed below…

             precision    recall  f1-score   support

       high       0.12      0.27      0.16        11
        mid       0.65      0.55      0.60       118
        low       0.69      0.71      0.70       133

avg / total       0.65      0.62      0.63       262

Alright a 5% improvement!  Not bad from just simply rebalancing the data.  We are now in the category of being able to make real business impact…

Can we bump improvement any more?

Well, lets just see if there is another question we can ask that might be a good bump in the right direction.  To do this we want to start by visualization a correlation chart.  We want to see how all of our variables are correlated with eachother, so we can see if there are some candidates we can immediately drop.

#Calculate Correlation Coefficient and Plot on a heatmap
corr_coe = data.corr(method='pearson', min_periods=1000)
trace = go.Heatmap(z=corr_coe.as_matrix())
iplot([trace])

We start by taking our data as we read it in as a dataframe and using the built in correlation function and specify pearsons and we want a minimum of 1000 samples to exist for us to even consider plotting it (if less, we will get nan’s and we know we can drop those as well).

corr_1

Alright; so we know that dark blue is extremely negatively correlated and red is extremely positively correlated (hence the dark red line diagonally down the middle.).  Reason I really like plotly for this…I can zoom and see exactly which column indices that big red block starts at…

corr_zoomed

Alright; so all that crazy correlation starts at index 201 and goes to the end on both sides; lets see what in the world we have there…

#corr_coe.columns[201:]
print(corr_coe[201:].index.values)
## Turns out these are replica records of weight...
weight = data.ix[:,204:]
print(weight[0:5])

If you look at the data’s dictionary record; these are all replica values of weight…but when you print them; it seems like a bunch of garbage; so I decided although asking somebody’s weight might be a good question; I’m not confident the data here is accurate; so I’m not sure I want to train my model on it; so I just dropped it all…

data = data.drop(data.columns[201:], axis = 1)
data = data.drop_duplicates()

Notice I also drop any duplicates.  As soon as I removed a ton of columns; I want to double back and drop duplicates; especially since many might not have been dropped due to random weird values in the weights columns.

Alright; back to it; lets revisualize our correlations again; but this time without the weight replicas in there…

corr_2

So now we have this red zone with a lot of also independently correlated variables; so lets just zero in on this and see if we can easily extract out one or two variables that might have some impact on our financials…Do the same zoom in with plotly to get our column indices and we see it starts at 127 through the end.

Collect Correlations, PValues and Column Names

So the first thing we want to do is grab all the columns in question; iterate through those columns and see “Are any of these related to income and if so is it due to random occurrence or is there really something here?”  Correlation tells us the answer to the first question, while PValue tells us the answer to the second.

#What is this red corridor of index 128 or so and up?
q_cols = corr_coe[127:].index.values

#before we ditch these; lets check out if any have a relationship with our target metric...
#Build a dataframe to collect the correlation and pvals
corr_df = pd.DataFrame(columns = ('COL', 'CORR', 'PVAL'))
i = 0
for col in tqdm(q_cols):
    t_dict = {'PINCP' : data['PINCP'], col : data[col]}
    t_df = pd.DataFrame(data = t_dict)
    t_df = t_df.dropna()
    corr, p_val = stats.pearsonr(t_df['PINCP'].as_matrix(), t_df[col].as_matrix().reshape(-1))
    tmp_pers_dict = {'COL' : [col], 'CORR' : [corr], 'P_VAL' : [p_val]}
    tmp_pers_df = pd.DataFrame(data=tmp_pers_dict)
    corr_df.loc[i] = [col, corr, p_val]
    i += 1
    print('{} corr: {}'.format(col, corr))

So the above code prints out everything as it goes; but that is a painful way to look at it.  It turns out that many of these columns are able to reject the null hypothesis; so I decided to filter the PValue for VERY low PValues and then grab the top 5 most positively correlated and the top 5 most negatively correlated…

corr_df = corr_df.dropna()
#keep only items with very low p-vals (we can reject the null hypothesis)
possible_corr = corr_df[corr_df.apply(lambda x: x['PVAL'] < 0.001, axis = 1)]

#Sort by decending and show the top 5
#This shows the top 5 most positively correllated
pos_corr = possible_corr.sort_values(by=['CORR'], ascending=False)[0:5]
print(pos_corr)

#Sort by ascending and show the top 5
#This shows the top 5 more negatively correlated
neg_corr = possible_corr.sort_values(by=['CORR'], ascending=True)[0:5]
print(neg_corr)
final_corr = pos_corr.append(neg_corr)

#Double back for variable independence.
corr_coe = data[final_corr['COL']].corr(method='pearson', min_periods=1)
trace = go.Heatmap(z=corr_coe.as_matrix())
iplot([trace])

So lets see what we got…

        COL      CORR          PVAL
16    FFODP  0.092791  4.472024e-98
24  FHINS4C  0.083444  1.497299e-10
33    FJWDP  0.059646  1.715382e-41
34   FJWMNP  0.047489  6.627835e-27
22  FHINS3C  0.041797  8.325575e-07
       COL      CORR          PVAL
44   FMARP -0.056567  1.737167e-37
62  FSCHLP -0.055245  7.839212e-36
14   FESRP -0.053826  4.239744e-34
61  FSCHGP -0.052272  2.977992e-32
70   FWKLP -0.048750  2.892478e-28

Ok, so our top positive correlation is FFODP and our top negative correlation is FMARP.

Lets take our top 10 candidates and find a variable with the lowest covariance.  Note if you find variables which are highly covariant; you can just pick one of those.  But whatever; lets just pick the one with the lowest covariance for fun…

var_cand_cov

Good news; look at the legend scale; dark blue is complete independence from each other.  Good SHOW!  Alright; well picking the column from this is painful; so lets just write some code to get it for us…

mat = corr_coe.as_matrix()
print(np.unravel_index(mat.argmin(), mat.shape))
corr_coe.iloc[4].name

This is where it is nice to use numpy to grab indices of various things.  Numpy has some really nice matrix manipulation; so we shove the data into a matrix; grab our index (4) and then push that into the original dataframe to get the column name. FHINS3C.  What in the world is that thing? It is “Medicare coverage given through the eligibility coverage”.  Well I can certainly see that being related to income for sure…

Before we go to crazy lets just look at this thing…

#Lets just check this distribution first...
med_cov_col = 'FHINS3C'
trace = go.Histogram(x=data[med_cov_col].as_matrix(), histnorm='probability')
iplot([trace])

Well; so 95% of the respondents to the government survey do not have medical coverage and a vast majority of them were also poor.  Remember that this is a voluntary survey.

Lets get a little more crazy and visualize what impact coverage vs no coverage looks like as a probability distribution (hence a decent predictor or not) for income buckets.  We will also visualize the difference as totals (so you can see how probability is definitely different from actual distribution).

#interesting; 95% of respondants did not have medical coverage; 
#this is strongly correlated to income and independent from various other variables
#Lets just check out some rows where they do have med coverage...

has_med_cov = data[data.apply(lambda x: x[med_cov_col] == 1, axis = 1)]
no_med_cov = data[data.apply(lambda x: x[med_cov_col] == 0, axis = 1)]
all_folks = data[data.apply(lambda x: x[med_cov_col] == 0 or x[med_cov_col] == 1, axis = 1)]
#Still not good enough; lets do that heatmap thing again...
print('NORMALIZED DISTRIBUTIONS')
trace1 = go.Histogram(x=all_folks['PINCP'].as_matrix(), nbinsx=25, opacity=0.5, name='all', histnorm='probability')
trace2 = go.Histogram(x=has_med_cov['PINCP'].as_matrix(), nbinsx=25, opacity=0.5, name='has_cov', histnorm='probability')
trace3 = go.Histogram(x=no_med_cov['PINCP'].as_matrix(), nbinsx=25, opacity=0.5, name='no_cov', histnorm='probability')
iplot([trace1, trace2, trace3])

print('UN-NORMALIZED DISTRIBUTIONS')
trace1 = go.Histogram(x=all_folks['PINCP'].as_matrix(), nbinsx=25, opacity=0.5, name='all')
trace2 = go.Histogram(x=has_med_cov['PINCP'].as_matrix(), nbinsx=25, opacity=0.5, name='has_cov')
trace3 = go.Histogram(x=no_med_cov['PINCP'].as_matrix(), nbinsx=25, opacity=0.5, name='no_cov')
iplot([trace1, trace2, trace3])

prob_cov_income

Here we can see that for the lowest income brackets you might as well be guessing as well as even in the $75k bucket; but that $125k bucket you have 2:1 odds folks have coverage as opposed to no coverage.  Before we draw social conclusions lets also visualize the raw quantity of respondents…

prob_cov_income_unnomralized

So like 95% of the folks responding to this didn’t have coverage; don’t forget that…

125k_coverage_bucket

Even in the 125k bucket; you have nearly 10:1 in raw responses.  Anyways; it looks like it will likely make a good predictor; and it is a question that isn’t terribly invasive that you might get good responses on since folks usually know if they have coverage or not.

Now lets go ahead and add the extra column in there.

#Data Prep; select some columns.
ml_df = data[['PINCP', 'AGEP', 'COW', 'MAR', 'SCHL', 'SEX', 'FHINS3C']]
print(len(ml_df))
ml_df = ml_df.dropna().drop_duplicates()
print(len(ml_df))
print(ml_df[0:5])

Notice how we drop duplicates after the final scope down.  This literally dropped the quantity of data from 61586 rows to 4243; again I do this every time naturally; so this also could account for why from the beginning I got a 10% performance bump vs the customer using the same algorithms.

Visualize your inputs

So since I’m banking on this one to be better than the customer; I wanted to really double check my inputs; so I visualized the distribution of all inputs before writing the final algorithm.

#HUGE DROP in columns, lets view that histogram again...
ml_df = ml_df[data.apply(lambda x: x['PINCP'] >= 0, axis = 1)]
trace1 = go.Histogram(x=ml_df['PINCP'].as_matrix(), histnorm='probability', nbinsx=25, name = 'pincp')
trace2 = go.Histogram(x=ml_df['AGEP'].as_matrix(), histnorm='probability', nbinsx=25, name = 'age')
trace3 = go.Histogram(x=ml_df['COW'].as_matrix(), histnorm='probability', nbinsx=25, name = 'cow')
trace4 = go.Histogram(x=ml_df['MAR'].as_matrix(), histnorm='probability', nbinsx=25, name = 'mar')
trace5 = go.Histogram(x=ml_df['SCHL'].as_matrix(), histnorm='probability', nbinsx=25, name = 'schl')
trace6 = go.Histogram(x=ml_df['SEX'].as_matrix(), histnorm='probability', nbinsx=25, name = 'sex')
trace7 = go.Histogram(x=ml_df['FHINS3C'].as_matrix(), histnorm='probability', nbinsx=25, name = 'med_cov')

fig = tools.make_subplots(rows = 3, cols = 3)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 2)
fig.append_trace(trace6, 2, 3)
fig.append_trace(trace7, 3, 1)
fig['layout'].update()
iplot(fig)

And this yeilded the charts below (I plotted this before applying our sampling on income:

hists_b4_smple

Now because the sampling can impact our distributions; lets just go ahead and see what that impact looks like…

hists_after_smple

So we look more or less the same with a few differences but overall that should be ok.  We may need to revisit marital status type 4: “Seperated”, but type 3 is divorced; so there is some work we can likely do here as far as consolidating feature categories.

Onto the Model!

X = df[['AGEP', 'COW', 'MAR', 'SEX', 'SCHL', 'FHINS3C']].as_matrix()
Y = df['PINCP'].as_matrix()

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

model = tree.DecisionTreeClassifier()
model = model.fit(x_train, y_train)

y_pred = model.predict(x_test)
conf = confusion_matrix(y_test, y_pred)
trace = go.Heatmap(z=conf,
                  x = ['high', 'mid', 'low'],
                  y = ['high', 'mid', 'low'])
print(classification_report(y_test, y_pred, target_names = ['high', 'mid', 'low']))
iplot([trace])

Alright; what did that get us?

             precision    recall  f1-score   support

       high       0.12      0.27      0.16        11
        mid       0.66      0.55      0.60       118
        low       0.70      0.72      0.71       133

avg / total       0.66      0.63      0.64       262

Our high income folks (over $250k) performance went down by 1%; but our mid and low performance went up by 1% for an average across all categories of 1.33% but more importantly an increase in our most impactful categories.

Lets see where in our confusion matrix we might be able to have some impact…

conf_desc_after

Looking at this; the biggest thing we can do to increase our score is to resolve our issues around high and middle income.  We predict high when the person is infact middle.

Try a different Model

So here we try an SVM with the same data…

model2 = svm.SVC()
model2 = model2.fit(x_train, y_train)

y_pred = model2.predict(x_test)
conf = confusion_matrix(y_test, y_pred)
trace = go.Heatmap(z=conf,
                  x = ['high', 'mid', 'low'],
                  y = ['high', 'mid', 'low'])
print(classification_report(y_test, y_pred, target_names = ['high', 'mid', 'low']))
iplot([trace])
             precision    recall  f1-score   support

       high       0.00      0.00      0.00        11
        mid       0.66      0.65      0.66       118
        low       0.68      0.74      0.71       133

avg / total       0.64      0.67      0.66       262

Check that out…another 2% average increase; but we lost everything on predicting high income folks…OH NO! But remember business impact on this;  I did try consolidating down to two classes; but got worse results; so actually keeping 3 classes and knowing that you are getting better performance against your 2 is not a bad trade off…

Going Forward – Next Steps

Ok; so we took a model were the customer came in with 47% and have been able to get up to 66% or 64% without even diving into some of the other variables or trying out complex polynomials or neural nets or any of that stuff; but what would I do to take this even further with just this one data set?

If you recall the histograms here:

hists_b4_smple

Notice that 81% of the survey respondents are over the age of 62; while 95% have no medical coverage and 40% make an income level that is oddly the same as social security…

I would maybe look at grabbing data from more states than just the one we used and then sectioning off age ranges.  I would perhaps look across the entire data set for the most correlated to PINCP;  I would then run through all variables we are currently using to see if any of them are representative of the same data; and perhaps find other questions we could ask by performing the same Pearson’s analysis.

Most importantly I would just start sticking this thing out into production and start leveraging it alongside a data collection system that picks up on whether or not the predictions yielded the target business goal you intended.  In this instance the prediction is being made on something that is beleived to impact the business; however that is a guess, once this goes into production, you can collect real data directly associated to your business.  Thats the real data you want to build these predictions off of; not some data set where source age distribution does not match your target age distribution.

Summary

So there we have it; some good solid ways to go about increasing your algorithm performance!  This article just touches on the surface; it is possibly to drive way better results with further analysis and more sophisticated methods.

Leave a Reply

Your email address will not be published. Required fields are marked *