If you are practicing machine learning, you are likely going to run into this at some point. Basically the reason we use feature scaling is to help our algorithms train faster and better. Lets begin by taking a standard theta optimization equation to help better understand the problem.
Where is the partial derivative of the sum of squares error function with respect to each theta.
Don’t really worry about what a partial derivative is too much, just that this equation is the optimizer for each weighted value theta. What is important here is that our alpha is constant across all thetas and x’s. Lets consider for a minute that we are looking to predict on home price. The price of a home is in hundreds of thousands, the age is in 10’s the acreage is in decimals and finally the square-footage is in thousands. Every time we take a step, our error with respect to every theta (at the various magnitudes) is multiplied by the same exact alpha, therefor possibly overstepping and under-stepping various features if their weights are not within a similar range.
Mean Range Normalization
Lets now introduce “Mean Range Normalization”. Basically apply this function as defined below to every component of your training data. This will provide new data useful for training between the range of -1 and +1.
= the Mean (or average) of the training set for this x
= the Range (or max – min) of x for this training set.
= the j-th observation of x in this training set.
So how does this work? Basically no number can ever be -1 or +1 unless all observations are exactly the same. Feel free to pick a variety of numbers and try it out. Assume a range of 1 to 10 and a mean of 1.01. Lets say you get x(j) to be 10. The normalized version of this observation is 9.99 / 10 or very close to +1. When doing this, it is also important to scale your prediction as well. So all features, including the prediction column should be scaled.
Scaling Predictions back to original Magnitude
Now the question comes up, well that means my prediction for a home is going to be between -1 and +1. Yes, it does. But you can simply apply the inverse to regain the original number
Feel free to go ahead and try it. It works :).
What about new Predictions
Yup, you can totally have some issues here. What if you receive an input in which the x observation is outside the range of training data. Well this probably deserves its own article, but 9 times out of 10, you can assume this is an odd observation and is not indicative of your training data. As you have no training data around this type of observation, if it is a crucial input which carries significant weight, then you may not necessarily be able to rely on your model to provide an accurate prediction in this instance and that type of information should be conveyed to the user of your model. The nice thing is that these types of requests should be rare if you had significant training data.
There you go, that is Feature Scaling for Machine Learning applications such that you can train your model as well as resurface predictions within the proper magnitude!