# Feature Scaling & Machine Learning

Hello World!

If you are practicing machine learning, you are likely going to run into this at some point.  Basically the reason we use feature scaling is to help our algorithms train faster and better.  Lets begin by taking a standard theta optimization equation to help better understand the problem.
$\theta_j = \theta_j - \alpha \cdot \frac{ \sum_i^m \left(H_{\theta}\left(x\right) - y\right) \cdot x_j } { m }$

Where $\left(H_{\theta}\left(x\right) - y\right) \cdot x_j$ is the partial derivative of the sum of squares error function with respect to each theta.

Don’t really worry about what a partial derivative is too much, just that this equation is the optimizer for each weighted value theta.  What is important here is that our alpha $\alpha$ is constant across all thetas and x’s.  Lets consider for a minute that we are looking to predict on home price.  The price of a home is in hundreds of thousands, the age is in 10’s the acreage is in decimals and finally the square-footage is in thousands.  Every time we take a step, our error with respect to every theta (at the various magnitudes) is multiplied by the same exact alpha, therefor possibly overstepping and under-stepping various features if their weights are not within a similar range.

## Mean Range Normalization

Lets now introduce “Mean Range Normalization”.  Basically apply this function as defined below to every component of your training data.  This will provide new data useful for training between the range of -1 and +1.
$x_j = \frac { x_j - M_x} { R_x }$

where
$M_x$ = the Mean (or average) of the training set for this x
$R_x$ = the Range (or max – min) of x for this training set.
$x_j$ = the j-th observation of x in this training set.

So how does this work?  Basically no number can ever be -1 or +1 unless all observations are exactly the same.  Feel free to pick a variety of numbers and try it out.  Assume a range of 1 to 10 and a mean of 1.01.  Lets say you get x(j) to be 10.  The normalized version of this observation is 9.99 / 10 or very close to +1.  When doing this, it is also important to scale your prediction as well.  So all features, including the prediction column should be scaled.

## Scaling Predictions back to original Magnitude

Now the question comes up, well that means my prediction for a home is going to be between -1 and +1.  Yes, it does.  But you can simply apply the inverse to regain the original number
$x_j = \left(x_j + M_x \right) \cdot R_x$

Feel free to go ahead and try it.  It works :).