Hello World,

So today we will do a quick conversion from mathematical notations of Algebra into a real algorithm that can be executed. Note we will not be covering gradient descent, but rather only cost functions, errors and execution of these to provide the framework for gradient descent. Gradient descent has so many flavors that it deserves its own article.

So to the mathematical representation.

So lets start with a few key items and definitions…

- Hypothesis: This is basically our template for a linear model. A linear model is basically just the function of a line. Remember y = mx + b. Thats it. For some x, we have an m and a b that predicts y. We will fit this line to our training data.
- Cost function is the function we want to reduce to zero. In the equation above, we can see that this equation is the sum of squares. but why 1/2…We will discuss.
- Goal is to minimize the cost function with the parameters thetaZero and thetaOne, we will discuss this.
- Parameters are thetaZero and thetaOne. Definately a to discuss item.

So lets talk first about thetaOne, thetaZero, the cost function and our hypothesis. So basically thetaOne is m and thetaZero is b in the equation y = mx + b. Thats it, nothing fancy, just some extra convoluded jargon to confuse the masses.

In our linear regression the goal is to reduce the error of our predictions on average against the training data. Since we chose a linear regression, the model template for a linear line is y = mx + b and we need to reduce the error, which is basically the output of our function minus the real value squared. The reason we square this is to remove the counter effect of negative error with positive error. For example if we are off by -2 and then off by +2, if we do not square the results we are off by zero. If we square the results, we are off by 8. Hence the sum of squares error calculation. I suppose theoretically you could do sum of absolute value to the function, but it is usually done sum of square, I do not know why, perhaps for use in gradient descent later to help in reducing complexity of the derivatives (will discuss for sure).

Ok, so I get it, we have this cost function, which we always want to reduce cost, and we have this hypothesis function which is really no more than the function of a line with varying m’s and b’s. So thats the concept. Gradient descent basically is a way to step wise come to a minimum in the cost function (goal = 0).

Now we get to convert to code. Note that this is for a very simple linear regression.

Below is the basics for everything except the implementation of gradient descent.

type UnivariateDataPoint = { X:float; Y:float } type IPredictionModel = abstract member PredictValue: float -> float type LinearPredictionModel(thetaOne:float, thetaTwo:float) = member this.ThetaOne = thetaOne member this.ThetaTwo = thetaTwo interface IPredictionModel with member this.PredictValue (x) = thetaOne * x + thetaTwo let square (x) = x * x let subtract (x) (y) = y - x let divideBy (x) (y) = y / x let getSeqLength (a:'a seq) = a |> Seq.length |> float let SumSquaredError (predictionModel:IPredictionModel) (trainingData:UnivariateDataPoint seq) = trainingData |> Seq.map(fun point -> point.X |> predictionModel.PredictValue |> subtract point.Y |> square ) |> Seq.sumBy(fun point -> point) let LinearCostFunction (thetaOne) (thetaTwo) (trainingData:UnivariateDataPoint seq) = let model = new LinearPredictionModel(thetaOne, thetaTwo) trainingData |> SumSquaredError model |> divideBy ((trainingData |> getSeqLength) * 2.0)

So this is a bit complicated because I rolled it as if I were writing production code; which I hope you always do. Note it is not fully generalized or as fast as it could be, but I’ll hopefully get a chance to write a follow up article on how we optimize this, for now its written in a way in which hopefully folks can understand just the basics of what is happening.

First we need to define a type of data, which contains X an observation with Y a value we want to predict on. We then define an interface for a prediction model which takes in a particular predictor value and produces a prediction. Note that this only works for univariate data sets (data sets with only one determining value (x)). We will fix this in a later article. We then implement with a class definition of a linear model (line function) and a prediction which takes in x and applies it to the model and produces a prediction.

This particular class implementation can be optimized using curried functions and partial application, but is a subject to be covered later. Just note that this is sub-optimal and in the final implementation this will be implemented as a series of stateless functions instead of stateful objects.

We define a few basic mathematical functions which enable a functional coding style and cleaner code.

SumSquaredError takes the prediction model as input as well as the training data and basically this is the implmentation of the summation of 1 to m in the training data of prediction – actual to get our sum of squared error. This section of code really showcases some of the value of functional coding with the cleanliness.

LinearCostFunction creates our prediction model for a particular hypothesis and calculates the SumSquaredError for that hypothesis. Note that we divide by 2 * m. because we are calculating the slope, which is essentially a derivative, the square bit in the sum of squared errors, we bring the two down into the numerator while we calculate the rate, therefor we divide by 2 * m to counteract that rate effect and have a true average.

So to do a quick test on this, lets take a look at the following code:

let tData = [| {X = 5.0; Y = 4.0}; {X = 3.0; Y = 4.0}; {X = 0.0; Y = 1.0}; {X = 4.0; Y = 3.0} |] |> Seq.ofArray tData |> LinearCostFunction 1.2 0.0

Here we have inputs 5, 3, 0 and 4 with the Y values being the actual’s. We predict a function of 1.2x + 0.0 and receive back as a result what the cost value is. The concept behind a gradient descent is that we start with some initial thetaOne and thetaTwo and define a step size and hopefully converge on a minimum value.

In a later article we will discuss gradient descent algorithms and options which yeild very different results.

Pingback: F# Weekly #19, 2016 – Sergey Tihon's Blog

The difference between using Some of Squared differences which is L2 norm and Maximum Value which is L1 norm is described here

http://www.chioka.in/differences-between-the-l1-norm-and-the-l2-norm-least-absolute-deviations-and-least-squares/

This is a summary not an exhaustive list.

You can also use the inifinity norm, which would minimize the maximum absolute deviation. This usually is approximated by Chebyshev orthogonal polynomials

Very interesting read. Definitely something to keep in mind while developing algorithms.

If you are not familiar with the matrix solution to this problem

https://en.wikipedia.org/wiki/Polynomial_least_squares

There are many ways to solve teh resulting matrix equation but my favorite is cholesky factorization, but the naive approach will not be stable past a 5 – 6 order polynomial in most cases of regular data

After that use Orthogonal polynomials that results in a tribanded matrix that is more easily and stably solved

Very glad you took the time to read and respond to this entry. I am actually working on a follow up to this entry on converting this into a Matrix Math approach using the Math.Numerics library. Keep an eye out for it this week :).

Pingback: Linear Regression from Scratch using Linear Algebra | DaCrook