Machine Learning Study Group Recap – Week 4

Hello World,

So here we go with another recap. This week we did a deep dive into binary classification using Logistic Regression. Logistic regression and binary classification is the underpinnings for modern neural networks so a deep and complete understanding of this is necessary to be proficient in machine learning.

Here are some of the key topics.

Element Wise Matrix Math

We discussed element wise matrix mathematics to assist in certain scenarios when vectorizing our operations.  You will often notice vectorized operations being denoted as such: \theta  .* X as opposed to \theta * X . Those two operations produce very different results.

Logistic Cost, Sigmoid and Partial Derivatives

This topic probably consumed the most significant amount of our time.  Here is a quick synopsis.  The logistic cost function denoted as follows
J_{\theta} = \frac{-1}{m} * \sum_i^m y_i * log\left(H_{\theta}\left(x_i\right)\right) + \left(1-y_i\right) * log\left(1-H_{\theta}\left(x_i\right)\right)

This is complex, as our cost function is typically something along these lines:
\sum_i^m \left(H_{\theta}\left(x\right) - y\right)^2

We could do this and it would be accurate, however as we use gradient descent with our function, we could not use the classic cost function as the shape of that cost function is very bumpy, creating many local minima, which puts gradient descent into a tizzy.  The more complex cost function creates a smoothed curve, enabling gradient descent to more easily find its minimum.

The reason this put us into a state of confusion for a while is that in many scenarios you do not need to code the cost function if you are doing iteration based descent.  The partial derivative of the complex cost function ended up being our normal partial derivative, so we were all very happy.  The confusion likely stemmed from the use of the word “derive the cost function from our hypothesis function”, when it was not a derivative in the sense of a calculus derivative, but rather a derivative in that it was a result of some fancy math that produced an equivalent with a different curve (cool huh!).

Sigmoid

I recapped that here already.

Increasing Gradient Descent Performance

Eric has been doing some very interesting research around increasing the performance of gradient descent by combining a few techniques.  He showed off thoughts, charts and code for that.  Basically its a combination of bold driver and momentum.  Here is an article on a few of those.  Eric has been combining them together and the code repository can be found here.  The current results appear promising for increasing performance while simultaneously dealing with the local minimum problem more effectively.

Regularization

We spent a fair amount of time talking about regularization.  Regularization is a technique to smooth your functions to reduce over fitting.  I will write a completely different article on this at some point, but basically it is an additional term in your gradient descent gradient optimization function that forces theta values down to create more smoothed prediction lines such that the algorithm will deal with unseen data more appropriately.

Non Linear Decision Boundaries

This was an interesting topic.  The answer to this is to basically use more complex non linear polynomials and let gradient descent work its magic.  The trick to it is vectorizing the equation correctly.  This requires a pre-processing step and I will cover that in an article this week.

Summary

Wow.  That was an incredible session.  These study session just keep getting better and better.  The skills of everybody showing up is just shooting through the roof.  I am truly excited to see the progress and be a part of this.

 

Leave a Reply

Your email address will not be published. Required fields are marked *