Tuesday, 1 January 2013

The relation between Gradient Descent and Newton Raphson methods

To say in the least, this is something that came quite late to me.
Whats the difference? The visualization!

Consider a function: f(theta) which we want to either maximize or minimize.

Gradient Descent is simple:
1. Take the 1st derivative. It gives the direction of sharpest ascent/increase of a function. This is its nature.
2. The negative obviously gives the direction of sharpest decrease.
3. Augment parameters (theta) like : theta = theta - (gradient) * (rate)
4. Note the minus sign, it tells that we go in direction of decrease in function value. This is for minimization.
5. For maximization we do the opposite use theta = theta + (gradient) * (rate)


So what does minimization/maximization mean???

We find a point where the slope is zero. Its time for some memory recall. What did the Newton-Raphson  method do? It found the root of a function. Whats that, you ask?? Well its finding a 'theta' such that f(theta)=0.
The formula for the NR method is: theta = theta - f'(theta) / f''(theta) or in Vector notation it is: theta = theta - (hessian-inverse) * (gradient).
So if I want to maximize or minimize the formula is the same. I find the roots of the derivative of the function.

How they are similar, you ask?
Simple: For functions which have an explicit maxima the 2nd derivative is negative (the Hessian is Negative (Semi) Definite) which makes the updation rule a maximizing one. For the ones with minima the 2nd derivative is positive (Hessian is Positive (Semi) Definite) which makes the updation rule a maximizing one. The Hessian is the direction as well as the rate dictator.

Note: The direction by the Hessian and the Gradient are different because the first talks of curvature and the other of rise. The end is the same but their means are different.

For convex functions the NR method is one shot that is it converges in 1 step but the gradient descent will take more iterations.

The problems with the Hessian method is in the complexity of getting the inverse of the Hessian matrix.
For that we have an intermediate method namely the Conjugate Gradient Methods in which we have the LBFGS, Rank One etc methods.

To sum it up: All the methods are different only in their path to the goal: OPTIMIZATION.

Gradient Descent -----------> Conjugate Gradient Methods ----------------> Newton Raphson Method.

Banzai!!!