In lesson 1, we were introduced to the basics of linear regression in a univariate context.

Now in lesson 2, we start to introduce models that have a number of different input features (multivariate).

We also cover the Normal equation, mean normalisation, and feature scaling.

For multivariate linear regression, the hypothesis is this, for one training example. We now have multiple x features, and multiple theta weights.

We can convert this hypothesis into matrix form by doing this. Note that this is for ONE training example.

The above is more concisely expressed in this vectorised formula for the hypothesis

Summary of the info in the previous list

Notice that there is no x(0) term. For convenience we assume x{0}=1 with a weight of 0, so that we can use matrix multiplication

Vectorised Hypothesis

With multiple training examples, we use this vectorised formula

Now with multiple features, with multiple training examples, you can construct vectors of X and Theta like this.

Note: Where the first row of the X matrix represents one training example

Recall that the cost function, for all training examples is this. (Also known as the "mean squared error")

We can replace h(x) now with the vectorised form in the previous list. This is the form for multiple features, and multiple training examples.

Vectorised Cost Function - Therefore, the vectorised version of the cost function now becomes this, where XTheta is the hypothesis

Remembering that X * Theta is the vectorised hypothesis

Now, because we have arranged theta, X and y as matrices, we can translate the hypothesis into vectorised code

The general form for gradient decent

Gradient Descent is a process that lets you "descend" down the cost function in order to find the minimum/optimal theta values of the cost function

Gradient descent algorithm. Repeat this until it converges, where j = 0 to n number of features. Theta j, x j, and y are all matrices.

Breaking down the above, into it's seperate components for each Theta. Similar to univariate, but now generally we compute the descent for each Ã¸ in the cost function (remember that for each Ã¸{j} you want to get the partial derivative with respect to j)

Remember also that this section of the function is the derivative of the J with respect to theta

The vectorised form of Gradient Descent is this.

Where X, theta and y are all matrices. Alpha is the learning rate and is a scalar. And m is the number of training examples.

Gradient Descent - Vectorized!

A single gradient descent step in Matlab code. We perform this for each gradient step iteration. X, hypothesis and theta are all matrices.

Full code example, with iteration loop

The learning rate is a scalar value, alpha

You need to choose an appropriate value of the learning rate alpha to tune gradient descent, to make sure that convergence happens (ie stops decreasing by non-trivial amounts)

If you set the learning rate too small, convergence will be slow. If you set it too high, is may not converge at all

To make sure gradient descent is working, plot the cost J theta, against the number of iterations, and make sure it is decreasing. If it decreases by less than a sufficiently small value (say 10 to the -3) then we can declare that gradient descent has converged.

Summarising learning rate guidelines for alpha

The equivalent in Octave of X(Transpose) is X' (Prime). So in Octave to compute ((X(Transpose)*X)(inverse)) * X(Transpose)Y - we can write pinv(X'* X) * X' * Y

Gradient Descent vs Normal

Summary

Normal equation, solves directly for the minimised cost function J

Basically set dx.J(Ã¸) = 0 for every j

Example ...

If for one training example ...

Matrix X - To calculate the design matrix X ...

Matrix Y - To calculate the matrix Y (its simply the vector of all results in the training example)

in summary, for X, and Y ....

X is the design Matric

Diviiding by the max-min value should get x in approximately these values -1 < x < 1

We use feature scaling so gradient descent uses less steps to exectue

Polynomial Regression. If you choose your features in this way, then feature scaling becomes very important.

As an example, you can combine features to simplify a hypothesis expression

Summary:

Take (x - average of x) / max - min

... in other words