Gradient Descent & Loss Function

Loss Functions

1. Define loss functions (cost functions) and the main purpose.

Loss function quantify the errors or discrepancy between the predicted values of a model and the actual target values.

The main purpose is to guide the learning process by providing a measure of how well or poorly a model is performing.

2. List the two common regression loss functions.

Mean Squared Error (MSE) - Sum of the mean squared errors.
Mean Absolute Error (MAE) - Sum of the mean absolute errors.

3. State the common loss function for binary classification problem.

Binary cross-entropy loss.

4. State the common loss function for multi-class classification problem.

Categorical cross-entropy loss.

5. What is the training objective for the loss function in supervised learning?

The objective of the loss function in supervised learning is to minimise the loss.

6. Does loss function have to be evaluation metric in assessing the model performance? Provide an example.

The loss function will not always be the evaluation metric. For example, classification models’ evaluation metrics are usually f1, precision and recall, while the loss function is cross-entropy.

7. Why it is important to select the appropriate loss function?

It is important to select the right loss function because different loss functions may emphasise different aspects of the loss performance, such as handling outliers, penalising larger errors and encouraging certain type of errors over others.

8. Provide the sum of squared errors formula for a linear equation

\hat{y} = a + β x_{i}

S S E = \sum (y_{i} - \hat{y_{i}})^{2} = \sum (y_{i} - (a + β x_{i}))^{2}

Gradient Descent

1. Explain gradient descent and the main purpose.

Gradient descent is an optimisation algorithm and the main goal is to find the parameters (weights and biases) of a model that minimises the error between the predicted output and the actual target values.

2. Why gradient descent algorithm is called gradient descent intuitively?

The gradient descent uses the gradient (slope) to descend to the lowest point of the loss function and thus, it is called gradient descent.

3. State how gradient descent finds the optimal values efficiently.

Gradient Descent finds the optimal value efficiently by taking large steps when it’s far away from the minimum (local minima) but small steps when it is close to the local minimum.

4. State the difference between Ordinary Least Squares Technique (OLS) and gradient descent (GD) and explain the advantage of GD over OLS.

OLS solves the optimal value of the parameter by finding where the slope of the curve equates to 0. Whereas gradient descent finds the minimum value by taking the steps until it reaches the best value. This renders gradient descent more practical than OLS in cases where solving for the point where the derivative equals zero is not feasible.

5. List the 4 key steps of gradient descent. (Hint: I-D-U-R)

Initialise: Randomly initialise the parameter value.
Derive: Find the slope of the loss function by computing the derivative of the loss function at that point (parameter value)
Update: Update the parameter value with the step size.
Repeat: Repeat step 2 and 3 until the GD reach local minimum or the stopping criterion.

6. Given

\hat{y} = β_{0} + β_{1} x

, explain how to optimise

β_{0}

using the gradient descent approach.

Initialise the intercept value $β_{0}$ = 0.
Calculate the partial derivative of the loss function at that point (parameter value) to find the slope with respect to the parameter.

$\frac{d (L)}{d (β_{0})}$

Note: Partial derivative is the derivative of a function of several variables with respect to change in just one of its variables. All other variables will be treated as constants. For example, to find the partial derivative with respect to $β_{0}$ , we treat $β_{1}$ as a constant, resulting $β_{1} = 0$ .
Update the new intercept by moving a proportional step in the opposite direction of the derivative.

$β_{0}^{(k + 1)} = β_{0}^{(k)} - n \frac{d L}{d β_{0}} (β_{0}^{(k)}) New β_{0} = Old β_{0} - Step Size$

Where, $n$ is the learning rate.

Note:
1. The step-size is proportional to the derivative value with a chosen Learning Rate ( $n$ )
2. The updated intercept value will be used for the next iteration.
Repeat step 2 and 3 again until the GD reach the local minimum of the loss function.

7. Explain why we move in the opposite direction of the derivative result, that is, we increase the parameter value when the derivative is negative.

When the derivative is negative, the slope is negative. This implies that there is a higher parameter value that will reach minimum of the loss function and therefore we need to increase the parameter value to get closer to the minimum.

8. Explain the negative implications of small learning rate.

Slow Convergence: Small learning rates can result in slow convergence, meaning it takes a long time for the optimization algorithm to reach the minimum of the loss function
Getting Stuck in Local Minima: With a very small learning rate, the optimization algorithm may become overly cautious and get stuck in local minima or saddle points.

9. Explain the negative implications of large learning rate.

Overshooting the Minimum: Large learning rates can cause the algorithm to take excessively large steps in the parameter space during each iteration.
Slow Convergence: A large learning rate can actually lead to slower convergence if the algorithm consistently overshoots the minimum and oscillates around it.

10. When the slope (gradient) of a function is equal or close to 0, it indicates a potential minimum of that point. Explain the underlying concept.

The minimum point of a function, corresponds to the valley in the graph of the function. At the minimum point, the function is flat for a brief moment before it starts increasing again, which corresponds to 0 rate of change, that is a 0 slope.

11. State the general minimum step size and number of steps to take before stopping the gradient descent optimisation.

Minimum Step Size: 0.001 or smaller
Minimum Number of Steps: 1000

12. Given

\hat{y} = β_{0} + β_{1} x

, briefly describe how to use the gradient descent to co-optimise

β_{0} and β_{2}

Simply, re-iterate the same procedure for both parameters simultaneously:

Initialise the two variables with random values.
Compute the gradients for the loss function with respect to both variables using the partial derivatives.
Update the variables simultaneously.

13. Given

\hat{y} = β_{0} + β_{1} x

, explain the process of deriving the SSE for

β_{0}

(Which will be similar for

β_{1}

)

SSE Formula

$S S E = \sum (y_{i} - \hat{y_{i}})^{2} S S E = \sum (y_{i} - (b_{0} + β x_{i}))^{2}$

Deriving SSE to get the loss function’s gradient

Applying the sum rule for derivatives, the derivation of the SSE is simply the sum of the derivatives of the individual error.

$\frac{d L}{d a} = \frac{d L}{d β_{a}} (y_{1} - (β_{0} + β x_{1}))^{2} + \frac{d L}{d β_{0}} (y_{2} - (β_{0} + β x_{2}))^{2}$ $+ . . + \frac{d L}{d β_{0}} (y_{n} - (β_{0} + β x_{n}))^{2}$
Apply the chain rule to derive the individual errors.

$\frac{d L}{d β_{0}} (y_{1} - (β_{0} + β x_{1}))^{2}$ $Outer = 2 ((y_{1} - (β_{0} + β x_{1}))$ $Inner = (y_{1} - β_{0} - β x_{1}) = - 1$ $Outer x Inner = - 2 (y_{1} - (β_{0} + β x_{1}))$ Note: For the inner function derivation, because we are using partial differentials, all the variables except for $b_{0}$ will be treated as a constant.
Substitute the values into the derived expression for each sample and then aggregate the results to find the gradient.

14. Explain batch gradient descent (BGD).

The model calculate gradients on the full training set at each step, which can be extremely computationally expensive and slow.

15. Explain stochastic gradient descent (SGD).

SGD uses a sample (one point) from the dataset to approximate the gradient descent to compute the gradient and update the parameters accordingly, which speeds up the process of optimization.

16. Explain mini-batch gradient descent (MBGD).

Mini-batch GD uses a subset of samples (multiple point) from the dataset to approximate the gradient descent to compute the gradient and update the parameters accordingly, which speeds up the process of optimization.

17. Illustrate the graphs for batch, stochastic and mini-batch gradient descent cost function as the number of iteration increases.

Image Source: https://www.geeksforgeeks.org/ml-mini-batch-gradient-descent-with-python/

BGD - The average of gradients of all the dataset, the loss function is relatively smooth.
SGD - The loss function fluctuates from epoch to epoch due to the only using one sample per epoch, making it more volatile.
MBGD - The balance between speed of convergence and noise associated with gradient update.

Last updated on 23 Aug 2024