Training

1. Define generalisation.
Generalisation refers to the model ability to adapt properly to new, previously unseen data. The performance of ML model is evaluated on its ability to generalise when predicting unseen data.
2. Define hyperparameters.
Hyperparameters are parameters that are not learned from the data during the training process but are set before the training begins. These parameters control various aspects of the machine learning model’s behavior and are essential for configuring and fine-tuning the model’s performance
3. State the difference between parameters and hyperparameters.
Parameters are the internal variables learned by the model during training, whereas hyperparameters are external settings that control the model’s behavior and are set by the user.
4. Define the holdout method (Train-Test Split).
It involves dividing a dataset into two distinct subsets: one for training the model and another for testing the model’s performance.
5. Explain how holdout method (Train-Test Split) works.
  1. Split the data into Training Data (70-80%) and the remainder to Test Data (20-30%)
  2. The training set is used to train the machine learning model.
  3. The testing set is used to evaluate the model’s performance using the appropriate metric.
6. Define data leakage and the implications.
Information from outside the training dataset is improperly used to train a model or make predictions. It can lead to overly optimistic or unrealistic model performance during development, which may not hold up when the model is deployed in production.
7. Data leakage can occur when future information is included in the training data. Elaborate on this concept and provide an example.

Training data inadvertently includes information from the future that the model would not have access to in a real-world scenario.

For example, if you’re predicting stock prices, using features that include future stock prices or news articles published after the date you want to predict would be considered data leakage.

8. Data leakage can occur because of data preprocessing errors. Elaborate on this concept and provide an example.

Errors in data preprocessing, such as normalising or scaling features, can introduce data leakage.

For example, if you scale the entire dataset, including the testing set, based on statistics calculated from both sets, it can lead to information leakage from the test set into the training set.

9. Data leakage can occur due to improper cross-validation implementation. Elaborate on this concept and provide an example.

When performing cross-validation, it’s essential to ensure that the validation set in each fold does not leak information from the training set.

For instance, if you preprocess your data (e.g., impute missing values) separately for each fold without considering the training set alone, you could introduce data leakage because the information from the validation set will influence the training set within that fold.

10. State the main issue with using only train-test split without a separate validation set.
Risk of Overfitting: When you use only a training set and a test set, you may end up training your model on the training set and adjusting its parameters until it performs well on the test set. This process can lead to overfitting, where the model essentially memorizes the training data instead of learning generalisable patterns.
11. Explain how K-fold cross validation works.
k-fold-cross-validation

Image Source: https://towardsdatascience.com/why-do-we-need-a-validation-set-in-addition-to-training-and-test-sets-5cf4a65550e0

  1. Split the training dataset into K folds, which is also the number of iterations.
  2. For each iteration
    1. Perform holdout method, take one group for test set, the rest will be training set.
    2. Fit a sub model on the training set and evaluate on the test set
    3. Retrain the score
  3. Average score of all sub models to get the cross-validated score of the model.
12. List the K-fold cross validation advantages.
  • It reduces the impact of randomness in the data split because the model is evaluated on multiple different test sets.
  • It provides a more accurate and stable estimate of how well the model generalizes to new data.
  • It helps identify if the model’s performance varies significantly across different subsets of the data.
  • It especially useful when working with limited data.
13. Does K-fold cross validation returns a trained model?
Cross validation does not output a train model, it provides a score on a model trained on the entire dataset.
Last updated on 19 Nov 2023