K-Nearest Neighours
1. Define KNN.
A supervised non-linear and distanced based ML algorithm capable of solving both classification and regression problem.
2. Explain KNN’s main assumption.
KNN assumes that similar things exist in close proximity. In other words, KNN assumes that data points that are close to each other in the feature space are more likely to belong to the same class (for classification) or have similar target values (for regression tasks).
3. KNN is a simple and intuitive algorithm. Explain the advantage.
- KNN is easy to understand and implement because it does not involve complex mathematical concepts or require extensive parameter tuning.
- KNN doesn’t require a separate training phase. It simply stores the training data and uses it directly for predictions.
4. KNN is a non-parametric algorithm. Explain the advantage.
KNN is non-parametric, which means it makes no assumptions about the underlying data distribution, allowing KNN to be used to classify or predict data from any distribution.
5. KNN is a versatile algorithm. Explain the advantage.
KNN can be used for both classification and regression tasks. It can handle various types of data, including numerical and categorical variables.
6. List the advantages of KNN.
- Simple and Intuitive.
- Non-Parametric.
- Versatile.
- No Training Period required.
7. KNN can computationally expensive. Explain the disadvantage.
KNN can be computationally expensive, especially for large datasets, as it requires calculating distances to all data points in the training set for each prediction.
8. KKN is sensitive to outliers. Explain the disadvantage.
KNN is sensitive to outliers because it relies on distance calculations. Outliers can significantly impact the neighbours selected and the resulting predictions.
9. KNN has the curse of dimensionality. Explain the disadvantage.
KNN’s performance can degrade as the dimensionality of the feature space increases. As the number of features increases, the data becomes more sparse, meaning that there are fewer data points in close proximity to each other. This makes it more difficult to identify the k-nearest neighbours of a given query point.
10. List the disadvantages of KNN.
- Computationally expensive.
- Sensitive to Outliers.
- Curse of Dimensionality.
11. State and explain the steps to perform KNN for both regression and classification problems..
- Preprocess dataset: Scale the dataset.
- Select K: The number of nearest neighbours to consider.
- Compute distance:
- Calculate the distance between the data point and all the data points.
- Select the K-nearest neighbours based on the smallest distances to the data point.
- Prediction:
- For classification, use mode (majority class based on the k-nearest neighbours) to determine the predicted class for the data point.
- For regression, use mean of the target value of the k-nearest neighbours to predict the target value for the data point.
12. Explain the importance of scaling before computing the distance.
KNN is a distance-based algorithm. Without scaling, variables with higher magnitude may dominate the distance metric, giving those variables greater importance. Therefore, to not have any biases towards variables with higher magnitude, the features needs to be contributed equally to the distance calculations.
13. State the primary hyperparameter in KNN.
K (Number of Neighbours): This hyperparameter determines the number of nearest neighbours to consider when making predictions.
14. List the two most common distance metrics to compute the distance in KNN.
- Euclidean distance
- Manhattan distance
15. State the negative implications of choosing too many “k” and too few “k”.
- Choosing Too Many Neighbours (Large K):
- Underfitting Risk: May experience high bias because the results are based on many data points
- Increased Computational Complexity: Calculating distances to a large number of neighbours can be computationally expensive, especially for large datasets
- Choosing Too Few Neighbours (Few K):
- Overfitting Risk: Prone to overfitting because the predictions are relying on a very small K, making the prediction sensitive.
- Less Stable Predictions: Predictions can be less stable and more subject to changes in the training data.
16. Explain elbow method’s use-case and the steps to perform the elbow method.
- The elbow method is a heuristic technique for finding the optimal K based on the trade-off between model complexity and performance.
- Steps:
- Calculate the model’s performance (e.g., error) for a range of K values.
- Plot the performance values against the corresponding K values.
- Look for the “elbow point” on the curve, which represents the K value where the performance improvement starts to diminish. This point is often considered the optimal K.