Decision Tree
1. Explain Decision Tree Algorithms.
2. Describe the first step in the decision tree algorithm.
- The algorithm evaluates all the attributes (features) and select the one that best distinguishes the data based on a certain criterion (for instance, gini index).
- The most important attribute becomes the root node.
3. What does the depth of the tree represents intuitively?
4. Provide the common criteria to select the features as a decision node for classification and regression.
- Classification: Gini Impurity
- Regression: Mean Squared Error
5. Gini Impurity is typically used to measure the impurity of the features. Provide the formula to calculate the Gini index.
It calculates the impurity of each node between [0,1]. The lower the better (purer the feature)
$$ \text{Gini(node)} =1-\sum p^2_i $$
Where, $p_i$ is the ratio of observation of being of class $i$ at each node.
6. Why does a lower Gini Impurity implies a better feature? Provide a simple example using the formula.
If there is a feature that can predict a class with 100% accuracy, the Gini index will 0
$$ 1 - (\text{probability of “yes”})^2 - (\text{probability of “no”})^2 \ 1- (\frac{100}{100})^2 - (\frac{0}{100})^2 = 0 $$
The feature has no impurity as it can 100% predict the class.
7. A binary feature will have two nodes. However, one node has higher number of samples than the other (Imbalanced sample size). Provide a formula to calculate the total Gini Impurity for that feature.
The total Gini Impurity is the weighted average of the leaf Impurities.
$$ \text{Gini Impuirity} = (\frac{n_1}{n_1+n_2})g_1 + (\frac{n_2}{n_1+n_2})g_2 $$
Where,
- $n_1$ : Sample size for node 1
- $n_2$: Sample size for node 2
- $g_1$: Gini index for node 1
- $g_2$: Gini index for node 2
8. Explain how to find the optimal split for a numeric / continuous variable?
- Sort the feature values (lowest to highest)
- Calculate the average values for all adjacent values (the middle point between the values)
- Calculate the impurity values for each average weight.
- Select the value that the lowest impurity value to be the optimal split.
9. For Decision Tree Regressor (DTR), explain the process of splitting the dataset optimally using the sum of squared errors (SSE) methodology.
Similar to classification, DTR splits the dataset by finding the right optimal threshold through an iterative process.
- Split the dataset into two subsets. The first data point and the remaining data points.
- For each subset, calculate the sum of squared error. Then combine the sum of squared error.
- Repeat step 1 and 2 until a predefined stopping condition is met such as
min_samples_split - Select the optimal split (threshold) that has the lowest SSE
- To complete the decision tree, repeat step 1 and 4.
10. For Decision Tree Regressor (DTR), briefly describe how to select the best feature (predictor) when there are multiple features.
- For each feature, compute the sum of squared errors for different thresholds and select the optimal threshold which has yield the lowest sum of squared error.
- Compare the sum of squared errors from each features and select the feature that has the lowest sum of squared errors.
11. Briefly describe the steps to perform the decision tree algorithm.
- Select the best attribute to be the root node.
- The dataset is split into subsets based their values of that attribute.
- Repeat the process of selecting the best attribute for each of the subset until a stopping condition is met.
12. Why decision tree algorithm is intuitive and easy to interpret?
- It reflects a sequential decision-making process, where each internal node corresponds to a decision based on a specific feature.
- Decision trees are visually represented as tree structures, making it easier to follow the decision making process.
- No complex mathematical formulas are used.
13. Why is the decision tree algorithm is considered to be a versatile algorithm?
- It can handle both classification and regression problems.
- It can process numerical and categorical data together.
- It can handle non-linear relationships.
- It is robust to outliers.
- No assumptions required on the data distribution.
14. An advantage of the decision tree lies in its automatic feature selection. Briefly explain the mechanism through which this selection occurs.
15. List the advantages of a decision tree algorithm.
- Pretty intuitive and easy to interpret
- Versatile
- No scaling necessary
- Robust to outliers
- Automated feature selection
16. List and explain the disadvantages of a decision tree algorithm.
- Local Optimal vs Global Optimal: It makes the most optimal decision at each step but does not take into account the global optimum. Choosing the best result at a given step does not ensure the optimal decision.
- Prone to overfitting: This is because the amount of specificity we look at leading to smaller sample of events that meet the previous assumptions. Using smaller samples could lead to sub-optimal conclusions, not adequate for an informed decision.
Supplementary questions to help you understand the concepts better.
1. Explain decision nodes (internal nodes) in the context of decision tree algorithms.
2. Explain leaf nodes (terminal nodes) in the context of decision tree algorithms.
Leaf nodes (terminal nodes) represent the final outcomes or predictions.
- In a classification tree, each leaf node corresponds to a class label.
- In a regression tree, each leaf node contains the predicted numerical value.
3. Provide the formula to calculate the total Gini Impurity if the categorical feature has 3 type.
4. Explain the hyperparameter min_samples_split and how to use it to mitigate overfitting.
min_samples_splitsets the minimum of samples required to split an internal node.- To prevent overfitting, set a higher value.
5. Explain the hyperparameter min_samples_leaf and how to use it to mitigate overfitting.
min_samples_splitsets the minimum of samples to be a leaf node.- To prevent overfitting, set a higher value.
6. Explain the hyperparameter max_depth and how to use it to mitigate overfitting.
max_depthcontrols the maximum depth of the tree, which is the longest path from the root node to a leaf node.- To prevent overfitting, set a lower value, which limits the complexity of the tree
7. Between the hyperparameters min_samples_split and min_samples_leaf, which one takes precedence over the other?
min_samples_leaf overrides min_samples_splt. The split won’t be allowed if the number of samples in the leaf is less than min_samples_leaf.