A1.3 - Decision Tree Learning: the impact of pruning - [LINFO2262] Machine Learning: classification and evaluation

Pregunta 1: Pre-pruning the tree

Among possible other pre-pruning parameters, sklearn's CART implemenation has a meta-parameter to prevent from growing a branch if the decrease in impurity is too small.

Study (offline, on your machine) how the test classification accuracy evolves when changing the min_impurity_decrease meta-parameter. Leave all other meta-parameters equal to their default values but, for reproducibility on Inginious, also set random_state=0. Use the entire training set to learn you models, and evaluate the training and testing accuracy on the entire train/test sets.

Once you have trained different models and analyzed the effect of this meta-parameter in terms of test classification performance, choose 3 of them in order to illustrate the impact: a model that performs best, one that performs worst, and one in between. Implement these 3 models in the code box below. Report for each one the number of nodes in the learned tree, as well as the training and testing accuracies.

Store these values in a panda dataframe named frame with the following column names: min_impurity_decrease, NodeCount, TrainAcc, TestAcc; where each row represents the result of a particular value for min_impurity_decrease.

from sklearn.tree import DecisionTreeClassifier
import pandas as pd

frame = pd.DataFrame(columns = ["min_impurity_decrease", "NodeCount", "TrainAcc", "TestAcc"])

# TODO: replace by your own python code

​x
 
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
​
frame = pd.DataFrame(columns = ["min_impurity_decrease", "NodeCount", "TrainAcc", "TestAcc"])
​
# TODO: replace by your own python code

Pregunta 2: Learning curves

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered question 1 correctly.

Perform the same experiment as in the previous question, but use different training sizes: 5%, 10%, 20%, 50%, 99% (similarly to what has been done in the previous task, perform 100 independent runs for each size to reduce the variance of the result). Create comparative plots of the learning curves and tree sizes without pruning and with pruning, using several values of min_impurity_decrease. Leave all other meta-parameters equal to their default values.

Based on your analysis, answer the following questions:

Select all valid sentenses.

Pruning the tree prevents overfitting.

Tree pruning tends to increase the test set classification accuracy.

One should use small values of min_impurity_decrease to prevent too agressive pruning.

The higher the value of min_impurity_decrease, the more nodes the tree will contain (for the same training data).

Tree pruning tends to increase the train set classification accuracy.

When the training accuracy decreases, the test accuracy necessarily increases.

A pruned tree may in some cases have a higher number of nodes than a unpruned tree learned on the same training set.

When the value of min_impurity_decrease is too large, the test accuracy crashes as the tree is so heavily pruned it contains very few nodes.

For some non-zero values of min_impurity_decrease, one can observe that the number of nodes of the tree first increases with the number of training examples, then decreases.

Pregunta 3: My best tree

Test question [80 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered question 1 correctly.

Sklearn's CART implementation contains other pruning mechanisms. For this question, your goal is to train a decision tree classifier on the training set such that it would provide the highest classification test set accuracy.

Your score for this question will be proportional to the accuracy obtained on the test set.

How to proceeed (do this offline, on your machine):

Study the combined effect of adapting the following meta-parameters: min_impurity_decrease, min_samples_split, max_depth, min_samples_leaf and try different combinations of these meta-parameters.

Leave all other meta-parameters set to their default values. In particular, to be consistent with the original CART algorithm, it is important to stick to criterion='gini', splitter='best', max_features=None.

Remember that random_state is not a learning meta-parameter and thus should not be fine-tuned to the data (as described in the introduction task). This parameter can however have an effect on the tree produced (in case of ties when several features offer the same drop of impurity at a given node). For each combination of actual meta-parameter values (among the 4 listed above), you should therefore repeat the training over 10 different runs (set random_state=i with i in range(10)). The quality of each combination of meta-parameters values must be evaluated as the median test accuracy obtained over these 10 runs.

Once you think that you have found good meta-parameter values (among the 4 listed above) for learning a decision tree, complete the assignment to the my_best_tree variable with these meta-parameter values. For this specific step, fix random_state=0. After this variable assignment, you should not fit here the decision tree to the training data as it will be done automatically by the grading procedure.

To evaluate the model, my_best_tree will be trained using the entire train set and the accuracy will be computed on the entire test set.

from sklearn.tree import DecisionTreeClassifier

my_best_tree = DecisionTreeClassifier(random_state=0, ...)  # TODO: replace by your own python code

xxxxxxxxxx
 
from sklearn.tree import DecisionTreeClassifier
​
my_best_tree = DecisionTreeClassifier(random_state=0, ...)  # TODO: replace by your own python code

Autor(es)	Pierre Dupont, Benoit Ronval
Fecha de entrega	23/02/2025 23:00:00
Tiempo límite de envío	Sin límite de envío

Información

Inicia sesión