Información

Autor(es) Pierre Dupont, Benoit Ronval
Fecha de entrega 23/02/2025 23:00:00
Tiempo límite de envío Sin límite de envío

Inicia sesión

A1.3 - Decision Tree Learning: the impact of pruning

In this task, we will evaluate how pruning the decision tree can affect the classification performance.


Pregunta 1: Pre-pruning the tree

Among possible other pre-pruning parameters, sklearn's CART implemenation has a meta-parameter to prevent from growing a branch if the decrease in impurity is too small.

Study (offline, on your machine) how the test classification accuracy evolves when changing the min_impurity_decrease meta-parameter. Leave all other meta-parameters equal to their default values but, for reproducibility on Inginious, also set random_state=0. Use the entire training set to learn you models, and evaluate the training and testing accuracy on the entire train/test sets.

Once you have trained different models and analyzed the effect of this meta-parameter in terms of test classification performance, choose 3 of them in order to illustrate the impact: a model that performs best, one that performs worst, and one in between. Implement these 3 models in the code box below. Report for each one the number of nodes in the learned tree, as well as the training and testing accuracies.

Store these values in a panda dataframe named frame with the following column names: min_impurity_decrease, NodeCount, TrainAcc, TestAcc; where each row represents the result of a particular value for min_impurity_decrease.

Pregunta 2: Learning curves

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered question 1 correctly.

Perform the same experiment as in the previous question, but use different training sizes: 5%, 10%, 20%, 50%, 99% (similarly to what has been done in the previous task, perform 100 independent runs for each size to reduce the variance of the result). Create comparative plots of the learning curves and tree sizes without pruning and with pruning, using several values of min_impurity_decrease. Leave all other meta-parameters equal to their default values.

Based on your analysis, answer the following questions:

Select all valid sentenses.

Pregunta 3: My best tree

Test question [80 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered question 1 correctly.

Sklearn's CART implementation contains other pruning mechanisms. For this question, your goal is to train a decision tree classifier on the training set such that it would provide the highest classification test set accuracy.

Your score for this question will be proportional to the accuracy obtained on the test set.

How to proceeed (do this offline, on your machine):

  • Study the combined effect of adapting the following meta-parameters: min_impurity_decrease, min_samples_split, max_depth, min_samples_leaf and try different combinations of these meta-parameters.
  • Leave all other meta-parameters set to their default values. In particular, to be consistent with the original CART algorithm, it is important to stick to criterion='gini', splitter='best', max_features=None.
  • Remember that random_state is not a learning meta-parameter and thus should not be fine-tuned to the data (as described in the introduction task). This parameter can however have an effect on the tree produced (in case of ties when several features offer the same drop of impurity at a given node). For each combination of actual meta-parameter values (among the 4 listed above), you should therefore repeat the training over 10 different runs (set random_state=i with i in range(10)). The quality of each combination of meta-parameters values must be evaluated as the median test accuracy obtained over these 10 runs.

Once you think that you have found good meta-parameter values (among the 4 listed above) for learning a decision tree, complete the assignment to the my_best_tree variable with these meta-parameter values. For this specific step, fix random_state=0. After this variable assignment, you should not fit here the decision tree to the training data as it will be done automatically by the grading procedure.

To evaluate the model, my_best_tree will be trained using the entire train set and the accuracy will be computed on the entire test set.