Processing math: 100%

Informações

Autores Pierre Dupont, Benoit Ronval
Prazo de entrega 23/02/2025 23:00:00
Limite de submissão No limitation

Entrar

A1.4 - Decision Tree Learning: bagging and random forests

In this last task, we will investigate how combining multiple trees can affect the classification performance.


Questão 1: A weak learner

Let us first analyze a so-called weak learner. Consider here a DecisionTreeClassifier estimated with all meta-parameters set to their default values, i.e. a decision tree learned without pruning.

In order to analyze the test performance of this model (and all subsequent models in this task) in a robust fashion, the learning should be repeated over 30 runs. Concretely, it means that the only control parameter used when defining the model should be random_state=i with i in range(30).

Report below the median test accuracy over these 30 runs of such unpruned trees learned on the entire training set.

When rounding, give at least 3 decimals. Use the decimal representation (e.g. 0.942, not 94.2%)

Questão 2: Bagging

Graded question with feedback [20 points]

The word "bagging" is a contraction of two things:

  • Bootstraping: A bootstrap sample has the same size as the original set (= the number of examples in the original training set) but, as it is drawn with replacement, it can contain replicated examples while some original examples are no longer included.
  • Aggregating: Several bootstrap samples are drawn, and a decision tree is trained on each one. The result is an ensemble of models that can be used to classify the test set. For each test example, the predicted class is obtained by a majority vote among the specific predictions of all the models.

If bagging works, one would expect to get a better test accuracy (as compared to using a single unpruned decision tree, the so-called weak learner mentioned in the previous question.

We ask you to implement such a bagging classifier, namely two functions:

  • bagging_fit(x_train, y_train, nb_trees, i_run) which takes a training set (x_train is a DataFrame of input values and y_train their corresponding class labels) as input and returns a list containing nb_trees different instances of DecisionTreeClassifier.

    • To generate the bootstrap sample of index i (for i in range(nb_trees)) you must use the pandas sample method method with random_state=i.
    • The DecisionTreeClassifier must be estimated without pruning and the only control parameter used when defining the model must be random_state=i_run. Therefore, this control parameter is set to the same value for all trees learned when estimating a bagging model, but will depend on a specific run index (i_run) whenever repeating, over several runs, the whole bagging estimation.
  • bagging_predict(tree_list, x_test) which takes the list of trees returned by bagging_fit and an unlabeled test set as inputs and returns a list with the corresponding class labels. (Note: in case of a tie between two labels during the majority voting, pick the first one according to the numerical or lexicographical order, depending on the type (numerical or string) of the class labels.)

Note that you should implement your own version of the bagging process and not use one already implemented in the scikit-learn library as it may give you different results (for your information, it performs a majority vote weighted by some probabilities which is not what is asked here).

Please do not use Inginious as your debugger. First implement your functions offline, and upload them once you are confident they work correctly.

Questão 3: Bagging (continued)

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Analyze the evolution of the test set classification accuracy on this task as a function of the number of trees used in the bagging (plotting a curve will be useful). For each value of nb_trees you choose, perform 30 different runs to obtain more robust results and evaluate the median test accuracy of the bagging models learned over these 30 runs.

Select all valid sentences

Questão 4: Random forest

Test question [40 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Random forests consists in bagging decision trees including also a random attribute pre-selection: at each node of the trees, the attribute which maximizes the drop of impurity among a random subset of attributes is chosen (by default d out of d input variables are chosen). Such a classifier is implemented in python by sklearn.ensemble.RandomForestClassifier.

Tune the meta-parameters of a Random Forest (RF) to obtain the best possible test accuracy (do this offline, on your machine). Note that, for now, we use the same test set to tune the meta-parameters and to evaluate the models. We will see better strategies later in the course. To do so, restrict your attention to the following meta-parameters: n_estimators, max_depth, min_samples_split, min_samples_leaf, min_impurity_decrease. Leave all other RF meta-parameters fixed to their default values.

Your score for this question will be proportional to the obtained median test accuracy over 30 runs. To specify the seed of a specific run, use the control parameter random_state=i with i in range(30).

Encode the meta-parameters for learning the best random forest you found in the assignment to the variable my_best_forest. After this variable assignment, you should not fit here the random forest to the training data as it will be done automatically by the grading procedure on the Inginious server.

Questão 5: Random forest (continued)

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Based on your previous experiments, what do you conclude?

Select all valid sentences