A1.4 - Decision Tree Learning: bagging and random forests - [LINFO2262] Machine Learning: classification and evaluation

Questão 1: A weak learner

Let us first analyze a so-called weak learner. Consider here a DecisionTreeClassifier estimated with all meta-parameters set to their default values, i.e. a decision tree learned without pruning.

In order to analyze the test performance of this model (and all subsequent models in this task) in a robust fashion, the learning should be repeated over 30 runs. Concretely, it means that the only control parameter used when defining the model should be random_state=i with i in range(30).

Report below the median test accuracy over these 30 runs of such unpruned trees learned on the entire training set.

When rounding, give at least 3 decimals. Use the decimal representation (e.g. 0.942, not 94.2%)

Questão 2: Bagging

Graded question with feedback [20 points]

The word "bagging" is a contraction of two things:

Bootstraping: A bootstrap sample has the same size as the original set (= the number of examples in the original training set) but, as it is drawn with replacement, it can contain replicated examples while some original examples are no longer included.

Aggregating: Several bootstrap samples are drawn, and a decision tree is trained on each one. The result is an ensemble of models that can be used to classify the test set. For each test example, the predicted class is obtained by a majority vote among the specific predictions of all the models.

If bagging works, one would expect to get a better test accuracy (as compared to using a single unpruned decision tree, the so-called weak learner mentioned in the previous question.

We ask you to implement such a bagging classifier, namely two functions:

bagging_fit(x_train, y_train, nb_trees, i_run) which takes a training set (x_train is a DataFrame of input values and y_train their corresponding class labels) as input and returns a list containing nb_trees different instances of DecisionTreeClassifier.

To generate the bootstrap sample of index i (for i in range(nb_trees)) you must use the pandas sample method method with random_state=i.

The DecisionTreeClassifier must be estimated without pruning and the only control parameter used when defining the model must be random_state=i_run. Therefore, this control parameter is set to the same value for all trees learned when estimating a bagging model, but will depend on a specific run index (i_run) whenever repeating, over several runs, the whole bagging estimation.

bagging_predict(tree_list, x_test) which takes the list of trees returned by bagging_fit and an unlabeled test set as inputs and returns a list with the corresponding class labels. (Note: in case of a tie between two labels during the majority voting, pick the first one according to the numerical or lexicographical order, depending on the type (numerical or string) of the class labels.)

Note that you should implement your own version of the bagging process and not use one already implemented in the scikit-learn library as it may give you different results (for your information, it performs a majority vote weighted by some probabilities which is not what is asked here).

Please do not use Inginious as your debugger. First implement your functions offline, and upload them once you are confident they work correctly.

from sklearn.tree import DecisionTreeClassifier

def bagging_fit(x_train, y_train, nb_trees, i_run):
    return []  # TODO: replace by your own python code

def bagging_predict(tree_list, x_test):
    return []  # TODO: replace by your own python code

​x
 
from sklearn.tree import DecisionTreeClassifier
​
def bagging_fit(x_train, y_train, nb_trees, i_run):
    return []  # TODO: replace by your own python code
​
def bagging_predict(tree_list, x_test):
    return []  # TODO: replace by your own python code

Questão 3: Bagging (continued)

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Analyze the evolution of the test set classification accuracy on this task as a function of the number of trees used in the bagging (plotting a curve will be useful). For each value of nb_trees you choose, perform 30 different runs to obtain more robust results and evaluate the median test accuracy of the bagging models learned over these 30 runs.

Select all valid sentences

The median test accuracy of bagging improves when adding a few trees, but at some point it reaches a plateau, from which there is no real benefit in test accuracy when adding more trees.

Whenever nb_trees is fixed to a constant value (e.g. 100), repeating the bagging process over several independent runs does not change the test results because the same random seeds are used to generate the bootstrap samples.

The test accuracy of a bagging model reduced to a single tree (nb_trees=1) is necessarily identical to the test accuracy obtained with a DecisionTreeClassifier learned with the same value of the control parameter random_state and no pruning (in both cases).

The median test accuracy of bagging always keeps improving when using more trees.

Bagging, with an appropriate number of trees, tends to produce better median test results than learning a single unpruned decision tree.

Increasing the number of trees beyond reaching a maximal median test accuracy may be beneficial to reduce the variance across runs.

The variance of the test accuracies across independent runs stay approximately constant no matter how many trees are estimated.

Questão 4: Random forest

Test question [40 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Random forests consists in bagging decision trees including also a random attribute pre-selection: at each node of the trees, the attribute which maximizes the drop of impurity among a random subset of attributes is chosen (by default $\sqrt{d}$ out of $d$ input variables are chosen). Such a classifier is implemented in python by sklearn.ensemble.RandomForestClassifier.

Tune the meta-parameters of a Random Forest (RF) to obtain the best possible test accuracy (do this offline, on your machine). Note that, for now, we use the same test set to tune the meta-parameters and to evaluate the models. We will see better strategies later in the course. To do so, restrict your attention to the following meta-parameters: n_estimators, max_depth, min_samples_split, min_samples_leaf, min_impurity_decrease. Leave all other RF meta-parameters fixed to their default values.

Your score for this question will be proportional to the obtained median test accuracy over 30 runs. To specify the seed of a specific run, use the control parameter random_state=i with i in range(30).

Encode the meta-parameters for learning the best random forest you found in the assignment to the variable my_best_forest. After this variable assignment, you should not fit here the random forest to the training data as it will be done automatically by the grading procedure on the Inginious server.

from sklearn.ensemble import RandomForestClassifier

my_best_forest = RandomForestClassifier(...)  # TODO: replace by your own python code

xxxxxxxxxx
 
from sklearn.ensemble import RandomForestClassifier
​
my_best_forest = RandomForestClassifier(...)  # TODO: replace by your own python code
​

Questão 5: Random forest (continued)

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Based on your previous experiments, what do you conclude?

Select all valid sentences

Increasing the number of trees in a Random Forest has a smaller benefit on the median test accuracy than setting max_depth below 3.

The median test accuracy of a Random Forest increases with the number of trees before reaching a plateau

Pruning is important to improve the test accuracy of a single decision tree while it matters less for Random Forests

Increasing the number of trees in a Random Forest tends to increase the variance of the test accuracy across independent runs

The median test accuracy of a Random Forest can significantly decrease if the number of trees becomes too large

Autores	Pierre Dupont, Benoit Ronval
Prazo de entrega	23/02/2025 23:00:00
Limite de submissão	No limitation

Informações

Entrar