A3.3 - Performance Assessment: practice - [LINFO2262] Machine Learning: classification and evaluation

A3.3 - Performance Assessment: practice

https://inginious.info.ucl.ac.be/course/LINFO2262/A3-3/HeartFailure.jpg

You are a data analyst in charge of predicting the occurrence of Heart Failure from clinical and patient status variables:

The original dataset is made of 111 input variables and a binary target class. Here, this dataset has been partitioned for you in two files HeartFailure_train.csv (800 examples) and HeartFailure_test.csv (305 examples), used respectively for training and testing your models.

The classification task consists in building a decision tree from the training set to predict the HeartFailure status (either \(0\) or \(1\)) on the test examples.

You can load the datasets with

import pandas as pd

train_df = pd.read_csv("HeartFailure_train.csv")
test_df = pd.read_csv("HeartFailure_test.csv")

We consider here a Decision Tree Learning algorithm as implemented in sklearn.tree.DecisionTreeClassifier. More specifically, we would like here to study the possible fluctuations of classification results according to the specific training or test considered. For this, the scipy.stats package may be a useful tool.

For this task, we consider the Balanced Classification Rate (BCR) (see the course slides) as performance metric. It is particularly relevant for such a medical application because the heart failure "positive" examples form a minority class. Indeed, one considers these "positive" examples as important as the negative ones (no heart failure) which form the majority class.

When learning a DecisionTreeClassifier for this task, you must always use the following meta-parameters: criterion = 'entropy', min_impurity_decrease = 0.01, max_depth = 10, class_weight = 'balanced'. The last one balances the importance of the various classes when learning the tree in the same spirit BCR does it when evaluating the performance of the learned tree. All other meta-paramaters should always be left to their default values, except for random_state as described below.

Copiez-collez cette commande dans votre terminal :

Le mot de pase pour se connecter est

Question 1: Test accuracy

Consider a training set that is made of 90 % of the total training size, sampled uniformly at random from train_df without replacement.

Select one test set containing 80 distinct examples by sampling these examples uniformly at random from test_df, and report the test accuracy of the decision tree.

You should assume that the data frames train_df and test_df already exist in memory and you should not reload them.

When using the pandas' sample method and sklearn's DecisionTreeClassifier, use random_state=0 to reproduce precomputed results used for grading.

Assign the test performance to the float variable test_bcr.

from sklearn.tree import DecisionTreeClassifier

# TODO: replace by your own python code

test_bcr = 0.0

Question 2: Confidence interval

Using the result of the previous question, compute the 95 % confidence interval for the BCR obtained with the decision tree on this test set.

BCR is the arithmetic average between 2 proportions \(\hat{p_1}\) (i.e. the True Positive rate) and \(\hat{p_2}\) (i.e. the True Negative Rate ). Therefore, the estimated BCR \(\Large{= \frac{1}{2}(\hat{p_1} + \hat{p_2})}\).

The variance of this estimator can be approximated by \(\Large{\frac{1}{4}(\frac{\hat{p_1}(1-\hat{p_1})}{n_1}+\frac{\hat{p_2}(1-\hat{p_2})}{n_2})}\) with \(n_1\) and \(n_2\), the actual numbers of positive and negative examples in the test set considered.

You must use the above properties to compute a confidence interval for the BCR through a Normal approximation.

Note also that BCR is sometimes corrected for chance (i.e. to get the correct class by guessing uniformly at random) but we ignore this correction in this task.

Write your answer following this format: CI_lower_bound, CI_upper_bound

Use the decimal representation (e.g. use 0.426 and not 42.6%). When rounding, give at least 3 decimals.

Question 3: Resampling data

Test question [25 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Build 200 different test sets, using the same procedure as in question 1 (i.e. sample at random 80 distinct test examples from test_df, and repeat this random sampling 200 times to form 200 test folds (since each fold will contain 80 test examples and test_df does not include \(200\times 80\) examples, the various test sets are expected to overlap partially)).

Use a specific seed for each resampling of the test set: to obtain the i th test fold (starting at 0), use pandas' sample method with the argument random_state=i.

Consider the same decision tree as in Question 1 (i.e. with all specified meta-parameters and random_state=0, trained on 90% of the train set), and compute the mean of the classification rates obtained with this model for each of the 200 different test folds.

You must call the tree you instantiated in question 1 (you should not re-implement it nor re-train it). Inginious will actually run your code submitted to question 1 right before your code submitted to this question.

You should assume that the data frames train_df and test_df already exist in memory and you should not reload them.

Assign the balanced accuracies of each individual test fold in a list of float values called test_bcrs (e.g. [0.91, 0.89, ...] ). The list should thus have a length of 200 values.

Assign the mean test BCR to the float variable mean_test_bcr.

from sklearn.tree import DecisionTreeClassifier

# TODO: replace by your own python code

test_bcrs = []
mean_test_bcr = 0.0

Question 4: Resampling data (continued)

Test question [25 points]

Compute the observed lower and upper bounds for the results obtained in the previous question. The observed lower (respectively upper) bound is such that 2.5% of the 200 mean balanced classification rates are below it (respectively above it). In other words, these observed bounds are the 2.5 and 97.5 percentiles of the 200 mean balanced classification rates. Compare the lower and upper bounds of the confidence interval (as computed in question Q2) and your observed bounds.

Write your answer following this format: observed_lower_bound, observed_upper_bound

Use the decimal representation (e.g. use 0.426 and not 42.6%). When rounding, give at least 3 decimals.

Question 5: Repeating the experiment

Test question [25 points]

The above sub-questions refer to the comparison between a confidence interval computed on a single test set and the variability observed across 200 test folds.

Could your conclusion be affected by the random sampling of the initial single test set and/or the initial training set? If you repeat the whole protocol (i.e. from question 1 up to question 4 included) 100 times, does it change your conclusion?

Summarize your results in a data frame named frame, with one row for each of the 100 experiments (the row names should be 0, 1, ..., 99 ). For each one, give the results according to the following column names: indiv_test_bcr, CI_lower_bound, CI_upper_bound, mean_test_bcr, observed_lower_bound and observed_upper_bound.

You should assume that the data frames train_df and test_df already exist in memory and you should not reload them.

To reproduce the results used for grading, you must use specific seeds for each random event:

To compute the i th row of the data frame (starting at zero), use random_state=i when sampling the train data and the single test set (to compute indiv_test_bcr, CI_lower_bound and CI_upper_bound like in Q1).

To compute mean_test_bcr, observed_lower_bound and observed_upper_bound (like in Q3), use random_state=(i+1)*j when sampling the 200 test folds with \(j\in[0, 199]\) the index of the test fold.

Recall, you must use the exact same metaparameters for the instantiation of the decision trees as in Q1 and random_state=0 .

from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# TODO: replace by your own python code

frame = pd.DataFrame(columns = ["indiv_test_bcr", "CI_lower_bound", "CI_upper_bound", "mean_test_bcr", "observed_lower_bound", "observed_upper_bound"])

Question 6: Conclusion

Test question [25 points]

Based on the results you have obtained, select all valid sentences.

The average balanced accuracy (over 200 test folds) of a model usually falls inside the 95 % confidence interval computed on a single test set.

The column indiv_test_bcr puts into light the large variability caused by both the sampling of the train and test set.

The columns observed_lower_bound and observed_upper_bound provide, in a majority of the runs, tighter (= smaller) confidence intervals on the accuracy of the model than the columns CI_lower_bound and CI_upper_bound.

Measuring the test performance of a learning algorithm trained on a single training set and evaluated on a single test set appears to give sufficiently reliable results.

The average balanced accuracy (over 200 test folds) of a model rarely falls inside the 95 % confidence interval computed on a single test set.

The column indiv_test_bcr is expected to vary less from run to run than the column mean_test_bcr because it is evaluated on a single test set.

The column indiv_test_bcr tends to overestimate the balanced accuracy of the model because it is evaluated on a single test set.

The column mean_test_bcr puts into light the large variability caused by the sampling of the train set.

Auteur(s)	Benoît Ronval, Pierre Dupont
Date limite	22/03/2026 23:00:00
Limite de soumission	Pas de limite

Informations

Se connecter