A3.3 - Performance Assessment: practice - [LINFO2262] Machine Learning: classification and evaluation

A3.3 - Performance Assessment: practice

https://inginious.info.ucl.ac.be/course/LINFO2262/A3-3/Heart.jpg

You are a data analyst in charge of predicting the occurrence of Heart Failure from clinical and patient status variables:

The original dataset is made of 1700 samples on 112 input variables and a binary target class. Here, this dataset has been partitioned for you in two files HeartFailure_train.csv (1275 samples) and HeartFailure_test.csv (425 samples), used respectively for training and testing your models.

The classification task consists in building a decision tree from the training set to predict the HeartFailure status on the test samples.

You can load the datasets with

import pandas as pd

train_df = pd.read_csv("HeartFailure_train.csv")
test_df = pd.read_csv("HeartFailure_test.csv")

We consider here the CART algorithm as implemented in the sklearn.tree.DecisionTreeClassifier package as our default learning algorithm and we use it without pruning.

More specifically, we would like here to study the possible fluctuations of classification results according to the specific training or test considered. For this, the scipy.stats package may be a useful tool.

Paste this command into your terminal:

The password to connect is

Question 1: Test accuracy

Consider a training set that is made of 5 % of the total training size, sampled uniformly at random from train_df without replacement. Consider also a basic decision tree with default metaparameters and random_state=0.

Select one test set containing 100 distinct examples by sampling these examples uniformly at random from test_df, and report the test accuracy of the decision tree.

You should assume that the data frames train_df and test_df already exist in memory and you should not reload them.

When using the pandas' sample method and sklearn's DecisionTreeClassifier, use random_state=0 to facilate grading.

Assign your test accuracy to the float variable test_acc.

from sklearn.tree import DecisionTreeClassifier

# TODO: replace by your own python code

test_acc = 0.0

Question 2: Confidence interval

Using the result of the previous question, compute the 95 % confidence interval for the classification rate (from the Normal approximation) obtained with the decision tree on this test set.

Write your answer following this format: CI_lower_bound, CI_upper_bound. Use the decimal representation (e.g. use 0.426 and not 42.6%). When rounding, give at least 3 decimals.

Question 3: Resampling data

Test question [25 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

Build 200 different test sets, using the same procedure as in question 1 (i.e. sample at random 100 distinct test examples from test_df, and repeat this random sampling 200 times to form 200 test folds (since each fold will contain 100 test examples and test_df does not include \(200\times 100\) examples, the various test sets are expected to overlap partially)).

Use a specific seed for each resampling of the test set: to obtain the i th test fold (starting at 0), use pandas' sample method with the argument random_state=i.

Consider the same decision tree as in Question 1 (i.e. default metaparameters and random_state=0, trained on 5% of the train set), and compute the mean of the classification rates obtained with this model for each of the 200 different test folds.

You must call the tree you instantiated in question 1 (you should not re-implement it nor re-train it). Inginious will actually run your code submitted to question 1 right before your code submitted in this question.

You should assume that the data frames train_df and test_df already exist in memory and you should not reload them.

Assign the accuracies of each individual test fold in a list of float values called test_accs (e.g. [0.91, 0.89, ...] ). The list should thus have a length of 200 values.

Assign the mean test accuracy to the float variable mean_test_acc.

from sklearn.tree import DecisionTreeClassifier

# TODO: replace by your own python code

test_accs = []
mean_test_acc = 0.0

Question 4: Resampling data (continued)

Test question [25 points]

Compute the observed lower and upper bounds for the results obtained in the previous question. The observed lower (respectively upper) bound is such that 2.5% of the 200 mean classification rates are below it (respectively above it). In other words, these observed bounds are the 2.5 and 97.5 percentiles of the 200 mean classification rates. Compare the lower and upper bounds of the confidence interval (as computed in question Q2) and your observed bounds.

Write your answer following this format: observed_lower_bound, observed_upper_bound. Use the decimal representation (e.g. use 0.426 and not 42.6%). When rounding, give at least 3 decimals.

Question 5: Repeating the experiment

Test question [25 points]

The above sub-questions refer to the comparison between a confidence interval computed on a single test set and the variability observed across 200 test folds.

Could your conclusion be affected by the random sampling of the initial single test set and/or the initial training set? If you repeat the whole protocol (i.e. from question 1 up to question 4 included) 50 times, does it change your conclusion?

Summarize your results in a data frame named frame, with one row for each of the 50 experiments (the row names should be 0, 1, ..., 49 ). For each one, give the results according to the following column names: indiv_test_acc, CI_lower_bound, CI_upper_bound, mean_test_acc, observed_lower_bound and observed_upper_bound.

You should assume that the data frames train_df and test_df already exist in memory and you should not reload them.

To facilitate grading, you must use specific seeds for each random event:

To compute the i th row of the data frame (starting at zero), use random_state=i when sampling the train data and the single test set (to compute indiv_test_acc, CI_lower_bound and CI_upper_bound like in Q1).

To compute mean_test_acc, observed_lower_bound and observed_upper_bound (like in Q3), use random_state=(i+1)*j when sampling the 200 test folds with \(j\in[0, 199]\) the index of the test fold.

Recall, you must use the exact same metaparameters for the instantiation of the decision trees as in Q1 (i.e. the default ones) and random_state=0 .

from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# TODO: replace by your own python code

frame = pd.DataFrame(columns = ["indiv_test_acc", "CI_lower_bound", "CI_upper_bound", "mean_test_acc", "observed_lower_bound", "observed_upper_bound"])

Question 6: Conclusion

Test question [25 points]

Based on the results you have obtained, select all valid sentences.

The average accuracy (over 200 test folds) of a model usually falls inside the 95 % confidence interval computed on a single test set.

The columns observed_lower_bound and observed_upper_bound provide, in a majority of the runs, tighter (= smaller) confidence intervals on the accuracy of the model than the columns CI_lower_bound and CI_upper_bound.

The column indiv_test_acc tends to overestimate the accuracy of the model because it is evaluated on a single test set.

The average accuracy (over 200 test folds) of a model rarely falls inside the 95 % confidence interval computed on a single test set.

The column indiv_test_acc is expected to vary less from run to run than the column mean_test_acc because it is evaluated on a single test set.

Measuring the test accuracy of a learning algorithm trained on a single training set and evaluated on a single test set appears to give sufficiently reliable results.

The column mean_test_acc puts into light the large variability caused by the sampling of the train set.

The column indiv_test_acc puts into light the large variability caused by both the sampling of the train and test set.

Author(s)	Benoît Ronval, Pierre Dupont
Deadline	23/03/2025 23:00:00
Submission limit	No limitation

Information

Sign in