A3.2 - Performance Assessment: theory - [LINFO2262] Machine Learning: classification and evaluation

Question 1: Comparing two models

A random forest classifies correctly 122 out of 140 test examples.

Compute the 95 % confidence interval for the accuracy of this model.

When rounding, give at least 3 decimals. Example of correctly formatted answer: 0.707, 0.867.

Question 2: Comparing two models (continued)

An SVM with a RBF kernel classifies correctly 109 out of 140 independent test examples.

Compute the 95 % confidence interval for the accuracy of this model.

When rounding, give at least 3 decimals. Example of correctly formatted answer: 0.707, 0.867.

Question 3: Comparing two models (continued)

Given your answers to the previous questions, can you conclude that the RF model is better than the SVM model?

Beware: you will only receive credit for this question if you answered the two previous ones correctly.

No

Yes

Question 4: Comparing two models (continued)

We will use a statistical test to decide whether the performances of the two models differ.

Since the sample sizes are here relatively small, we can use a $\chi^2$ test to compare the proportions (rem.: a Fisher exact test is a common alternative likely offering similar numerical results).

What is the $p$ -value of the $\chi^2$ test on these proportions?

When rounding, give at least 3 decimals.

Question 5: Comparing two models (continued)

Can you conclude from this test that there is a statistically significant difference between our two models?

Beware: you will only receive credit for this question if you answered the previous one correctly.

No

Yes

Question 6: Comparing two models (continued)

What is the minimal number of test examples for each model (instead of the 140 examples originally used) that would be needed in the previous test to get a $p$ -value below $1 \%$ , assuming the test classification rates do not change (87.14% versus 77.86%)?

Please report the number of examples needed for each model (not twice this number).

Question 7: An evaluation protocol

A well informed data analyst observes that a machine learning package is apparently bugged as it produces models that, once tested on independent test examples, have classification accuracies distributed uniformly in the interval $[0\%, 100 \%]$ .

To support his hypothesis, the data analyst implements the following protocol. He repetitively learns models with the package under study and he observes the accuracies on independent test samples.

More specifically, he reproduces this experiment (learn and test) over several independent training/test sets and he reports as quality measure the average of the test accuracies observed on $k$ such test sets. As his results could depend on particular tests, he repeats the whole protocol over 100 distinct runs. He then plots the distribution of this quality measure over the 100 runs and he monitors how this distribution evolves as a function of the number $k=2, 5, 10, 20, 50, 100$ of test sets considered.

From what you know about the problem at hand, how do you expect the distribution of the quality measure to behave as a function of $k$ ?

Select all valid answers.

The quality measure $\bar X$ is approximately distributed according $\sim \mathrm{Normal} \left( \mu, \frac{\sigma^2}{k} \right)$ , with $\mu = 0.5$ and $\sigma^2 = \frac{1}{12}$ .

The quality measure is $\bar X = \frac 1 k \sum_{i=1}^k X_i$ .

The quality measure $\bar X$ is approximately distributed according $\sim \mathrm{Uniform} \left( \mu, \sigma^2 \right)$ , with $\mu = 0.5$ and $\sigma^2 = \frac{1}{k}$ .

The quality measure is $\bar X = \frac 1 k \sum_{i=1}^k X_i^2$ .

$X_i$ , the test set accuracy computed on the $i$ th set, is assumed to be distributed according $\sim \mathrm{Normal}(\mu=0,\sigma^2=1)$ .

The quality measure $\bar X$ is approximately distributed according $\sim \mathrm{Normal} \left( \mu, \sigma^2 \right)$ , with $\mu = 0.5$ and $\sigma^2 = \frac{1}{k}$ .

$X_i$ , the test set accuracy computed on the $i$ th set, is assumed to be distributed according $\sim \mathrm{Uniform}(min=0,max=1)$ .

Question 8: An evaluation protocol (continued)

Simulate numerically the protocol of the data analyst and check whether these simulations satisfy the expected results from your theoretical analysis of the problem. Some plots should be helpful!

Select all valid answers.

The quality measure is well centered around the expected $\mu$ value, found in the previous question.

The variance increases when the number of test sets increases.

The quality measure is not well centered around the expected $\mu$ value, found in the previous question.

The variance remains roughly the same when the number of test sets increases.

The variance decreases when the number of test sets increases.

Question 9: An evaluation protocol (continued)

What would differ in the previous analysis if the classification test accuracies would be distributed non-uniformly, while keeping the same mean and variance?

Select all valid answers.

The expected value of the quality measure $\bar X$ would stay the same.

The quality measure $\bar X$ would no longer be approximately normally distributed. The results are thus expected to be significantly different.

The quality measure $\bar X$ would still be approximately normally distributed. The results are thus expected to be similar.

The variance of the quality measure $\bar X$ would stay the same.

Question 10: An evaluation protocol (continued)

Based on your answer to the previous question, can you conclude that such a protocol is adequate to assess how the individual test accuracies are distributed?

Beware: you will only receive credit for this question if you answered the previous one correctly.

No

Yes

Author(s)	Pierre Dupont
Deadline	23/03/2025 23:00:00
Abgabenlimit	No limitation

Information

Einloggen