A3.2 - Performance Assessment: theory - [LINFO2262] Machine Learning: classification and evaluation

Spørsmål 1: Comparing two models

A random forest classifies correctly 134 out of 180 test examples.

Compute the 95 % confidence interval (using the Normal approximation) for the accuracy of this model.

When rounding, give at least 3 decimals. Example of correctly formatted answer: 0.707, 0.867

Spørsmål 2: Comparing two models (continued)

An SVM with a RBF kernel classifies correctly 104 out of 160 independent test examples.

Compute the 95 % confidence interval (using the Normal approximation) for the accuracy of this model.

When rounding, give at least 3 decimals. Example of correctly formatted answer: 0.707, 0.867

Spørsmål 3: Comparing two models (continued)

Given your answers to the previous questions, can you conclude that the RF model is better than the SVM model?

Beware: you will only receive credit for this question if you answered the two previous ones correctly.

Yes

No

Spørsmål 4: Comparing two models (continued)

We will use a statistical test to decide whether the performances of the two models differ.

Since the sample sizes are here relatively small, we can use a \(\chi^2\) test to compare the proportions (rem.: a Fisher exact test is a common alternative likely offering similar numerical results).

What is the \(p\)-value of the \(\chi^2\) test on these proportions?

When rounding, give at least 3 decimals.

Spørsmål 5: Comparing two models (continued)

Can you conclude from this test that there is a statistically significant difference between our two models?

Beware: you will only receive credit for this question if you answered the previous one correctly.

No

Yes

Spørsmål 6: Comparing two models (continued)

What is the minimal number of test examples for each model (instead of, respectively, 180 and 160 examples originally used) that would be needed in the previous statistical test to get a \(p\)-value as close as possible to \(0.01\), but \(< 0.01\)? To address this question, we further assume that both test classification rates, 74% and 65% respectively, stay unchanged or do not change by more than 1% in absolute terms due to a possible rounding effect (as a number of examples is required to be an integer value).

Report the integer confusion matrix entries for each model. Write your answer in this format: n1_positive, n2_positive, n1_negative, n2_negative where n1_positive is the number of correctly classified examples for model 1, n2_positive is the number of correctly classified examples for model 2, n1_negative is the number of incorrectly classified examples for model 1 and n2_negative is the number of incorrectly classified examples for model 2. These 4 values are necessarily positive integers.

Spørsmål 7: An evaluation protocol

A well informed data analyst observes that a machine learning package is apparently bugged as it produces models that, once tested on independent test examples, have classification accuracies distributed uniformly in the interval \([0\%, 100 \%]\).

To support his hypothesis, the data analyst implements the following protocol. He repetitively learns models with the package under study and he observes the accuracies on independent test samples.

More specifically, he reproduces this experiment (learn and test) over several independent training/test sets and he reports as quality measure the average of the test accuracies observed on \(k\) such test sets. As his results could depend on particular tests, he repeats the whole protocol over 100 distinct runs. He then plots the distribution of this quality measure over the 100 runs and he monitors how this distribution evolves as a function of the number \(k=2, 5, 10, 20, 50, 100\) of test sets considered.

From what you know about the problem at hand, how do you expect the distribution of the quality measure to behave as a function of \(k\)?

Select all valid answers.

The quality measure is \(\bar X = \frac 1 k \sum_{i=1}^k X_i^2\).

The quality measure \(\bar X\) is approximately distributed according \(\sim \mathrm{Normal} \left( \mu, \sigma^2 \right)\), with \(\mu = 0.5\) and \(\sigma^2 = \frac{1}{k}\).

The quality measure is \(\bar X = \frac 1 k \sum_{i=1}^k X_i\).

\(X_i\), the test set accuracy computed on the \(i\) th set, is assumed to be distributed according \(\sim \mathrm{Normal}(\mu=0,\sigma^2=1)\).

The quality measure \(\bar X\) is approximately distributed according \(\sim \mathrm{Uniform} \left( \mu, \sigma^2 \right)\), with \(\mu = 0.5\) and \(\sigma^2 = \frac{1}{k}\).

The quality measure \(\bar X\) is approximately distributed according \(\sim \mathrm{Normal} \left( \mu, \frac{\sigma^2}{k} \right)\), with \(\mu = 0.5\) and \(\sigma^2 = \frac{1}{12}\).

\(X_i\), the test set accuracy computed on the \(i\) th set, is assumed to be distributed according \(\sim \mathrm{Uniform}(min=0,max=1)\).

Spørsmål 8: An evaluation protocol (continued)

Simulate numerically the protocol of the data analyst and check whether these simulations satisfy the expected results from your theoretical analysis of the problem. Some plots should be helpful!

Select all valid answers.

The quality measure is well centered around the expected \(\mu\) value, found in the previous question.

The variance increases when the number of test sets increases.

The variance decreases when the number of test sets increases.

The quality measure is not well centered around the expected \(\mu\) value, found in the previous question.

The variance remains roughly the same when the number of test sets increases.

Spørsmål 9: An evaluation protocol (continued)

What would differ in the previous analysis if the classification test accuracies would be distributed non-uniformly, while keeping the same mean and variance?

Select all valid answers.

The expected value of the quality measure \(\bar X\) would stay the same.

The variance of the quality measure \(\bar X\) would stay the same.

The quality measure \(\bar X\) would still be approximately normally distributed. The results are thus expected to be similar.

The quality measure \(\bar X\) would no longer be approximately normally distributed. The results are thus expected to be significantly different.

Spørsmål 10: An evaluation protocol (continued)

Based on your answer to the previous question, can you conclude that such a protocol is adequate to assess how the individual test accuracies are distributed?

Beware: you will only receive credit for this question if you answered the previous one correctly.

No

Yes

Spørsmål 11: Public Resources

If ever you have been using public resources to answer some questions above, please quote all used resources below.

If ever such a resource is a generative AI, please share some link(s) to the chat(s) (e.g. https://chatgpt.com/share/698dac4c-f32c-800e-af13-f9ee2ebfdb0b)
You can generate such link(s) by clicking on "share" or "partager".

Otherwise, specify None in the field below.

Forfatter(e)	Pierre Dupont
Frist	22/03/2026 23:00:00
Innleveringsgrense	Ingen begrensning

Informasjon

Logg inn