Loading [MathJax]/jax/output/HTML-CSS/jax.js

Information

Author(s) Pierre Dupont
Deadline 23/03/2025 23:00:00
Abgabenlimit No limitation

Einloggen

A3.2 - Performance Assessment: theory

This task will be graded after the deadline


Question 1: Comparing two models

A random forest classifies correctly 122 out of 140 test examples.

Compute the 95 % confidence interval for the accuracy of this model.

When rounding, give at least 3 decimals. Example of correctly formatted answer: 0.707, 0.867.

Question 2: Comparing two models (continued)

An SVM with a RBF kernel classifies correctly 109 out of 140 independent test examples.

Compute the 95 % confidence interval for the accuracy of this model.

When rounding, give at least 3 decimals. Example of correctly formatted answer: 0.707, 0.867.

Question 3: Comparing two models (continued)

Given your answers to the previous questions, can you conclude that the RF model is better than the SVM model?

Beware: you will only receive credit for this question if you answered the two previous ones correctly.

Question 4: Comparing two models (continued)

We will use a statistical test to decide whether the performances of the two models differ.

Since the sample sizes are here relatively small, we can use a χ2 test to compare the proportions (rem.: a Fisher exact test is a common alternative likely offering similar numerical results).

What is the p-value of the χ2 test on these proportions?

When rounding, give at least 3 decimals.

Question 5: Comparing two models (continued)

Can you conclude from this test that there is a statistically significant difference between our two models?

Beware: you will only receive credit for this question if you answered the previous one correctly.

Question 6: Comparing two models (continued)

What is the minimal number of test examples for each model (instead of the 140 examples originally used) that would be needed in the previous test to get a p-value below 1%, assuming the test classification rates do not change (87.14% versus 77.86%)?

Please report the number of examples needed for each model (not twice this number).

Question 7: An evaluation protocol

A well informed data analyst observes that a machine learning package is apparently bugged as it produces models that, once tested on independent test examples, have classification accuracies distributed uniformly in the interval [0%,100%].

To support his hypothesis, the data analyst implements the following protocol. He repetitively learns models with the package under study and he observes the accuracies on independent test samples.

More specifically, he reproduces this experiment (learn and test) over several independent training/test sets and he reports as quality measure the average of the test accuracies observed on k such test sets. As his results could depend on particular tests, he repeats the whole protocol over 100 distinct runs. He then plots the distribution of this quality measure over the 100 runs and he monitors how this distribution evolves as a function of the number k=2,5,10,20,50,100 of test sets considered.

From what you know about the problem at hand, how do you expect the distribution of the quality measure to behave as a function of k?

Select all valid answers.

Question 8: An evaluation protocol (continued)

Simulate numerically the protocol of the data analyst and check whether these simulations satisfy the expected results from your theoretical analysis of the problem. Some plots should be helpful!

Select all valid answers.

Question 9: An evaluation protocol (continued)

What would differ in the previous analysis if the classification test accuracies would be distributed non-uniformly, while keeping the same mean and variance?

Select all valid answers.

Question 10: An evaluation protocol (continued)

Based on your answer to the previous question, can you conclude that such a protocol is adequate to assess how the individual test accuracies are distributed?

Beware: you will only receive credit for this question if you answered the previous one correctly.