A2.2 - Linear Discriminants and SVMs: in practice - [LINFO2262] Machine Learning: classification and evaluation

A2.2 - Linear Discriminants and SVMs: in practice

Single-photon emission computed tomography (SPECT) is an nuclear medicine imaging technique. It builds 3-D images of targeted organs using a radioactive substance and a special gamma camera. This allows to assess if the imaged organs are functionning properly. These 3-D pictures are often represented as a set of 2-D slices:

https://inginious.info.ucl.ac.be/course/LINFO2262/A2-2/heartSPECT.jpeg

You are in charge of implementing machine learning models that analyzes SPECT images of the heart and predicts whether or not the heart is in correct condition. Each of the patients is classified into two categories: normal and abnormal. The dataset consists of 417 SPECT image sets and was processed to extract 44 continous features that summarize the original images. The dataset has been partitioned for you in two files, used respectively for training and testing your models:

A training set with 220 samples, available here: HeartTrain.csv

A test set with 197 samples, available here: HeartTest.csv

Those .csv files should be read and loaded in data frames. The columns, named F_1 to F_44, correspond to the numerical input features. The last column, named labels, represents the class label, in binary encoding (0:abnormal, 1:normal).

This data can be loaded into DataFrame's using python's pandas library:

import pandas as pd

train_df = pd.read_csv("HeartTrain.csv")
test_df = pd.read_csv("HeartTest.csv")

In this task, we will evaluate how support vector machines, fitted on the training data, are able to recognize abnormal from normal heart SPECT test images.

SVMs are implemented in python in sklearn.svm.SVC.

Lim denne kommandoen inn i din terminal:

Passord for å koble til er

Spørsmål 1: Basic SVM

Let's start by training a basic linear SVM. Some meta-parameter of SVC should be tuned to make sure you consider a linear SVM (leave all other meta-parameters fixed at their default values).

To evaluate the performance of this classifier, 10-fold cross-validation will be used. It consists in:

Splitting the training set into 10 different consecutive folds of the same size (do not shuffle the data, such that the first fold contains the first 10% of examples, the second fold the next 10% of examples, etc.)

Using each fold in turn as validation set: train the SVM using the 9 other folds together, and evaluate the classification accuracy on the validation set. Repeat this such that each fold is used once for validation.

Report the cross-validation accuracy, i.e. the mean validation accuracy over the 10 folds.

You may assume that the data frame train_df exists in memory. Note that you must include a preprocessing step that scales the input features (to 0 mean and 1 standard deviation) by fitting the scaling parameters on the training fraction (90% of the data in a 10-fold CV protocol) and by applying them also on the validation fraction. This can be easily implemented by including StandardScaler in a Pipeline.

You should apply the same scaling logic to all subsequent tasks : always include a preprocessing step on the training fraction (this is the fraction on which you fit a model: either the training fraction of a CV protocol or the full training set, depending on the task) and apply next the fitted scaling parameters on the evaluation fraction (either the validation fraction of a CV protocol or the test set, depending on the task). You should not perform any other (pre-)processing step on the data.

Report here your full code for the cross-validation process. Assign the cross-validation accuracy to the variable cv_acc.

Note

For this task and all subsequent ones: some predefined functions, e.g. in scikit-learn, include the possibility to specify njobs=-1 in order to distribute the computation on all available cores on a machine. While you could use this option locally on your machine, you should not include such an option in any code you submit on Inginious. This is because such parallelization is incompatible with the virtualization process used on Inginious to evaluate many submissions in parallel and could therefore result in a runtime error when evaluating your code.

from sklearn.svm import SVC

# TODO: replace by your own python code

cv_acc = 0.0

Spørsmål 2: Polynomial kernel degree

Now, we will evaluate the impact of using a kernel function. For now, we will use a polynomial kernel. This kernel has different metaparameters, among which the degree of the polynomial that is used. Find the optimal value of this parameter in the range \([2, 10]\) (leaving all other kernel metaparameters at their default value) (do this offline, on your machine).

Note that the smallest degree to consider should be 2 as a degree-1 polynomial (with all other kernel metaparameters set at their default value) is simply a linear kernel.

Once you have selected a value, implement the SVM below and report its 10-fold cross-validation accuracy. Use the same procedure as in the Question 1 above (i.e. each fold being composed of 10% not shuffled examples and used only once as validation set while the model is trained on the 9 other folds).

Assign the cross-validation accuracy to the variable cv_acc

from sklearn.svm import SVC

# TODO: replace by your own python code

cv_acc = 0.0

Spørsmål 3: Polynomial kernel parameters

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

An SVM with polynomial kernel has other metaparameters. Try tuning them in order to obtain the best cross-validation accuracy.

Once you have tuned (offline, on your machine) your polynomial SVM to your satisfaction, instantiate it to a variable named my_poly_svm (which must thus contain an instance of SVC created with the best kernel metaparameters you have found). After this variable assignment, you should not fit the SVM in any way as it will be done automatically by the grading procedure.

Your score for this question will be proportional to the obtained cross-validation accuracy (computed by Inginious with the same procedure as in Question 1 and Question 2; don't implement the cross-validation here).

from sklearn.svm import SVC

my_poly_svm = SVC(kernel="poly", ...)  # TODO: replace by your own python code

Spørsmål 4: Other kernels

Test question [40 points]

Do the same exercise as in Question 3 using RBF and sigmoid kernels, to be stored respectively in variables my_rbf_svm and my_sigm_svm.

Again, your score for this question will be proportional to the obtained cross-validation accuracies (computed by Inginious with the same procedure as in Question 1 and Question 2; don't implement the cross-validation here).

from sklearn.svm import SVC

my_rbf_svm = SVC(kernel="rbf", ...)  # TODO: replace by your own python code

my_sigm_svm = SVC(kernel="sigmoid", ...)  # TODO: replace by your own python code

Spørsmål 5: Test set performance

Test question [10 points]

Now that you have selected the best metaparameters for each kernel, let us see if the cross validation accuracy accurately predicts the performance on the test set.

Compare (offline, on your machine) the test accuracy of the models you defined in the two previous questions (i.e. the polynomial, RBF and sigmoid SVMs which you tuned) as well as the linear model from question 1. Train these 4 SVMs (with the metaparameters you chose in the previous questions, don't change any of them) using the entire training set and compute their classification accuracy on the test set. Which one of the 4 performs best on the test set?

Indicate which model performs best by assigning the name of its kernel, i.e. "linear", "poly", "rbf" or "sigmoid", to the string variable kernel_choice.

Rem. You need not give any other code than the kernel_choice variable and (optionally) the code to preprocess the test set.

kernel_choice = ""  # TODO: replace by either "linear", poly", "rbf" or "sigmoid"

Spørsmål 6: Support vectors and overfitting

Test question [20 points]

Is the optimal performance of a SVM on this dataset correlated with the minimal number of support vectors obtained when varying the metaparameters? Train different models with significantly different metaparameter settings, and analyze the corresponding number of support vectors, training and test set classification results to answer the following questions.

Select all valid sentences

When the RBF kernel overfits due to a poor choice of \(\gamma\), generalization is impossible because every training point is similar (high kernel value) to all other points. This can be seen as the number of support vectors tends towards to the number of training points.

The number of support vectors strictly increases when increasing the degree of a polynomial SVM (all other metaparameters at default values).

When the RBF kernel overfits due to a poor choice of \(\gamma\), generalization is impossible because every training point is essentially only similar (high kernel value) to itself and dissimilar (low kernel value) from all other points. This can be seen as the number of support vectors tends towards to the number of training points.

Learning a RBF-SVM with a hard margin (i.e. \(C\) in the order of \(10^4\)) does converge, which means that the data is linearly separable in the feature space induced by the corresponding RBF kernel.

Massive overfitting is observed when there are too few support vectors.

When varying the \(\gamma\) metaparameter (\(\gamma \in\) [0.0001, 0.0005, 0.001, 0.005, 0.01]) of a sigmoid kernel SVM with \(C=1\) (all other metaparameters at default values), one can observe that there is a negative Pearson's correlation between the test accuracy and the number of support vectors.

Learning a hard margin (i.e. \(C\) in the order of \(10^4\)) linear SVM on the training set looks difficult (i.e. the computation time is highly superior to the one observed with the default metaparameters), hence providing evidence of non-linear separability in the original input space.

Spørsmål 7: Explainable AI

Test question [10 points]

The EU General Data Protection Regulation (GDPR) puts the "right to an explanation" when dealing with AI into law. This means that whenever AI is used to make a decision involving a person (this is typically the case for a medical application), the latter has a right to know what drove the algorithm to make the decision.

If the user of your software asks you what are the most important features to predict the class of an example, which 4 features would you give him?

Hint: We are looking for 4 features that interact well together. One particular model is especially convenient to answer this question.

Give your answer in the following format: feat_1,feat_2,feat_3,feat_4 where each feat_i is the (unquoted) name of a specific input feature in this dataset (the order between the 4 chosen features does not matter).

Forfatter(e)	Pierre Dupont, Benoit Ronval
Frist	08/03/2026 23:00:00
Innleveringsgrense	Ingen begrensning

Informasjon

Logg inn