A2.2 - Linear Discriminants and SVMs: in practice - [LINFO2262] Machine Learning: classification and evaluation

A2.2 - Linear Discriminants and SVMs: in practice

You are in charge of implementing machine learning models to classify waveforms into one out of
3 possible classes from a set of 40 continuous input features.
These input features are corrupted by noise.

https://inginious.info.ucl.ac.be/course/LINFO2262/A2-2/Waveform.jpg

The dataset has been partitioned for you in 2 files, used respectively for training and testing your models:

A training set with 4000 examples, available here: Waveform_train.csv

A test set with 1000 examples, available here: Waveform_test.csv

Those .csv files should be read and loaded in data frames:

import pandas as pd

train_df = pd.read_csv("Waveform_train.csv")
test_df = pd.read_csv("Waveform_test.csv")

The first 40 columns correspond to the input features, named x_1 to x_40 and the last column, denoted labels, encodes the class label of each example.

In this task, we will evaluate how support vector machines fitted on the training set can be used to predict the class labels of the test examples.

SVMs are implemented in Python in sklearn.svm.SVC.

Paste this command into your terminal:

The password to connect is

Question 1: Basic SVM

Let's start by training a basic linear SVM. Some meta-parameter of SVC should be tuned to make sure you consider a linear SVM (leave all other meta-parameters fixed at their default values).

To evaluate the performance of this classifier, 10-fold cross-validation will be used. It consists in:

Splitting the training set into 10 different consecutive folds of the same size (do not shuffle the data, such that the first fold contains the first 10% of examples, the second fold the next 10% of examples, etc.)

Using each fold in turn as validation set: train the SVM using the 9 other folds together, and evaluate the classification accuracy on the validation set. Repeat this such that each fold is used once for validation.

Report the cross-validation accuracy, i.e. the mean validation accuracy over the 10 folds.

You may assume that the data frame train_df exists in memory. Note that you must include a preprocessing step that scales the input features (to 0 mean and 1 standard deviation) by fitting the scaling parameters on the training fraction (90% of the data in a 10-fold CV protocol) and by applying them also on the validation fraction. This can be easily implemented by including StandardScaler in a Pipeline.

You should apply the same scaling logic to all subsequent tasks : always include a preprocessing step on the training fraction (this is the fraction on which you fit a model: either the training fraction of a CV protocol or the full training set, depending on the task) and apply next the fitted scaling parameters on the evaluation fraction (either the validation fraction of a CV protocol or the test set, depending on the task). You should not perform any other (pre-)processing step on the data.

Report here your full code for the cross-validation process. Assign the cross-validation accuracy to the variable cv_acc

from sklearn.svm import SVC

# TODO: replace by your own python code

cv_acc = 0.0

​x
 
from sklearn.svm import SVC
​
# TODO: replace by your own python code
​
cv_acc = 0.0

Question 2: Polynomial kernel degree

Now, we will evaluate the impact of using a kernel function. For now, we will use a polynomial kernel. This kernel has different metaparameters, among which the degree of the polynomial that is used. Find the optimal value of this parameter in the range $[2, 10]$ (leaving all other kernel metaparameters at their default value) (do this offline, on your machine).

Note that the smallest degree to consider should be 2 as a degree-1 polynomial (with all other kernel metaparameters set at their default value) is simply a linear kernel.

Once you have selected a value, implement the SVM below and report its 10-fold cross-validation accuracy. Use the same procedure as in the Question 1 above (i.e. each fold being composed of 10% not shuffled examples and used only once as validation set while the model is trained on the 9 other folds).

Assign the cross-validation accuracy to the variable cv_acc

from sklearn.svm import SVC

# TODO: replace by your own python code

cv_acc = 0.0

xxxxxxxxxx
 
from sklearn.svm import SVC
​
# TODO: replace by your own python code
​
cv_acc = 0.0

Question 3: Polynomial kernel parameters

Test question [20 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 and 2 correctly.

An SVM with polynomial kernel has other metaparameters. Try tuning them in order to obtain the best cross-validation accuracy.

Once you have tuned (offline, on your machine) your polynomial SVM to your satisfaction, instantiate it to a variable named my_poly_svm (which must thus contain an instance of SVC created with the best kernel metaparameters you have found). After this variable assignment, you should not fit the SVM in any way as it will be done automatically by the grading procedure.

Your score for this question will be proportional to the obtained cross-validation accuracy (computed by Inginious with the same procedure as in Question 1 and Question 2; don't implement the cross-validation here).

from sklearn.svm import SVC

my_poly_svm = SVC(kernel="poly", ...)  # TODO: replace by your own python code

xxxxxxxxxx
 
from sklearn.svm import SVC
​
my_poly_svm = SVC(kernel="poly", ...)  # TODO: replace by your own python code

Question 4: Other kernels

Test question [40 points]

Do the same exercise as in Question 3 using RBF and sigmoid kernels, to be stored respectively in variables my_rbf_svm and my_sigm_svm.

Again, your score for this question will be proportional to the obtained cross-validation accuracies (computed by Inginious with the same procedure as in Question 1 and Question 2; don't implement the cross-validation here).

from sklearn.svm import SVC

my_rbf_svm = SVC(kernel="rbf", ...)  # TODO: replace by your own python code

my_sigm_svm = SVC(kernel="sigmoid", ...)  # TODO: replace by your own python code

xxxxxxxxxx
 
from sklearn.svm import SVC
​
my_rbf_svm = SVC(kernel="rbf", ...)  # TODO: replace by your own python code
​
my_sigm_svm = SVC(kernel="sigmoid", ...)  # TODO: replace by your own python code

Question 5: Test set performance

Test question [10 points]

Now that you have selected the best metaparameters for each kernel, let us see if the cross validation accuracy accurately predicts the performance on the test set.

Compare (offline, on your machine) the test accuracy of the models you defined in the two previous questions (i.e. the polynomial, RBF and sigmoid SVMs which you tuned) as well as the linear model from question 1. Train these 4 SVMs (with the metaparameters you chose in the previous questions, don't change any of them) using the entire training set and compute their classification accuracy on the test set. Which one of the 4 performs best on the test set?

Indicate which model performs best by assigning the name of its kernel, i.e. "linear", "poly", "rbf" or "sigmoid", to the string variable kernel_choice.

Rem. You need not give any other code than the kernel_choice variable and (optionally) the code to preprocess the test set.

kernel_choice = ""  # TODO: replace by either "linear", poly", "rbf" or "sigmoid"

xxxxxxxxxx
 
kernel_choice = ""  # TODO: replace by either "linear", poly", "rbf" or "sigmoid"

Question 6: Support vectors and overfitting

Test question [20 points]

Is the optimal performance of a SVM on the waveform dataset correlated with the minimal number of support vectors obtained when varying the metaparameters? Train different models with significantly different metaparameter settings, and analyze the corresponding number of support vectors, training and test set classification results to answer the following questions.

Select all valid sentences

It is possible to learn a RBF-SVM with a hard margin (i.e. C in the order of $10^4$ ), which means that the data is linearly separable in the feature space induced by the corresponding RBF kernel.

When the RBF kernel overfits due to a poor choice of $\gamma$ , generalization is impossible because every training point is essentially only similar (high kernel value) to itself and dissimilar (low kernel value) from all other points. This can be seen as the number of support vectors tends towards to the number of training points.

When varying the $\gamma$ metaparameter ( $\gamma \in$ [0.0001, 0.0005, 0.001, 0.005, 0.01]) of a sigmoid kernel SVM with C=100 (all other metaparameters at default values), one can observe that there is a negative Pearson's correlation between the test accuracy and the number of support vectors.

The number of support vectors strictly increases when increasing the degree of a polynomial SVM (all other metaparameters at default values).

Learning a hard margin (i.e. C in the order of $10^4$ ) linear SVM on the training set looks difficult (i.e. the computation time is highly superior to the one observed with the default metaparameters), hence providing evidence of non-linear separability in the original input space.

Massive overfitting is observed when there are too few support vectors.

When the RBF kernel overfits due to a poor choice of $\gamma$ , generalization is impossible because every training point is similar (high kernel value) to all other points. This can be seen as the number of support vectors tends towards to the number of training points.

Question 7: Explainable AI

Test question [10 points]

The EU General Data Protection Regulation (GDPR) puts the "right to an explanation" when dealing with AI into law. This means that whenever AI is used to make a decision involving a person (this is the case for such a medical application), the latter has a right to know what drove the algorithm to make the decision.

If the user of your software asks you what are the most important features to determine the class of the waveform, which 4 features would you give him?

Hint: We are looking for 4 features that interact well together. One particular model is especially convenient to answer this question.

Give your answer in the following format: feat_1,feat_2,feat_3,feat_4 where each feat_i is the (unquoted) name of a specific input feature in this dataset (the order between the 4 chosen features does not matter).

Author(s)	Pierre Dupont, Benoit Ronval
Deadline	09/03/2025 23:00:00
Submission limit	No limitation

Information

Sign in