A1.2 - Decision Tree Learning: in practice - [LINFO2262] Machine Learning: classification and evaluation

A1.2 - Decision Tree Learning: in practice

https://inginious.info.ucl.ac.be/course/LINFO2262/A1-2/ParkinsonB.png

You are a data analyst in charge of diagnosing Parkinson disease from speech signals recorded from real patients and healthy control patients. To do so, you have access to 756 samples:

The data contains 754 input variables, including some patient characteristics such as age and gender, and a large collection of acoustic features (Time Frequency, MEL Cepstrum, Vocal Fold, etc.) from the sustained phonation of the vowel /a/.

The last variable, representing the label of the patient or control sample, has been coded with 2 values: 1 (Parkinson) or 0 (Healthy).

The data have been partioned for you in two files:

A training set with 498 examples, available here: Parkison_train.csv

A test set with 258 examples, available here: Parkinson_test.csv

The data can be loaded into DataFrame's using python's pandas library:

import pandas as pd

train_df = pd.read_csv("Parkinson_train.csv")
test_df = pd.read_csv("Parkinson_test.csv")

Here is a pandas tutorial on how to use data frames, and in particular how to select subsets of rows and columns.

Your main task is to build a model from the training set to predict the label value on the test examples using decision tree learning.

Part of the work has been done for you in python's sklearn package. For this task, and all subsequent tasks of this assignment, you must use version 1.5.X (as some of the more recent versions introduced a bug in decision tree learning). In a nutshell, you must use the following installation procedure (or an equivalent using conda and/or your favorite IDE):

pip install scikit-learn==1.5.0

Paste this command into your terminal:

The password to connect is

Question 1: A basic tree

Use the Parkinson_train.csv training file and the sklearn.tree.DecisionTreeClassifier CART implementation to build a decision tree without pruning to predict whether the class label is 1 or 0. Use it to predict the class labels of both the train and test set.

You can assume that the data frames train_df and test_df already exist in memory. Use the parameter random_state=0 when calling the classifier's constructor (as features could be reordered, it ensures reproducibility of your results on Inginious).

Assign the prediction vectors to the variables y_train_pred and y_test_pred.

from sklearn.tree import DecisionTreeClassifier

# TODO: replace by your own python code

y_train_pred = []
y_test_pred = []

​x
 
from sklearn.tree import DecisionTreeClassifier
​
# TODO: replace by your own python code
​
y_train_pred = []
y_test_pred = []

Question 2: A basic tree (continued)

What is:

The number of nodes of the learned tree?

The classification accuracy on the training data?

The classification accuracy on the test data?

Give your answer in the following format: nb_nodes, train_accur, test_accur.

For accuracy values, use the decimal representation (e.g. 0.942, not 94.2%). When rounding, give at least 3 decimals.

Question 3: Training accuracy

Is the training set accuracy guaranteed to be 100 % with CART without pruning?

Only if the data is consistent (no 2 examples with same features $\mathbf{x}_1 = \mathbf{x}_2$ but different outcome $y_1 \neq y_2$ ).

No, even if the data is consistent (no 2 examples with same features $\mathbf{x}_1 = \mathbf{x}_2$ but different outcome $y_1 \neq y_2$ ).

Yes, always.

Question 4: Resampling data

Use 25 % of the training data (without replacement) to build a decision tree with CART (without pruning; use the exact same meta-parameters as before, including random_state=0 as before) and all the test data to further assess the performance. The sample method of the pandas.DataFrame object may be very useful to extract a random subset of examples from a training set.

Run this experiment 100 times. What is the mean number of nodes and classification accuracy (both on the 25% training and the whole test set) of these trees over the 100 distinct runs?

To facilitate grading, we ask you to use a specific random seed for each run only for sampling the train data. To obtain the train data of the i th run (starting at 0), use the sample method of the pandas.DataFrame object with the argument random_state=i.

Report the results of each individual run in a pandas dataframe named frame with the following column names: Run (which will contain the index i of the run), NodeCount, TrainAcc, TestAcc. From this dataframe, compute the means and store these values in the float variables mean_nb_nodes, mean_train_accur and mean_test_accur.

from sklearn.tree import DecisionTreeClassifier
import pandas as pd

frame = pd.DataFrame(columns = ["Run", "NodeCount", "TrainAcc", "TestAcc"])

# TODO: replace by your own python code

mean_nb_nodes = 0
mean_train_accur = 0.0
mean_test_accur = 0.0

xxxxxxxxxx
 
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
​
frame = pd.DataFrame(columns = ["Run", "NodeCount", "TrainAcc", "TestAcc"])
​
# TODO: replace by your own python code
​
mean_nb_nodes = 0
mean_train_accur = 0.0
mean_test_accur = 0.0

Question 5: Learning curve

Test question [60 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 to 4 correctly.

Measure the impact of training set size on the accuracy and the size of the learned tree (without pruning; use the whole test set for testing). Consider the following fractions of the total training set: 1%, 5%, 10%, 20%, 50%, 99%, without replacement.

Because of the high variance due to random splits, repeat the experience with 100 independent random samples for each training set size. Compute the number of nodes, train and test accuracies for the 100 runs for each training set size. (Rem.: by train accuracy, we mean the accuracy on the dataset that was used to train a particular model, i.e. a subset of the entire training set.)

To facilitate grading, we ask you to use a specific random seed for each run when sampling the train data. To obtain the train data of the i th run (starting at 0), use the sample method of the pandas.DataFrame object with the argument random_state=i (runs should go from 0 to 99 for each different training fraction). As before, use random_state=0 for your decision tree model.

Report the results of each individual run in a pandas dataframe named frame with the following column names: Frac (corresponding to the training fraction, i.e. 0.01, 0.05, etc.), Run, NodeCount, TrainAcc, TestAcc.

from sklearn.tree import DecisionTreeClassifier
import pandas as pd

frame = pd.DataFrame(columns = ["Frac", "Run", "NodeCount", "TrainAcc", "TestAcc"])

# TODO: replace by your own python code

xxxxxxxxxx
 
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
​
frame = pd.DataFrame(columns = ["Frac", "Run", "NodeCount", "TrainAcc", "TestAcc"])
​
# TODO: replace by your own python code

Question 6: Learning curve (continued)

Test question [40 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 to 4 correctly.

Using the results you obtained in the previous question, create a plot showing how the number of nodes varies with the training set size. Panda's boxplot function is a convenient tool for this task. Do you observe the expected evolution of the tree sizes as function of the number of training examples? (To match your expectation make sure trees are learned without pre- or post-pruning.)

Create a plot showing how accuracy on the test set varies with the training set size. This plot, called a learning curve, is fundamental for estimating whether a program learns, that is actually improves with experience. The boxplot function could also be useful here.

Due to the random selection of training examples, results may vary even for a fixed number of training examples. How do you expect this variance to be a function of the training size? Do you confirm your expectation on the plot?

Select all valid sentences

The test accuracy is roughly proportionnal to the number of nodes in the tree.

The variance of the test accuracy decreases when the training set size increases.

The test accuracy increases as a function of the training set size.

The number of nodes of the trees increases with the training set size because more splits are needed to classify all training examples.

The number of nodes of the trees is approximately constant with the training set size, as only few nodes are needed to have a good model.

The test accuracy increases first with the training set size, then reaches a plateau before largely decreasing due to an overfitting issue

The variance of the number of nodes clearly decreases when the training set size increases.

Author(s)	Pierre Dupont, Benoit Ronval
Deadline	23/02/2025 23:00:00
Submission limit	No limitation

Information

Sign in