A1.2 - Decision Tree Learning: in practice - [LINFO2262] Machine Learning: classification and evaluation

A1.2 - Decision Tree Learning: in practice

https://inginious.info.ucl.ac.be/course/LINFO2262/A1-2/DIA_2.jpeg

https://inginious.info.ucl.ac.be/course/LINFO2262/A1-2/DIA.jpeg

You are a data analyst in charge of predicting to which extent some specific drugs induce autoimmune diseases. In such diseases, medication acts as a trigger that impairs immune tolerance, leading to an inappropriate autoimmune attack on the host tissues after drug ingestion. To do so, you have access to 397 samples (a subset of the data used in this original study):

The data contains 196 input variables representing molecular properties and structural characteristics of some drugs.

Additionally, a class variable denoted Label represent the drug induce autoimmunity (DIA) outcome. It has been coded with 2 values: 1 (DIA positive) or 0 (DIA negative).

The data have been partitioned for you in two files:

A training set with 197 examples, available here: DrugImmunity_train.csv

A test set with 200 examples, available here: DrugImmunity_test.csv

The data can be loaded into DataFrame's using python's pandas library:

import pandas as pd

train_df = pd.read_csv("DrugImmunity_train.csv")
test_df = pd.read_csv("DrugImmunity_test.csv")

Here is a pandas tutorial on how to use data frames, and in particular how to select subsets of rows and columns.

Your main task is to build a model from the training set to predict the Label value on the test examples using decision tree learning.

Part of the work has been done for you the sklearn package. For this task, and all subsequent tasks of this assignment, you must use version 1.8.0 to guarantee reproducibility of the computed results on Inginious. In a nutshell, you must use the following installation procedure (or an equivalent using conda and/or your favorite IDE):

pip install scikit-learn==1.8.0

Paste this command into your terminal:

The password to connect is

Question 1: A basic tree

Use the DrugImmunity_train.csv training file and the sklearn.tree.DecisionTreeClassifier CART implementation to build a decision tree without pruning to predict whether the class label is 1 or 0. Use it to predict the class labels of both the train and test set.

You can assume that the data frames train_df and test_df already exist in memory. Use the parameter random_state=0 when calling the classifier's constructor (as features could be reordered, it ensures reproducibility of your results on Inginious).

Assign the prediction vectors to the variables y_train_pred and y_test_pred.

from sklearn.tree import DecisionTreeClassifier

# TODO: replace by your own python code

y_train_pred = []
y_test_pred = []

Question 2: A basic tree (continued)

What is:

The number of nodes of the learned tree?

The classification accuracy on the training data?

The classification accuracy on the test data?

Give your answer in the following format: nb_nodes, train_accur, test_accur.

For accuracy values, use the decimal representation (e.g. 0.942, not 94.2%). When rounding, give at least 3 decimals.

Question 3: Training accuracy

Is the training set accuracy guaranteed to be 100 % with CART without pruning?

Only if the data is consistent (no 2 examples with same features \(\mathbf{x}_1 = \mathbf{x}_2\) but different outcome \(y_1 \neq y_2\)).

No, even if the data is consistent (no 2 examples with same features \(\mathbf{x}_1 = \mathbf{x}_2\) but different outcome \(y_1 \neq y_2\)).

Yes, always.

Question 4: Resampling data

Use 25 % of the training data (without replacement) to build a decision tree with CART (without pruning; use the exact same meta-parameters as before, including random_state=0 as before) and all the test data to further assess the performance. The sample method of the pandas.DataFrame object may be very useful to extract a random subset of examples from a training set.

Run this experiment 100 times. What is the mean number of nodes and classification accuracy (both on the 25% training and the whole test set) of these trees over the 100 distinct runs?

To facilitate grading, we ask you to use a specific random seed for each run only for sampling the train data. To obtain the train data of the i th run (starting at 0), use the sample method of the pandas.DataFrame object with the argument random_state=i.

Report the results of each individual run in a pandas dataframe named frame with the following column names: Run (which will contain the index i of the run), NodeCount, TrainAcc, TestAcc. From this dataframe, compute the means and store these values in the float variables mean_nb_nodes, mean_train_accur and mean_test_accur.

from sklearn.tree import DecisionTreeClassifier
import pandas as pd

frame = pd.DataFrame(columns = ["Run", "NodeCount", "TrainAcc", "TestAcc"])

# TODO: replace by your own python code

mean_nb_nodes = 0
mean_train_accur = 0.0
mean_test_accur = 0.0

Question 5: Learning curve

Test question [60 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 to 4 correctly.

Measure the impact of training set size on the accuracy and the size of the learned tree (without pruning; use the whole test set for testing). Consider the following fractions of the total training set: 1%, 5%, 10%, 20%, 50%, 99%, without replacement.

Because of the high variance due to random splits, repeat the experience with 100 independent random samples for each training set size. Compute the number of nodes, train and test accuracies for the 100 runs for each training set size. (Rem.: by train accuracy, we mean the accuracy on the dataset that was used to train a particular model, i.e. a subset of the entire training set.)

To facilitate grading, we ask you to use a specific random seed for each run when sampling the train data. To obtain the train data of the i th run (starting at 0), use the sample method of the pandas.DataFrame object with the argument random_state=i (runs should go from 0 to 99 for each different training fraction). As before, use random_state=0 for your decision tree model.

Report the results of each individual run in a pandas dataframe named frame with the following column names: Frac (corresponding to the training fraction, i.e. 0.01, 0.05, etc.), Run, NodeCount, TrainAcc, TestAcc.

from sklearn.tree import DecisionTreeClassifier
import pandas as pd

frame = pd.DataFrame(columns = ["Frac", "Run", "NodeCount", "TrainAcc", "TestAcc"])

# TODO: replace by your own python code

Question 6: Learning curve (continued)

Test question [40 points]: This question will be graded after the deadline. You will only receive credit for this question if you answered questions 1 to 4 correctly.

Using the results you obtained in the previous question, create a plot showing how the number of nodes varies with the training set size. Panda's boxplot function is a convenient tool for this task. Do you observe the expected evolution of the tree sizes as function of the number of training examples? (To match your expectation make sure trees are learned without pre- or post-pruning.)

Create a plot showing how accuracy on the test set varies with the training set size. This plot, called a learning curve, is fundamental for estimating whether a program learns, that is actually improves with experience. The boxplot function could also be useful here.

Due to the random selection of training examples, results may vary even for a fixed number of training examples. How do you expect this variance to be a function of the training size? Do you confirm your expectation on the plot?

Select all valid sentences

The median test accuracy clearly tends to increase as a function of the training set size.

The median test accuracy does not tend to increase with the training set size, illustrating a possible overfitting issue.

The variance of the test accuracy decreases when the training set size increases.

The number of nodes of the trees increases with the training set size because more splits are needed to classify all training examples.

The number of nodes of the trees is approximately constant with the training set size, as only few nodes are needed to have a good model.

The test accuracy is roughly proportionnal to the number of nodes in the tree.

The variance of the number of nodes clearly decreases when the training set size increases.

Author(s)	Pierre Dupont, Benoit Ronval
Deadline	22/02/2026 23:00:00
Submission limit	No limitation

Information

Sign in