You are a data analyst in charge of predicting the occurrence of Heart Failure from clinical and patient status variables:
- The original dataset is made of 111 input variables and a binary target class. Here, this dataset has been partitioned for you in two files HeartFailure_train.csv (800 examples) and HeartFailure_test.csv (305 examples), used respectively for training and testing your models.
- The classification task consists in building a decision tree from the training set to predict the
HeartFailurestatus (either \(0\) or \(1\)) on the test examples.
You can load the datasets with
import pandas as pd
train_df = pd.read_csv("HeartFailure_train.csv")
test_df = pd.read_csv("HeartFailure_test.csv")
We consider here a Decision Tree Learning algorithm as implemented in sklearn.tree.DecisionTreeClassifier. More specifically, we would like here to study the possible fluctuations of classification results according to the specific training or test considered. For this, the scipy.stats package may be a useful tool.
For this task, we consider the Balanced Classification Rate (BCR) (see the course slides) as performance metric. It is particularly relevant for such a medical application because the heart failure "positive" examples form a minority class. Indeed, one considers these "positive" examples as important as the negative ones (no heart failure) which form the majority class.
When learning a DecisionTreeClassifier for this task, you must always use the following meta-parameters:
criterion = 'entropy', min_impurity_decrease = 0.01, max_depth = 10, class_weight = 'balanced'.
The last one balances the importance of the various classes when learning the tree in the same spirit BCR does it when evaluating the performance of the learned tree.
All other meta-paramaters should always be left to their default values, except for random_state as described below.
INGInious