
You are a data analyst in charge of predicting the occurrence of Heart Failure from clinical and patient status variables:
- The original dataset is made of 1700 samples on 112 input variables and a binary target class. Here, this dataset has been partitioned for you in two files HeartFailure_train.csv (1275 samples) and HeartFailure_test.csv (425 samples), used respectively for training and testing your models.
- The classification task consists in building a decision tree from the training set to predict the
HeartFailure
status on the test samples.
You can load the datasets with
import pandas as pd train_df = pd.read_csv("HeartFailure_train.csv") test_df = pd.read_csv("HeartFailure_test.csv")
We consider here the CART algorithm as implemented in the sklearn.tree.DecisionTreeClassifier package as our default learning algorithm and we use it without pruning.
More specifically, we would like here to study the possible fluctuations of classification results according to the specific training or test considered. For this, the scipy.stats package may be a useful tool.