
You are a data analyst in charge of diagnosing Parkinson disease from speech signals recorded from real patients and healthy control patients. To do so, you have access to 756 samples:
- The data contains 754 input variables, including some patient characteristics such as age and gender, and a large collection of acoustic features (Time Frequency, MEL Cepstrum, Vocal Fold, etc.) from the sustained phonation of the vowel /a/.
- The last variable, representing the
label
of the patient or control sample, has been coded with 2 values:1
(Parkinson) or0
(Healthy).
The data have been partioned for you in two files:
- A training set with 498 examples, available here: Parkison_train.csv
- A test set with 258 examples, available here: Parkinson_test.csv
The data can be loaded into DataFrame
's using python's pandas library:
import pandas as pd train_df = pd.read_csv("Parkinson_train.csv") test_df = pd.read_csv("Parkinson_test.csv")
Here is a pandas tutorial on how to use data frames, and in particular how to select subsets of rows and columns.
Your main task is to build a model from the training set to predict the label
value on the test examples using decision tree learning.
Part of the work has been done for you in python's sklearn package.
For this task, and all subsequent tasks of this assignment, you must use version 1.5.X
(as some of the more recent versions introduced a bug in decision tree learning).
In a nutshell, you must use the following installation procedure (or an equivalent using conda
and/or your favorite IDE):
pip install scikit-learn==1.5.0