You are a data analyst in charge of predicting to which extent some specific drugs induce autoimmune diseases. In such diseases, medication acts as a trigger that impairs immune tolerance, leading to an inappropriate autoimmune attack on the host tissues after drug ingestion.
To do so, you have access to 397 samples (a subset of the data used in this original study):
- The data contains
196input variables representing molecular properties and structural characteristics of some drugs.- Additionally, a class variable denoted
Labelrepresent the drug induce autoimmunity (DIA) outcome. It has been coded with 2 values:1(DIA positive) or0(DIA negative).
The data have been partitioned for you in two files:
- A training set with
197examples, available here: DrugImmunity_train.csv- A test set with
200examples, available here: DrugImmunity_test.csv
The data can be loaded into DataFrame's using python's pandas library:
import pandas as pd
train_df = pd.read_csv("DrugImmunity_train.csv")
test_df = pd.read_csv("DrugImmunity_test.csv")
Here is a pandas tutorial on how to use data frames, and in particular how to select subsets of rows and columns.
Your main task is to build a model from the training set to predict the Label value on the test examples using decision tree learning.
Part of the work has been done for you the sklearn package.
For this task, and all subsequent tasks of this assignment, you must use version 1.8.0 to guarantee reproducibility of the computed results on Inginious.
In a nutshell, you must use the following installation procedure (or an equivalent using conda and/or your favorite IDE):
pip install scikit-learn==1.8.0
INGInious