Informações

Autores Alexander Gerniers, Pierre Dupont
Prazo de entrega Sem prazo
Limite de submissão No limitation

Entrar

A touch of randomness

A lot of processes in Machine Learning rely on randomness: randomly dividing the data into a training set and a test set, selecting variables in a Random Forest, etc.

A computer uses pseudorandomness, meaning the result appears to be random, while actually being produced in a deterministic way. By setting the seed of a pseudorandom generator, one can reproduce the exact same results when repeating an experiment.

This is used throughout the assignments in order to ensure an accurate grading. Therefore, we will from time to time ask you to set the random_state argument to a certain value when using functions from the Scikit-learn library. It is imporant that you comply exactly to the instructions, as improperly setting the seed may cause your code to fail the test (e.g. omitting the seed could change the train set / test set partitionning, and thus give a slightly different testing accuracy).

Note that this is only a technical requirement to facilitate grading. The random_state argument is not a metaparameter of a learning algorithm! Your job is to produce Machine Learning models that generalize well, i.e. work well on new, unseen, data. It would make no sense to "tune" the random_state argument to produce a better accuracy on certain data. Therefore, you should not use random_state when not explicitely asked (actually, Inginious frequently overrides the random_state argument during the grading).

There is a simple exercise below to verify you understand how you should use random_state.


A simple exercise

Assume the df variable contains a pandas DataFrame with 100 observations (it already exists in memory). In order to test a learning algorithm, you wish to create 10 different datasets, each one containing 25 observations sampled at random from df.

To perform this sampling, you should use the sample method of DataFrame. Besides the n parameter to specify the size of the sample, we ask you to set random_state=i, where i is the index of the data sample (i.e. from 0 to 9).

Put each of the 10 data samples into the dictionary called data (use the value of i as key for the corresponding data sample)