A lot of processes in Machine Learning rely on randomness: randomly dividing the data into a training set and a test set, selecting variables in a Random Forest, etc.
A computer uses pseudorandomness, meaning the result appears to be random, while actually being produced in a deterministic way. By setting the seed of a pseudorandom generator, one can reproduce the exact same results when repeating an experiment.
This is used throughout the assignments in order to ensure an accurate grading. Therefore, we will from time to time ask you to set the random_state
argument to a certain value when using functions from the Scikit-learn library. It is imporant that you comply exactly to the instructions, as improperly setting the seed may cause your code to fail the test (e.g. omitting the seed could change the train set / test set partitionning, and thus give a slightly different testing accuracy).
Note that this is only a technical requirement to facilitate grading. The random_state
argument is not a metaparameter of a learning algorithm! Your job is to produce Machine Learning models that generalize well, i.e. work well on new, unseen, data. It would make no sense to "tune" the random_state
argument to produce a better accuracy on certain data. Therefore, you should not use random_state
when not explicitely asked (actually, Inginious frequently overrides the random_state
argument during the grading).
There is a simple exercise below to verify you understand how you should use random_state
.