What is cross-validation in Machine Learning ?

by keshav


Cross-validation is a technique to evaluate the predictive models by splitting the original training data sample into a training set to train the model, and a test set to evaluate it. Cross-validation is a re-sampling process used to evaluate the model if we have a limited amount of data. It is one of the widely used techniques used to test the effectiveness of machine learning models. To perform cross-validation in machine learning we need to keep aside a portion of the given data as a training dataset on which we train the machine learning model and we use the remaining portion of data as a test dataset which is used for testing/validating.
Cross-validation is also known as rotation estimation.

test-train data split

There are many methods of cross-validation. Few of them are as follows:

  • Train-Test split In this method, we split the complete data randomly into training and test datasets. After that, we Perform the model training on the training set and use the test set for validation purposes. Mostly the data is split into 70:30 or 80:20 of Train: Test. Using this method there is a high possibility of high bias if we have a limited amount of data because we would miss some information about the data which we have not used for training. If our data is huge and our test sample and train sample has the same distribution (i.e. the train data and test data have almost the same nature) then this approach is acceptable.
  • K-fold Cross-validation: In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or “folds,”  as D1, D2, : : : , Dk, each having approximately equal size. The training and testing processes are performed k times. In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train the model. That is, in the first iteration, if the subsets D2, : : : , Dk is collectively served as the training set to obtain a first model, then the subset D1 is treated as test data.i.e. the testing of the model is done over D1; then the second iteration is trained on subsets D1, D3, : : : , Dk and tested on D2; and so on. Here each sample is used the same number of times for training and once for testing. For classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data.

      Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples. That is, only one sample is “left out” at a time for the test set. In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in each fold is approximately the same as that in the initial data. 

In general, stratified 10-fold cross-validation is recommended for estimating accuracy (even if computation power allows using more folds) due to its relatively low bias and variance.

Why do we use cross-validation?

Cross-validation is used to evaluate the performance of a machine learning model. Cross-Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate overfitting. It is also of use in determining the hyperparameters of your model, in the sense that which parameters will result in the lowest test error.

No Comments

Post a Comment