Cross Validation — Basic concepts

Machine learning is a process of teaching machines how to learn underlying patterns in data-sets. Instead of encoding answers somewhere in memory(i.e. memorizing), we want computers to be able to use prior knowledge about a particular subject to make new conclusions

“ Effective leaning must progress from individual examples to broad generalizations”

  • Learning — The acquisition of knowledge or skills through study, experience, or being taught.
  • Validation — The process of evaluating accuracy of your machine learning models. e.g. mean squared error evaluation for regression models

Training and validation using the same data-set

If you use the same data-set to evaluate or validate the accuracy of the model, you’d get a model which will most likely fail to correctly predict the price of the house given new information — that is, age and size not included in the original data set.

In simple terms your model will fail to generalize response values on new or unseen features once it has been deployed. Cross validation solves this problem by splitting the data into training and test sets in the modeling stage, to allow us to get an idea of how well our final model will perform given new feature values.

Different cross validation methods exist for supervised learning, this includes hold out methods, k-fold methods, stratified k-fold and leave-p-out cross validation.

  1. Train-test split or Hold-out method


  • The ratio split is usually 30% for the test/validation set and 70% for the training set — If your data-set has a high number of observations you generally don’t need the whole 30% for validation, you could use 20% ,15% etc.


  • Information loss — Some useful patterns may be missed in the training phase since some observations containing useful information end up in the validation set. In the long run this leads to high generalization error and consequently high induced bias.

Sklearn Example

# Example with train/test = 70/30
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=42)

2. K-fold method


  • Run k training / testing iterations to ensure that each observation appears exactly once in the testing sets, and k-1 times in the training sets.


  • This is due to the fact that each k strata might contain different class proportions.

Sklearn example

# k-fold cross validation with k=10
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

3. Stratified K-fold method


  • This is achieved by maintaining the class proportions of the original data-set in each subset. That is if the data contained 60% 0 labels and 40% 1 labels, the folds are selected such that these ratios are maintained.

Sklearn example

# stratified k-fold example
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
print(skf) # output = 2

for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Please feel free to drop any constructive comments below.

Written by

Former Glorified Electrician(aka Electrical Engineer). Now a Software Developer working on complex Enterprise Software. Lets connect on twitter @NdamuleloNemakh

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store