Cross Validation — Basic concepts
Machine learning is a process of teaching machines how to learn underlying patterns in data-sets. Instead of encoding answers somewhere in memory(i.e. memorizing), we want computers to be able to use prior knowledge about a particular subject to make new conclusions
“ Effective leaning must progress from individual examples to broad generalizations”
- Learning — The acquisition of knowledge or skills through study, experience, or being taught.
- Validation — The process of evaluating accuracy of your machine learning models. e.g. mean squared error evaluation for regression models
Training and validation using the same data-set
Say we were to fit a simple linear regression model to a data-set showing the house prices in a particular location given the age and size of the house.
If you use the same data-set to evaluate or validate the accuracy of the model, you’d get a model which will most likely fail to correctly predict the price of the house given new information — that is, age and size not included in the original data set.
In simple terms your model will fail to generalize response values on new or unseen features once it has been deployed. Cross validation solves this problem by splitting the data into training and test sets in the modeling stage, to allow us to get an idea of how well our final model will perform given new feature values.
Different cross validation methods exist for supervised learning, this includes hold out methods, k-fold methods, stratified k-fold and leave-p-out cross validation.
- Train-test split or Hold-out method
Technique
- Split the original data-set into two subsets, named test set and training set respectively.
- The ratio split is usually 30% for the test/validation set and 70% for the training set — If your data-set has a high number of observations you generally don’t need the whole 30% for validation, you could use 20% ,15% etc.
Limitations
- Model instability — Model performance varies depending on which observations end up in the validation set, i.e. High variance
- Information loss — Some useful patterns may be missed in the training phase since some observations containing useful information end up in the validation set. In the long run this leads to high generalization error and consequently high induced bias.
Sklearn Example
# Example with train/test = 70/30
import numpy as np
from sklearn.model_selection import train_test_splitX, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=42)
2. K-fold method
Technique
- Divide your data-set into k equal folds or subsets
- Run k training / testing iterations to ensure that each observation appears exactly once in the testing sets, and k-1 times in the training sets.
Limitations
- Miss-leading accuracy measure— For classification problems with class imbalance(e.g. 0:1 = 90% : 10%) we may get misleading high accuracy measures.
- This is due to the fact that each k strata might contain different class proportions.
Sklearn example
# k-fold cross validation with k=10
from sklearn.model_selection import KFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
3. Stratified K-fold method
Technique
- If your classes are not balanced, we want to divide the data such that each fold or subset contains approximately equal class proportions,i.e “mean response value must be equal in all folds”.
- This is achieved by maintaining the class proportions of the original data-set in each subset. That is if the data contained 60% 0 labels and 40% 1 labels, the folds are selected such that these ratios are maintained.
Sklearn example
# stratified k-fold example
from sklearn.model_selection import StratifiedKFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
print(skf) # output = 2
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Please feel free to drop any constructive comments below.