Table of Contents
What is model validation?
Model validation is the step that comes after training.
During model validation, we evaluate the accuracy of our model by seeing how it performs with data he hasn’t been trained on. This way we verify that it has learned the right relationships from the training samples and can generalize.
Model validation analogy
Assume that I am a professor and must instruct my students on equations. I have already lectured in class, using simpler examples (data preprocessing) and assigned homework (training phase).
Today is the test day.
Obviously, I will not include exercises I have already assigned homework in the test, because my students are already used to it and would be advantaged. By assigning them exercises that they have never done I can see if they have understood the topic and can put it into practice in real life.
How does model validation work?
Model validation works just like in school tests.
- Give the model the input validation data.
- Compare their guesses with the real output values.
- Assign a score based on a metric for comparison.
For example, let’s say we’ve built a linear regression model for predicting the price of a property given its area.
In this case, we use the metric MSE (mean squared error). The formula for the squared error is:
And then we calculate the mean of all the squared errors.
y | predictions | squared error |
---|---|---|
5 | 3 | (5-3)² = 4 |
7 | 6 | (7-6)² = 1 |
10 | 7 | (10-7)² = 9 |
The MSE is (1 + 4 + 9) / 3 = 4.6
Model validation techniques
Now the question may arise. But where do we get this unseen data for validation?
Hold-out validation
This is the classic method for evaluating a model’s performance. We randomly divide our dataset in two subsets, the training dataset and the validation (or testing) dataset.
The model is trained on the training dataset and validated on the validation dataset.
We must choose the size ratio between the two datasets. Common values are 70 – 30, 80 – 20, or 90 – 10.
Cross-validation
Hold-out validation is good but has a flaw: the estimation of our model accuracy isn’t too precise because it is based only on a part of our data.
To solve this issue we can use cross-validation. The dataset is split into k subsets, and each one is used as a validation set while the rest of the data a training set. This operation continues, and then the output of cross-validation is the average error.
Model validation in Python
Problem statement
We have a regression problem with predicting house prices. We need to build a decision tree and validate it.
1. Import necessary libraries
#import the dataset
import pandas as pd
#split the dataset using hold-out validation
from sklearn.model_selection import train_test_split
#calculate the MSE
from sklearn.metrics import mean_squared_error
2. Import the dataset
#upload the dataset
file_path = "C:\\...\\realestate_dataset.csv"
dataset = pd.read_csv(file_path)
3. Train test split
#feature engineering...
train_X, val_X, train_y, val_y = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state = 0)
4. Build the model
#load and train the model...
5. Model validation
We evaluate how well our model using the mean squared error metric.
MSE = mean_squared_error(val_y, model.predict(val_X))
print(MSE)