What is model validation?

What is model validation?

Model validation is the step of building an efficient model after training.

During model validation, we evaluate the accuracy of our model by seeing how it performs with data he hasn’t been trained on. This way we verify that it has learned the right relationships from the training samples and can generalize.

Model validation analogy

Assume that I am a professor and must instruct my students on equations. I have already lectured in class, using simpler examples (data preprocessing) and assigned homework (training phase).

Today is the day of the test.

It is obvious that in the test I will not include exercises that I have already assigned homework, because my students are already used to it and would be advantaged. By assigning them exercises that they have never done I can see if they have really understood the topic and can put it into practice in real life.

How does model validation work?

Model validation works just like in school tests.

We have a part of the dataset, with inputs and outputs. We put each input into the model, and we receive a list with all the predictions. The latter are confronted with the actual values according to a metric (depending on the problem type).

For example, let’s say we’ve built a linear regression model for predicting the price of a property given its area. In this case, we use the metric MSE (mean squared error). The formula for the squared error is:

E = (y – prediction)²

And then we calculate the mean of all the squared errors.

ypredictionssquared error
53(5-3)² = 4
76(7-6)² = 1
107(10-7)² = 9

The MSE is (1 + 4 + 9) / 3 = 4.6

Model validation techniques

Now the question may arise. But where do we get this unseen data for validation?

Hold-out validation

This is the classic method for evaluating a model’s performance. We randomly divide our dataset in two subsets, the training dataset and the validation (or testing) dataset.

The model is trained on the training dataset and validated on the validation dataset.

The training subset must contain more observations than the validation one. To do this, we must choose the ratio of the size of the two datasets. Common values are 70 – 30, 80 – 20, or 90 – 10.

Cross-validation

Hold-out validation is good but has a flaw: the estimation of our model accuracy isn’t too precise because it is based only on a part of our data.

To solve this issue we can use cross-validation. The dataset is split into k subsets, and each one is used as validation set while the rest of the data as training set. This operation continues, and then the output of cross-validation is the average error.

Model validation in Python

Problem statement

We have a regression problem with predicting house prices. We need to build a decision tree and validate it.

1. Import necessary libraries

#import the dataset
import pandas as pd

#split the dataset
from sklearn.model_selection import train_test_split

#calculate the MSE
from sklearn.metrics import mean_squared_error

2. Import the dataset

#upload the dataset

file_path = "C:\\...\\realestate_dataset.csv"

dataset = pd.read_csv(file_path)

3. Train test split

#feature engineering


train_X, val_X, train_y, val_y = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state = 0)

4. Build the model

#load and train the model

5. Model validation

We evaluate how well our model using the mean squared error metric.

MSE = mean_squared_error(val_y, model.predict(val_X))

print(MSE)
Share the knowledge