How to build a machine learning model in 6 steps

This article explains the steps to build a machine learning model, including explanations, examples, and Python code that everyone can try to execute and modify as they like.

In this article, I will use the following Python libraries to make the programming of our model more automatic.

import pandas as pd

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

import joblib

1. Data collection

Data collection is the process of collecting data from various sources. In addition to accumulating data we must structure it so that it can be managed and processed by a program.

However, this structure depends on the data we have available and the nature of the problem we need to solve.

For example, for a house value prediction, data about the building would be organized in a table that looks something like this. Since this is a regression problem (supervised learning) we have input columns, like area and n. of bedrooms, and output columns like the price.

IndexHouse area in m2Number of bedroomsNumber of bathroomsPrice
120021
230033

If the problem we need to solve requires an unsupervised learning approach, the data will not be labeled, that is, it will not have an output column (i.e. label).

IndexSexAgeHeight
1Male30185 cm
2Female23167 cm

Data collection is crucial for building a model because the better the quality of input data the better accurate the predictions in real-world scenarios.

2. Data preprocessing

After collecting all the data, we need to focus on data preprocessing, the process of transforming raw data into a format suitable for training and testing our models.

Some common data preprocessing techniques are:

  • Handling missing values
  • Transforming (or encoding) categorical values into numerical ones
  • Creating new features by combining existing ones
  • Choosing the most relevant feature for the prediction (feature selection)
# the path of the file with the data
file_path = "C:\\...\\data.csv"

# tell Python to read and understand the table
table = pd.read_csv(file_path)

# remove columns with missing values (there are other methods to replace them)
table = table.dropna(axis=0)

# select only the x variables affecting the y
table_variables = ["variable1", "variable2"...]

# X = all the useful variables
X = table[table_variables]

# y = the varialbe of the table we want to predict
y = table[["Price"]]

# divide the training data from the validation ones
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

3. Model training

Training is the actual learning phase of the model.
We feed the model with previously selected and processed data, and through mathematical algorithms can iterate between data points and modify its structure.

# after defining the variables, we first decide what model we want to use, in this case a # decision tree and then we train it using the training X and y
model = DecisionTreeRegressor()

model.fit(train_X, train_y)

For those interested in finding out how a machine can learn, I have written this article with examples that is within everyone’s reach.

4. Model validation

Model validation is the evaluation of a model’s accuracy and reliability.

To do this we must first have the model predict the values in the subset for testing, on which it has not been trained.
Then we compare the model outputs with the actual values according to various parameters, such as:

Mean absolute error

The mean value of the absolute differences between predicted and actual values. Used for regression.

Mean squared error

The mean value of the squared differences between predicted and actual values. Used for regression.

Accuracy

The number of correctly predicted values is divided by the number of the total samples. Used for classification.

Precision

The number of actual positive values is divided by the number of positive predictions.

5. Save the model in a file/checkpoint

We can save a trained (or partially trained) model in a .sav file for two main reasons:

  • So that every time we need to make predictions we don’t have to go through the training process again.
  • To make checkpoints. If our training process is very long we can save the model and continue the training later.
# Save the model to a folder of your choice using the library Joblib
joblib.dump(model, "C:\\...\\my_model.sav")

# Open the model in the same code or another one
loaded_model = joblib.load("C:\\...\\my_model.sav")

# Use the model to do as you want

6. Make predictions

Now that our model is trained and saved, it’s ready to make predictions.

print(loaded_model.predict(...))
Share the knowledge

Leave a Reply

Your email address will not be published. Required fields are marked *