Random forest: definition, deep functioning, and Python code

What a random forest is

A random forest is an ensemble machine-learning model that combines the output of numerous decision trees trained on input data to make predictions.

The theory behind a random forest: how it works

Model main parameters

These are the main parameters taken from the scikit-learn documentation about parameters. You can check the parameters for the decision trees, which also apply to this model, in their article.

  • n_estimators: the number of trees in the forest.
  • max_depth: the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • max_leaves_num: the maximum number of leaves in the model.
  • min_samples_split: the minimum number of samples required to split an internal node.
  • min_samples_leaf: the minimum number of samples required to be at a leaf node.
  • n_jobs: the number of processors used to train the model and make the predictions. -1 = use all processors

The bagging algorithm: how random forests work

1. Bootstrapping

The first step of creating a random forest model is Bootstrapping. During this phase new datasets are made by mixing randomly (and also repeating) rows of the original dataset. The number of datasets matches the number of trees in our forest.

2. Random feature selection

However, the new datasets created during bootstrapping take into account only certain features chosen by the algorithm randomly. This is called random feature selection.

Machine learning experts have discovered that the number of features chosen for each dataset as the square root of the number of features works well.


Phases 1 and 2, which deal with data preprocessing, are also called feature bagging.

3. Tree construction

After feature choosing, a decision tree is built and independently trained on every new dataset.

4. Output prediction through aggregation

For predicting the output of chosen input data, the latter is passed through each tree.

In classification, the output is equal to the class that appears more in predictions.

In regression is equal to the average of all tree predictions.

In machine learning, this process of combining results from multiple models is called aggregation.

Why is randomness in a random forest so important?

Bootstrapping and random feature selection are vital in a random forest model because they ensure that every tree is trained on different data. This helps our model be less sensitive and dependent on the original training data, therefore improving the overfitting problem of single decision trees.

They are also useful because they prevent trees from being too similar. After all, with the same features, they would probably share similar decision nodes.

Decision tree advantages and disadvantages

Advantages

A random forest model maintains the benefits of a decision tree, in addition to the things below.

  • As already mentioned, decision trees sometimes can be oversensitive and run the risk of overfitting. However, a random forest model won’t overfit, thanks to the number of trees and feature bagging.
  • Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing.
    To be clear missing values present in a specific row will not affect pieces of the database that do not contain it.
  • It is possible to use parallel processing to distribute the training and prediction process among processors, thus decreasing time. The number of processors is adjustable with the n_jobs parameter seen earlier.

Disadvantages

  • Since random forest algorithms can handle large data sets, they can provide more accurate predictions but can be slow to process data and can require more resources
  • It is difficult to fully visualize, analyze, and interpret the model as it is composed of numerous trees.

Programming a random forest model

1. Import necessary libraries

The libraries used in this project are:

  • Pandas for handling input and output data
  • Sklearn for importing the decision tree algorithm, validation parameter and preprocessing techniques
  • Matplotlib for visualizing the model structure
  • Joblib for saving the model
import pandas as pd

import joblib

import sklearn

from sklearn import tree

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

2. Upload the dataset

#upload the dataset

file_path = "C:\\...\\melb_data.csv" #real estate data with house prices and input details

table = pd.read_csv(file_path)

The data used to train this model look something like this:

RoomsBuilding areaYear BuiltSale price
181501987650 000
25952015300 000
361051967130 000
4475200175 000

The dataset I used is a real estate dataset that reports the sales values of properties with their respective building characteristics.

3. Select input and output features

#define the input and the output

table_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']

X = table[table_variables]

y = table[["SalePrice"]]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data

4. Feature preprocessing

#handle non-numeric colums

cols_without_numbers = [col for col in train_X.columns
                     if train_X[col].dtype == "object"]

ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number

for col in cols_without_numbers:

    train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
    
    val_X[[col]] = ordinal_encoder.transform(val_X[[col]])

#handle missing data

cols_with_missing = [col for col in train_X.columns
                     if train_X[col].isnull().any()]

imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature

for col in cols_with_missing:

    train_X[[col]] = imputer.fit_transform(train_X[[col]])

    val_X[[col]] = imputer.transform(val_X[[col]])

5. Find the best parameter values

#find the best value for the max_depth and n_estimators parameters

max_depth_list = []

max_depth_mae = []

def get_depth(max_depth, train_X, val_X, train_y, val_y):
    model = RandomForestRegressor(max_depth=max_depth, random_state=0)
    model.fit(train_X, train_y.values.ravel())
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    max_depth_list.append(max_depth)
    max_depth_mae.append(mae)

for max_depth in range(2 , 10):
    get_depth(max_depth, train_X, val_X, train_y, val_y)

n_estimators_list = []

estimators_mae = []

def get_estimators(n_estimators, train_X, val_X, train_y, val_y):
    model = RandomForestRegressor(n_estimators=n_estimators, random_state=0)
    model.fit(train_X, train_y.values.ravel())
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    n_estimators_list.append(n_estimators)
    estimators_mae.append(mae)

for n_estimators in range(2, 100):
    get_estimators(n_estimators, train_X, val_X, train_y, val_y)

6. Build, train and evaluate the model

#load and train the model

model = RandomForestRegressor(max_depth=max_depth_list[max_depth_mae.index(min(max_depth_mae))], n_estimators=n_estimators_list[estimators_mae.index(min(estimators_mae))], random_state=0)

model.fit(train_X, train_y.values.ravel())

print(mean_absolute_error(val_y, model.predict(val_X))) #evaluate it's performance

Wow! The mean absolute error of our model is 20 363. It means that on average, the real value differs by $ 20 363 from the predicted price for each prediction. For guessing house prices this is an excellent result.

7. Visualize the model structure

#show model's trees individual structure textual and visual representation through indexing

plt.figure(figsize=(25,20))
tree.plot_tree(model.estimators_[0],
                   filled = True)
plt.show()

8. Save the model on a .sav file

#save the model on desktop in a .sav file

joblib.dump(model, "C:\\...\\my_model.sav")

Regression model full code

import pandas as pd

import joblib

import sklearn

from sklearn import tree

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder


#upload the dataset

file_path = "C:\\Users\\ciaos\\Documents\\blog\\posts\\blog post information\\Random forest\\realestate_dataset.csv"

table = pd.read_csv(file_path)


#define the input and the output

input_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']

X = table[input_variables]

y = table[["SalePrice"]]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data


#handle non-numeric colums

cols_without_numbers = [col for col in train_X.columns
                     if train_X[col].dtype == "object"]

ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number

for col in cols_without_numbers:

    train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
    
    val_X[[col]] = ordinal_encoder.transform(val_X[[col]])

#handle missing data

cols_with_missing = [col for col in train_X.columns
                     if train_X[col].isnull().any()]

imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature

for col in cols_with_missing:

    train_X[[col]] = imputer.fit_transform(train_X[[col]])

    val_X[[col]] = imputer.transform(val_X[[col]])


#find the best value for the max_depth and n_estimators parameters

max_depth_list = []

max_depth_mae = []

def get_depth(max_depth, train_X, val_X, train_y, val_y):
    model = RandomForestRegressor(max_depth=max_depth, random_state=0)
    model.fit(train_X, train_y.values.ravel())
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    max_depth_list.append(max_depth)
    max_depth_mae.append(mae)


for max_depth in range(2 , 10):
    get_depth(max_depth, train_X, val_X, train_y, val_y)

n_estimators_list = []

estimators_mae = []

def get_estimators(n_estimators, train_X, val_X, train_y, val_y):
    model = RandomForestRegressor(n_estimators=n_estimators, random_state=0)
    model.fit(train_X, train_y.values.ravel())
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    n_estimators_list.append(n_estimators)
    estimators_mae.append(mae)


for n_estimators in range(2, 100):
    get_estimators(n_estimators, train_X, val_X, train_y, val_y)
 

#load and train the model

model = RandomForestRegressor(max_depth=max_depth_list[max_depth_mae.index(min(max_depth_mae))], n_estimators=n_estimators_list[estimators_mae.index(min(estimators_mae))], random_state=0)

model.fit(train_X, train_y.values.ravel())

print(mean_absolute_error(val_y, model.predict(val_X))) #evaluate it's performance


#show model's trees individual structure textual and visual representation

plt.figure(figsize=(25,20))
tree.plot_tree(model.estimators_[0],
                   filled = True
                   )
plt.show()

#save the model on desktop in a .sav file

joblib.dump(model, "C:\\Users\\ciaos\\Documents\\Informatica\\Coding\\pyhton\\AI\\machine learning (kaggle)\\my_model.sav")

Classifier model full code

import pandas as pd

import joblib

import sklearn

from sklearn import tree

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder


#upload the dataset

file_path = "C:\\Users\\ciaos\\Documents\\Informatica\\Coding\\pyhton\\AI\\machine learning (kaggle)\\archive\\mushroom_cleaned.csv"

table = pd.read_csv(file_path)


#define the input and the output

input_variables = ["cap-diameter", "cap-shape", "gill-attachment", "gill-color", "stem-height", "stem-width", "stem-color", "season"]

X = table[input_variables]

y = table[["class"]]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data


#handle non-numeric colums

cols_without_numbers = [col for col in train_X.columns
                     if train_X[col].dtype == "object"]

ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number

for col in cols_without_numbers:

    train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
    
    val_X[[col]] = ordinal_encoder.transform(val_X[[col]])

#handle missing data

cols_with_missing = [col for col in train_X.columns
                     if train_X[col].isnull().any()]

imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature

for col in cols_with_missing:

    train_X[[col]] = imputer.fit_transform(train_X[[col]])

    val_X[[col]] = imputer.transform(val_X[[col]])


#find the best value for the max_depth and n_estimators parameters

max_depth_list = []

max_depth_accuracy = []

def get_depth(max_depth, train_X, val_X, train_y, val_y):
    model = RandomForestClassifier(max_depth=max_depth, random_state=0)
    model.fit(train_X, train_y.values.ravel())
    preds_val = model.predict(val_X)
    accuracy = accuracy_score(val_y, preds_val)
    max_depth_list.append(max_depth)
    max_depth_accuracy.append(accuracy)


for max_depth in range(2 , 10):
    get_depth(max_depth, train_X, val_X, train_y, val_y)

n_estimators_list = []

estimators_accuracy = []

def get_estimators(n_estimators, train_X, val_X, train_y, val_y):
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
    model.fit(train_X, train_y.values.ravel())
    preds_val = model.predict(val_X)
    accuracy = accuracy_score(val_y, preds_val)
    n_estimators_list.append(n_estimators)
    estimators_accuracy.append(accuracy)


for n_estimators in range(2, 100):
    get_estimators(n_estimators, train_X, val_X, train_y, val_y)
 

#load and train the model

model = RandomForestClassifier(max_depth=max_depth_list[max_depth_accuracy.index(max(max_depth_accuracy))], n_estimators=n_estimators_list[estimators_accuracy.index(max(estimators_accuracy))], random_state=0)

model.fit(train_X, train_y.values.ravel())

print(accuracy_score(val_y, model.predict(val_X))) #evaluate it's performance


#show model's trees individual representation

plt.figure(figsize=(25,20))
tree.plot_tree(model.estimators_[0],
                   filled = True
                   )
plt.show()

#save the model on desktop in a .sav file

joblib.dump(model, "C:\\...\\my_model.sav")
Share the knowledge

Leave a Reply

Your email address will not be published. Required fields are marked *