Table of Contents
What a random forest is
A random forest is an ensemble machine-learning model that combines the output of numerous decision trees trained on input data to make predictions.
The theory behind a random forest: how it works
Model main parameters
These are the main parameters taken from the scikit-learn documentation about parameters. You can check the parameters for the decision trees, which also apply to this model, in their article.
- n_estimators: the number of trees in the forest.
- max_depth: the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- max_leaves_num: the maximum number of leaves in the model.
- min_samples_split: the minimum number of samples required to split an internal node.
- min_samples_leaf: the minimum number of samples required to be at a leaf node.
- n_jobs: the number of processors used to train the model and make the predictions. -1 = use all processors
The bagging algorithm: how random forests work
1. Bootstrapping
The first step of creating a random forest model is Bootstrapping. During this phase new datasets are made by mixing randomly (and also repeating) rows of the original dataset. The number of datasets matches the number of trees in our forest.
2. Random feature selection
However, the new datasets created during bootstrapping take into account only certain features chosen by the algorithm randomly. This is called random feature selection.
Machine learning experts have discovered that the number of features chosen for each dataset as the square root of the number of features works well.
Phases 1 and 2, which deal with data preprocessing, are also called feature bagging.
3. Tree construction
After feature choosing, a decision tree is built and independently trained on every new dataset.
4. Output prediction through aggregation
For predicting the output of chosen input data, the latter is passed through each tree.
In classification, the output is equal to the class that appears more in predictions.
In regression is equal to the average of all tree predictions.
In machine learning, this process of combining results from multiple models is called aggregation.
Why is randomness in a random forest so important?
Bootstrapping and random feature selection are vital in a random forest model because they ensure that every tree is trained on different data. This helps our model be less sensitive and dependent on the original training data, therefore improving the overfitting problem of single decision trees.
They are also useful because they prevent trees from being too similar. After all, with the same features, they would probably share similar decision nodes.
![](https://www.insidealgorithms.com/wp-content/uploads/2024/05/Random-forest-image1-1024x576.png)
Decision tree advantages and disadvantages
Advantages
A random forest model maintains the benefits of a decision tree, in addition to the things below.
- As already mentioned, decision trees sometimes can be oversensitive and run the risk of overfitting. However, a random forest model won’t overfit, thanks to the number of trees and feature bagging.
- Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing.
To be clear missing values present in a specific row will not affect pieces of the database that do not contain it. - It is possible to use parallel processing to distribute the training and prediction process among processors, thus decreasing time. The number of processors is adjustable with the n_jobs parameter seen earlier.
Disadvantages
- Since random forest algorithms can handle large data sets, they can provide more accurate predictions but can be slow to process data and can require more resources
- It is difficult to fully visualize, analyze, and interpret the model as it is composed of numerous trees.
Programming a random forest model
1. Import necessary libraries
The libraries used in this project are:
- Pandas for handling input and output data
- Sklearn for importing the decision tree algorithm, validation parameter and preprocessing techniques
- Matplotlib for visualizing the model structure
- Joblib for saving the model
import pandas as pd
import joblib
import sklearn
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
2. Upload the dataset
#upload the dataset
file_path = "C:\\...\\melb_data.csv" #real estate data with house prices and input details
table = pd.read_csv(file_path)
The data used to train this model look something like this:
Rooms | Building area | Year Built | … | Sale price | |
1 | 8 | 150 | 1987 | 650 000 | |
2 | 5 | 95 | 2015 | 300 000 | |
3 | 6 | 105 | 1967 | 130 000 | |
4 | 4 | 75 | 2001 | 75 000 |
The dataset I used is a real estate dataset that reports the sales values of properties with their respective building characteristics.
3. Select input and output features
#define the input and the output
table_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']
X = table[table_variables]
y = table[["SalePrice"]]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data
4. Feature preprocessing
#handle non-numeric colums
cols_without_numbers = [col for col in train_X.columns
if train_X[col].dtype == "object"]
ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number
for col in cols_without_numbers:
train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
val_X[[col]] = ordinal_encoder.transform(val_X[[col]])
#handle missing data
cols_with_missing = [col for col in train_X.columns
if train_X[col].isnull().any()]
imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature
for col in cols_with_missing:
train_X[[col]] = imputer.fit_transform(train_X[[col]])
val_X[[col]] = imputer.transform(val_X[[col]])
5. Find the best parameter values
#find the best value for the max_depth and n_estimators parameters
max_depth_list = []
max_depth_mae = []
def get_depth(max_depth, train_X, val_X, train_y, val_y):
model = RandomForestRegressor(max_depth=max_depth, random_state=0)
model.fit(train_X, train_y.values.ravel())
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
max_depth_list.append(max_depth)
max_depth_mae.append(mae)
for max_depth in range(2 , 10):
get_depth(max_depth, train_X, val_X, train_y, val_y)
n_estimators_list = []
estimators_mae = []
def get_estimators(n_estimators, train_X, val_X, train_y, val_y):
model = RandomForestRegressor(n_estimators=n_estimators, random_state=0)
model.fit(train_X, train_y.values.ravel())
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
n_estimators_list.append(n_estimators)
estimators_mae.append(mae)
for n_estimators in range(2, 100):
get_estimators(n_estimators, train_X, val_X, train_y, val_y)
6. Build, train and evaluate the model
#load and train the model
model = RandomForestRegressor(max_depth=max_depth_list[max_depth_mae.index(min(max_depth_mae))], n_estimators=n_estimators_list[estimators_mae.index(min(estimators_mae))], random_state=0)
model.fit(train_X, train_y.values.ravel())
print(mean_absolute_error(val_y, model.predict(val_X))) #evaluate it's performance
Wow! The mean absolute error of our model is 20 363. It means that on average, the real value differs by $ 20 363 from the predicted price for each prediction. For guessing house prices this is an excellent result.
7. Visualize the model structure
#show model's trees individual structure textual and visual representation through indexing
plt.figure(figsize=(25,20))
tree.plot_tree(model.estimators_[0],
filled = True)
plt.show()
8. Save the model on a .sav file
#save the model on desktop in a .sav file
joblib.dump(model, "C:\\...\\my_model.sav")
Regression model full code
import pandas as pd
import joblib
import sklearn
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
#upload the dataset
file_path = "C:\\Users\\ciaos\\Documents\\blog\\posts\\blog post information\\Random forest\\realestate_dataset.csv"
table = pd.read_csv(file_path)
#define the input and the output
input_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']
X = table[input_variables]
y = table[["SalePrice"]]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data
#handle non-numeric colums
cols_without_numbers = [col for col in train_X.columns
if train_X[col].dtype == "object"]
ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number
for col in cols_without_numbers:
train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
val_X[[col]] = ordinal_encoder.transform(val_X[[col]])
#handle missing data
cols_with_missing = [col for col in train_X.columns
if train_X[col].isnull().any()]
imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature
for col in cols_with_missing:
train_X[[col]] = imputer.fit_transform(train_X[[col]])
val_X[[col]] = imputer.transform(val_X[[col]])
#find the best value for the max_depth and n_estimators parameters
max_depth_list = []
max_depth_mae = []
def get_depth(max_depth, train_X, val_X, train_y, val_y):
model = RandomForestRegressor(max_depth=max_depth, random_state=0)
model.fit(train_X, train_y.values.ravel())
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
max_depth_list.append(max_depth)
max_depth_mae.append(mae)
for max_depth in range(2 , 10):
get_depth(max_depth, train_X, val_X, train_y, val_y)
n_estimators_list = []
estimators_mae = []
def get_estimators(n_estimators, train_X, val_X, train_y, val_y):
model = RandomForestRegressor(n_estimators=n_estimators, random_state=0)
model.fit(train_X, train_y.values.ravel())
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
n_estimators_list.append(n_estimators)
estimators_mae.append(mae)
for n_estimators in range(2, 100):
get_estimators(n_estimators, train_X, val_X, train_y, val_y)
#load and train the model
model = RandomForestRegressor(max_depth=max_depth_list[max_depth_mae.index(min(max_depth_mae))], n_estimators=n_estimators_list[estimators_mae.index(min(estimators_mae))], random_state=0)
model.fit(train_X, train_y.values.ravel())
print(mean_absolute_error(val_y, model.predict(val_X))) #evaluate it's performance
#show model's trees individual structure textual and visual representation
plt.figure(figsize=(25,20))
tree.plot_tree(model.estimators_[0],
filled = True
)
plt.show()
#save the model on desktop in a .sav file
joblib.dump(model, "C:\\Users\\ciaos\\Documents\\Informatica\\Coding\\pyhton\\AI\\machine learning (kaggle)\\my_model.sav")
Classifier model full code
import pandas as pd
import joblib
import sklearn
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
#upload the dataset
file_path = "C:\\Users\\ciaos\\Documents\\Informatica\\Coding\\pyhton\\AI\\machine learning (kaggle)\\archive\\mushroom_cleaned.csv"
table = pd.read_csv(file_path)
#define the input and the output
input_variables = ["cap-diameter", "cap-shape", "gill-attachment", "gill-color", "stem-height", "stem-width", "stem-color", "season"]
X = table[input_variables]
y = table[["class"]]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data
#handle non-numeric colums
cols_without_numbers = [col for col in train_X.columns
if train_X[col].dtype == "object"]
ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number
for col in cols_without_numbers:
train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
val_X[[col]] = ordinal_encoder.transform(val_X[[col]])
#handle missing data
cols_with_missing = [col for col in train_X.columns
if train_X[col].isnull().any()]
imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature
for col in cols_with_missing:
train_X[[col]] = imputer.fit_transform(train_X[[col]])
val_X[[col]] = imputer.transform(val_X[[col]])
#find the best value for the max_depth and n_estimators parameters
max_depth_list = []
max_depth_accuracy = []
def get_depth(max_depth, train_X, val_X, train_y, val_y):
model = RandomForestClassifier(max_depth=max_depth, random_state=0)
model.fit(train_X, train_y.values.ravel())
preds_val = model.predict(val_X)
accuracy = accuracy_score(val_y, preds_val)
max_depth_list.append(max_depth)
max_depth_accuracy.append(accuracy)
for max_depth in range(2 , 10):
get_depth(max_depth, train_X, val_X, train_y, val_y)
n_estimators_list = []
estimators_accuracy = []
def get_estimators(n_estimators, train_X, val_X, train_y, val_y):
model = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
model.fit(train_X, train_y.values.ravel())
preds_val = model.predict(val_X)
accuracy = accuracy_score(val_y, preds_val)
n_estimators_list.append(n_estimators)
estimators_accuracy.append(accuracy)
for n_estimators in range(2, 100):
get_estimators(n_estimators, train_X, val_X, train_y, val_y)
#load and train the model
model = RandomForestClassifier(max_depth=max_depth_list[max_depth_accuracy.index(max(max_depth_accuracy))], n_estimators=n_estimators_list[estimators_accuracy.index(max(estimators_accuracy))], random_state=0)
model.fit(train_X, train_y.values.ravel())
print(accuracy_score(val_y, model.predict(val_X))) #evaluate it's performance
#show model's trees individual representation
plt.figure(figsize=(25,20))
tree.plot_tree(model.estimators_[0],
filled = True
)
plt.show()
#save the model on desktop in a .sav file
joblib.dump(model, "C:\\...\\my_model.sav")