Decision tree: definition, deep functioning, and Python code

What a decision tree is

A decision tree is a non-parametric supervised machine learning model that predicts outputs by extracting various conditions from the data that an input may or may not meet.
This determines the “decisions” that the algorithm makes on its way to a final prediction.
It is called a decision tree because of the composition of a real tree and the “decisions” made during the process.

The theory behind a decision tree: how it works

Decision tree terminology

  • Splitting is the process of dividing the dataset in equal subgroups (called nodes) based on a condition extracted by the algorithm from the data itself.
  • Pruning is the process of removing useless nodes to prevent overfitting.
  • The root node is the entire data population that is being analyzed and split into sub-nodes.
  • Decision nodes are further sub-nodes of the first split.
  • Branches are the possible choices or values that the attribute can take. They start from a node; usually, the left branch “is true” and the right one “is false”.
  • Leaves (or terminal nodes) represent the final predictions associated with a certain path of decisions.

Model main parameters

These are the model’s main taken from the scikit-learn documentation about parameters.

  • max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • max_leaves_num: The maximum number of leaves in the model.
  • min_samples_split: The minimum number of samples required to split an internal node.
  • min_samples_leaf: The minimum number of samples required to be at a leaf node.

The CART algorithm: how decision trees work

1. Extract thresholds (conditions) in given data features

For each feature, the algorithm evaluates all possible split points. For continuous features, this means evaluating every unique value in the feature as a potential split point. For example, in a dataset about house prices, the possible thresholds of the building area feature are all the values present in that column.

For categorical features, it evaluates all possible ways to partition the categories based on all the values. For example, in a dataset about the efficiency of a drug on various patients, the possible thresholds of the sex feature are all the categorical values present in that column, i.e., male and female.

2.1 Choose the optimal thresholds (classification)

In classification the algorithms iterate through each condition and calculate its information gain, the difference between the impurity of the parent node and the sum of the impurities of the sub-nodes.

Impurity is calculated using the Gini impurity parameter. It expresses the class homogeneity of the samples present in a node.

Giny impurity = 1 – (p1)^2 + (p2)^2 …

Where:

  • p = number of samples of the same class in a node/total nodes samples

To understand, the lower the impurity of a node, the more examples in that node are of the same class. This indicates that the condition is expressing a characteristic specific to that class, which is helpful for identification.

The condition with the higher information gain is chosen to split the data.

2.2 Choose the optimal thresholds (regression)

In regression problems, the algorithms iterate through each condition and calculate the average output of the samples in each node.

It then calculates the square residual for each sample in the nodes.

Square residual = (y_sample – y_node_average)^2

The lower the sum of the square residuals, the more examples of a node have similar results. This indicates that the condition expresses a characteristic of subjects with the same output, which is useful for prediction.

The condition with the lower sum of all the square residuals is chosen to split the data.

3. Stop the tree from growing

The splitting process of a node ends when the impurity or squared residual of itself is minimal, or when the value of a node encounters the value of the parameter min_samples_leaf. At this point, the node becomes a leaf and the predicted value of the leaf is the average output of its components.

The learning process stops entirely (the tree stops growing) when the value of the max_depth or max_leaves_num is reached.

4. Predicting outputs

For each input, the model follows a decision path. The model output is the average output of the leave in which the input falls.

Decision tree advantages and disadvantages

Advantages

  • Simple to understand and to interpret. Trees can be visualized.
  • Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values need to be removed. Some tree and algorithm combinations support missing values.
  • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
  • Able to handle both numerical and categorical data.
  • Able to handle multi-output problems.
  • Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic (true or false). By contrast, results may be more difficult to interpret in a black box model (e.g., in an artificial neural network).
  • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
  • Decision trees make it easy to evaluate feature importance, or contribution, to the model.

    There are a few ways to evaluate feature importance, like the gini importance measure of a feature, which equals the average information gain of nodes related to that feature.

Disadvantages

  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
  • Decision trees can be oversensitive and unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble (such as the random forest model or XGBoost).
  • Some concepts are hard to learn because decision trees do not express them easily, such as XOR (a logical operation that returns true if the inputs are different), parity or multiplexer problems, where the output is one of 2 input values, A or B, depending on a 3rd value called selector (es. If S == true => output = A else output = B)
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

These aspects are taken from the general scikit-learn documentation about decision trees.

Programming a decision tree

From now you are going to learn how to build a decision tree model in Python using the scikit-learn library.

1. Import necessary libraries

The libraries used in this project are:

  • Pandas for handling input and output data
  • Sklearn for importing the decision tree algorithm,validation parameter and preprocessing techniques
  • Matplotlib for visualizing the model structure
  • Joblib for saving the model
import pandas as pd

import joblib

import sklearn

from sklearn import tree

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

2. Upload the dataset

#upload the dataset

file_path = "C:\\...\\melb_data.csv" #real estate data with house prices and input details

table = pd.read_csv(file_path)

The data used to train this model look something like this:

RoomsBuilding areaYear BuiltSale price
181501987650 000
25952015300 000
361051967130 000
4475200175 000

The dataset I used is a real estate dataset that reports the sales values of properties with their respective building characteristics.

3. Select input and output features

#define the input and the output

input_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']

X = table[input_variables]

y = table[["SalePrice"]]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data

4. Feature preprocessing

#handle non-numeric colums

cols_without_numbers = [col for col in train_X.columns
                     if train_X[col].dtype == "object"]

ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number

for col in cols_without_numbers:

    train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
    
    val_X[[col]] = ordinal_encoder.transform(val_X[[col]])

#handle missing data

cols_with_missing = [col for col in train_X.columns
                     if train_X[col].isnull().any()]

imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature

for col in cols_with_missing:

    train_X[[col]] = imputer.fit_transform(train_X[[col]])

    val_X[[col]] = imputer.transform(val_X[[col]])

In this step, we convert

5. Find the best parameter values

#Find the best value for the max_leaf_nodes parameter

leaves_list = []

leaves_mae = []

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    leaves_list.append(max_leaf_nodes)
    leaves_mae.append(mae)


for max_leaf_nodes in range(2 , 100):
    get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)

I decided to use max_leaf_nodes as a parameter to prevent overfitting. I could have also used max_depth.

6. Build, train and evaluate the model

#load and train the model

model = DecisionTreeRegressor(max_leaf_nodes=leaves_list[leaves_mae.index(min(leaves_mae))], random_state=0)

model.fit(train_X, train_y)

print(mean_absolute_error(val_y, model.predict(val_X))) #evaluate it's performance

Wow! The mean absolute error of our model is 25 542. It means that on average, the real value differs by $ 25 542 from the predicted price for each prediction. For guessing house prices this is an excellent result.

7. Visualize the model structure

#show model structure textual and visual representation

text_representation = tree.export_text(model)
print(text_representation)

plt.figure(figsize=(25,20))
tree.plot_tree(model, feature_names=input_variables, filled=True)
plt.show()
Decision tree splits visual representation
This represents a smaller tree than the one we’ve just created. In the actual representation, nodes are too small to read

8. Save the model on a .sav file

#save the model on desktop in a .sav file

joblib.dump(model, "C:\\...\\my_model.sav")

Regression model full code

import pandas as pd

import joblib

import sklearn

from sklearn import tree

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt


#upload the dataset

file_path = "C:\\...\\realestate_dataset.csv"

table = pd.read_csv(file_path)


#define the input and the output

input_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']

X = table[input_variables]

y = table[["SalePrice"]]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data


#handle non-numeric colums

cols_without_numbers = [col for col in train_X.columns
                     if train_X[col].dtype == "object"]

ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number

for col in cols_without_numbers:

    train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
    
    val_X[[col]] = ordinal_encoder.transform(val_X[[col]])

#handle missing data

cols_with_missing = [col for col in train_X.columns
                     if train_X[col].isnull().any()]

imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature 

for col in cols_with_missing:

    train_X[[col]] = imputer.fit_transform(train_X[[col]])

    val_X[[col]] = imputer.transform(val_X[[col]])


#Find the best value for the max_leaf_nodes parameter

leaves_list = []

leaves_mae = []

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    leaves_list.append(max_leaf_nodes)
    leaves_mae.append(mae)


for max_leaf_nodes in range(2 , 100):
    get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
 

#load and train the model

model = DecisionTreeRegressor(max_leaf_nodes=leaves_list[leaves_mae.index(min(leaves_mae))], random_state=0)

model.fit(train_X, train_y)

print(mean_absolute_error(val_y, model.predict(val_X))) #evaluate it's performance


#show model structure textual and visual representation

text_representation = tree.export_text(model)
print(text_representation)

plt.figure(figsize=(25,20))
tree.plot_tree(model, feature_names=input_variables, filled=True)
plt.show()


#save the model on desktop in a .sav file

joblib.dump(model, "C:\\...\\my_model.sav")

Classifier model full code

import pandas as pd

import joblib

import sklearn

from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt


#upload the dataset

file_path = "C:\\...\\mushroom_cleaned.csv" #poisonousness of mushrooms according to their characteristics 

table = pd.read_csv(file_path)


#define the input and the output

input_variables = ["cap-diameter", "cap-shape", "gill-attachment", "gill-color", "stem-height", "stem-width", "stem-color", "season"]

X = table[input_variables]

y = table[["class"]]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data


#handle non-numeric colums

cols_without_numbers = [col for col in train_X.columns
                     if train_X[col].dtype == "object"]

ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number

for col in cols_without_numbers:

    train_X[[col]] = ordinal_encoder.fit_transform(train_X[[col]])
    
    val_X[[col]] = ordinal_encoder.transform(val_X[[col]])

#handle missing data

cols_with_missing = [col for col in train_X.columns
                     if train_X[col].isnull().any()]

imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature

for col in cols_with_missing:

    train_X[[col]] = imputer.fit_transform(train_X[[col]])

    val_X[[col]] = imputer.transform(val_X[[col]])


#Find the best value for the max_leaf_nodes parameter

leaves_list = []

leaves_accuracy = []

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)  
    preds_val = model.predict(val_X)
    accuracy = accuracy_score(val_y, preds_val)
    leaves_list.append(max_leaf_nodes)
    leaves_accuracy.append(accuracy)


for max_leaf_nodes in range(2, 100):
    get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
 
#load and train the model

model = DecisionTreeClassifier(max_leaf_nodes=leaves_list[leaves_accuracy.index(max(leaves_accuracy))], random_state=0)

model.fit(train_X, train_y)

print(accuracy_score(val_y, model.predict(val_X))) #evaluate it's performance


#show model structure textual and visual representation

text_representation = tree.export_text(model)
print(text_representation)


tree.plot_tree(model, feature_names=input_variables, filled=True)
plt.show()

#save the model on desktop in a .sav file

joblib.dump(model, "C:\\...\\my_model.sav")
Share the knowledge

Leave a Reply

Your email address will not be published. Required fields are marked *