Linear regression: functioning, types and Python code

What linear regression is

Linear regression is a supervised machine-learning algorithm that fits a straight line to the input data to represent the relationship between x and y.

Linear regression structure

Suppose we have a dataset that contains two columns, one with the area in m2 of some houses and the other with their prices.

We can plot the data on a graph, which would look something like this:

We are investors and want to predict the price of a house based on its area. This means that the column with the area is the input column X. The price is the output column y.

he algorithm aims to find a line that best represents the relationship between X and y. Extending a point from the x-axis makes it possible to determine its value on the y-axis, which is the height where it intersects with the line.

But for the predictions to be accurate the line has to pass through the points, because otherwise, as in the example below, the difference between the y and the predicted y is $30,000.

This means that the error of our calculations is equal to the distance between the point and the line.

The task of the algorithm is then to move the line so that it is as close as possible to all the data points by adjusting two parameters:

  • The slope, the tangent of the angle of the line with the x-axis.
  • The intercept, the point where the line intersects with the y-axis.

So the mathematical formula for y is:

y = intercept + slope * X

How linear regression works

Problem statement

We have a dataset with a numerical label (regression problem). We want to fit a straight line as close to all the data points to represent the relationship between x and y.

1. Initialize the intercept and slope

Initially, the algorithm set both our parameters, the intercept, and the slope, to 0.

2. Fit the line with gradient descent

To fit the line it uses gradient descent, a powerful algorithm that optimizes parameter value to minimize a given function.

The function in this case we want to optimize is the absolute distance between the data points and our line, i.e. mean squared error.

E = 1/ N * (y – (wx + b))²

Where:

  • N is the number of data points

2.1 Calculate the sum of gradients

The algorithm calculates the partial derivative of f with respect to w and b by summing all the derivatives at each data point.

∂E / ∂w = 2 / N * -x * (y – (wx + b))

∂E / ∂b = – 2 / N * (y – (wx + b))

2.2 Update the parameters

To update the parameters, the algorithm subtracts from the initial value the derivative multiplied by a small learning rate.

w = w – α * (∂E / ∂w)

b = b – α * (∂E / ∂b)


The algorithm continues to update parameters until the change is minimal or a maximum iteration is reached.

3. Predict outputs

To predict an output we just have to plug it in the formula y = b + wx where b and w are the constants we have just found.

Linear regression advantages and disadvantages

Linear regression advantages

  • Since linear regression is a simple model, it is computationally cheap and fast to execute.
  • Linear regression is perfect for representing linear relationships between data.
  • Linear regression is a white box model. This means that his internal work and decision-making process is clear. That’s because we know the exact value of w and b

Linear regression disadvantages

  • Linear regression is very sensible to outliers. Outliers are rare data that differ significantly from the mean values of the other observations.
  • Linear regression can’t represent non-linear relationships between data.

Linear regression types

Simple linear regression

Describes the relationship between one independent and one dependent variable.

Multiple linear regression

Describes the relationship between multiple independent variables and one dependent variable.

The formula is:

y = w1 * x1 + w2 * x1 + … + b

Where:

  • b is the intercept
  • each independent variable is multiplied by a different slope

Linear regression in Python

1. Import necessary libraries

2. Upload the dataset

2. Upload the dataset

#upload the dataset
file_path = "C:\\...\\melb_data.csv" #real estate data with house prices and input details

dataset = pd.read_csv(file_path)

The data used to train this model look something like this:

RoomsBuilding areaYear BuiltSale price
181501987650 000
25952015300 000
361051967130 000
4475200175 000

The dataset I used is a real estate dataset that repo

3. Feature engineering

#handle non-numeric colums

cols_without_numbers = [col for col in dataset.columns
                     if dataset[col].dtype == "object"]

ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number

for col in cols_without_numbers:

    dataset[[col]] = ordinal_encoder.fit_transform(dataset[[col]])

#handle missing data

cols_with_missing = [col for col in dataset.columns
                     if dataset[col].isnull().any()]

imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature

for col in cols_with_missing:

    dataset[[col]] = imputer.fit_transform(dataset[[col]]

4. Select input and output features and split the data

#define the features and the label

input_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']

X = dataset[input_variables]

y = dataset[["SalePrice"]]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data

5. Train and evaluate the model

#load and train the model

model = LinearRegression()

model.fit(train_X, train_y)

print(mean_squared_error(val_y, model.predict(val_X))) #evaluate it's performance

Wow! The mean absolute error of our model is 25 542. It means that on average, the real value differs by $ 25 542 from the predicted price for each prediction. For guessing house prices this is an excellent result.

6. Visualize the model

7. Save the model on a .sav file

#save the model on desktop in a .sav file
joblib.dump(model, "C:\\...\\my_model.sav")
Share the knowledge