Table of Contents
What linear regression is
Linear regression is a supervised machine-learning algorithm that fits a straight line to the input data to represent the relationship between x and y.
Linear regression structure
Suppose we have a dataset that contains two columns, one with the area in m2 of some houses and the other with their prices.
We can plot the data on a graph, which would look something like this:
We are investors and want to predict the price of a house based on its area. This means that the column with the area is the input column X. The price is the output column y.
he algorithm aims to find a line that best represents the relationship between X and y. Extending a point from the x-axis makes it possible to determine its value on the y-axis, which is the height where it intersects with the line.
But for the predictions to be accurate the line has to pass through the points, because otherwise, as in the example below, the difference between the y and the predicted y is $30,000.
This means that the error of our calculations is equal to the distance between the point and the line.
The task of the algorithm is then to move the line so that it is as close as possible to all the data points by adjusting two parameters:
- The slope, the tangent of the angle of the line with the x-axis.
- The intercept, the point where the line intersects with the y-axis.
So the mathematical formula for y is:
y = intercept + slope * X
How linear regression works
Problem statement
We have a dataset with a numerical label (regression problem). We want to fit a straight line as close to all the data points to represent the relationship between x and y.
1. Initialize the intercept and slope
Initially, the algorithm set both our parameters, the intercept, and the slope, to 0.
2. Fit the line with gradient descent
To fit the line it uses gradient descent, a powerful algorithm that optimizes parameter value to minimize a given function.
The function in this case we want to optimize is the absolute distance between the data points and our line, i.e. mean squared error.
E = 1/ N * (y – (wx + b))²
Where:
- N is the number of data points
2.1 Calculate the sum of gradients
The algorithm calculates the partial derivative of f with respect to w and b by summing all the derivatives at each data point.
∂E / ∂w = 2 / N * -x * (y – (wx + b))
∂E / ∂b = – 2 / N * (y – (wx + b))
2.2 Update the parameters
To update the parameters, the algorithm subtracts from the initial value the derivative multiplied by a small learning rate.
w = w – α * (∂E / ∂w)
b = b – α * (∂E / ∂b)
The algorithm continues to update parameters until the change is minimal or a maximum iteration is reached.
3. Predict outputs
To predict an output we just have to plug it in the formula y = b + wx where b and w are the constants we have just found.
Linear regression advantages and disadvantages
Linear regression advantages
- Since linear regression is a simple model, it is computationally cheap and fast to execute.
- Linear regression is perfect for representing linear relationships between data.
- Linear regression is a white box model. This means that his internal work and decision-making process is clear. That’s because we know the exact value of w and b
Linear regression disadvantages
- Linear regression is very sensible to outliers. Outliers are rare data that differ significantly from the mean values of the other observations.
- Linear regression can’t represent non-linear relationships between data.
Linear regression types
Simple linear regression
Describes the relationship between one independent and one dependent variable.
Multiple linear regression
Describes the relationship between multiple independent variables and one dependent variable.
The formula is:
y = w1 * x1 + w2 * x1 + … + b
Where:
- b is the intercept
- each independent variable is multiplied by a different slope
Linear regression in Python
1. Import necessary libraries
2. Upload the dataset
2. Upload the dataset
#upload the dataset
file_path = "C:\\...\\melb_data.csv" #real estate data with house prices and input details
dataset = pd.read_csv(file_path)
The data used to train this model look something like this:
Rooms | Building area | Year Built | … | Sale price | |
1 | 8 | 150 | 1987 | 650 000 | |
2 | 5 | 95 | 2015 | 300 000 | |
3 | 6 | 105 | 1967 | 130 000 | |
4 | 4 | 75 | 2001 | 75 000 |
The dataset I used is a real estate dataset that repo
3. Feature engineering
#handle non-numeric colums
cols_without_numbers = [col for col in dataset.columns
if dataset[col].dtype == "object"]
ordinal_encoder = OrdinalEncoder() #assigne for each possible value a number
for col in cols_without_numbers:
dataset[[col]] = ordinal_encoder.fit_transform(dataset[[col]])
#handle missing data
cols_with_missing = [col for col in dataset.columns
if dataset[col].isnull().any()]
imputer = SimpleImputer() #fill a cell without missing data with the mean value of the entire feature
for col in cols_with_missing:
dataset[[col]] = imputer.fit_transform(dataset[[col]]
4. Select input and output features and split the data
#define the features and the label
input_variables = ['LotArea', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']
X = dataset[input_variables]
y = dataset[["SalePrice"]]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) #split the data into training and testing data
5. Train and evaluate the model
#load and train the model
model = LinearRegression()
model.fit(train_X, train_y)
print(mean_squared_error(val_y, model.predict(val_X))) #evaluate it's performance
Wow! The mean absolute error of our model is 25 542. It means that on average, the real value differs by $ 25 542 from the predicted price for each prediction. For guessing house prices this is an excellent result.
6. Visualize the model
7. Save the model on a .sav file
#save the model on desktop in a .sav file
joblib.dump(model, "C:\\...\\my_model.sav")