This lecture introduces a method to train linear regression models where the input is a row vector, the parameter is a vector of regression coefficients and is the prediction of the output .
The method is called boosting, and a linear regression model trained with this method is called boosted linear regression.
We are going to assume that both the output and the entries of the input vector have zero mean. In other words, we assume that all the variables have been demeaned (centered) before training the linear regression model.
Boosting is an iterative procedure that yields a sequence of increasingly complex regression models.
We start from Then, at each iteration , we perform the following steps:
we compute the regression residuals from the previous iteration:
we find the input variable that has the highest correlation (in absolute value) with the residuals (on the training sample);
we estimate by ordinary least squares (on the training sample) the coefficient of the uni-variate regression of the residuals on the chosen variable (suppose it is the -th);
we set where is the learning rate (usually ); a learning rate less than 1 is used so as to have a gradual increase in complexity and overfitting; all the other entries of are left unchanged;
we compute the mean squared error (MSE) of the regression on the validation sample;
if the MSE has not been decreasing for a pre-set number of iterations, we stop the algorithm.
The boosted regression model, that we use to make predictions, is the most complex one, produced in the last boosting round (iteration of the algorithm).
Boosting usually works very well and yields highly accurate predictive models.
Why? Basically because it is able to reduce a regression problem which is usually high-dimensional and plagued by the curse of dimensionality, to a sequence of uni-dimensional problems that can be solved with high precision.
The stopping rule in step 6 of the algorithm is called early stopping.
It is a rule used in many iterative machine learning algorithms.
Roughly speaking, we gradually increase model complexity until the performance of the model on the validation sample starts to degrade.
Early stopping is extremely important and is one of the ingredients that explain the good forecasting performance of many machine learning models.
In our example, we continue to use the same inflation data set used previously.
We first import the data and split it into train-val-test.
# Import the packages used to load and manipulate the data
import numpy as np # Numpy is a Matlab-like package for array manipulation and linear algebra
import pandas as pd # Pandas is a data-analysis and table-manipulation tool
import urllib.request # Urlib will be used to download the dataset
# Import the function that performs sample splits from scikit-learn
from sklearn.model_selection import train_test_split
# Load the output variable with pandas (download with urllib if not downloaded previously)
remoteAddress = 'https://www.statlect.com/ml-assets/y_hicp.csv'
localAddress = './y_hicp.csv'
try:
y = pd.read_csv(localAddress, header=None)
except:
urllib.request.urlretrieve(remoteAddress, localAddress)
y = pd.read_csv(localAddress, header=None)
y = y.values # Transform y into a numpy array
# Print some information about the output variable
print('Class and dimension of output variable:')
print(type(y))
print(y.shape)
# Load the input variables with pandas
remoteAddress = 'https://www.statlect.com/ml-assets/x_hicp.csv'
localAddress = './x_hicp.csv'
try:
x = pd.read_csv(localAddress, header=None)
except:
urllib.request.urlretrieve(remoteAddress, localAddress)
x = pd.read_csv(localAddress, header=None)
x = x.values
# Print some information about the input variables
print('Class and dimension of input variables:')
print(type(x))
print(x.shape)
# Create the training sample
x_train, x_val_test, y_train, y_val_test
= train_test_split(x, y, test_size=0.4, random_state=1)
# Split the remaining observations into validation and test
x_val, x_test, y_val, y_test
= train_test_split(x_val_test, y_val_test, test_size=0.5, random_state=1)
# Print the numerosities of the three samples
print('Numerosities of training, validation and test samples:')
print(x_train.shape[0], x_val.shape[0], x_test.shape[0])
The output is:
Class and dimension of output variable:
class 'numpy.ndarray'
(270, 1)
Class and dimension of input variables:
class 'numpy.ndarray'
(270, 113)
Numerosities of training, validation and test samples:
162 54 54
We create our own class for training boosted linear regression models.
# Import package used to make copies of objects
from copy import deepcopy
# Our boosted linear regression (blr) class will implement 3 methods
# (constructor, fit, and predict), as previously seen in scikit-learn
class blr:
def __init__(self, learning_rate, max_iter, early_stopping):
self.lr = learning_rate
self.max_iter = max_iter
self.early = early_stopping
self.y_mean = 0
self.y_std = 1
self.x_mean = 0
self.x_std = 1
self.theta = 0
self.mses = []
def fit(self, x_train_0, y_train_0, x_val_0, y_val_0):
# Make copies of data to avoid over-writing original dataset
x_train = deepcopy(x_train_0)
y_train = deepcopy(y_train_0)
x_val = deepcopy(x_val_0)
y_val = deepcopy(y_val_0)
# De-mean the output variable
self.y_mean = np.mean(y_train)
y_train -= self.y_mean
y_val -= self.y_mean
# Standardize the output variable
self.y_std = np.std(y_train)
y_train /= self.y_std
y_val /= self.y_std
# De-mean the input variables
self.x_mean = np.mean(x_train, axis=0, keepdims=True)
x_train -= self.x_mean
x_val -= self.x_mean
# Standardize the input variables
self.x_std = np.std(x_train, axis=0, keepdims=True)
x_train /= self.x_std
x_val /= self.x_std
# Initialize counters (total boosting iterations and unproductive iterations)
current_iter = 0
no_improvement = 0
# The starting model has all coefficients equal to zero and predicts a constant zero output
self.theta = np.zeros((x_train.shape[1], 1))
y_train_pred = 0 * y_train
y_val_pred = 0 * y_val
eta = y_train - y_train_pred
mses = [np.var(y_val - y_val_pred)]
# Boosting iterations
while no_improvement < self.early and current_iter < self.max_iter:
current_iter += 1
corr_coeffs = np.mean(x_train * eta, axis=0) # Correlations (equal to betas) beteen residual and inputs
index_best = np.argmax(np.abs(corr_coeffs)) # Choose variable that has maximum correlation with residual
self.theta[index_best] += self.lr * corr_coeffs[index_best] # Parameter update
y_train_pred += self.lr * corr_coeffs[index_best] * x_train[:, [index_best]] # Prediction update
eta = y_train - y_train_pred # Residuals update
y_val_pred += self.lr * corr_coeffs[index_best] * x_val[:, [index_best]] # Validation prediction update
mses.append(np.var(y_val - y_val_pred)) # New validation MSE
if mses[-1] > np.min(mses[0:-1]): # Stopping criterion to avoid over-fitting
no_improvement += 1
else:
no_improvement = 0
# Final output message
print('Boosting stopped after ' + str(current_iter) + ' iterations')
def predict(self, x_test_0):
# Make copies of the data to avoid over-writing original dataset
x_test = deepcopy(x_test_0)
# De-mean input variables using means on training sample
x_test = x_test - self.x_mean
# Standardize output variables using standard deviations on training sample
x_test = x_test / self.x_std
# Return prediction
return self.y_mean + self.y_std * np.dot(x_test,self.theta)
We train the boosted regression model with all the 113 input variables.
# Import model-evaluation metrics from scikit-learn
from sklearn.metrics import mean_squared_error, r2_score
# Create a boosted linear regression object
lr = blr(0.1, 10000, 20)
# Train the model
lr.fit(x_train, y_train, x_val, y_val)
# Make predictions on the train, validation and test sets
y_train_pred = lr.predict(x_train)
y_val_pred = lr.predict(x_val)
y_test_pred = lr.predict(x_test)
# Print empirical risk on all sets
print('MSE on training set:')
print(mean_squared_error(y_train, y_train_pred))
print('MSE on validation set:')
print(mean_squared_error(y_val, y_val_pred))
print('MSE on test set:')
print(mean_squared_error(y_test, y_test_pred))
print('')
# Print R squared on all sets
print('R squared on training set:')
print(r2_score(y_train, y_train_pred))
print('R squared on validation set:')
print(r2_score(y_val, y_val_pred))
print('R squared on test set:')
print(r2_score(y_test, y_test_pred))
The output is:
Boosting stopped after 181 iterations
MSE on training set:
0.03676763521269099
MSE on validation set:
0.08231588238762148
MSE on test set:
0.09441771372808147
R squared on training set:
0.7661747762416133
R squared on validation set:
0.6517679287094578
R squared on test set:
0.5446686738671733
This is the best result thus far, better than both 1) selection of the best model among a set of randomly generated ones and 2) selection of a regularized regression model.
Why? Not only we minimized overfitting on the validation set because we basically used it to choose a single parameter (number of boosting rounds), but we also managed to reduce overfitting on the training set by using a smart training strategy (set only a single parameter at a time).
Please cite as:
Taboga, Marco (2021). "Boosted linear regression", Lectures on machine learning. https://www.statlect.com/machine-learning/boosted-linear-regression.
Most of the learning materials found on this website are now available in a traditional textbook format.