This lecture deals with standardized linear regressions, that is, regression models in which the variables are standardized.
A variable is standardized by subtracting from it its sample mean and by dividing it by its standard deviation. After being standardized, the variable has zero mean and unit standard deviation.
We are going to deal with linear
regressionswhere
are the observations in the sample, there are
regressors
and
regression coefficients
,
is the dependent variable and
is the error term.
In a standardized regression all the variables have zero mean and unit
standard deviation or, equivalently, unit
variance. More
precisely,for
.
Furthermore, we assume that also the dependent variable is
standardized:
In general, a variable to be included in a regression model has not zero mean
and unit variance. Denote by
such a variable (where the superscript
indicates that the variable is unstandardized). Then, we standardize it before
including it in the regression.
We compute the sample mean and variance of
:
Then, we compute the standardized variable
to be used in the
regression:
for
and
.
The same process is performed on the dependent variable
if it does not have zero mean and unit variance.
Particular care needs to be taken if the regression includes an intercept, that is, if one of the regressors is constant and equal to 1.
Clearly, the constant cannot be standardized because it has zero variance and division by zero is not allowed.
We have two possibilities:
we leave the constant as it is, that is, we do not standardize it;
we drop the constant from the regression.
If all the variables, including the dependent variable
,
are standardized, as we have assumed above, then there is no need to include a
constant in the regression because the
OLS
estimate of its coefficient would anyway be equal to zero (proof below).
Therefore, in what follows we are always going to drop the constant.
Write the regression in matrix
formwhere
is the
vector of independent variables,
the
vector of regressors,
is the
matrix of regression coefficients and
the
vector of error terms.
The OLS estimator of
is
Suppose the first regressor is constant and equal to 1, and all the other
regressors are standardized. Denote by
the matrix obtained by deleting the first column of
(i.e., the column containing the constant). Then,
is block
diagonal:
where
the off-diagonal blocks are zero because the variables are standardized.
As a consequence,
is block
diagonal:
Furthermore,where
because
is standardized.
Thus, by carrying out the multiplication of the two block matrices
and
,
we get
In other words, when we add an intercept, the OLS estimator of the other regressors does not change and the estimated intercept is always equal to zero.
Standardizing the variables in the regression greatly simplifies the computation of their sample covariances and correlations.
The sample covariance between two regressors
and
is
where
the sample means
and
are zero because the two regressors are standardized.
For the same reason, the sample covariance between
and
is
The sample correlation between
and
is
where
the sample variances
and
are equal to 1 because the two regressors are standardized.
By the same token, the sample correlation between
and
is
Thus, in a standardized regression, sample correlations and sample variances coincide.
Denote by
the
vector of independent variables and by
the
matrix of regressors, so that the regression equation can be written in
matrix form
as
where
is the
vector of regression coefficients and
is the
vector of error terms.
The OLS estimator of
is
When all the variables are standardized, the OLS estimator can be written as a function of their sample correlations.
Denote by
the
-th
row of
.
Note that the
-th
element of
is
Furthermore, the
-th
element of
is
Denote by
the sample correlation matrix of
,
that is, the
matrix whose
-th
entry is equal to
.
Then,
Similarly, denote by
the
vector whose
-th
entry is equal to
,
so
that
Thus, we can write the OLS estimator as a function of the sample
correlation
matrices:
The estimated coefficients of a linear regression model with standardized variables are called standardized coefficients. They are sometimes deemed easier to interpret than the coefficients of an unstandardized regression.
In general, a regression coefficient
is interpreted as the effect that is produced on the dependent variable when
the
-th
regressor is increased by one unit.
Sometimes, for example, when we read the output of a regression estimated by
someone else, we are unable to tell whether a unit increase in the regressor
is a lot or little, or we are uncertain about the relevance of the effect
on the dependent variable. In these situations, standardized coefficients are
easier to interpret.
In a standardized regression, a unit increase in a variable is equal to its
standard deviation. Roughly speaking, the standard deviation is the average
deviation of a random variable from its mean. So, when a variable differs from
its mean by one standard deviation, that is in a sense a "typical" deviation.
Then, a standardized coefficient
tells you what multiple or fraction of a typical deviation in
is caused by a typical deviation in the
-th
regressor.
Another benefit of standardization is that it is easier to make comparisons among regressors. In particular, if we ask what regressor has the largest impact on the dependent variable, then we have an easy answer: it is the regressor whose coefficient is the highest in absolute value. In fact, a typical deviation of that regressor from its mean will produce the largest effect, as compared to the effects produced by typical deviations of the other regressors from their mean.
Please cite as:
Taboga, Marco (2021). "Linear regression with standardized variables", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-with-standardized-variables.
Most of the learning materials found on this website are now available in a traditional textbook format.