This lecture deals with maximum likelihood estimation of the logistic classification model (also called logit model or logistic regression).
Before proceeding, you might want to revise the introductions to maximum likelihood estimation (MLE) and to the logit model.
In the logit model, the output variable
is a Bernoulli
random variable (it can take only two values, either 1 or 0)
and
where
is
the logistic function,
is a
vector of inputs and
is a
vector of coefficients.
Furthermore,
The vector of coefficients
is the parameter to be estimated by maximum likelihood.
We assume that the estimation is carried out with an
IID sample comprising
data points
The likelihood of an observation
can be written
as
If you are wondering about the exponents
and
or, more in general, about this formula for the likelihood, you are advised to
revise the lecture on
Classification
models and their maximum likelihood estimation.
Denote the
vector of all outputs by
and the
matrix of all inputs by
.
Since the observations are IID, then the likelihood of the entire sample is
equal to the product of the likelihoods of the single
observations:
The log-likelihood of the logistic
model
is
It is computed as follows:
The score vector, that is the vector of
first derivatives of the log-likelihood with respect to the parameter
,
is
This is obtained as
follows:
The Hessian, that is the matrix of second derivatives,
is
It can be proved as
follows:where
we have used the fact that the derivative of the logistic function
is
The maximum likelihood estimator
of the parameter
solves
In general, there is no analytical solution of this maximization problem and a solution must be found numerically (see the lecture entitled Maximum likelihood algorithm for an introduction to the numerical maximization of the likelihood).
The maximization problem is not guaranteed to have a solution because some pathological situations can arise in which the log-likelihood is an unbounded function of the parameters.
In these situations the log-likelihood can be made as large as desired by
appropriately choosing
.
This happens when the residuals can be made as small as desired (so-called
perfect separation of classes).
It is not a common situation. It means that the model can perfectly fit the observed classes.
In all other situations, the maximization problem has a solution, and at the
maximum the score vector satisfies the first order condition
that
is,
Note that
is the error committed by using
as a predictor of
.
It is similar to a regression residual (see
Linear
regression).
Furthermore, the first order condition above is similar to the first order
condition that is found when estimating a linear regression model by ordinary
least squares: it says that the residuals need to be orthogonal to the
predictors
.
The first order condition above has no explicit solution. In most statistical
software packages it is solved by using the
Newton-Raphson
method. The method is pretty simple: we start from a guess of the solution
(e.g.,
),
and then we recursively update the guess with the
equation
until
numerical convergence (of
to the solution
).
Denote by
the
vector of conditional probabilities of the outputs computed by using
as
parameter:
Denote by
the
diagonal matrix (i.e., having all off-diagonal elements equal to
)
such that the elements on its diagonal are
,
...,
:
The
matrix of
inputs
which
is called design matrix (as in linear regression), is assumed to be a
full-rank matrix.
By using this notation, the score in Newton-Raphson recursive formula can be
written
asand
the Hessian
as
Therefore, the Newton-Raphson formula
becomeswhere
the existence of the inverse
is guaranteed by the assumption that
has full-rank (the assumption also guarantees that the log-likelihood is
concave and the maximum likelihood problem has a unique solution).
If you deal with logit models, you will often read that they can be estimated
by Iteratively Reweighted Least Squares (IRLS). The Newton-Raphson formula
above is equivalent to the IRLS formula
that
is obtained by performing a Weighted Least Squares (WLS) estimation with
weights
of a linear regression of the dependent variables
on the regressors
.
Write
as
Then,
we can re-write the Newton Raphson formula as
follows:
The IRLS formula can alternatively be written
as
The asymptotic covariance matrix of the maximum likelihood estimator
is usually estimated with the Hessian (see the lecture on the
covariance
matrix of MLE estimators), as follows:
where
and
(
is the last step of the iterative procedure used to maximize the likelihood).
As a consequence, the distribution of
can be approximated by a normal distribution with mean equal to the true
parameter value and variance equal
to
StatLect has several MLE examples. Learn how to find the estimators of the parameters of the following models and distributions.
Type | Solution | |
---|---|---|
Exponential distribution | Univariate distribution | Analytical |
Normal distribution | Univariate distribution | Analytical |
Poisson distribution | Univariate distribution | Analytical |
T distribution | Univariate distribution | Numerical |
Multivariate normal distribution | Multivariate distribution | Analytical |
Normal linear regression model | Regression model | Analytical |
Probit classification model | Classification model | Numerical |
Please cite as:
Taboga, Marco (2021). "Logistic regression - Maximum Likelihood Estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-likelihood.
Most of the learning materials found on this website are now available in a traditional textbook format.