This lecture explains how to perform maximum likelihood estimation of the coefficients of a probit model (also called probit regression).
Before reading this lecture, it may be helpful to read the introductory lectures about maximum likelihood estimation and about the probit model.
In a probit model, the output variable
is a Bernoulli
random variable (i.e., a discrete variable that can take only two values,
either
or
).
Conditional on a
vector of inputs
,
we have
that
where
is the cumulative distribution
function of the
standard normal
distribution and
is a
vector of coefficients.
We assume that a sample of independently and
identically distributed input-output couples
,
for
,
is observed and used to estimate the vector
.
The likelihood of a single observation
is
In fact, note that
when
,
then
and
;
when
,
then
and
;
Since the observations are IID, then the likelihood of the entire sample is
equal to the product of the likelihoods of the single
observations:where
is the
vector of all outputs and
is the
matrix of all inputs.
Now, define
so that
if
;
if
.
By using the newly defined variables
,
we can also write the likelihood in the following more compact
form:
First note that when
,
then
and
Furthermore,
when
,
then
and
Since
can take only two values
(
and
),
(a) and (b) imply
that
for
all
.
Moreover, the symmetry of the standard normal distribution around
implies that
So,
when
,
then
and
When
,
then
and
Thus,
it descends from (c) and (d)
that
for
all
.
Thanks to these facts, we can write the likelihood
as
The log-likelihood
is
It is computed as follows:
By using the
variables, the log-likelihood can also be written
as
This is derived from the compact form of the
likelihood:
The score vector, that is the vector of
first derivatives of the log-likelihood with respect to the parameter
,
is
where
is the probability density
function of the standard normal distribution.
This is obtained as
follows:where
in step
we have used the fact that the probability density function is the derivative
of the cumulative distribution function, that
is,
By using the
variables, the score can also be written
as
where
This is demonstrated as
follows:
The Hessian, that is the matrix of second derivatives,
is
It can be proved as
follows:
It can be proved (see, e.g., Amemiya 1985) that the
quantityis
always positive.
The maximum likelihood estimator
of the parameter
is obtained as a solution of the following maximization
problem:
As for the logit model, also for the probit model the maximization problem is
not guaranteed to have a solution, but when it has one, at the maximum the
score vector satisfies the first order
conditionthat
is,
The quantity
is the residual, that is, the forecasting error committed by using
to predict
.
Note the difference with respect to the logit model:
in the logit model, residuals need to be orthogonal to the predictors
;
in the probit model, the orthogonality condition holds for
weighted residuals; the weight assigned to each residual
is
By using the
variables and the second expression for the score derived above, the first
order condition can also be written as
where
There is no analytical solution of the first order condition. One of the most
common ways of solving it numerically is by using the
Newton-Raphson
method. It is an iterative method. Starting from an initial guess of the
solution
(e.g.,
),
we generate a sequence of
guesses
and
we stop when numerical convergence is achieved (see
Maximum
likelihood algorithm for an introduction to numerical optimization methods
and numerical convergence).
Define
and the
vector
Denote by
the
diagonal matrix (i.e., having all off-diagonal elements equal to
)
such that the elements on its diagonal are
,
...,
:
The
matrix
is positive definite because all its diagonal entries are positive (see the
comments about the Hessian above).
Finally, the
matrix of inputs (the design matrix) defined
by
is
assumed to have full rank.
With the notation just introduced, we can write the score
asand
the Hessian
as
Therefore, the Newton-Raphson recursive formula
becomes
The assumption that
has full-rank guarantees the existence of the inverse
.
Furthermore, it ensures that the Hessian is negative definite, so that the
log-likelihood is concave.
As for the logit classification model, also for the probit model it is
straightforward to prove that the Newton-Raphson iterations are equivalent to
Iteratively Reweighted Least Squares (IRLS) iterations:
where
we perform a Weighted Least Squares (WLS) estimation with weights
of a linear regression of the dependent variables
on the regressors
.
Write
as
Then,
the Newton-Raphson formula can be written
as
The Hessian matrix derived above is usually employed to estimate the
asymptotic
covariance
matrix of the maximum likelihood estimator
:
where
and
(
is the last step of the iterative procedure used to maximize the likelihood).
A proof of the fact that the inverse of the negative Hessian, divided by the sample size, converges to the asymptotic covariance matrix can be found in the lecture on estimating the covariance matrix of MLE estimators.
Given the above estimate of the asymptotic covariance matrix, the distribution
of
can be approximated by a normal distribution having mean equal to the true
parameter and covariance matrix
Amemiya, T. (1985) Advanced econometrics, Harvard University Press.
Please cite as:
Taboga, Marco (2021). "Probit classification model - Maximum likelihood", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/probit-model-maximum-likelihood.
Most of the learning materials found on this website are now available in a traditional textbook format.