Maximum likelihood estimation (MLE) is an estimation method that allows us to use a sample to estimate the parameters of the probability distribution that generated the sample.
This lecture provides an introduction to the theory of maximum likelihood, focusing on its mathematical aspects, in particular on:
its asymptotic properties;
the assumptions that are needed to prove the properties.
At the end of the lecture, we provide links to pages that contain examples and that treat practically relevant aspects of the theory, such as numerical optimization and hypothesis testing.
Table of contents
The main elements of a maximum likelihood estimation problem are the following:
a sample
,
that we use to make statements about the probability distribution that
generated the sample;
the sample
is regarded as the realization of a random vector
,
whose distribution is unknown and needs to be estimated;
there is a set
of real vectors (called the parameter
space) whose elements (called
parameters) are put into correspondence
with the possible distributions of
;
in particular:
if
is a discrete random
vector, we assume that its
joint probability
mass function
belongs to a set of joint probability mass functions
indexed by the parameter
;
when the joint probability mass function is considered as a function of
for fixed
,
it is called likelihood (or likelihood
function) and it is denoted
by
if
is a continuous
random vector, we assume that its
joint probability
density function
belongs to a set of joint probability density functions
indexed by the parameter
;
when the joint probability density function is considered as a function of
for fixed
,
it is called likelihood and it is denoted
by
we need to estimate the true parameter
,
which is associated with the unknown distribution that actually generated the
sample (we rule out the possibility that several different parameters are put
into correspondence with true distribution).
A maximum likelihood estimator
of
is obtained as a solution of a maximization
problem:
In
other words,
is the parameter that maximizes the likelihood of the sample
.
is called the maximum likelihood estimator of
.
In what follows, the symbol
will be used to denote both a maximum likelihood estimator (a random variable)
and a maximum likelihood estimate (a realization of a random variable): the
meaning will be clear from the context.
The same estimator
is obtained as a solution
of
i.e.,
by maximizing the natural logarithm of the likelihood function. Solving this
problem is equivalent to solving the original one, because the logarithm is a
strictly increasing function. The logarithm of the likelihood is called
log-likelihood and it is denoted
by
To derive the (asymptotic) properties of maximum likelihood estimators, one
needs to specify a set of assumptions about the sample
and the parameter space
.
The next section presents a set of assumptions that allows us to easily derive the asymptotic properties of the maximum likelihood estimator. Some of the assumptions are quite restrictive, while others are very generic. Therefore, the subsequent sections discuss how the most restrictive assumptions can be weakened and how the most generic ones can be made more specific.
Note: the presentation in this section does not aim at being one hundred per cent rigorous. Its aim is rather to introduce the reader to the main steps that are necessary to derive the asymptotic properties of maximum likelihood estimators. Therefore, some technical details are either skipped or de-emphasized. After getting a grasp of the main issues related to the asymptotic properties of MLE, the interested reader can refer to other sources (e.g., Newey and McFadden - 1994, Ruud - 2000) for a fully rigorous presentation of MLE theory.
Let
be a sequence of
random vectors. Denote by
the sample comprising the first
realizations of the
sequence
which
is a realization of the random
vector
We assume that:
IID.
is an IID sequence.
Continuous variables. A generic term
of the sequence
is a continuous random vector, whose joint probability density function
belongs
to a set of joint probability density functions
indexed by a
parameter
(where we have dropped the subscript
to highlight the fact that the terms of the sequence are identically
distributed).
Identification. If
,
then the
ratio
is
not almost surely constant. This also
implies that the
parametric family
is
identifiable:
there does not exist another parameter
such that
is the true probability density function of
.
Integrable log-likelihood. The log-likelihood is
integrable:
Maximum. The density functions
and the parameter space
are such that there always exists a unique solution
of the maximization
problem:
where
the rightmost equality is a consequence of independence (see the IID
assumption above). Of course, this is the same
as
where
is the log-likelihood and
are
the contributions of the individual observations to the log-likelihood. It is
also the same
as
Exchangeability of limit. The density functions
and the parameter space
are such
that
where
denotes a limit in probability. Roughly speaking,
the probability limit can be brought inside the
operator.
Differentiability. The log-likelihood
is two times continuously differentiable with respect to
in a neighborhood of
.
Other technical conditions. The derivatives of the
log-likelihood
are well-behaved, so that it is possible to exchange integration and
differentiation, compute their first and second moments, and probability
limits involving their entries are also well-behaved.
Given the assumptions made above, we can derive an important fact about the
expected value of the
log-likelihood:
First of
all,Therefore,
the
inequality
is
satisfied if and only
if
which
can be also written
as
(note
that everything we have done so far is legitimate because we have assumed that
the log-likelihoods are integrable). Thus, proving our claim is equivalent to
demonstrating that this last inequality holds. In order to do this, we need to
use Jensen's inequality. Since
the logarithm is a strictly concave function and, by our assumptions, the
ratio
is
not almost surely constant, by Jensen's inequality we
have
But,
Therefore,
which
is exactly what we needed to prove.
This inequality, called information inequality by many authors, is essential for proving the consistency of the maximum likelihood estimator.
Given the assumptions above, the maximum likelihood estimator
is a consistent estimator of the
true parameter
:
where
denotes a limit in probability.
We have assumed that the density functions
and the parameter space
are such
that
But
The
last equality is true, because, by
Kolmogorov's Strong Law of Large Numbers
(we have an IID sequence with finite mean), the sample average
converges
almost surely to
and, therefore, it converges also in probability
(convergence almost surely implies convergence in
probability). Thus, putting things together, we
obtain
In
the proof of the information inequality (see above), we have seen
that
which,
obviously,
implies
Thus,
Denote by
the gradient of the log-likelihood, that is, the vector of first derivatives
of the log-likelihood, evaluated at the point
.
This vector is often called the score vector.
Given the assumptions above, the score has zero expected
value:
First of all, note
thatbecause
probability density functions integrate to
.
Now, taking the first derivative of both sides with respect to any component
of
and bringing the derivative inside the
integral:
Now,
multiply and divide the integrand function by
:
Since
we
can
write
or,
using the definition of expected
value:
which
can be written in vector form using the gradient notation
as
This
result can be used to derive the expected value of the score as
follows:
Given the assumptions above, the covariance matrix
of the score (called information matrix or Fisher information
matrix)
iswhere
is the Hessian of the log-likelihood, that is, the matrix of second
derivatives of the log-likelihood, evaluated at the point
.
From the previous proof, we know
thatNow,
taking the first derivative of both sides with respect to any component
of
,
we
obtain
Rearranging,
we
get
Since
this is true for any
and any
,
we can express it in matrix form
as
where
the left hand side is the covariance matrix of the gradient. This result is
equivalent to the result we need to prove
because
The latter equality is often called information equality.
The maximum likelihood estimator is asymptotically
normal:In
other words, the distribution of the maximum likelihood estimator
can be approximated by a multivariate normal
distribution with mean
and covariance
matrix
Denote
bythe
gradient of the log-likelihood, i.e., the vector of first derivatives of the
log-likelihood. Denote
by
the
Hessian of the log-likelihood, i.e., the matrix of second derivatives of the
log-likelihood. Since the maximum likelihood estimator
maximizes the log-likelihood, it satisfies the first order
condition
Furthermore,
by the Mean Value Theorem, we
have
where,
for each
,
the intermediate points
satisfy
and
the
notation
indicates
that each row of the Hessian is evaluated at a different point (row
is evaluated at the point
).
Substituting the first order condition in the mean value equation, we
obtain
which,
by solving for
,
becomes
which
can be rewritten
as
We
will show that the term in the first pair of square brackets converges in
probability to a constant, invertible matrix and that the term in the second
pair of square brackets converges in distribution to a normal distribution.
The consequence will be that their product also converges in distribution to a
normal distribution (by
Slutsky's theorem).
As far as the first term is concerned, note that the intermediate points
converge in probability to
:
Therefore,
skipping some technical details, we
get
As
far as the second term is concerned, we get
By
putting things together and using the Continuous Mapping Theorem and Slutsky's
theorem (see also the exercises in the lecture on
Slutsky's theorem), we
obtain
By the information equality (see its proof), the asymptotic covariance matrix
is equal to the negative of the expected value of the Hessian matrix:
As previously mentioned, some of the assumptions made above are quite restrictive, while others are very generic. We now discuss how the former can be weakened and how the latter can be made more specific.
Assumption 1 (IID). It is possible to relax the assumption
that
is IID and allow for some dependence among the terms of the sequence (see,
e.g., Bierens - 2004 for a discussion). In case
dependence is present, the formula for the asymptotic covariance matrix of the
MLE given above is no longer valid and needs to be replaced by a formula that
takes serial correlation into account.
Assumption 2 (continuous variables). It is possible to prove
consistency and asymptotic normality also when the terms of the sequence
are extracted from a discrete distribution, or from a distribution that is
neither discrete nor continuous (see, e.g., Newey and
McFadden - 1994).
Assumption 3 (identification). Typically, different identification conditions are needed when the IID assumption is relaxed (e.g., Bierens - 2004).
Assumption 5 (maximum). To ensure the existence of a maximum, requirements are typically imposed both on the parameter space and on the log-likelihood function. For example, it can be required that the parameter space be compact (closed and bounded) and the log-likelihood function be continuous. Also, the parameter space can be required to be convex and the log-likelihood function strictly concave (e.g.: Newey and McFadden - 1994).
Assumption 6 (exchangeability of limit). To ensure the
exchangeability of the limit and the
operator, the following condition is often
imposed:
Assumption 8 (other technical conditions). See, for example, Newey and McFadden (1994) for a discussion of these technical conditions.
In some cases, the maximum likelihood problem has an analytical solution. That
is, it is possible to write the maximum likelihood estimator
explicitly as a function of the data.
However, in many cases there is no explicit solution. In these cases, numerical optimization algorithms are used to maximize the log-likelihood. The lecture entitled Maximum likelihood - Algorithm discusses these algorithms.
The following lectures provide detailed examples of how to derive analytically the maximum likelihood (ML) estimators and their asymptotic variance:
ML estimation of the parameter of the exponential distribution
ML estimation of the parameters of the multivariate normal distribution
ML estimation of the parameters of a normal linear regression model
The following lectures provides examples of how to perform maximum likelihood estimation numerically:
ML estimation of the degrees of freedom of a standard t distribution (MATLAB example)
ML estimation of the coefficients of a logistic classification model
ML estimation of the coefficients of a probit classification model
The following sections contain more details about the theory of maximum likelihood estimation.
Methods to estimate the asymptotic covariance matrix of maximum likelihood estimators, including OPG, Hessian and Sandwich estimators, are discussed in the lecture entitled Maximum likelihood - Covariance matrix estimation.
Tests of hypotheses on parameters estimated by maximum likelihood are discussed in the lecture entitled Maximum likelihood - Hypothesis testing, as well as in the lectures on the three classical tests:
Bierens, H. J. (2004) Introduction to the mathematical and statistical foundations of econometrics, Cambridge University Press.
Newey, W. K. and D. McFadden (1994) "Chapter 35: Large sample estimation and hypothesis testing", in Handbook of Econometrics, Elsevier.
Ruud, P. A. (2000) An introduction to classical econometric theory, Oxford University Press.
Please cite as:
Taboga, Marco (2021). "Maximum likelihood estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/maximum-likelihood.
Most of the learning materials found on this website are now available in a traditional textbook format.